Multi-modal modulated signal recognition method based on optimal transmission and attention mechanism

By using a multimodal modulation recognition method based on optimal transmission and attention mechanisms, feature alignment and fusion of modulated signals are achieved, solving the problems of cumbersome feature engineering and low efficiency of single-modal recognition in traditional methods, and improving recognition performance and robustness.

CN122226558APending Publication Date: 2026-06-16刘慧玲

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
刘慧玲
Filing Date
2026-01-23
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies for modulated signal recognition suffer from problems such as cumbersome traditional feature engineering, inability to adapt to the era of big data, and inability to fully utilize the complementary relationships between multiple modalities when using a single modal input. Furthermore, there is a lack of effective multimodal feature alignment and fusion algorithms.

Method used

A multimodal modulation recognition method based on optimal transmission and attention mechanisms is adopted. Distribution alignment and fine-grained interactive alignment are achieved through multi-level feature processing. The network is optimized by combining multiple loss functions, using traditional features as complementary information of multimodal data, and using difference information to enhance feature representation, thereby achieving feature alignment and fusion.

🎯Benefits of technology

It improves the performance and robustness of modulation recognition networks, especially maintaining good recognition results in complex environments, avoiding cumbersome feature engineering, and making full use of the complementarity of multimodal information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122226558A_ABST
    Figure CN122226558A_ABST
Patent Text Reader

Abstract

The application discloses a multimodal modulated signal recognition method based on optimal transmission and attention mechanism, mainly completes the migration of feature alignment and fusion technology in the field of communication signal processing. Initial features are extracted through a feature extraction network without sharing weights; modal feature alignment is performed from multiple dimensions, first, the optimal transmission method is used to complete the soft alignment of features in the global distribution; then, cross attention mechanism is used to realize fine-grained interactive alignment of features; finally, differential attention is used to enhance feature difference understanding, which is used for mining specific modal features; finally, the spliced features complete the recognition process. The application emphasizes the importance of alignment before fusion in the multimodal recognition process, improves the performance and robustness of the multimodal modulation recognition network by means of multiple loss joint optimization network, completes the migration of feature alignment and fusion technology, and can be used for solving the communication signal recognition problem in a complex environment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of wireless communication and signal processing, and further relates to an intelligent modulation signal recognition method using feature alignment and fusion technology under multimodal conditions, which can be used to improve the performance and robustness of modulation recognition networks in complex environments. Background Technology

[0002] In digital communication systems, to improve the reliability and efficiency of communication transmission, the signal transmitter typically modulates the signal, and the signal receiver demodulates the received signal to obtain the source information. Demodulation requires obtaining the modulation pattern of the signal. Traditional modulation identification methods mainly transform the signal to wavelet domain, frequency domain, and time-frequency domain using techniques such as cyclostationary characteristics, time-frequency distribution, spectral correlation, power spectrum, wavelet transform, instantaneous phase, instantaneous frequency, and higher-order cumulants. Traditional methods can selectively extract modulation information, but feature analysis requires strong domain knowledge, making it difficult for non-professionals to replicate expert experience, and they cannot cope with the challenges of the dramatic increase in data volume in the era of big data.

[0003] Deep learning excels at discovering the latent structures and patterns in high-dimensional data, saving significant time on feature engineering. Using deep learning for signal modulation recognition has become a major trend; however, most recognition tasks focus on single-modal inputs, failing to fully utilize the complementarity of information between multiple modalities in real-world scenarios. While some research has improved recognition performance using multimodal inputs, most studies concentrate on using complex fusion methods instead of simple summation or concatenation, neglecting the inherent semantic and distributional differences in the embedding spaces of different modalities. This can lead to fusion results that are potentially inferior to single-modal recognition results.

[0004] In its patent application, "A Modulation Recognition Method Based on Phase Entropy Blind Frequency Offset Compensation," the University of Electronic Science and Technology of China (UESTC) utilizes polar coordinate transformation to extract features such as higher-order cumulants, skewness, peak-to-peak ratio, and power spectrum for signal processing. These features are then used as input to a neural network to obtain classification confidence, and the class is estimated by combining phase entropy. This invention cleverly combines traditional features with deep features, achieving excellent modulation recognition. However, its shortcomings include relying heavily on expert knowledge during signal processing, resulting in lengthy feature engineering. Furthermore, it uses only a single mode as input, neglecting the information complementarity between temporal features and other inputs, such as traditional features.

[0005] In its patent application, "A Multimodal Modulation Signal Recognition Method Based on Gated Attention Fusion and Weighted Loss," the Army Engineering University of the Chinese People's Liberation Army employs multiple feature extraction networks to extract features from multimodal modulation signals separately. It then uses a gated attention mechanism to fuse the extracted features and constructs a triple-weighted loss function to optimize the entire network. This invention uses attention weights to adaptively fuse multimodal features, improving signal recognition accuracy under low signal-to-noise ratio conditions. However, it directly performs the fusion operation during feature fusion without considering the inconsistencies between different modalities and the network at the feature embedding space level.

[0006] While many scholars have researched modal feature alignment and fusion techniques for multimodal recognition tasks, most studies have focused on image, speech, and text data processing, lacking research on multimodal feature alignment and fusion algorithms in the field of communication signals. Therefore, a novel modulation signal recognition method is needed that avoids extremely cumbersome feature engineering, fully utilizes the complementarity between different modalities, and aligns the feature embedding space. This method would transfer feature alignment and fusion to the field of communication signal processing, thereby improving the performance of modulation recognition networks. Summary of the Invention

[0007] The purpose of this invention is to provide a multimodal modulation recognition method based on optimal transmission and attention mechanisms, which overcomes the shortcomings of existing technologies in modulation recognition tasks: traditional feature engineering is cumbersome and cannot adapt to the era of big data, and a single input cannot fully utilize the complementary relationship between multiple modes, thereby achieving feature alignment and fusion in the field of communication signals and improving the performance and robustness of recognition networks.

[0008] Compared with the prior art, the present invention has the following advantages:

[0009] 1. Multi-level feature processing: From distribution alignment to fine-grained interactive alignment, it not only achieves soft alignment on the global distribution, but also completes sample-level feature interaction to achieve progressive feature alignment.

[0010] 2. Complementary multimodal input information: Commonly used traditional features are used as multimodal data to avoid cumbersome traditional feature engineering, and at the same time, they serve as complementary information to improve recognition performance.

[0011] 3. Full utilization of differential information: Treating intermodal differences as useful information rather than noise, feature representation is enhanced through differential attention to make full use of specific information.

[0012] 4. Joint optimization of multiple losses: The distribution alignment is incorporated into the loss part to reduce the computational complexity in the forward propagation process. Feature consistency loss is used to ensure the correct direction of feature processing, and the network is jointly optimized by the basic classification loss. Attached Figure Description

[0013] Figure 1 This is a flowchart illustrating the implementation of the present invention;

[0014] Figure 2 This is an overall structural diagram of the multimodal modulation recognition network model in this invention;

[0015] Figure 3 This is a diagram showing the confusion matrix results of the simulation experiment of this invention on dataset one;

[0016] Figure 4 This is a diagram showing the confusion matrix results of the simulation experiment of this invention on dataset two. Detailed Implementation

[0017] The implementation and effects of the invention will be further described below with reference to the accompanying drawings.

[0018] Reference Figure 1 The implementation steps of this invention are as follows:

[0019] Step 1: Obtain the multimodal dataset.

[0020] The original dataset RML2016.10a is used. For the IQ modes of the signal in the original dataset, their amplitude, phase, and frequency features are extracted and combined along the channel dimension to form APF mode data. The IQ mode and APF mode data together constitute a multimodal dataset. Data sets with modulated signals in the range of 0dB to 18dB SNR in the multimodal dataset constitute dataset one, and data sets with modulated signals in the range of -20dB to 18dB SNR in the multimodal dataset constitute dataset two.

[0021] Step 2: Divide the experimental dataset.

[0022] 2.1) For the modulation data in each category, the IQ mode data, APF mode data and category label are matched one-to-one to form a signal sample pair, which together constitute a pair sample set;

[0023] 2.2) The experimental datasets were divided into dataset 1 and dataset 2 according to the same division ratio. 50% of the sample data in the paired sample set were randomly selected to form the training sample set. The remaining 50% of the sample data in the paired sample set were randomly selected to form the validation sample set and the test sample set according to the division ratio of 2:3.

[0024] Step 3: Construct a multimodal modulation recognition network model.

[0025] 3.1) Feature extraction module.

[0026] The feature extraction module uses a ResNet18 network structure. To better adapt to one-dimensional input data, the original ResNet18 network replaces the two-dimensional convolutional layer Conv2d with the one-dimensional convolutional layer Conv1d, the two-dimensional pooling layer MaxPool2d with the one-dimensional pooling layer MaxPool1d, and the two-dimensional batch processing layer BatchNorm2d with the one-dimensional batch processing layer BatchNorm1d. Other than this, the kernel size and network block size remain unchanged. Both modalities use the same feature extraction network structure, but the parameters of the two feature extraction networks are not shared.

[0027] 3.2) Optimal transmission distribution alignment module.

[0028] After the feature extraction module, the initial depth features of the IQ mode and APF mode are obtained. The Euclidean distance between each pairwise feature of the IQ mode and APF mode within a batch is calculated as the element value of the cost matrix, thus obtaining the cost matrix. Then, the Sinkhorn algorithm with entropy regularization is used to solve the optimal transport problem to obtain the optimal transport plan matrix. The sum of the element-wise products of the cost matrix and the transport plan matrix is ​​calculated as the Wasserstein distance to measure the degree of difference between the two distributions, and it is used as part of the loss function to optimize the network parameters, thus completing the soft alignment of the feature distribution.

[0029] 3.3) Cross-attention interaction alignment module.

[0030] Using the initial deep features as input, a multi-head attention mechanism with 4 heads is employed to calculate the cross-attention from IQ modality features to APF modality features and the cross-attention from APF modality features to IQ modality features. Specifically, the cross-attention from IQ modality features to APF modality features uses IQ modality features as the query Q and APF modality features as the key K and value V; the cross-attention from APF modality features to IQ modality features uses APF modality features as the query Q and IQ modality features as the key K and value V. Finally, a normalization layer and a fully connected layer are used as feedforward networks to enhance the feature representation, resulting in a fine-grained interactively aligned modality feature representation.

[0031] 3.4) Differential attention enhancement module.

[0032] Using the output aligned by cross-attention interaction as input, the difference representation between the two modal features is calculated, including direct difference, absolute difference, and similarity difference. These difference representations are combined along the column dimension. The combined difference representation is then concatenated with the original features along the channel dimension and input into a multi-head attention mechanism with 4 heads to learn the complex relationships between features. Normalization layers and fully connected layers are used as feedforward networks to enhance the feature representation, resulting in enhanced features. The enhanced features are then separated along the channel dimension to obtain enhanced differential features and the original features. Finally, a gating mechanism consisting of single-layer linear layers is used to adaptively control the fusion of differential features and the original features, resulting in feature representations that emphasize the specificity of each modality.

[0033] 3.5) Classification module.

[0034] The outputs of the differential attention enhancement module are concatenated and used as the input to the classification module, which consists of a fully connected layer with parameters of 256, 128, and 64, a batch normalization layer, and a ReLU activation function.

[0035] 3.6) Set the loss function and optimization algorithm for the recognition network model.

[0036] 3.6.1) The loss function of the recognition network consists of three parts: classification loss, optimal transport alignment loss, and feature consistency loss. The classification loss uses the cross-entropy loss function, as shown in the following formula:

[0037]

[0038] in, This indicates the number of modulation types during training. This represents the predicted class vector value of the sample by the recognition network. The one-hot representation of the true class label of the sample;

[0039] The optimal transmission alignment loss uses the Sinkhorn approximation of the Wasserstein distance as the loss, as shown in the following formula:

[0040]

[0041] in, It is a cost matrix. Use Euclidean distance to calculate the distance between two modal features; It's a transmission plan. It is a set of transmission plans, representing a characteristic distribution. Transition to characteristic distribution Feasible solutions; This represents the entropy regularization coefficient, set to 0.1; It is the entropy of the transmission plan matrix, used for regularization to make the function smooth and solvable;

[0042] The consistency loss uses the cosine similarity loss function, as shown in the following formula:

[0043] ,

[0044] in, The representation is the feature representation before module processing. The feature representation after module processing is represented. The consistency loss is applied to the cross-attention interaction alignment module and the differential attention enhancement module. The average of the losses applied to different modules and different modalities is taken as the final consistency loss value.

[0045] The formula for the total loss function is as follows:

[0046]

[0047] in, and These represent the weights of OT loss and consistency loss, respectively.

[0048] 3.6.2) The optimization algorithm used is Adam, an optimization algorithm based on adaptive matrix estimation.

[0049] 3.7) Set the training step size and number of iterations for the recognition network:

[0050] The training step size of the network refers to the number of training samples that are fed into the network in each batch for training. Set Batch_size to 256. The number of iterations refers to the number of times the training samples are repeatedly fed into the network for training. Set Epoch to 100.

[0051] Step 4: Train the multimodal modulation recognition network model.

[0052] 4.1) Randomly shuffle the order of the samples in the training sample set, and input the shuffled training samples into the multimodal modulation recognition network model in batches according to the training step size for iterative network training to obtain the predicted output of the input samples.

[0053] 4.2) Substitute the predicted output and actual labels obtained in 4.1) into the loss function to calculate the error loss of the training sample set; backpropagate the error loss and use the optimization function in 3.6.2) to perform gradient optimization, thereby adjusting the network parameters;

[0054] 4.3) Each iteration of the training sample set yields a candidate network. At this point, the order of the samples in the validation sample set is randomly shuffled. The shuffled validation samples are then input into the network model in batches according to the training step size for network validation. The error loss of the candidate network on the validation sample set is obtained. If the error loss is lower than that of the previous iteration, the current candidate network is saved; otherwise, it is not saved.

[0055] 4.4) Repeatedly train the multimodal modulation recognition network model until the number of training iterations reaches 100, obtain the optimal network in the iteration process, and complete the training process of the multimodal modulation recognition network model.

[0056] Step 5: Identify the test sample set.

[0057] The test sample set is input into the multimodal modulation recognition network model trained in step 4 to obtain the prediction vector output by the model. The position of the maximum value in the prediction vector is taken as the type of modulation signal predicted for the test sample, and the multimodal modulation recognition process is completed.

[0058] The effects of the present invention will be further described below with reference to simulation experiments.

[0059] 1. Simulation Experiment Conditions

[0060] The hardware platform for the simulation experiment of this invention is a HP Z840 server with an Intel i7-9700 processor and a TIATAN V graphics card with 12G of video memory. The software platform for the simulation experiment of this invention is Ubuntu 22.04.5 LTS, PyTorch 1.12, CUDA 11.3 + cuDNN 8.3.2, and Python 3.10.

[0061] 2. Simulation Content and Results

[0062] This invention evaluated the proposed method in Experiment 1, Experiment 2 with the optimal transport distribution alignment module removed, Experiment 3 with the cross-attention interaction alignment module removed, Experiment 4 with the differential feature enhancement module removed, Experiment 5 using a splicing and fusion method, Experiment 6 using IQ single-modal input, and Experiment 7 using APF single-modal input. The evaluation metrics were overall accuracy (OA), average accuracy (AA), and Kappa coefficient.

[0063] The present invention first conducted experiments on dataset one. For the same dataset, the experimental parameters for different experiments were set the same. The evaluation results of the above experiments on dataset one are shown in Table 1.

[0064] Table 1. Evaluation results of the multimodal modulation recognition network model on Dataset 1.

[0065]

[0066] To demonstrate the robustness of the present invention, the same experiment was conducted on dataset two. For the same dataset, the experimental parameters for different experiments were set the same. The evaluation results of the above experiments on dataset two are shown in Table 2.

[0067] Table 2 Evaluation results of the multimodal modulation recognition network model on dataset 2

[0068]

[0069] The confusion matrix results of the method proposed in this invention on dataset 1 and dataset 2 are shown below. Figure 3 and Figure 4 The simulation results above show that all metrics in Experiment 1 are superior to those in Experiments 2, 3, and 4, demonstrating the effectiveness of the processing modules proposed in this invention in the overall model recognition process. Experiment 1 also outperforms Experiments 5, 6, and 7, highlighting the advantages of multimodal input and the necessity of modal feature alignment and fusion in the proposed method. Even with datasets containing extremely low signal-to-noise ratio data, the proposed method maintains good performance and is suitable for modulation signal recognition problems in complex environments.

[0070] In summary, this invention uses common traditional features as input to a branch of a deep network to extract features from modally applicable deep networks, eliminating the need for complex feature extraction based on expert knowledge. It fully utilizes the complementary information of two different modalities, thereby improving recognition performance. In particular, this invention employs a multi-level feature processing approach that combines global distribution alignment and fine-grained interactive alignment to align features from different representation spaces, thus migrating feature alignment and fusion from non-communication signal domains such as images to communication signals and signal processing domains.

Claims

1. A method for recognizing multimodal modulated signals based on optimal transmission and attention mechanisms, characterized in that, Including the following: (1) Construct an experimental dataset for the Radio Signal Modulation Recognition RML2016.10a dataset. (1a) Obtain basic traditional features as multimodal data; (1b) Divide the experimental dataset; (2) Construct a multimodal modulation recognition network model (2a) ResNet18 is used as the initial feature extraction network, and the network parameters are not shared among multiple modalities; (2b) Utilize Optimal Transport (OT) theory to softly align the global distribution of multimodal features; (2c) Use cross-attention mechanism to interactively align multimodal fine-grained features; (2d) Introduce differential attention mechanism to extract modality-specific features; (2e) Construct the classification module and set the parameters of the fully connected layer network; (2f) Set the loss function of the network model to be a combination of optimal transmission alignment loss, feature consistency loss and classification loss, and set the network optimization algorithm to Adam; (2g) Set the training step size Batch_size of the network to 256, the number of training iterations Epoch to 100, the initial learning rate to 0.001, and the learning rate scheduler to cosine annealing scheduler. (3) Training a multimodal modulation recognition network model (3a) Randomly shuffle the order of the samples in the training sample set in (1) and input them into the network model in batches according to the training step size set in (2g) for iterative network training; (3b) Calculate the error loss of the training sample set according to the loss function set in (2f); (3c) Backpropagate the error loss and optimize the gradient using the loss function set in (2f) to adjust the network parameters; (3d) For each iteration of the training dataset, the order of the samples in the validation sample set is randomly shuffled and input into the network model in batches according to the training step size set in (2g) to perform network validation, so as to obtain the optimal network parameter weights during the iteration process. (3e) Train the network model until the number of training iterations set in (2g) is reached, complete the training process of the network model, and obtain and save the network parameter weights; (4) Input the test sample set into the multimodal modulation recognition network model trained in (3). The position of the maximum value in the prediction vector output by the model is used as the predicted signal modulation type, and the modulation signal recognition process is completed.

2. The method according to claim 1, characterized in that: The steps for obtaining traditional features as multimodal data as described in (1a) are as follows: Traditional characteristics were obtained based on the raw IQ data. It includes amplitude, phase, and frequency features; different traditional features are spliced ​​together along the channel dimension to obtain multimodal data, which is called APF modal data.

3. The method according to claim 1, characterized in that: The steps for partitioning the experimental dataset as described in (1b) are as follows: For a multimodal sample set consisting of IQ modal data and APF modal data, the training sample set, validation sample set, and test sample set are respectively divided in a ratio of 5:2:

3.

4. The method according to claim 1, characterized in that: The steps for the global distribution of soft-aligned multimodal features in optimal transport theory as described in (2b) are as follows: Calculate the cost matrix between the two modal features, then solve for the optimal transport plan matrix (TransportPlan). Based on the cost matrix and the transport plan matrix, calculate the inner product of the two matrices as the optimal transport alignment loss, which is used to optimize network parameters and complete the soft alignment of feature distribution.

5. The method according to claim 1, characterized in that: The steps for cross-attention interaction alignment described in (2c) are as follows: A bidirectional multi-head cross-attention mechanism is used between the two modal features. The IQ modal features are used as queries to search for relevant information in the APF modal feature space, and the APF modal features are used as queries to search for relevant information in the IQ modal feature space. The multi-head attention mechanism is used to process the feature relationships of different subspaces in parallel to obtain modal features that incorporate information from the other modality, thus completing fine-grained interactive alignment of features.

6. The method according to claim 1, characterized in that: The steps for enhancing specific features using differential attention as described in (2d) are as follows: Feature differences are measured from different dimensions. The direct difference, absolute difference, and similarity difference between the two modal features are calculated separately. The difference features are combined and the complex relationship between the features is learned by using a multi-head attention mechanism. A gating mechanism is used to dynamically control the fusion of the difference features and the original features to complete the feature enhancement using specific information.

7. The method according to claim 1, characterized in that: The loss function described in (2f) involves the joint optimization of multiple losses, specifically including: The classification loss is based on the supervision signal of any one of the multimodal modes, and the cross-entropy loss is used as the classification loss function; the optimal transmission alignment loss function is based on the Wasserstein distance of the migration between feature distributions in the optimal transmission theory; the feature consistency loss is based on ensuring the consistency of feature semantics in each feature processing stage, and the cosine similarity loss is used as the consistency loss function.

8. A system for implementing the method according to any one of claims 1 to 7, characterized in that, include: The feature extraction module is used to extract the initial depth features of the IQ mode and the APF mode; The optimal transmission distribution alignment module, embodied in the loss function, is used to achieve soft alignment of the distribution of different modal features; The cross-attention interaction alignment module is used to achieve fine-grained interaction alignment of features from different modalities. The differential attention enhancement module is used to extract modality-specific information and enhance feature representation; The classification module is used for final classification prediction.