A speech emotion recognition method based on attention MCNN combined with gender information

By introducing attention-based MCNN and gender information into speech emotion recognition, and combining coordinated attention mechanism and Bi-GRU, the problem of low emotion recognition rate and weak generalization ability caused by differences in male and female acoustic features is solved, thereby improving recognition accuracy and robustness.

CN116453548BActive Publication Date: 2026-06-30CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2023-03-28
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing voice emotion recognition technologies suffer from low recognition rates and weak generalization capabilities in practical applications, especially due to the differences in how men and women express emotions, which leads to performance degradation.

Method used

We employ an attention-based MCNN combined with gender information approach. After preprocessing the speech signal, we use MCNN for gender recognition and introduce a coordinated attention mechanism and a bidirectional gated recurrent unit (Bi-GRU) to capture emotional features and temporal information. Finally, we use the softmax function for emotion classification.

Benefits of technology

It improves the accuracy and robustness of voice emotion recognition, and solves the problems of low emotion recognition rate and weak generalization ability caused by differences in male and female acoustic characteristics.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116453548B_ABST
    Figure CN116453548B_ABST
Patent Text Reader

Abstract

This invention claims protection for a speech emotion recognition method based on attention-based MCNN combined with gender information. The method includes the following steps: S1, obtaining three-dimensional dynamic Mel-frequency cepstral coefficients (MFCCs) from the preprocessed speech signal as input to a gender recognition network; S2, using MCNN for gender recognition and classifying the speech signal into male and female categories; S3, based on the output of the gender classification, extracting three-dimensional dynamic MFCC features from the male and female speech signals as input to the emotion recognition model. To focus on channel and spatial location information and address long-term dependency issues, a coordinated attention mechanism is introduced into the original MCNN model to establish the speech emotion recognition model; S4, to better capture emotional features and temporal information, A_GRUs are added to the emotion recognition model, and finally, a softmax function is used for emotion classification, providing emotion recognition results for different genders. This invention effectively solves the problems of low recognition rate and weak generalization ability of emotion recognition models caused by differences in male and female acoustic features, improving the accuracy and robustness of emotion recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of speech signal processing and pattern recognition, and in particular to a speech emotion recognition method based on attention MCNN combined with gender information. Background Technology

[0002] Speech emotion recognition is an important branch of speech recognition. Its purpose is to enable machines to learn and remember human pronunciation or voice to identify and understand the emotional state of the speaker in the transmission of speech signals. The process can be summarized as: speech signal preprocessing, feature extraction, feature selection, recognition model matching, and recognition completion. In recent years, important applications that rely on user emotional states, such as intelligent robots, dialogue systems, medical care, audio monitoring, in-vehicle driving, criminal investigations, automated smart home appliances, and music or movie recommendation systems, can all be achieved through a system that automatically detects and recognizes user emotional states from speech.

[0003] Deep learning-based speech emotion recognition systems have made significant contributions in many areas, but the performance of existing speech emotion recognition technologies in practical applications still lags far behind the emotional information perceived by human hearing. Recognizing emotions in human voices is difficult because human emotions lack unique temporal boundaries, and different people express emotions in different ways. Differences in acoustic characteristics between individuals are one of the main factors affecting the performance of speech emotion recognition systems. Since men and women express emotions differently and have unique vocal systems, this indicates that gender differences affect the overall performance of speech emotion recognition systems. In past research on speech emotion recognition, few studies have explored emotion recognition through the lens of gender differences. Previous studies, such as Narayanan et al., have demonstrated that gender-based emotion recognizers outperform gender-independent emotion recognizers. Bisio et al. proposed an emotion classification algorithm based on pitch features to establish gender recognition, aiming to provide prior information about the speaker's gender. Anish et al. employed multi-task learning with gender recognition as an auxiliary task for emotion recognition. Wang Lin et al. proposed using gender awareness features in speech emotion recognition systems. They proposed modifying the speech spectrogram with the speaker's gender information and then using it as input to train a single CNN-BLSTM classifier. This method provides better recognition results than not using gender information at all.

[0004] This invention addresses the issues of poor generalization ability in emotion recognition models and low emotion recognition rates due to differences in how men and women express emotions. It improves the generalization ability of the models by providing mixed male and female samples to the gender classifiers for training, and independently training female and male emotion classifiers using only female samples and only male samples respectively. The output of the gender recognition model determines which gender-specific emotion recognition model to use.

[0005] CN113643723A discloses a speech emotion recognition method based on attention-based CNN and Bi-GRU fusion of visual information, comprising the following steps: S1, preprocessing the speech signal to obtain a three-dimensional logarithmic Mel-spectrum; S2, pre-training a 3DRACNN speech network using the three-dimensional logarithmic Mel-spectrum to improve generalization ability; S3, extracting static facial appearance features and geometric features using CNN and AGRUs respectively; S4, to reduce the low recognition rate of speech features, a fusion model is used to fuse speech features with facial features sequentially to obtain mixed features, and irrelevant features are filtered out by KLDA; S5, during model training, the loss is minimized by updating parameters, and the algorithm is optimized, with emotion classification finally performed by a softmax layer. This invention can effectively solve the problems of low recognition rate and weak generalization ability of emotion recognition models, improving recognition accuracy and robustness.

[0006] This patent improves the robustness and generalization ability of the speech emotion recognition system to a certain extent, achieving good recognition results. However, the 3DRACNN network in this patent uses a traditional CNN. To obtain a larger receptive field during feature extraction, the convolutional kernels must be enlarged, increasing computational complexity. This invention combines traditional CNN with dilated convolutions to form a hybrid convolutional neural network (MCNN), achieving a larger receptive field while reducing computational overhead. Secondly, the convolutional attention module (CABM) used in this patent employs massive pooling to capture only local relevance using location information, making it difficult to model long-term dependencies. This invention, however, uses a coordinated attention module that embeds location information into channel attention, increasing the spatial attention range and eliminating the location information loss problem caused by two-dimensional global pooling in the convolutional attention module. Furthermore, this module considers both channel and spatial aspects in parallel and effectively addresses the long-term dependency problem. Finally, this patent uses a fusion model to fuse speech features with facial features to obtain hybrid features for emotion recognition. This invention addresses the acoustic differences and emotional expression styles between men and women, providing emotion recognition results based on the speaker's specific gender. Summary of the Invention

[0007] This invention aims to solve the problems of the prior art. It proposes a speech emotion recognition method based on attention-based MCNN combined with gender information. The technical solution of this invention is as follows:

[0008] A speech emotion recognition method based on attention-based MCNN combined with gender information includes the following steps:

[0009] S1, preprocessing the original speech signal including framing, windowing, Fourier transform, and difference to obtain three-dimensional dynamic MFCC features;

[0010] S2, the three-dimensional dynamic MFCC obtained after the preprocessing in step S1 is input into the gender recognition network, and the MCNN (hybrid convolutional neural network) model is used to perform gender recognition and classify the speech signal into male and female;

[0011] S3, based on the output of gender recognition, extracts three-dimensional MFCC features from male and female speech data and inputs them into the emotion recognition model, while introducing a coordinated attention mechanism into MCNN;

[0012] S4. In order to capture emotional features and temporal information, bidirectional gated recurrent units (Bi-GRUs) combined with attention layers (Attention-GRUs, A-GRUs) are added to the emotion recognition model. Finally, the softmax function is used for emotion classification to provide emotion recognition results for different genders.

[0013] Furthermore, step S1 performs preprocessing on the original speech signal, including framing, windowing, Fourier transform, and differencing. Specifically, the given speech signal is divided into frames, with a time length of 5-10ms between consecutive frames; before performing Fourier transform on each frame, a Hamming window is used, with the window length equal to the frame length; a short-time Fourier transform is performed on each frame, and the power spectrum is obtained by summing the squares; MFCC features are obtained through discrete cosine transform of the log-Mel spectrum; to obtain dynamic information, velocity and acceleration features are added by performing differencing operations on the input MFCC features along the time axis to form three-dimensional dynamic features.

[0014] Furthermore, step S2 inputs the preprocessed 3D dynamic MFCC features into the gender recognition network, and uses the MCNN model for gender recognition and speech signal classification, specifically including:

[0015] (1) Hybrid convolutional layers combine standard convolution and dilated convolution in the same layer and can utilize the same convolutional kernel. Hybrid convolutional layers are formed as follows:

[0016] [σ(ω s );σ(ω d (1)

[0017] Where ω s and ω d These are the parameters of standard convolution and dilated convolution, respectively; σ is the combination of group normalized layers (GN) and linear rectified units (ReLUs);

[0018] (2) The hybrid convolutional block consists of a hybrid convolutional layer, a group normalization layer (GN), and a linear rectified unit (ReLU), and is used for feature acquisition;

[0019] (3) The MCNN architecture for gender recognition includes 5 hybrid layers, 1 max pooling layer, and 2 fully connected layers.

[0020] Furthermore, the gender recognition MCNN architecture specifically includes: adjusting the size of the 3D MFCC features to 224×224×3 as the input to the MCNN network; the first convolutional layer has a kernel size of 2×2, a stride of 2, 3 input channels, and 32 output channels; the max pooling layer has a kernel size of 2×2, a stride of 2, 32 input channels, and 32 output channels; the second convolutional layer has a kernel size of 1×1, a stride of 1, 3 input channels, and 32 output channels; the third convolutional layer has a kernel size of 1×1, a stride of 1, 3 input channels, and 96 output channels; the fourth convolutional layer has a kernel size of 2×2, a stride of 2, 96 input channels, and 96 output channels; the fifth convolutional layer has a kernel size of 1×1, a stride of 1, 96 input channels, and 96 output channels; the first fully connected layer consists of 1000 neurons, and the second fully connected layer is a classification layer with 2 neurons corresponding to male or female.

[0021] Furthermore, step S3 utilizes an attention mechanism to weight the image feature space and channel weight parameters, and then fuses the shallow and deep features in the feature layer, specifically including:

[0022] (1) In the MCNN architecture for gender recognition, the last two fully connected layers are removed, and two hybrid convolutional layers are added as the sixth and seventh layers, along with an average pooling layer. The sixth layer has a convolutional kernel size of 2×2, a stride of 2, 96 input channels, and 288 output channels. The seventh layer has a convolutional kernel size of 1×1, a stride of 1, 288 input channels, and 288 output channels. The average pooling layer has a convolutional kernel size of 2×2, a stride of 1, 288 input channels, and 288 output channels.

[0023] (2) Three coordinating attention modules are integrated into the 3rd and 4th layers, the 5th and 6th layers, and the 7th layer of the hybrid convolutional layer, respectively, between the layer and the average pooling layer. The operation of the coordinating attention module can be divided into two parts: coordinate information embedding and coordinate attention generation. Coordinate information embedding encodes channel information in the horizontal and vertical coordinates, while coordinate attention generation captures position information and generates weight values.

[0024] Furthermore, step S3 utilizes a coordinated attention mechanism to focus on channel and spatial location information, and its calculation process specifically includes:

[0025] (1) Coordinate information embedding steps:

[0026] (2) Coordinate attention generation steps.

[0027] Furthermore, the coordinate information embedding step specifically includes:

[0028] For a given input element Pooling kernels of size (H,1) and (1,W) are used to encode information from different channels along the horizontal and vertical directions, respectively. Represents the elements of a three-dimensional matrix, x c Let H represent the feature vector of the c-th channel, where C represents the number of channels. The output process for the feature vector of the c-th channel at height H is as follows:

[0029]

[0030] Similarly, the output of channel C at width W is:

[0031]

[0032] These two formulas generate a pair of orientation-aware feature maps, enabling the embedding of coordinate information.

[0033] i and j represent the number of traversals along the width and height, respectively.

[0034] Furthermore, the coordinate attention generation step specifically includes:

[0035] By concatenating the two encoded features in the spatial dimension, the length becomes (H+W), and then using the shared convolutional transform function F1, we obtain:

[0036] f=δ(F1([z h ,z w ])) (4)

[0037] Among them, [z h ,z w ] indicates a concatenation operation along a spatial dimension, z h ,z w These represent the output vectors at height and width, respectively, where δ is a non-linear activation function. It is an intermediate element mapping used to encode spatial information in the horizontal and vertical directions, where r is the reduction rate of the control block size, which is generally 32, and the number of channels of f is reduced by formula (5).

[0038] C out =max(8,C in / r) (5)

[0039] C in The number of input channels, F, is decomposed into two independent tensors along the spatial dimension: and f h and f wLet f represent the vertical and horizontal tensors in spatial dimensions, respectively. Two 1×1 convolutional transformations are used for f. h and f w So that they remain tensors with the same number of channels as the X input; then they are processed using the sigmoid activation function to obtain g. h With g w This can be achieved using formula (6):

[0040] g h =δ(F h (f h ))

[0041] g w =δ(F w (f w (6)

[0042] Where, F h and F w It is two 1×1 convolutions, g h and g w It is a two-dimensional weight; finally, g h and g w The coordinate attention module output is obtained by fusing the input feature X.

[0043]

[0044] Furthermore, step S4 incorporates an A_GRUs model into the established emotion recognition model to capture emotional cues, specifically including:

[0045] The high-level features extracted by CA-MCNN (Coordinated Attention-Hybrid Convolutional Neural Network) are passed to Bi-GRU (Bidirectional Gated Recurrent Unit) to capture temporal information; an attention layer is added to focus on the emotion-related parts of the speech features; the Bi-GRU is configured with 512 bidirectional hidden units, and a new sequence of shape L×1024 is created and put into the attention layer to finally generate a new sequence H; sequence H is first input into a fully connected layer for pre-classification, and finally the softmax function is used to achieve the final emotion classification result.

[0046] Furthermore, the formula for the Softmax function is as follows:

[0047]

[0048] P(S i () represents the classification output, and n represents the number of categories. There are a total of n categories S represented by numerical values. k k∈(0,n], i represents a certain category in k, g i S represents the value of this category.i This represents the classification probability of the i-th element.

[0049] The advantages and beneficial effects of this invention are as follows:

[0050] The main innovations of this invention are concentrated in three parts: S2, S3, and S4. This invention provides a speech emotion recognition method based on attention-based MCNN combined with gender information. Under the same experimental conditions, this method can improve the low emotion recognition rate caused by differences in male and female emotional expression. First, the speech signal is preprocessed to obtain three-dimensional dynamic Mel-frequency cepstral coefficients, which are then used as input to the gender recognition network. Second, MCNN is used for gender recognition, classifying the speech signal into male and female categories. Then, based on the output of the gender classification, three-dimensional dynamic MFCC features are extracted from the male and female speech signals and input into the speech emotion recognition model, while a coordinated attention mechanism is introduced into the original MCNN model. Finally, to better capture emotional features and temporal information, A_GRUs are added to the emotion recognition model, and finally, a softmax function is used for emotion classification, providing emotion recognition results for different genders. The model proposed in this invention can effectively solve the problems of low recognition rate and weak generalization ability of emotion recognition models caused by differences in male and female acoustic features, improving the accuracy and robustness of emotion recognition. Attached Figure Description

[0051] Figure 1 This is a framework diagram of a preferred embodiment of the speech emotion recognition method based on attention MCNN combined with gender information provided by the present invention;

[0052] Figure 2 This is a diagram of the MCNN architecture used for gender recognition;

[0053] Figure 3 This is a diagram of the CA-MCNN network architecture used for emotion recognition. Detailed Implementation

[0054] The technical solutions of the embodiments of the present invention will be clearly and thoroughly described below with reference to the accompanying drawings. The described embodiments are merely some embodiments of the present invention.

[0055] The technical solution of the present invention to solve the above-mentioned technical problems is:

[0056] like Figure 1 As shown, this invention provides a speech emotion recognition method based on attention-based MCNN combined with gender information. Its features include the following steps:

[0057] S1: The original speech signal undergoes preprocessing including framing, windowing, Fourier transform, and differencing to obtain three-dimensional dynamic MFCC features. The process is as follows: The given speech signal is divided into frames (approximately 20ms), with a time interval of 5-10ms between consecutive frames. Before performing a Fourier transform on each frame, a Hamming window is used, with the window length equal to the frame length. A short-time Fourier transform is performed on each frame, and the power spectrum is obtained by summing the squares. The MFCC features are obtained through the discrete cosine transform of the log-Mel spectrum. To obtain dynamic information, the input MFCC features are differencing along the time axis, and delta and double-delta features are added to form three-dimensional dynamic features.

[0058] S2: The 3D dynamic MFCC obtained after step S1 is input into the gender recognition network. The MCNN model is used for gender recognition, and the speech signal is classified into male and female categories. Specifically, this includes:

[0059] (1) Hybrid convolutional layers combine standard convolution and dilated convolution in the same layer and can utilize the same convolutional kernel. Hybrid convolutional layers are formed as follows:

[0060] [σ(ω s );σ(ω d (1)

[0061] Where ω s and ω d These are the parameters for standard convolution and dilated convolution, respectively; σ is the combination of group normalized layers (GN) and linear rectified units (ReLUs).

[0062] (2) The hybrid convolutional block consists of a hybrid convolutional layer, a group normalization layer (GN) and a linear rectified unit (ReLU), and is used for feature acquisition;

[0063] (3) The MCNN architecture for gender recognition includes 5 hybrid layers, 1 max pooling layer, and 2 fully connected layers.

[0064] S3: Based on gender recognition output, 3D MFCC features are extracted from male and female speech data and input into the emotion recognition model. Simultaneously, a coordinated attention mechanism is introduced into the original MCNN architecture to address long-term dependency issues. Specifically, this includes:

[0065] (1) In the MCNN architecture for gender recognition, the last two fully connected layers are removed, and two hybrid convolutional layers are added as the sixth and seventh layers, along with an average pooling layer. The sixth layer has a convolutional kernel size of 2×2, a stride of 2, 96 input channels, and 288 output channels. The seventh layer has a convolutional kernel size of 1×1, a stride of 1, 288 input channels, and 288 output channels. The average pooling layer has a convolutional kernel size of 2×2, a stride of 1, 288 input channels, and 288 output channels.

[0066] (2) Three coordinating attention modules are integrated into layers 3 and 4, layers 5 and 6, and layer 7 of the hybrid convolutional layer, respectively, between the layer and the average pooling layer. The operation of the coordinating attention module can be divided into two parts: coordinate information embedding and coordinate attention generation. Coordinate information embedding encodes channel information in the horizontal and vertical coordinates, while coordinate attention generation captures position information and generates weight values.

[0067] Coordinate information embedding:

[0068] For a given input element Pooling kernels of size (H, 1) and (1, W) are used to encode information from different channels along the horizontal and vertical directions, respectively. The output process for the feature of the C-th channel at height H is as follows:

[0069]

[0070] Similarly, the output of channel C at width W is:

[0071]

[0072] These two formulas generate a pair of orientation-aware feature maps, enabling the embedding of coordinate information.

[0073] i and j represent the number of traversals along the width and height, respectively.

[0074] Coordinate attention generation:

[0075] By concatenating the two encoded features in the spatial dimension, the length becomes (H+W). Then, using the shared convolutional transform function F1, we obtain:

[0076] f=δ(F1([z h ,z w ])) (4)

[0077] Among them, [z h ,z w ] indicates a concatenation operation along a spatial dimension, z h ,z wThese represent the output vectors at height and width, respectively, where δ is a non-linear activation function. It is an intermediate element mapping used to encode spatial information in the horizontal and vertical directions, where is the reduction rate of the control block size, which is generally 32, and the number of channels of f is reduced by formula (5).

[0078] C out =max(8,C in / r) (5)

[0079] C in The number of input channels, F, is decomposed into two independent tensors along the spatial dimension: and f h and f w Let f represent the vertical and horizontal tensors in spatial dimensions, respectively. Two 1×1 convolutional transformations are used for f. h and f w This ensures they remain tensors with the same number of channels as the X input. Then, a sigmoid activation function is applied to obtain g. h With g w This can be achieved using formula (6):

[0080] g h =δ(F h (f h ))

[0081] g w =δ(F w (f w (6)

[0082] Where, F h and F w It is two 1×1 convolutions, g h and g w It is a two-dimensional weight. Finally, g... h and g w The coordinate attention module output is obtained by fusing the input feature X.

[0083]

[0084] S4: To capture emotional features and temporal information, Bidirectional Gated Recurrent Units (Bi-GRUs) combined with attention layers (Attention-GRUs, A-GRUs) are added to the emotion recognition model. Finally, a softmax function is used for emotion classification, providing emotion recognition results for different genders. Specifically, this includes:

[0085] The high-level features extracted by CA-MCNN are passed to Bi-GRU to capture temporal information. An attention layer is added to focus on the emotion-related parts of the speech features. The Bi-GRU is configured with 512 bidirectional hidden units, and a new sequence of shape L×1024 is created and fed into the attention layer to generate a new sequence H. Sequence H is first fed into a fully connected layer for pre-classification, and finally, the softmax function is used to achieve the final emotion classification result.

[0086] The formula for the Softmax function is as follows:

[0087]

[0088] P(S i () represents the classification output, and n represents the number of categories. There are a total of n categories S represented by numerical values. k k∈(0,n], i represents a certain category in k, g i S represents the value of this category. i This represents the classification probability of the i-th element.

[0089] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions.

[0090] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0091] The above embodiments should be understood as illustrative only and not as limiting the scope of protection of the present invention. After reading the description of the present invention, those skilled in the art can make various alterations or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. A speech emotion recognition method based on attention-based MCNN combined with gender information, characterized in that, Includes the following steps: S1, preprocessing the original speech signal including framing, windowing, Fourier transform, and difference to obtain three-dimensional dynamic MFCC features; S2, the three-dimensional dynamic MFCC obtained after the preprocessing in step S1 is input into the gender recognition network, and gender recognition is performed using the hybrid convolutional neural network MCNN, and the speech signal is classified into male and female. S3, based on the output of gender recognition, extracts three-dimensional MFCC features from male and female speech data and inputs them into the emotion recognition model, while introducing a coordinated attention mechanism into MCNN; S4. In order to capture emotional features and temporal information, a bidirectional gated recurrent unit (Bi-GRU) combined with an attention layer (A-GRU) is added to the emotion recognition model. Finally, the softmax function is used for emotion classification to provide emotion recognition results for different genders. Step S2 inputs the preprocessed 3D dynamic MFCC features into the gender recognition network, and uses the MCNN model for gender recognition and speech signal classification, specifically including: (1) Hybrid convolutional layers combine standard convolution and dilated convolution in the same layer and can utilize the same convolution kernel. Hybrid convolutional layers are formed as follows: (1); in and These are the parameters for standard convolution and dilated convolution, respectively. It is a combination of normalization layers (GN) and linear rectifier units (ReLUs); (2) The hybrid convolutional block consists of a hybrid convolutional layer, a group normalization layer (GN), and a linear rectified unit (ReLU), and is used for feature acquisition; (3) The MCNN architecture for gender recognition includes 5 hybrid layers, 1 max pooling layer, and 2 fully connected layers; The specific gender recognition MCNN architecture includes: adjusting the size of the 3D MFCC features to 224×224×3 as the input to the MCNN network; the first convolutional layer has a kernel size of 2×2, a stride of 2, 3 input channels, and 32 output channels; the max pooling layer has a kernel size of 2×2, a stride of 2, 32 input channels, and 32 output channels; the second convolutional layer has a kernel size of 1×1, a stride of 1, 3 input channels, and 32 output channels; the third convolutional layer has a kernel size of 1×1, a stride of 1, 3 input channels, and 96 output channels; the fourth convolutional layer has a kernel size of 2×2, a stride of 2, 96 input channels, and 96 output channels; the fifth convolutional layer has a kernel size of 1×1, a stride of 1, 96 input channels, and 96 output channels; the first fully connected layer consists of 1000 neurons, and the second fully connected layer is a classification layer with 2 neurons corresponding to male or female.

2. The speech emotion recognition method based on attention MCNN combined with gender information according to claim 1, characterized in that, Step S1 involves preprocessing the original speech signal, including framing, windowing, Fourier transform, and differencing. Specifically, the given speech signal is divided into frames, with a time interval of 5-10 ms between consecutive frames. Before performing a Fourier transform on each frame, a Hamming window is used, with the window length equal to the frame length. A short-time Fourier transform is performed on each frame, and the power spectrum is obtained by summing the squares. The MFCC features are obtained through the discrete cosine transform of the log-Mel spectrum. To obtain dynamic information, velocity and acceleration features are added by performing a differencing operation on the input MFCC features along the time axis to form three-dimensional dynamic features.

3. The speech emotion recognition method based on attention MCNN combined with gender information according to claim 1, characterized in that, Step S3 utilizes an attention mechanism to weight the image feature space and channel weight parameters, and then fuses the shallow and deep features in the feature layer, specifically including: (1) In the MCNN architecture for gender recognition, the last two fully connected layers are removed and two hybrid convolutional layers are added as the sixth and seventh layers and an average pooling layer. The convolutional kernel size of the sixth layer is 2×2, the stride is 2, the input has 96 channels and the output has 288 channels. The convolutional kernel size of the seventh layer is 1×1, the stride is 1, the input has 288 channels and the output has 288 channels. The convolutional kernel size of the average pooling layer is 2×2, the stride is 1, the input has 288 channels and the output has 288 channels. (2) Three coordination attention modules are integrated into the third and fourth layers, the fifth and sixth layers, and the seventh layer of the hybrid convolutional layer and the average pooling layer, respectively. The operation process of the coordination attention module can be divided into two parts: coordinate information embedding and coordinate attention generation. The coordinate information is embedded in the horizontal and vertical coordinates to encode the channel information, and the coordinate attention generation captures the position information and generates weight values.

4. The speech emotion recognition method based on attention MCNN combined with gender information according to claim 1, characterized in that, Step S3 utilizes a coordinated attention mechanism to focus on channel and spatial location information, and its calculation process specifically includes: (1) Coordinate information embedding steps; (2) Coordinate attention generation steps.

5. The speech emotion recognition method based on attention MCNN combined with gender information according to claim 4, characterized in that, The coordinate information embedding step specifically includes: For a given input element Size is and The pooling kernels are used to encode information from different channels along the horizontal and vertical directions, respectively. Representing the elements of a three-dimensional matrix, This represents the feature vector of the c-th channel, where C represents the number of channels; the output process of the feature vector of the c-th channel at height H is as follows: (2); Similarly, the output of channel C at width W is: (3); These two formulas generate a pair of orientation-aware feature maps, enabling the embedding of coordinate information; i and j represent the number of traversals along the width and height, respectively.

6. The speech emotion recognition method based on attention MCNN combined with gender information according to claim 5, characterized in that, The coordinate attention generation step specifically includes: Connecting two encoded features in the spatial dimension, the length becomes... Then, using the shared convolution transform function F1, we obtain: (4); in, This indicates a chain operation along a spatial dimension. These represent the output vectors at height and width, respectively. It is a non-linear activation function. It is an intermediate element mapping used to encode spatial information in the horizontal and vertical directions, where r is the reduction rate controlling the block size, typically 32, and is reduced using formula (5). The number of channels; (5); The number of input channels, F, is decomposed into two independent tensors along the spatial dimension: and , and Representing the vertical and horizontal tensors in spatial dimensions, using two... Convolution transformation is used for and So that they maintain tensors, having the same... Input the same number of channels; then process them using the sigmoid activation function to obtain... and This can be achieved using formula (6): (6); in, and There are two convolution, and It is a two-dimensional weight; finally, and Input features The coordinate attention module outputs the result of the fusion process. (7)。 7. The speech emotion recognition method based on attention MCNN combined with gender information according to claim 6, characterized in that, Step S4 involves adding an A_GRUs model to the established emotion recognition model to capture emotional cues, specifically including: The high-level features extracted by the CA-MCNN coordinated attention-hybrid convolutional neural network are passed to the Bi-GRU bidirectional gated recurrent unit to capture temporal information; an attention layer is added to focus on the emotion-related parts of the speech features; the Bi-GRU is configured with 512 bidirectional hidden units, and a new sequence of shape L×1024 is created and put into the attention layer to finally generate a new sequence H; the sequence H is first input into a fully connected layer for pre-classification, and finally the softmax function is used to achieve the final emotion classification result.

8. The speech emotion recognition method based on attention MCNN combined with gender information according to claim 7, characterized in that, The formula for the Softmax function is as follows: (8); This represents the classification output, where n represents the number of categories. There are a total of n categories represented by numerical values. , , where i represents a category in k. This represents the value of the category. This represents the classification probability of the i-th element.