A speaker recognition method based on Fca-Res2Net with self-attention fusion

By combining improved MFCC feature extraction and the Fca-Res2Net model, the problems of insufficient feature extraction and low recognition accuracy in existing speaker recognition systems are solved, achieving higher recognition accuracy and better generalization ability, especially in utilizing high-frequency signals and dynamic features.

CN116469395BActive Publication Date: 2026-06-30CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2023-04-10
Publication Date
2026-06-30

Smart Images

  • Figure CN116469395B_ABST
    Figure CN116469395B_ABST
Patent Text Reader

Abstract

This invention claims protection for a speaker recognition method based on Fca-Res2Net with self-attention fusion, comprising the following steps: S1, preprocessing the speech signal and passing it through different filters, and performing differential operations to obtain a set of two-dimensional log-Mel spectrograms, fully utilizing the dynamic and static information of the high and low frequency bands of the speech; S2, pre-training Fca-Res2Net using the log-Mel spectrograms; S3, introducing a frequency domain channel attention network: FcaNet, on the baseline Res2Net model, using residual modules to fuse shallow and deep speaker feature information to obtain different feature channel weight information; S4, proposing an Fca-Res2Net model combining a self-attention mechanism to combine speaker spatial features with temporal features, capturing speech features over a long time span; S5, minimizing the loss by updating parameters during model training, while simultaneously optimizing through algorithms, and finally classifying the speaker using a softmax layer. This invention can effectively solve the problems of low recognition rate and weak generalization ability of speaker recognition models, improving recognition accuracy and robustness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of speech signal processing and pattern recognition, and in particular, it is a speaker recognition method based on Fca-Res2Net fused with self-attention. Background Technology

[0002] Speaker recognition, also known as voiceprint recognition, is the process of identifying a speaker through their vocal characteristics, and it largely depends on the speaker. Differences between speakers mostly stem from their speaking style, vocal cord and voice quality, and the variations in how they convey specific meanings. These features are utilized in state-of-the-art speaker recognition systems, primarily for security systems such as credit card voiceprint protection, telephone banking customer verification, and police evidence collection. With the increasing demands of various applications and the continuous maturation of speaker recognition technologies, improving system accuracy has become a hot research topic in recent years.

[0003] The basic framework of speaker recognition systems mainly consists of feature extraction and speaker model building. Feature extraction involves extracting the speaker's speech signal feature vectors as input to the speaker model, enabling it to fully reflect individual speaker differences and improve recognition accuracy. Common time-domain features among speaker features include amplitude and energy. However, these features are usually obtained directly from the original speech signal through filters. Although simple to process, they have poor stability and are therefore rarely used in recent years. Common transform-domain features include linear prediction coefficients (LPC), filter bank (Fbank) features, and Mel-frequency cepstral (MFCC) coefficients. Since Mel filters are based on the structure of the human ear and can better fit the characteristics of signals received by the human ear, fully reflecting the speaker's characteristics, speaker recognition systems often use MFCC feature extraction methods.

[0004] With the development of deep learning, speaker recognition methods based on deep learning have become increasingly popular due to their excellent modeling capabilities. Deep learning improves the accuracy of classification or recognition by learning more useful features through building models with hidden layers and training on a large amount of parameter data. However, shallow networks have weak feature representation capabilities, so researchers have focused on using deeper networks to extract deeper network parameters. However, increasing the number of network layers can lead to gradient explosion and gradient vanishing problems. Based on deep residual networks, Gao et al. proposed Res2Net. This network structure uses layered connections of residual blocks to better integrate speaker feature information from shallow and deep layers as well as different channels, achieving more powerful feature extraction capabilities without increasing the parameter computation load.

[0005] In previous deep learning networks used for speaker recognition, different features and channels were assigned the same weights. The resulting speaker recognition models failed to focus on the most important voiceprint features. Therefore, introducing attention mechanisms into speaker recognition has become a hot topic. Channel attention is a mechanism that directly learns the weights for different channels to emphasize regions of interest while suppressing irrelevant background regions. The traditional channel attention mechanism, SE-Net (Squeeze and Excitation Net), uses scalars to represent channels and employs Global Average Pooling (GAP) to retain only the lowest frequency information, discarding all components from other frequencies. To better compress channels and introduce more information, researchers at Zhejiang University proposed a new attention mechanism: a frequency domain channel attention network called FcaNet.

[0006] The feature extraction method used in this patent extracts single feature information, which cannot characterize the speaker's high-frequency signals and dynamic features. This invention proposes an improved MFCC feature extraction method, combining inverse Mel-Cepstral Coefficients (IMFCC) with MFCC and performing a difference operation on the two. The fused features can fully reflect the speaker's characteristics. The model used in this patent is the ResNet34 network, but ResNet34 has been shown to have lower classification extraction performance than ResNet50. This invention, while inheriting the advantages of ResNet50, proposes Res2Net50. This network structure uses residual blocks for hierarchical connection, which can better fuse shallow and deep layers and different channels of speaker feature information, achieving stronger feature extraction capabilities without increasing parameter computational load. This patent uses a simple channel attention module on the backbone network, retaining only the lowest frequency component while losing other frequency components, affecting speaker recognition performance. This invention uses a novel channel attention network, Fca-Net, which retains all frequency information of all channels and finally fuses a self-attention mechanism, focusing on global temporal information in addition to extracting speaker spatial information, thus improving speaker recognition performance.

[0007] This invention introduces FcaNet into the lightweight network Res2Net and proposes a speaker recognition model based on FcaNet-Res2Net. First, an improved MFCC feature extraction method is proposed, combining inverse Mel-Cepstral Coefficients (IMFCC) with MFCC. Then, a novel speaker recognition network incorporating an attention mechanism is designed, including frequency domain channel attention and self-attention convergence mechanism, which can assign corresponding weights to different channel information and segment-level features to improve speaker recognition accuracy. Summary of the Invention

[0008] This invention aims to solve the problems of the prior art. It proposes a speaker recognition method based on Fca-Res2Net fusion with self-attention. The technical solution of this invention is as follows:

[0009] A speaker recognition method based on Fca-Res2Net fused with self-attention includes the following steps:

[0010] S1. Perform pre-emphasis, framing, and windowing preprocessing on the original speech signal. Pass the preprocessed speech signal through different filters and perform differential operations to obtain a set of two-dimensional logarithmic Mel spectrograms with horizontal length related to signal duration and vertical length related to filter bank.

[0011] S2. The two-dimensional log-Mel spectrum processed in step S1 is used to pre-train Fca-Res2Net. Fca-Res2Net is an improved residual network model that integrates a frequency domain channel attention network and a speaker recognition network to improve generalization ability. Res2Net is an improvement on ResNet. While inheriting the advantages of ResNet, it does not increase the amount of parameter computation. It improves the feature extraction ability of the convolutional neural network by increasing the size of the receptive field.

[0012] S3. By fusing the frequency domain channel attention network FcaNet with the improved residual network Res2Net, the speaker feature information of the shallow and deep layers is obtained and used to obtain the weight information of different feature channels.

[0013] S4. We propose an Fca-Res2Net model that combines self-attention mechanism to combine speaker spatial features with temporal features and capture speech features over a long time span.

[0014] S5. During model training, the cross-entropy error function is used as the training objective function. The cross-entropy loss is minimized by updating the parameters. At the same time, the Adam algorithm is used for optimization to obtain the final network model. Finally, the softmax layer is used for speaker classification.

[0015] Furthermore, S1: The original speech signal is preprocessed by pre-emphasis, framing, windowing, etc., to obtain a three-dimensional log-Mel spectrum. The specific steps are as follows:

[0016] (1) Use a high-pass filter as shown in the following formula to boost the high-frequency part:

[0017] H(Z) = 1 - μz -1 (1)

[0018] Where H(Z) is the transfer function in the z-domain, z represents the coordinate value in the z-domain, H(z) is the transfer function, the z-domain describes the discrete-time system, the z-transform is a transformation of the Laplace transform of the sampling function, and the spatial domain for sampling and processing the sampled signal of the continuous-time system is called the z-domain. μ represents the pre-emphasis coefficient, and the output result after pre-emphasis is x(n); μ represents the pre-emphasis coefficient, and the output result after pre-emphasis is x(n);

[0019] (2) The pre-emphasized output x(n) is framed. To solve the problem of discontinuity at the endpoints after frame segmentation, a Hamming window is used for windowing:

[0020]

[0021] y(n) represents the framed speech signal, w(n,a) represents the window function of the Hamming window, where a takes the value 0.46, n = 0, 1, ..., N-1, and N is the frame length. The windowed speech signal is: s(n) = y(n) × w(n);

[0022] (3) After removing silent segments by endpoint detection, the energy distribution of the speech signal in the frequency domain is obtained by discrete Fourier transform. The output is a complex number S(k) containing N frequency bands, representing the amplitude and phase of a certain frequency in the original signal, as shown in the following formula:

[0023]

[0024] (4) Based on the sensitivity of the human ear, the spectrum is divided into multiple Mel filter banks and multiple inverse Mel filter banks. After passing through M different filters, the frequency response H is obtained. m (k), then calculate the logarithmic energy of the output of each filter bank:

[0025]

[0026] (5) Then, the corresponding first-order difference is obtained by difference operation. The logarithmic spectrum and its first-order difference are superimposed together. The dynamic and static information of the high and low frequency bands of speech is fully utilized to obtain the logarithmic Mel spectrum with horizontal length related to signal duration and vertical length related to filter bank.

[0027] Furthermore, step S2 uses the two-dimensional log-Mel spectrum processed in step S1 to pre-train Fca-Res2Net. Fca-Res2Net is an improved residual network model that fuses a frequency domain channel attention network, specifically including:

[0028] (1) A convolutional block consists of a convolutional layer, a group normalization layer (GN), and a linear rectified unit (ReLU), and is used to obtain features;

[0029] (2) The attention module uses a novel channel attention network, FcaNet, to assist Res2Net in capturing refined features in both spatial and channel aspects;

[0030] (3) By combining the skip residual connection technique of ResNet, four residual attention blocks Fca-Res2Net blocks were designed to learn deep fusion shallow features in sequence.

[0031] Furthermore, in step (3), four residual attention blocks (Fca-Res2Net blocks) were designed using ResNet's skip residual connection technique to sequentially learn deep-layer fusion of shallow-layer features, specifically including:

[0032] The log-Melogram size is adjusted to 300×256×4 and used as input to Fca-Res2Net. The first convolutional kernel size is 7×7 with a stride of 2; the max pooling layer size is 3×3 with a stride of 2, preserving salient features of prominent parts. Next, each residual block is processed in the same way, sequentially connected to the channel attention module Fca block, which provides comprehensive channel attention to the features obtained from the residual block. In the first residual, it passes through 3 Res2Net blocks with a convolutional layer of stride 1. In the second residual block, it passes through 4 Res2Net blocks with a convolutional layer of stride 1; in the third residual block, it passes through 6 Res2Net blocks with a convolutional layer of stride 1; in the fourth residual block, it passes through 3 Res2Net blocks with a convolutional layer of stride 1. The output channels of the four residual blocks are increased. Finally, a global average pooling layer with a stride of 1×2×2 is applied to describe the global features of the channels.

[0033] Furthermore, step S3 involves fusing the frequency domain channel attention network FcaNet with the improved residual network Res2Net to obtain fused shallow and deep speaker feature information, which is then used to acquire different feature channel weight information. Specifically:

[0034] (1) The improved residual network Res2Net utilizes channel-based residual skip-layer connections. By increasing the size of the receptive field, the hierarchical parallel network structure also increases the model's receptive field and integrates speaker information from different layers across channels.

[0035] (2) Add an attention module Fca-block to the backbone network Res2Net-50 and connect it to Res2Net-block to weight speech information and suppress information in the input features that is not related to the speaker feature extraction.

[0036] (3) Fca-block is a novel attention mechanism based on SE-block. Fca-block uses two-dimensional discrete cosine transform 2D-DCT to compress feature maps while preserving other frequency components.

[0037] The designed Fca-Res2Net network extracts local-global and spatial-channel features to enrich the representation of speaker features.

[0038] Furthermore, the Fca-Res2Net model designed in step S4, which incorporates a self-attention mechanism, combines speaker spatial features with temporal features to capture speech features over a long time span. The specific steps are as follows:

[0039] (1) After preprocessing the original speech segment, the input spectrogram is segmented and input into Fca-Res2Ne in parallel to create multiple output channels;

[0040] (2) After straightening and encoding multiple outputs, they are simultaneously fed into the self-attention block to obtain a long speech feature time series related to the entire time series. After decoding, speaker feature information is used to filter out the most discriminative features.

[0041] The features output by self-attention are the weighted sum of all samples in the time series. It can see all the input time samples and focus on the speaker's own time attention points according to different weights.

[0042] Furthermore, the self-attention mechanism can capture long-term dependencies in sequences of arbitrary length. It uses features extracted from Fca-Res2Net as input, which are then encoded to obtain a series of input vectors. These input vectors are then compared with matrix w. q w k w v We get q for each input by performing the inner product. i k i v i q between each pair of inputs i k j The similarity matrix between the two inputs is obtained by performing an inner product and applying the softmax function. Will With v i Perform inner product summation to obtain the output sequence for each input; q for each input i k i v i The input matrix values ​​Q, K, and V are combined, and the output matrix is ​​calculated as shown in formula (5).

[0043]

[0044] Among them, d k For q i k i The same length dimension, Attention is the output attention;

[0045] Self-attention is used to decode the output matrix to obtain the final output feature values.

[0046] Furthermore, in step S5, during model training, the cross-entropy error function is used as the training objective function, and the Adam algorithm is used for optimization. Finally, a softmax layer is used for sentiment classification, specifically including:

[0047] The cross-entropy algorithm is defined as follows:

[0048]

[0049] Where m represents the number of samples. Let y represent the true value of the i-th sample. i Let L represent the predicted output value of the i-th sample, and L represent the loss value.

[0050] The Adaptive Moment Estimation (Adam) algorithm combines the Stochastic Gradient Descent with Momentum (SGDM) and Root Mean Square Prop (RMSprop) algorithms. The final weights are defined as follows:

[0051]

[0052] in, This represents the bias-corrected, exponentially weighted average of the Momentum. This represents the bias-corrected exponentially weighted average of RMSprop, where α and ε are hyperparameters.

[0053] The formula for the Softmax function is as follows:

[0054]

[0055] n represents the number of categories, and there are a total of n categories S represented by numerical values. k k∈(0,n], i represents a certain category in k, g i S represents the value of this category. i This represents the classification probability of the i-th element.

[0056] The advantages and beneficial effects of this invention are as follows:

[0057] This invention provides an Fca-Res2Net speaker recognition model that incorporates a self-attention mechanism. Under the same experimental conditions, an Fca-Res2Net speaker recognition model that incorporates a self-attention mechanism is proposed, which can improve the problems of poor generalization ability and low speaker feature recognition rate of speaker recognition models. The specific steps are as follows: First, improved Mel-Frequency Cepstral Coefficients (MFCCs) are used as the system feature input. The inverse Mel-Frequency Cepstral Coefficients (IMFCCs) are combined with MFCCs to extract more representative speech spectral features. Based on this, their difference parameters are fused to fully utilize the dynamic and static information of high and low frequency bands. This information is then used as input to the Fca-Res2Net speaker network for pre-training. The network's weight parameters are transferred to the subsequent learning process to obtain better weight initialization results and reduce the possibility of overfitting. Second, a frequency domain channel attention network, FcaNet, is introduced on the baseline Res2Net model. This network uses residual modules to fuse shallow and deep speaker feature information, better acquiring different feature channel weight information without increasing the number of parameters. Finally, to better incorporate temporal information and capture speech features over long time spans, this invention combines a self-attention mechanism to enhance the long-span modeling of speech characteristics. Finally, the classification output results are recognized to improve speaker recognition accuracy. Attached Figure Description

[0058] Figure 1 This is a general block diagram of the speaker recognition method based on Fca-Res2Net fusion self-attention, provided by the preferred embodiment of the present invention;

[0059] Figure 2 This is a network structure diagram of an improved feature extraction method;

[0060] Figure 3 This is a diagram of the Fca-Res2Net model network structure;

[0061] Figure 4 This is a structural diagram of the speaker network fusion self-attention mechanism model. Detailed Implementation

[0062] The technical solutions of the embodiments of the present invention will be clearly and thoroughly described below with reference to the accompanying drawings. The described embodiments are merely some embodiments of the present invention.

[0063] The technical solution of the present invention to solve the above-mentioned technical problems is:

[0064] like Figure 1As shown, this invention provides a speaker recognition method based on Fca-Res2Net fused self-attention, characterized by the following steps:

[0065] S1: The original speech signal is preprocessed by pre-emphasis, framing, windowing, etc., to obtain a three-dimensional log-Mel spectrum. The specific steps are as follows:

[0066] (1) Use a high-pass filter as shown in the following formula to boost the high-frequency part:

[0067] H(Z) = 1 - μz -1 (1)

[0068] Where H(Z) is the transfer function in the z-domain, μ represents the pre-emphasis coefficient, which is 0.96 in this invention, and the output result after pre-emphasis is x(n);

[0069] (2) Due to the short-time stationary nature of the speech signal, the pre-emphasized output x(n) needs to be framed. To address the discontinuity at the endpoints after framing, a Hamming window is used for windowing:

[0070]

[0071] y(n) represents the framed speech signal, w(n,a) represents the Hamming window function, where a takes the value 0.46, n = 0, 1, ..., N-1, and N is the frame length. The windowed speech signal is: s(n) = y(n) × w(n);

[0072] (3) After removing silent segments by endpoint detection, the energy distribution of the speech signal in the frequency domain is obtained by discrete Fourier transform. The output is a complex number S(k) containing N frequency bands, representing the amplitude and phase of a certain frequency in the original signal, as shown in the following formula:

[0073]

[0074] (4) Based on the sensitivity of the human ear, the spectrum is divided into multiple Mel filter banks and multiple inverse Mel filter banks. After passing through M different filters, the frequency response H is obtained. m (k), then calculate the logarithmic energy of the output of each filter bank:

[0075]

[0076] (5) Then, the corresponding first-order difference is obtained by difference operation. The logarithmic spectrum and its first-order difference are superimposed together. The dynamic and static information of the high and low frequency bands of speech is fully utilized to obtain the logarithmic Mel spectrum with horizontal length related to signal duration and vertical length related to filter bank.

[0077] S2: The preprocessed two-dimensional log-Mel group spectrogram is used to pre-train the Fca-Res2Net speaker recognition network to improve the model's generalization ability. The improved Fca-Res2Net extracts shallow and deep features from the two-dimensional log-Mel group spectrogram and fuses them with a residual neural network (Res2Net) and a frequency domain channel attention network (FcaNet). Specifically, this includes:

[0078] (1) A convolutional block consists of a convolutional layer, a group normalization layer (GN), and a linear rectified unit (ReLU), and is used to obtain features;

[0079] (2) The attention module uses a novel channel attention network (FcaNet) to assist Res2Net in capturing refined features in both spatial and channel aspects;

[0080] (3) By combining the skip residual connection technique of ResNet, four residual attention blocks (Fca-Res2Net blocks) were designed to learn deep fusion shallow features in sequence.

[0081] The log-Melogram is resized to 300×256×4 and used as input to Fca-Res2Net. The first convolutional kernel is 7×7 with a stride of 2. The max-pooling layer is 3×3 with a stride of 2, preserving salient features. Next, each residual block is processed similarly, sequentially connected to a channel attention module (Fca block) to comprehensively focus on the channel aspects of the features obtained from the residual block. In the first residual, it passes through 3 Res2Net blocks with a stride of 1 in a convolutional layer. In the second residual block, it passes through 4 Res2Net blocks with a stride of 1 in a convolutional layer. In the third residual block, it passes through 6 Res2Net blocks with a stride of 1 in a convolutional layer. In the fourth residual block, it passes through 3 Res2Net blocks with a stride of 1 in a convolutional layer. The output channels of the four residual blocks are increased. Finally, a global average pooling layer with a stride of 1×2×2 is applied to describe the global features of the channels.

[0082] S3: Using a pre-designed speaker recognition method based on Fca-Res2Net fusion self-attention, this method fuses a frequency domain channel attention network (FcaNet) with an improved residual network (Res2Net) to obtain fused shallow and deep speaker feature information, and better acquires the weight information of different feature channels. The specific content is as follows:

[0083] (1) Improved Residual Network (Res2Net) utilizes channel-based residual skip-layer connections, inheriting the advantages of ResNet without increasing parameter computation. It improves the feature extraction capability of convolutional neural networks by increasing the size of the receptive field. The hierarchical parallel network structure also greatly increases the model's receptive field and allows for cross-channel fusion of speaker information from different layers.

[0084] (2) Add an attention module Fca-block to the backbone network Res2Net-50 and connect it to Res2Net-block to reduce the weight of low-quality speech information and suppress information in the input features that is not related to the speaker feature extraction.

[0085] (3) Fca-block is a novel attention mechanism based on SE-block (Squeeze and Excitation block). It was first applied to object detection. Unlike SE-block, which uses global average pooling (GAP) to compress feature maps, Fca-block uses two-dimensional discrete cosine transform (2D-DCT) to compress feature maps, preserving other frequency components. Using only GAP will cause the feature channels to retain only the lowest frequency component and lose other frequency component features.

[0086] The designed Fca-Res2Net network extracts local-global and spatial-channel features to enrich the representation of speaker features.

[0087] S4: The designed Fca-Res2Net model, which incorporates a self-attention mechanism, combines speaker spatial features with temporal features to capture speech features over a long time span. The specific steps are as follows:

[0088] (1) After preprocessing the original speech segment, the input spectrogram is segmented and input into Fca-Res2Ne in parallel to create multiple output channels.

[0089] (2) After straightening and encoding multiple outputs, they are simultaneously fed into the self-attention block to obtain a long speech feature time series related to the entire time series. After decoding, it is used as speaker feature information, giving full play to the correlation between different features and selecting the most discriminative features.

[0090] As an attention mechanism, self-attention can capture long-term dependencies in sequences of arbitrary length. This paper uses features extracted from Fca-Res2Net as input, which are then encoded to obtain a series of input vectors. These input vectors are then compared with matrix w. q w k w v We get q for each input by performing the inner product.i k i v i q between each pair of inputs i k j The similarity matrix between the two inputs is obtained by performing an inner product and applying the softmax function. Will With v i The inner product is summed to obtain the output sequence for each input. The q of each input... i k i v i When combined to form the input matrix values ​​Q, K, and V, the output matrix is ​​calculated as shown in Formula 5.

[0091]

[0092] Among them, d k For q i k i The same length dimension. Attention is the output attention.

[0093] Since contextual temporal information should also be considered in speaker recognition, this invention adopts self-attention. The output matrix is ​​decoded to obtain the final output feature value. The feature output by self-attention is the weighted sum of all samples in the time series. It can see all the input time samples and focus on the speaker's own time attention points according to different weights.

[0094] S5: During model training, the cross-entropy error function is used as the training objective function, and the Adam algorithm is used for optimization. Finally, a softmax layer is used for sentiment classification, specifically including:

[0095] The cross-entropy algorithm is defined as follows:

[0096]

[0097] Where m represents the number of samples. Let y represent the true value of the i-th sample. i Let L represent the predicted output value of the i-th sample, and L represent the loss value.

[0098] The Adam algorithm is actually a combination of the Momentum and RMSprop algorithms, and the final updated weights are defined as follows:

[0099]

[0100] in, This represents the bias-corrected, exponentially weighted average of the Momentum. This represents the bias-corrected exponentially weighted average of RMSprop, where α and ε are hyperparameters, with ε typically set to 10. -8 .

[0101] The formula for the Softmax function is as follows:

[0102]

[0103] n represents the number of categories, and there are a total of n categories S represented by numerical values. k k∈(0,n], i represents a certain category in k, g i S represents the value of this category. i This represents the classification probability of the i-th element.

[0104] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.

[0105] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0106] The above embodiments should be understood as illustrative only and not as limiting the scope of protection of the present invention. After reading the description of the present invention, those skilled in the art can make various alterations or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. A speaker recognition method based on Fca-Res2Net fusion self-attention, characterized in that, Includes the following steps: S1. Perform pre-emphasis, framing, and windowing preprocessing on the original speech signal. Pass the preprocessed speech signal through different filters and perform differential operations to obtain a set of two-dimensional logarithmic Mel spectrograms with horizontal length related to signal duration and vertical length related to filter bank. S2. The two-dimensional log-Mel spectrum processed in step S1 is used to pre-train Fca-Res2Net. Fca-Res2Net is an improved residual network model that integrates a frequency domain channel attention network and a speaker recognition network to improve generalization ability. Res2Net is an improvement on ResNet. While inheriting the advantages of ResNet, it does not increase the amount of parameter computation. It improves the feature extraction ability of the convolutional neural network by increasing the size of the receptive field. S3. By fusing the frequency domain channel attention network FcaNet with the improved residual network Res2Net, the speaker feature information of the shallow and deep layers is obtained and used to obtain the weight information of different feature channels. S4. We propose an Fca-Res2Net model that combines self-attention mechanism to combine speaker spatial features with temporal features and capture speech features over a long time span. S5. During model training, the cross-entropy error function is used as the training objective function. The cross-entropy loss is minimized by updating the parameters. At the same time, the Adam algorithm is used for optimization to obtain the final network model. Finally, the softmax layer is used for speaker classification. Step S2 uses the two-dimensional log-Mel spectrum processed in step S1 to pre-train Fca-Res2Net. Fca-Res2Net is an improved residual network model that fuses a frequency domain channel attention network, specifically including: (1) A convolutional block consists of a convolutional layer, a group normalization layer (GN), and a linear rectified unit (ReLU), and is used to obtain features; (2) The attention module uses a novel channel attention network, FcaNet, to assist Res2Net in capturing refined features in both spatial and channel aspects; (3) By combining the skip residual connection technique of ResNet, four residual attention blocks Fca-Res2Netblock were designed to learn deep fusion shallow features in sequence; The Fca-Res2Net model designed in step S4, which incorporates a self-attention mechanism, combines speaker spatial features with temporal features to capture speech features over a long time span. The specific steps are as follows: (1) After preprocessing the original speech segment, the input spectrogram is segmented and input into Fca-Res2Ne in parallel to create multiple output channels; (2) After straightening and encoding multiple outputs, they are simultaneously fed into the self-attention block to obtain a long speech feature time series related to the entire time series. After decoding, the speaker feature information is used to select the most discriminative features. The features output by self-attention are the weighted sum of all samples in the time series. It can see all the input time samples and focus on the speaker's own time attention points according to different weights.

2. The speaker recognition method based on Fca-Res2Net fusion self-attention according to claim 1, characterized in that, S1: The original speech signal is preprocessed by pre-emphasis, framing, windowing, etc., to obtain a three-dimensional log-Mel spectrum. The specific steps are as follows: (1) Use the high-pass filter shown in the following formula to boost the high-frequency part: (1) wherein, is a transfer function of a z-domain, z represents a coordinate value of the z-domain, H(z) is a transfer function, the z-domain is a description of a discrete-time system, a z-transform is a modification of a Laplace transform for sampled functions, a space domain in which a continuous-time system is sampled and processed is referred to as a z-domain, represents a pre-emphasis coefficient, and an output result after pre-emphasis is ; (2) Output after pre-emphasis To address the discontinuity issue at the endpoints of frames after segmentation, a Hamming window is used for windowing. (2) This represents the audio signal after framing. The window function represents the Hamming window, where The value is 0.

46. N is the frame length, and the windowed audio signal is: ; (3) After removing silent segments by endpoint detection, the energy distribution of the speech signal in the frequency domain is obtained by discrete Fourier transform, and the output is a complex number containing N frequency bands. , representing the amplitude and phase at a certain frequency in the original signal, as shown in the following formula: (3) (4) Based on the sensitivity of the human ear, the spectrum is divided into multiple Mel filter banks and multiple inverse Mel filter banks. After passing through M different filters, the frequency response is obtained as follows: Then calculate the logarithmic energy of the output of each filter bank: (4) (5) Then, the corresponding first-order difference is obtained by difference operation. The logarithmic spectrum and its first-order difference are superimposed together. The dynamic and static information of the high and low frequency bands of speech is fully utilized to obtain the logarithmic Mel spectrum with horizontal length related to signal duration and vertical length related to filter bank.

3. The speaker recognition method based on Fca-Res2Net fusion self-attention as described in claim 1, characterized in that, The third step, combining the skip residual connection technique of ResNet, designs four residual attention blocks (Fca-Res2Net blocks) to sequentially learn deep-layer fusion of shallow-layer features, specifically including: The log-Melogram size is adjusted to 300×256×4 and used as input to Fca-Res2Net. The first convolutional kernel size is 7×7 with a stride of 2; the max pooling layer size is 3×3 with a stride of 2, preserving salient features of prominent parts. Next, each residual block is processed in the same way and connected to the channel attention module Fca block in sequence to give comprehensive channel attention to the features obtained from the residual block. In the first residual, it passes through 3 Res2Net blocks with a stride of 1 convolutional layer; in the second residual block, it passes through 4 Res2Net blocks with a stride of 1 convolutional layer; in the third residual block, it passes through 6 Res2Net blocks with a stride of 1 convolutional layer; in the fourth residual block, it passes through 3 Res2Net blocks with a stride of 1 convolutional layer. The output channels of the four residual blocks are increased. Finally, a global average pooling layer with a stride of 1×2×2 is applied to describe the global features of the channels.

4. The speaker recognition method based on Fca-Res2Net fusion self-attention as described in claim 1, characterized in that, Step S3 involves fusing the frequency domain channel attention network FcaNet with the improved residual network Res2Net to obtain speaker feature information from the shallow and deep layers, and then using this information to acquire different feature channel weights. Specifically: (1) The improved residual network Res2Net utilizes channel-based residual skip-layer connections. By increasing the size of the receptive field, the hierarchical parallel network structure also increases the model's receptive field and integrates speaker information from different layers across channels. (2) Add an attention module Fca-block to the backbone network Res2Net-50 and connect it to Res2Net-block to weight speech information and suppress information in the input features that is not related to the speaker feature extraction; (3) Fca-block is a novel attention mechanism based on SE-block. Fca-block uses two-dimensional discrete cosine transform 2D-DCT to compress feature maps while retaining other frequency components. The designed Fca-Res2Net network extracts local-global and spatial-channel features to enrich the representation of speaker features.

5. The speaker recognition method based on Fca-Res2Net fusion self-attention as described in claim 1, characterized in that, The self-attention mechanism can capture long-term dependencies in sequences of arbitrary length. It uses features extracted from Fca-Res2Net as input, which are then encoded to obtain a series of input vectors. These input vectors are then compared with a matrix... , , The inner product is obtained for each input. , , Between pairwise inputs , The similarity matrix between the two inputs is obtained by performing an inner product and applying the softmax function. ,Will and Perform inner product addition to obtain the output sequence for each input; each input's , , The input matrix values ​​Q, K, and V are combined to form the output matrix calculation as shown in formula (5); (5) Among them, for , The same length dimension, Attention is the output attention; Self-attention is used to decode the output matrix to obtain the final output feature values.

6. The speaker recognition method based on Fca-Res2Net fusion self-attention as described in claim 5, characterized in that, Step S5, during model training, uses the cross-entropy error function as the training objective function, optimizes it using the Adam algorithm, and finally performs sentiment classification using a softmax layer. Specifically, it includes: The cross-entropy algorithm is defined as follows: (6) Where m represents the number of samples. Indicates the first The true value of each sample Indicates the first The predicted output value for each sample. Indicates the loss value; Adam, the adaptive moment estimation algorithm, combines the SGDM momentum stochastic gradient descent algorithm and the RMSprop forward root mean square gradient descent algorithm. The final updated weights are defined as follows: (7) in, This represents the bias-corrected, exponentially weighted average of the Momentum. This represents the bias-corrected exponentially weighted average of RMSprop. and For hyperparameters; The formula for the Softmax function is as follows: (8) n represents the number of categories; there are a total of n categories represented by numerical values. , , where i represents a category in k. This represents the value of the category. This represents the classification probability of the i-th element.