Far-field speaker verification method based on self-distillation pre-training and meta-learning fine-tuning
By combining self-distillation pre-training and meta-learning fine-tuning, robust deep features are generated, solving the performance degradation problem in far-field speaker recognition and enabling the ability to recognize user speech in the far field in smart home systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2023-06-12
- Publication Date
- 2026-06-23
Smart Images

Figure CN116863937B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech signal processing technology, specifically to a far-field speaker identification method based on self-distillation pre-training and meta-learning fine-tuning. Background Technology
[0002] Speaker verification aims to determine whether a test speech segment comes from the same speaker as a registered speech segment. In recent years, deep neural network-based speaker verification methods have made significant progress, achieving satisfactory results under controlled conditions (e.g., close-range conversation scenarios with minimal interference). However, under far-field phonation conditions, speech signal intensity attenuation, spatial reverberation, and environmental noise interference significantly degrade the performance of existing speaker verification methods. To mitigate the impact of far-field phonation on the performance of speaker verification methods, existing techniques mainly include front-end speech signal enhancement and model domain adaptation.
[0003] Front-end speech signal enhancement involves introducing additional front-end processing modules to amplify, reduce noise, and dedevertize the input speech. Typical methods include algorithms based on traditional digital signal processing, such as Wiener filtering, Kalman filtering, and weighted prediction error, as well as algorithms based on deep neural networks. The advantage of these techniques is that they can directly use existing speaker verification models without modification; the disadvantage is that they increase the number of parameters and computational cost. Furthermore, they may potentially damage speaker information in the speech sample while filtering noise.
[0004] Domain adaptation techniques treat far-field speaker verification as a domain adaptation problem, transferring models trained on near-field datasets to far-field datasets in the target domain through methods such as domain adversarial training and maximum mean difference. While these techniques can achieve some performance improvements, they suffer from convergence difficulties during training. Furthermore, achieving domain adaptation requires pre-collecting a certain number of far-field speech samples from the target domain, which has limitations. These methods primarily focus on the domain inconsistency between the training and test sets, neglecting the domain inconsistency between registered and test speech in speaker verification. Summary of the Invention
[0005] The purpose of this invention is to address the performance degradation of speaker recognition methods caused by inconsistencies between the recording scenarios of registered and test voices. It provides a far-field speaker identification method based on self-distillation pre-training and meta-learning fine-tuning. This method combines self-distillation learning, meta-learning, and existing deep neural networks used for speaker recognition to achieve a near-field registration and far-field testing speaker identification method. Self-distillation learning is a training method that effectively improves the performance of deep neural networks. It uses the output of the last layer of the deep neural network as additional supervisory information to guide the training of the intermediate layers, enabling the deep neural network to generate more robust deep features. Meta-learning is a training strategy that improves the generalization of deep neural networks. By simulating different noise environments in the support set and query set, it ensures that the deep features obtained from the network transformation of voice samples of the same speaker recorded in different noise environments are as close as possible in the feature space, while the deep features of voice samples from different speakers are as far apart as possible, giving the deep neural network the ability to generate domain-invariant features.
[0006] The objective of this invention can be achieved by adopting the following technical solutions:
[0007] A far-field speaker verification method based on self-distillation pre-training and meta-learning fine-tuning, the far-field speaker verification method comprising the following steps:
[0008] S1. Establish the speech dataset: Divide the speech dataset into near-field training speech of the pre-training dataset, far-field training speech of the fine-tuning dataset, near-field registration speech of the evaluation dataset, and far-field test speech of the evaluation dataset.
[0009] S2. Extracting Log-Mel Spectrum Features: Extracting log-Mel spectrum features from the near-field training speech of the pre-training dataset, the far-field training speech of the fine-tuning dataset, the near-field registration speech of the evaluation dataset, and the far-field test speech, respectively. The extraction process includes pre-emphasis, framing, windowing, Fourier transform, Mel filtering, logarithmic operation, and normalization.
[0010] S3. Construct and initialize the self-distillation learning framework: The self-distillation learning framework includes a backbone neural network and branch neural networks used only for the self-distillation pre-training stage;
[0011] S4. Self-distillation pre-trained backbone neural network: The log-Mel spectrum features of the near-field training speech are input into the backbone neural network in the self-distillation learning framework. By using the branch neural network and optimizing the classification loss function and the self-distillation loss function, the output of the last layer of the backbone neural network is used as additional supervision information to guide the training of the intermediate layers of the backbone neural network. The pre-trained backbone neural network is obtained through iterative updates.
[0012] S5. Meta-learning fine-tuning of the backbone neural network: The log-Mel spectrum features of the far-field training speech are input into the pre-trained backbone neural network. The network parameters of the pre-trained backbone neural network are fine-tuned through meta-learning methods, and the backbone neural network is iteratively updated until convergence.
[0013] S6. Speaker Confirmation: Near-field registered speech and far-field test speech from the evaluation dataset are combined into test sample pairs, including positive sample pairs and negative sample pairs. In a positive sample pair, the two speech samples belong to the same speaker, while in a negative sample pair, the two speech samples belong to different speakers. The log-Mel spectrum features of the test sample pairs are input into the pre-trained and fine-tuned backbone neural network to obtain the deep features of the test sample pairs. The similarity between the deep features of the test sample pairs is calculated. If the similarity between the two is greater than a preset threshold, the two speech samples are considered to come from the same speaker; otherwise, they are not.
[0014] Furthermore, the process of step S2 is as follows:
[0015] S2.1 Pre-emphasis: Pre-emphasis is used to enhance high-frequency components, i.e., to compensate for high-frequency components. A first-order high-pass filter is used to pre-emphasize the near-field training speech of the input pre-training dataset, the far-field training speech of the fine-tuning dataset, and the near-field registration speech and far-field test speech of the evaluation dataset. The transfer function of the filter is H(z) = 1 - αz. -1 Where 0.9≤α≤1;
[0016] S2.2 Frame Segmentation: The pre-emphasized near-field training speech, far-field training speech, near-field registration speech, and far-field test speech are framed to obtain short-time speech frames. The reason for this is that the frequency of the signal changes over time. In order to avoid the loss of the frequency profile of the signal over time, it is necessary to perform frame segmentation on the signal, assuming that the signal within each frame is short-term invariant.
[0017] S2.3 Windowing: Windowing is used to smoothly attenuate the two ends of the frame, reduce the intensity of the side lobes of the subsequent Fourier transform, and thus obtain a higher quality spectrum; windowing is performed on short-time speech frames, Hamming window is selected as the window function, and it is multiplied with each frame of speech to obtain the windowed short-time speech frame.
[0018] S2.4 Extracting the logarithmic Mel spectrum: Perform a discrete Fourier transform on the windowed short-time speech frame to obtain the corresponding linear spectrum, then use a Mel filter to convert the linear spectrum into a Mel spectrum, and finally take the logarithm of the Mel spectrum to obtain the logarithmic Mel spectrum.
[0019] S2.5, Normalization: The log-Mel spectrum features are processed using the local cepstral mean normalization method to obtain normalized features. By normalizing the acoustic feature vectors, the energy of each frequency band can be scaled to the same level, making the voice features of different speakers more consistent in the frequency domain, comparable, and enabling better classification and recognition of different speakers.
[0020] Furthermore, the process of step S3 is as follows:
[0021] S3.1 Constructing the backbone neural network: The backbone neural network includes a sequentially connected convolutional input layer, a first convolutional module, a second convolutional module, ..., an Nth convolutional module, a convolutional dimensionality reduction layer, an attention statistical pooling layer, and a fully connected layer. The output of the fully connected layer is the final speaker depth feature. The number N of convolutional modules and the specific structure of the convolutional modules are determined according to the selected backbone neural network. The function of the convolutional input layer is to map the low-dimensional input acoustic features to a high-dimensional feature map rich in semantic information. The function of the first to Nth convolutional modules is to learn and extract key features from the high-dimensional feature map to distinguish different speakers. The function of the convolutional dimensionality reduction layer, the attention statistical pooling layer, and the fully connected layer is to remove redundant information in the high-dimensional feature map and map it to a low-dimensional feature space for easy recognition and classification.
[0022] S3.2 Constructing a Branch Neural Network: The branch neural network includes a bottleneck module, a statistical pooling layer, and a fully connected layer connected in sequence. The branch neural network is only used in the self-distillation pre-training stage to calculate the self-distillation loss. The input of the branch neural network is the output of the first to N-1th convolutional modules in the backbone neural network. The branch neural network that takes the output of the first convolutional module as its input is called the first branch neural network, the branch neural network that takes the output of the second convolutional module as its input is called the second branch neural network, and so on, until the branch neural network that takes the output of the N-1th convolutional module as its output is called the N-1th branch neural network. The reason for using the branch neural network is that the purity of the speaker feature information contained in the output of each convolutional module is inconsistent, and direct interaction with the output of the backbone neural network would have a negative impact on the training of the backbone neural network.
[0023] Furthermore, the process of step S4 is as follows:
[0024] S4.1 Extracting the backbone depth features: Input the near-field training speech log Mel spectrum features extracted in step S1 into the backbone neural network, and obtain the fully connected layer output of the backbone neural network as the backbone depth features.
[0025] S4.2 Extracting branch depth features: For the same input, the outputs of the first to N-1th convolutional modules in the backbone time delay neural network are respectively input into the first to N-1th branch neural networks to obtain the first branch depth features to the N-1th branch depth features;
[0026] S4.3 Update the parameters of the backbone neural network and branch neural networks: Calculate the classification loss function and distillation loss function based on the backbone depth features output by the backbone neural network and the branch depth features output by the first to N-1th branch neural networks. Update the parameters of the backbone neural network and the first to N-1th branch neural networks simultaneously using the backpropagation algorithm. The classification loss and distillation loss functions are defined as follows:
[0027] Classification loss function: After the fully connected layers of the first to N-1 branches of the neural network and the backbone neural network, connect linear classifiers with the same structure, corresponding to the first to N-1 linear classifiers and the Nth linear classifier, respectively. For one of the linear classifiers c, let its input be a deep feature of dimension d, from... This indicates that the corresponding real label is speaker y. i ∈{1, 2, ..., K}, where K is the number of speaker categories, and the parameters of the linear classifier are expressed as follows: Then z is classified as speaker y. i probability for:
[0028] Where s and m are the scaling factor and the interval parameter, respectively. For speaker y in linear classifier c i Corresponding parameters The angle between the input deep feature z and the input deep feature is used to calculate the predicted probability distribution of each input deep feature; the role of s and m is to reduce the probability that the output of the linear classifier corresponds to the true label. Therefore, in order to obtain the correct classification result, the first to N-1 branches of the neural network and the backbone neural network must generate more discriminative deep features to make the training process more effective.
[0029] The predicted probability distributions for the input deep features are calculated on each linear classifier, and the classification loss function is as follows:
[0030] Where CrossEntropy(·) represents the cross-entropy loss, p i Let y be the predicted probability distribution output by the i-th linear classifier, and y be the real speaker category label.
[0031] The distillation loss function is equal to the sum of the following two terms: the Kullback-Leible divergence loss between the predicted probability distributions of the first to N-1 linear classifiers and the predicted probability distribution of the Nth linear classifier, and the weighted sum of the L2 distances between the deep features of the first to N-1 branches and the deep features of the backbone. The formula for calculating the distillation loss function is as follows:
[0032]
[0033] Where KL(·) represents the Kullback-Leible divergence, ||·||2 represents the L2 distance, and p i and p N F represents the predicted probability distributions of the i-th linear classifier and the N-th linear classifier, respectively. i and F N Let L represent the depth features of the i-th branch and the depth features of the main branch, respectively. Let λ be the balancing hyperparameter. The expression for the overall loss function of the self-distillation learning pre-training is: L total =β·L dis +(1-β)L cls
[0034] Here, β is a trade-off parameter used to balance the impact of the two losses on the network. In the classification loss function, the classification loss function is calculated for the backbone neural network and each convolutional module in the backbone neural network. This allows each convolutional module in the backbone neural network to directly receive supervised learning from the real speaker category labels, which can effectively alleviate the gradient vanishing problem of the backpropagation algorithm when the neural network has too many layers in deep learning. This allows the parameters of each convolutional module in the backbone neural network to be trained more fully, thereby generating more discriminative deep features. In the distillation loss function, the backbone neural network can be divided into multiple parts from shallow to deep according to the order of the convolutional modules. As the depth increases, the abstraction and discriminativeness of the extracted deep features increase. The degree of differentiation gradually increases. On the one hand, by introducing the Kullback-Leible divergence loss value between the predicted probability distribution of the first to N-1 linear classifiers and the predicted probability distribution of the Nth linear classifier, the knowledge learned by the deepest layer of the backbone neural network can be passed to each convolutional module. At the same time, it can also prevent the backbone neural network from making overconfident predictions, thus playing a regularization role and effectively avoiding overfitting. On the other hand, through the L2 distance between the deep features of the first to N-1 branches and the deep features of the backbone, the deep features of the branches output by each convolutional module in the backbone neural network can be driven to be as close as possible to the deep features of the backbone, making each deep feature of the branches more discriminative.
[0035] Furthermore, the meta-learning method in step S4 is a training strategy that can improve the generalization of deep neural networks. Conventional supervised learning methods calculate classification loss on the training dataset to update the parameters of deep neural networks. Deep neural networks tend to overfit the training dataset, generating deep features that are discriminative between different categories in the training dataset, but have poor generalization to new classes. The meta-learning method uses tasks as training units. Each task consists of a support set and a query set to simulate the task scenario during testing. The support set contains K different categories with N samples in each category, and the query set contains the same K different categories as the support set with M samples in each category. This training strategy is called the K-way, N-shot meta-learning strategy.
[0036] Furthermore, the process of step S5 is as follows:
[0037] S5.1 Constructing the Meta-Learning Task: A K-way, N-shot meta-learning strategy is adopted. In each training iteration, K different speakers with N voices from each speaker are extracted from the fine-tuning dataset as the support set; K speakers with M voices from each speaker are extracted from the same support set as the query set. The voices in the query set are different from those in the support set, and the voices in the support set and the query set come from different recording environments. The voices in the constructed support set are equivalent to near-field registration voices, and the voices in the constructed query set are equivalent to far-field test voices. By constructing support sets and query sets with different recording environments, the composition of test sample pairs is simulated. This meta-learning task allows the backbone neural network to better adapt to the needs of real-world tasks.
[0038] S5.2 Calculate the central features of each speaker in the support set: Input the log-Mel spectrum features of the speech of each speaker in the support set into the pre-trained backbone neural network to obtain the deep features of each speaker's speech. Then, calculate the mean of the deep features of the speech of each speaker as the central feature of each speaker, as shown in the following formula:
[0039] Among them, S k To support the set of speech samples of the central speaker k, x is the speaker deep feature output by the pre-trained backbone time-delay neural network;
[0040] S5.3 Update the backbone neural network parameters: Calculate the cosine distance between the deep features of each speaker's speech in the query set and the central features of each speaker in the support set. Based on the true labels corresponding to each speaker's speech in the query set, calculate the angular prototype loss, as shown in the following formula:
[0041]
[0042] Where M is the number of voices in the query set, and c k x represents the central feature supporting the central speaker k. j c represents the deep feature of the j-th speech in the query set. j Representative support concentration and x j The speaker-centered features corresponding to the true labels, where w and b are the learnable scale factor and bias, respectively; during training, L is continuously reduced. ap The loss value reduces the distance between the deep features of each speaker's speech in the query set and the central features of the same speaker's speech in the support set, increases the distance between the deep features of different speakers' speech in the support set, and aligns the deep features of speech in different noise environments.
[0043] Furthermore, the process of step S6 is as follows:
[0044] S6.1 Generating test sample pairs: Combine the speech in the near field registration set of the evaluation speech dataset with the speech in the far field test set in pairs to generate test sample pairs, including positive sample pairs and negative sample pairs. In the positive sample pair, the two speech samples belong to the same speaker, and in the negative sample pair, the two speech samples belong to different speakers.
[0045] S6.2 Extracting deep features: Input the log-Mel spectrum features of the above test sample pairs into the pre-trained and fine-tuned backbone neural network to obtain the deep features of the test sample pairs;
[0046] S6.3 Decision: Calculate the similarity between the deep features of the test samples, and determine whether the two voices in the test sample pair come from the same speaker based on a pre-set threshold.
[0047] The present invention has the following advantages and effects compared with the prior art:
[0048] (1) This invention uses a self-distillation method to pre-train a deep neural network. On the one hand, by adding the classification loss of intermediate layer features to the overall loss function of the deep neural network, the deep features generated by each intermediate layer of the deep neural network can be more class-discriminative, thereby improving the class discrimination of the deep features generated by the deep neural network. On the other hand, the distillation loss plays a regularization role, which can effectively avoid overfitting and improve the generalization of the deep features generated by the deep neural network. Applying self-distillation training to speaker identification based on deep neural networks effectively improves the performance of deep neural networks without increasing the training cost, enabling the deep neural network to generate more robust speaker deep features.
[0049] (2) This invention further uses a meta-learning method to fine-tune the pre-trained deep neural network, constructing a support set and a query set containing speech samples recorded in different environments. Through a triangular prototype loss function, the deep features of the same speaker in different recording environments are more compact in the feature space, and the deep features of different speakers are further apart in the feature space. This can effectively alleviate the performance degradation problem of the neural network caused by the inconsistency of noise and reverberation environment in the registration and testing voice channels in practical applications. For example, in a smart home system, when a user registers their voice at the terminal, the smart home system can perform far-field recognition of the voice emitted by the user at any location in the home, at any distance from the terminal. Attached Figure Description
[0050] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:
[0051] Figure 1 This is a flowchart of the far-field speaker recognition method in an embodiment of the present invention;
[0052] Figure 2 This is a schematic diagram of the ResNet-34 structure in Embodiment 1 of the present invention;
[0053] Figure 3 This is a schematic diagram of the residual convolution module structure in Embodiment 1 of the present invention;
[0054] Figure 4 This is a schematic diagram of the branch network structure in Embodiment 1 of the present invention;
[0055] Figure 5 This is a schematic diagram of self-distillation learning pre-training in Embodiment 1 of the present invention;
[0056] Figure 6 This is a schematic diagram of the ECAPA-TDNN structure in Embodiment 2 of the present invention;
[0057] Figure 7 This is a schematic diagram of the squeeze-excited residual convolution module structure in Embodiment 2 of the present invention;
[0058] Figure 8 This is a schematic diagram of the branch network structure in Embodiment 2 of the present invention;
[0059] Figure 9 This is a schematic diagram of self-distillation learning pre-training in Embodiment 2 of the present invention;
[0060] Figure 10 This is a flowchart of meta-learning fine-tuning in an embodiment of the present invention;
[0061] Figure 11This is a flowchart of the speaker confirmation system in an embodiment of the present invention. Detailed Implementation
[0062] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0063] Example 1
[0064] This embodiment discloses a far-field speaker recognition method based on self-distillation learning pre-training and meta-learning fine-tuning. A flowchart of the far-field speaker recognition method based on self-distillation learning pre-training and meta-learning fine-tuning is shown below. Figure 1 As shown, the specific steps include the following:
[0065] S1. Establish the speech dataset: Divide the speech dataset into near-field training speech of the pre-training dataset, far-field training speech of the fine-tuning dataset, near-field registration speech of the evaluation dataset, and far-field test speech of the evaluation dataset.
[0066] In this embodiment, step S1 is specifically as follows:
[0067] The training and development sets of the open-source English speech dataset Voxceleb1 were used as the near-field training speech for the pre-training dataset, the training set of the open-source Chinese dataset FFSVC2020 was used as the far-field training speech for the fine-tuning dataset, and the test set of the open-source Chinese dataset FFSVC2020 was used as the near-field registration speech and far-field test speech for the evaluation dataset.
[0068] S2. Extracting Log-Mel Spectrum Features: Extracting log-Mel spectrum features from the near-field training speech of the pre-training dataset, the far-field training speech of the fine-tuning dataset, the near-field registration speech of the evaluation dataset, and the far-field test speech, respectively. The extraction process includes pre-emphasis, framing, windowing, Fourier transform, Mel filtering, logarithmic operation, and normalization.
[0069] In this embodiment, step S2 is specifically as follows:
[0070] S2.1 Pre-emphasis: A first-order high-pass filter is used to pre-emphasize the near-field training speech of the input pre-training dataset, the far-field training speech of the fine-tuning dataset, the near-field registration speech of the evaluation dataset, and the far-field test speech. The transfer function of the filter is H(z) = 1 - αz. -1 α is set to 0.95;
[0071] S2.2 Frame Segmentation: The pre-emphasized near-field training speech, far-field training speech, near-field registration speech, and far-field test speech are segmented into frames to obtain short-time speech frames, with a frame length of 25ms and a frame shift of 10ms during framing.
[0072] S2.3 Windowing: Windowing is performed on short-time speech frames: Hamming window is selected as the window function, and it is multiplied with each speech frame to obtain the windowed short-time speech frame;
[0073] S2.4 Extracting the logarithmic Mel spectrum: Perform a discrete Fourier transform on the windowed short-time speech frame to obtain the corresponding linear spectrum, then use a Mel filter to convert the linear spectrum into a Mel spectrum, and finally take the logarithm of the Mel spectrum to obtain the logarithmic Mel spectrum.
[0074] S2.5, Normalization: The log-Mel spectrum features are processed using the local cepstral mean normalization method to obtain normalized features.
[0075] S3. Initialize the self-distillation learning framework: The self-distillation learning framework includes a backbone neural network and branch neural networks used only in the self-distillation pre-training stage;
[0076] In this embodiment, step S3 is as follows:
[0077] S3.1 Constructing the backbone neural network: In this embodiment, the backbone neural network adopts a residual convolutional neural network containing 34 convolutional layers, and its structure is as follows: Figure 2 As shown, the following residual convolutional neural network containing 34 convolutional layers is abbreviated as ResNet-34. ResNet-34 includes a sequentially connected 2D convolutional input layer, a first residual convolutional module, a second residual convolutional module, a third residual convolutional module, a fourth residual convolutional module, a 2D convolutional dimensionality reduction layer, an attention statistical pooling layer, and a fully connected layer. The output of the fully connected layer is the final speaker depth feature. The first residual convolutional module consists of 3 basic convolutional modules, the second residual convolutional module consists of 4 basic convolutional modules, the third residual convolutional module consists of 6 basic convolutional modules, and the fourth residual convolutional module consists of 3 basic convolutional modules. The structure of the basic convolutional modules is as follows: Figure 3 As shown;
[0078] S3.2 Constructing a Branching Neural Network: In this embodiment, the branching neural network consists of a sequentially connected two-dimensional bottleneck module, a statistical pooling layer, and a fully connected layer, as follows: Figure 4As shown; the branch neural network is only used in the self-distillation pre-training stage to calculate the self-distillation loss; the input of the branch neural network is the output of the first to third residual convolutional modules in the backbone neural network ResNet-34. The branch neural network with the output of the first residual convolutional module as input is called the first branch neural network, the branch neural network with the output of the second residual convolutional module as input is called the second branch neural network, and the branch neural network with the output of the third convolutional module as output is called the third branch neural network.
[0079] S4. Self-distillation pre-trained backbone neural network: The log-Mel spectrum features of the near-field training speech are input into the backbone neural network in the self-distillation learning framework. By using the branch neural network and optimizing the classification loss function and the self-distillation loss function, the output of the last layer of the backbone neural network is used as additional supervision information to guide the training of the intermediate layers of the backbone neural network. The pre-trained backbone neural network is obtained through iterative updates.
[0080] In this embodiment, step S4 is specifically as follows:
[0081] S4.1 Extracting backbone depth features: Input the near-field training speech log Mel spectrum features extracted in step S1 into the backbone neural network ResNet-34, and obtain the output of the fully connected layer of the backbone neural network ResNet-34 as the backbone depth features.
[0082] S4.2 Extracting branch depth features: For the same input, the outputs of the first to third convolutional modules in the backbone neural network ResNet-34 are respectively input into the first to third branch neural networks to obtain the first branch depth features to the third branch depth features;
[0083] S4.3, Update the parameters of the backbone neural network ResNet-34 and the branch neural networks: such as Figure 5 As shown, the classification loss function and distillation loss function are calculated based on the backbone depth features output by the ResNet-34 backbone neural network and the first to third branch depth features output by the first to third branch neural networks. The parameters of the ResNet-34 backbone neural network and the first to third branch neural networks are updated simultaneously through the backpropagation algorithm. The classification loss and distillation loss functions are defined as follows:
[0084] Classification loss function: After the fully connected layers of the first to third branch neural networks and the backbone neural network ResNet-34, a linear classifier with the same structure is connected, corresponding to the first to third linear classifiers and the fourth linear classifier, respectively. For one of the linear classifiers c, let its input be a depth feature of dimension d, and then... This indicates that the corresponding real label is speaker y. i∈{1, 2, ..., K}, where K is the number of speaker categories, and the parameters of the linear classifier are expressed as follows: Then z is classified as speaker y. i probability for:
[0085] Where s and m are the scaling factor and the interval parameter, respectively. In this embodiment, s is 64 and m is 0.4. For speaker y in linear classifier c i Corresponding parameters The angle between the input depth feature z and the input depth feature z is used to calculate the predicted probability distribution of each input depth feature.
[0086] Calculate the predicted probability distribution for each linear classifier on the input deep features, and the classification loss function is as follows:
[0087] Where CrossEntropy(·) represents the cross-entropy loss, p i Let y be the predicted probability distribution output by the i-th linear classifier, and y be the real speaker category label.
[0088] The distillation loss function is equal to the sum of the following two terms: the Kullback-Leible divergence loss between the predicted probability distributions of the first to third linear classifiers and the predicted probability distribution of the fourth linear classifier, and the weighted sum of the L2 distances between the deep features of the first to third branches and the deep features of the backbone. The formula for calculating the distillation loss function is as follows:
[0089]
[0090] Where KL(·) represents the Kullback-Leible divergence, ||·||2 represents the L2 distance, and p i and p 4 F represents the predicted probability distributions of the i-th and fourth linear classifiers, respectively. i F4 and F4 represent the deep features of the i-th branch and the deep features of the main branch, respectively, and λ is the balancing hyperparameter. The expression for the overall loss function of self-distillation learning pre-training is: L total =β·L dis +(1-β)L cls
[0091] Here, β is a trade-off parameter used to balance the impact of the two losses on the network.
[0092] S5. Meta-learning fine-tuning of the backbone neural network: The log-Mel spectrum features of the far-field training speech are input into the pre-trained backbone neural network. The network parameters of the pre-trained backbone neural network are fine-tuned through meta-learning methods, and the backbone neural network is iteratively updated until convergence.
[0093] In this embodiment, step S5 is specifically as follows:
[0094] S5.1 Constructing the meta-learning task: In this embodiment, a 20-way, 1-shot meta-learning strategy is adopted. In each training iteration, 20 different speakers are extracted from the fine-tuning dataset, with 1 speech from each speaker as the support set; 20 speakers identical to the support set are extracted, with 2 speech from each speaker as the query set. The speech of each speaker in the query set is different from that of each speaker in the support set. The speech of each speaker in the support set and the speech of each speaker in the query set come from different recording environments.
[0095] S5.2 Calculate the central features of each speaker in the support set: Input the log-Mel spectrum features of each speaker's speech in the support set into the pre-trained backbone neural network ResNet-34 to obtain the deep features of each speaker's speech. Then, calculate the mean of the deep features of each speaker's speech as the central feature of each speaker, as shown in the following formula:
[0096] Among them, S k To support the set of speech samples of the central speaker k, x is the speaker deep feature output by the pre-trained backbone neural network ResNet-34;
[0097] S5.3, Update the parameters of the backbone neural network ResNet-34: (e.g.) Figure 10 As shown, the cosine distance between the deep features of each speaker's speech in the query set and the central features of each speaker in the support set is calculated. Based on the ground truth labels corresponding to each speaker's speech in the query set, the angular prototype loss is calculated using the following formula:
[0098] Where M is the total number of speech samples in the query set, and c k x represents the central feature supporting the central speaker k. j c represents the deep feature of the j-th speech in the query set. j Representative support concentration and x j The speaker-centered features corresponding to the true labels, where w and b are the learnable scale factor and bias, respectively; during training, L is continuously reduced. apThe loss value reduces the distance between the deep features of each speaker's speech in the query set and the central features of the same speaker's speech in the support set, increases the distance between the deep features of different speakers' speech in the support set, and aligns the deep features of speech in different noise environments.
[0099] S6. Speaker Confirmation: Near-field registered speech and far-field test speech are combined into test sample pairs, including positive sample pairs and negative sample pairs. The two speech samples in a positive sample pair belong to the same speaker, while the two speech samples in a negative sample pair belong to different speakers. The log-Mel spectrum features of the test sample pairs are input into the pre-trained and fine-tuned backbone neural network to obtain the deep features of the test sample pairs. The similarity between the deep features of the test sample pairs is calculated. If the similarity between the two is greater than a preset threshold, the two speech samples are considered to come from the same speaker; otherwise, they are not.
[0100] In this embodiment, step S6 is specifically as follows:
[0101] S6.1 Generating test sample pairs: Combine the speech in the near field registration set of the evaluation speech dataset with the speech in the far field test set in pairs to generate test sample pairs, including positive sample pairs and negative sample pairs. In the positive sample pair, the two speech samples belong to the same speaker, and in the negative sample pair, the two speech samples belong to different speakers.
[0102] S6.2 Extracting deep features: Input the log-Mel spectrum features of the above test sample pairs into the pre-trained and fine-tuned backbone neural network ResNet-34 to obtain the deep features of the test sample pairs;
[0103] S6.3 Decision: Calculate the similarity between the deep features of the test samples, and determine whether the two voices in the test sample pair come from the same speaker based on a pre-set threshold.
[0104] Through the above embodiments, a comparison was made between a ResNet-34 backbone neural network trained only by self-distillation, a ResNet-34 backbone neural network trained by self-distillation and fine-tuned by meta-learning, and a ResNet-34 backbone network trained using conventional supervised learning. Performance tests were conducted on the Voxceleb1 test set (Vox1-test) and the FFSVC2020 test set (FFSVC-test), and the results are as follows:
[0105] Table 1. Comparison of ResNet-34 trained by self-distillation pre-training and meta-learning fine-tuning with ResNet-34 trained by conventional supervised learning.
[0106]
[0107] As shown in the table, the ECAPA-TDNN backbone neural network, which was pre-trained by self-distillation and fine-tuned by meta-learning, achieved an equal error rate on both test sets that was superior to the ECAPA-TDNN backbone network trained using conventional supervised learning.
[0108] Example 2
[0109] This embodiment discloses a far-field speaker recognition method based on self-distillation learning pre-training and meta-learning fine-tuning. A flowchart of the far-field speaker recognition method based on self-distillation learning pre-training and meta-learning fine-tuning is shown below. Figure 1 As shown, the specific steps include the following:
[0110] S1. Establish the speech dataset: Divide the speech dataset into near-field training speech of the pre-training dataset, far-field training speech of the fine-tuning dataset, near-field registration speech of the evaluation dataset, and far-field test speech of the evaluation dataset.
[0111] S2. Extracting Log-Mel Spectrum Features: Extracting log-Mel spectrum features from the near-field training speech of the pre-training dataset, the far-field training speech of the fine-tuning dataset, the near-field registration speech of the evaluation dataset, and the far-field test speech, respectively. The extraction process includes pre-emphasis, framing, windowing, Fourier transform, Mel filtering, logarithmic operation, and normalization.
[0112] In this embodiment, step S2 is specifically as follows:
[0113] S2.1 Pre-emphasis: A first-order high-pass filter is used to pre-emphasize the near-field training speech of the input pre-training dataset, the far-field training speech of the fine-tuning dataset, the near-field registration speech of the evaluation dataset, and the far-field test speech. The transfer function of the filter is H(z) = 1 - αz. -1 α is set to 0.97;
[0114] S2.2 Frame Segmentation: The pre-emphasized near-field training speech, far-field training speech, near-field registration speech, and far-field test speech are segmented into frames to obtain short-time speech frames, with a frame length of 20ms and a frame shift of 8ms during framing.
[0115] S2.3 Adding a window: Refer to step S2.3 in Example 1;
[0116] S2.4 Extracting the log-Mel spectrum: Refer to step S2.4 in Example 1;
[0117] S2.5, Normalization: Refer to step S2.5 in Example 1.
[0118] S3. Initialize the self-distillation learning framework: The self-distillation learning framework includes a backbone neural network and branch neural networks used only in the self-distillation pre-training stage;
[0119] In this embodiment, step S3 is as follows:
[0120] S3.1 Constructing the backbone neural network: In this embodiment, the backbone neural network adopts a time-delay neural network based on enhanced channel attention, propagation, and fusion, such as... Figure 6 As shown below, the Temporal Delay Neural Network based on Enhanced Channel Attention, Propagation, and Fusion (ECAPA-TDNN) is referred to as ECAPA-TDNN. The ECAPA-TDNN includes a sequentially connected one-dimensional convolutional input layer, a first squeeze-activated residual convolutional module, a second squeeze-activated residual convolutional module, a third squeeze-activated residual convolutional module, a one-dimensional convolutional dimensionality reduction layer, an attention statistical pooling layer, and a fully connected layer. The output of the fully connected layer is the final speaker depth feature. The structure of the squeeze-activated residual convolutional module is as follows: Figure 7 As shown;
[0121] S3.2 Constructing a Branch Neural Network: In this embodiment, the structure of the branch neural network consists of a sequentially connected one-dimensional bottleneck module, a statistical pooling layer, and a fully connected layer, as follows: Figure 7 As shown; the branch neural network is only used in the self-distillation pre-training stage to calculate the self-distillation loss; the input of the branch neural network is the output of the first to second residual convolutional modules in the backbone neural network ECAPA-TDNN. The branch neural network that takes the output of the first residual convolutional module as its input is called the first branch neural network, and the branch neural network that takes the output of the second residual convolutional module as its input is called the second branch neural network.
[0122] S4. Self-distillation pre-trained backbone neural network: The log-Mel spectrum features of the near-field training speech are input into the backbone neural network in the self-distillation learning framework. By using the branch neural network and optimizing the classification loss function and the self-distillation loss function, the output of the last layer of the backbone neural network is used as additional supervision information to guide the training of the intermediate layers of the backbone neural network. The pre-trained backbone neural network is obtained through iterative updates.
[0123] In this embodiment, step S4 is specifically as follows:
[0124] S4.1 Extracting backbone depth features: Input the near-field training speech log Mel spectrum features extracted in step S1 into the backbone neural network ECAPA-TDNN, and obtain the fully connected layer output of the backbone neural network ECAPA-TDNN as the backbone depth features.
[0125] S4.2 Extracting branch depth features: For the same input, the outputs of the first and second convolutional modules in the backbone neural network ECAPA-TDNN are respectively input into the first and second branch neural networks to obtain the first branch depth features to the second branch depth features;
[0126] S4.3 Update the parameters of the backbone neural network ECAPA-TDNN and the branch neural networks: such as Figure 8 As shown, the classification loss function and distillation loss function are calculated based on the backbone depth features output by the ECAPA-TDNN backbone neural network and the first and second branch depth features output by the first and second branch neural networks. The parameters of the ECAPA-TDNN backbone neural network and the first and second branch neural networks are updated simultaneously through the backpropagation algorithm. The classification loss and distillation loss functions are defined as follows:
[0127] Classification loss function: After the fully connected layers of the first and second branch neural networks and the backbone neural network ECAPA-TDNN, a linear classifier with the same structure is connected, corresponding to the first, second, and fourth linear classifiers, respectively. For one of the linear classifiers c, let its input be a depth feature of dimension d, from... This indicates that the corresponding real label is speaker y. i ∈{1, 2, ..., K}, where K is the number of speaker categories, and the parameters of the linear classifier are expressed as follows: Then z is classified as speaker y. i probability for:
[0128]
[0129] Where s and m are the scaling factor and the interval parameter, respectively. In this embodiment, s is 64 and m is 0.2. For speaker y in linear classifier c i Corresponding parameters The angle between the input depth feature z and the input depth feature z is used to calculate the predicted probability distribution of each input depth feature.
[0130] Calculate the predicted probability distribution for each linear classifier on the input deep features, and the classification loss function is as follows:
[0131] Where CrossEntropy(·) represents the cross-entropy loss, p i Let y be the predicted probability distribution output by the i-th linear classifier, and y be the real speaker category label.
[0132] The distillation loss function is equal to the sum of the following two terms: the Kullback-Leible divergence loss between the predicted probability distributions of the first and second linear classifiers and the predicted probability distribution of the third linear classifier, and the weighted sum of the L2 distances between the deep features of the first and second branches and the deep features of the backbone. The formula for calculating the distillation loss function is as follows:
[0133]
[0134] Where KL(·) represents the Kullback-Leible divergence, ||·||2 represents the L2 distance, and p i and p 3 F represents the predicted probability distributions of the i-th linear classifier and the third linear classifier, respectively. i F4 and F4 represent the deep features of the i-th branch and the deep features of the main branch, respectively, and λ is the balancing hyperparameter. The expression for the overall loss function of self-distillation learning pre-training is: L total =β·L dis +(1-β)L cls
[0135] Here, β is a trade-off parameter used to balance the impact of the two losses on the network.
[0136] S5. Meta-learning fine-tuning of the backbone neural network: The log-Mel spectrum features of the far-field training speech are input into the pre-trained backbone neural network. The network parameters of the pre-trained backbone neural network are fine-tuned through meta-learning methods, and the backbone neural network is iteratively updated until convergence.
[0137] In this embodiment, step S5 is specifically as follows:
[0138] S5.1 Constructing the meta-learning task: In this embodiment, a 30-way, 1-shot meta-learning strategy is adopted. In each training iteration, 30 different speakers are extracted from the fine-tuning dataset, with 1 speech from each speaker as the support set; 30 speakers identical to the support set are extracted, with 2 speech from each speaker as the query set. The speech of each speaker in the query set is different from that of each speaker in the support set. The speech of each speaker in the support set and the speech of each speaker in the query set come from different recording environments.
[0139] S5.2 Calculate the central features of each speaker in the support set: Input the log-Mel spectrum features of the speech of each speaker in the support set into the pre-trained backbone neural network ECAPA-TDNN to obtain the deep features of each speaker's speech. Then, calculate the mean of the deep features of the speech of each speaker as the central feature of each speaker, as shown in the following formula:
[0140] Among them, S k To support the set of speech samples of the central speaker k, x is the speaker deep feature output by the pre-trained backbone neural network ECAPA-TDNN;
[0141] S5.3, Update the parameters of the backbone neural network ECAPA-TDNN: (e.g.) Figure 10As shown, the cosine distance between the deep features of each speaker's speech in the query set and the central features of each speaker in the support set is calculated. Based on the ground truth labels corresponding to each speaker's speech in the query set, the angular prototype loss is calculated using the following formula:
[0142] Where L is the total number of speech samples in the query set, and c k x represents the central feature supporting the central speaker k. j c represents the deep feature of the j-th speech in the query set. j Representative support concentration and x j The speaker-centered features corresponding to the true labels, where w and b are the learnable scale factor and bias, respectively; during training, L is continuously reduced. ap The loss value reduces the distance between the deep features of each speaker's speech in the query set and the central features of the same speaker's speech in the support set, increases the distance between the deep features of different speakers' speech in the support set, and aligns the deep features of speech in different noise environments.
[0143] S6. Speaker Confirmation: Near-field registered speech and far-field test speech are combined into test sample pairs, including positive sample pairs and negative sample pairs. The two speech samples in a positive sample pair belong to the same speaker, while the two speech samples in a negative sample pair belong to different speakers. The log-Mel spectrum features of the test sample pairs are input into the pre-trained and fine-tuned backbone neural network to obtain the deep features of the test sample pairs. The similarity between the deep features of the test sample pairs is calculated. If the similarity between the two is greater than a preset threshold, the two speech samples are considered to come from the same speaker; otherwise, they are not.
[0144] In this embodiment, step S6 is specifically as follows:
[0145] S6.1 Generate test sample pairs: Refer to step S6.1 in Example 1;
[0146] S6.2 Extracting deep features: Input the log-Mel spectrum features of the above test sample pairs into the pre-trained and fine-tuned backbone neural network ECAPA-TDNN to obtain the deep features of the test sample pairs.
[0147] S6.3, Judgment: Refer to step S6.3 in Example 1.
[0148] Through the above embodiments, the ECAPA-TDNN backbone neural network, which was pre-trained by self-distillation and fine-tuned by meta-learning, was compared with the ECAPA-TDNN backbone network trained using conventional supervised learning. The performance of the three networks was tested using the same test set, and the results are as follows:
[0149] Table 2. Comparison of ECAPA-TDNN trained by self-distillation pre-training and meta-learning fine-tuning with ECAPA-TDNN trained by conventional supervised learning.
[0150]
[0151] As shown in the table, the ECAPA-TDNN backbone neural network, which was pre-trained by self-distillation and fine-tuned by meta-learning, achieved an equal error rate on both test sets that was superior to the ECAPA-TDNN backbone network trained using conventional supervised learning.
[0152] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.
Claims
1. A far-field speaker verification method based on self-distillation pre-training and meta-learning fine-tuning, characterized in that, The far-field speaker confirmation method includes the following steps: S1. Establish the speech dataset: Divide the speech dataset into near-field training speech of the pre-training dataset, far-field training speech of the fine-tuning dataset, near-field registration speech of the evaluation dataset, and far-field test speech of the evaluation dataset. S2. Extracting Log-Mel Spectrum Features: Extracting log-Mel spectrum features from the near-field training speech of the pre-training dataset, the far-field training speech of the fine-tuning dataset, the near-field registration speech of the evaluation dataset, and the far-field test speech, respectively. The extraction process includes pre-emphasis, framing, windowing, Fourier transform, Mel filtering, logarithmic operation, and normalization. S3. Construct and initialize the self-distillation learning framework: The self-distillation learning framework includes a backbone neural network and branch neural networks used only for the self-distillation pre-training stage; S4. Self-distillation pre-trained backbone neural network: The log-Mel spectrum features of the near-field training speech are input into the backbone neural network in the self-distillation learning framework. By using the branch neural network and optimizing the classification loss function and the self-distillation loss function, the output of the last layer of the backbone neural network is used as additional supervision information to guide the training of the intermediate layers of the backbone neural network. The pre-trained backbone neural network is obtained through iterative updates. S5. Meta-learning fine-tuning of the backbone neural network: The log-Mel spectrum features of the far-field training speech are input into the pre-trained backbone neural network. The network parameters of the pre-trained backbone neural network are fine-tuned through meta-learning methods, and the backbone neural network is iteratively updated until convergence. S6. Speaker Confirmation: The near-field registered speech and far-field test speech from the evaluation dataset are combined into test sample pairs, including positive sample pairs and negative sample pairs. In a positive sample pair, the two speech samples belong to the same speaker, while in a negative sample pair, the two speech samples belong to different speakers. The log-Mel spectrum features of the test sample pairs are input into the pre-trained and fine-tuned backbone neural network to obtain the deep features of the test sample pairs. The similarity between the deep features of the test sample pairs is calculated. If the similarity between the two is greater than a preset threshold, the near-field registered speech and far-field test speech are considered to come from the same speaker; otherwise, they are not.
2. The method of far-field speaker verification based on self-distilled pre-training and meta-learning fine-tuning according to claim 1, wherein, The process of step S2 is as follows: S2.1 pre-emphasis: a first-order high-pass filter is used to pre-emphasize the near-field training speech of the input pre-training data set, the far-field training speech of the fine-tuning data set, the near-field registration speech of the evaluation data set, and the far-field test speech, and the transfer function of the filter is wherein ; S2.2 Frame Segmentation: The pre-emphasized near-field training speech, far-field training speech, near-field registration speech, and far-field test speech are segmented into frames to obtain short-time speech frames. S2.3 Windowing: Windowing is performed on short-time speech frames: Hamming window is selected as the window function, and it is multiplied with each speech frame to obtain the windowed short-time speech frame; S2.4 Extracting the logarithmic Mel spectrum: Perform a discrete Fourier transform on the windowed short-time speech frame to obtain the corresponding linear spectrum, then use a Mel filter to convert the linear spectrum into a Mel spectrum, and finally take the logarithm of the Mel spectrum to obtain the logarithmic Mel spectrum. S2.5, Normalization: The log-Mel spectrum features are processed using the local cepstral mean normalization method to obtain normalized features.
3. The method of far-field speaker verification based on self-distilled pre-training and meta-learning fine-tuning according to claim 1, wherein, The process of step S3 is as follows: S3.1 Constructing the backbone neural network: The backbone neural network includes a sequentially connected convolutional input layer, a first convolutional module, a second convolutional module, ..., the ... The system consists of convolutional modules, convolutional dimensionality reduction layers, attention-based statistical pooling layers, and fully connected layers. The output of the fully connected layers is the final speaker depth feature. The number of convolutional modules... The specific structure of the convolutional module is determined based on the selected backbone neural network; S3.2 Constructing a branch neural network: The branch neural network includes a bottleneck module, a statistical pooling layer, and a fully connected layer connected in sequence; the branch neural network is only used in the self-distillation pre-training stage to calculate the self-distillation loss, and the input of the branch neural network is the first to second layers in the backbone neural network. The output of the convolutional module is used to define the branch neural network that takes the output of the first convolutional module as input. The branch neural network that takes the output of the second convolutional module as input is called the first branch neural network; the branch neural network that takes the output of the second convolutional module as input is called the second branch neural network; and so on. The branch neural network whose output is the output of the convolutional module is called the th branch. Branching neural networks.
4. The far-field speaker verification method based on self-distillation pre-training and meta-learning fine-tuning according to claim 1, characterized in that, The process of step S4 is as follows: S4.1 Extracting the backbone depth features: Input the near-field training speech log Mel spectrum features extracted in step S1 into the backbone neural network, and obtain the fully connected layer output of the backbone neural network as the backbone depth features. S4.2 Extracting Branch Depth Features: For the same input, extract the first to second branch depth features from the main neural network. The outputs of the convolution modules are respectively input to the first to the second... Branching neural networks, obtaining the deep features of the first branch to the second branch. Branch depth feature; S4.3, Update the parameters of the backbone neural network and branch neural networks: Based on the backbone depth features output by the backbone neural network, the parameters from the first to the second branch neural network are updated. The first to the second branched neural network output Branch depth feature calculations calculate classification loss functions and distillation loss functions, and the backbone neural network and the first to second branches are updated simultaneously through backpropagation algorithm. The parameters of the branched neural network, where the classification loss and distillation loss functions are defined as follows: Classification loss function: from the first to the second The fully connected layers of both the branch neural network and the backbone neural network have the same post-connection structure, representing linear classifiers from the first to the second generation, respectively. Linear classifiers and the first Linear classifiers, for one of the linear classifiers Let its input be dimension . The depth features, by This indicates that the corresponding real label is the speaker. ,in The parameters of the linear classifier are expressed as: where is the number of speaker categories. ,but Classified as speaker probability for: , Where s and m are the scaling factor and the interval parameter, respectively. For linear classifiers Chinese speaker Corresponding parameters With input deep features The angle between them is used to calculate the predicted probability distribution of each input depth feature; The predicted probability distributions for the input deep features are calculated on each linear classifier, and the classification loss function is as follows: , in, Represents cross-entropy loss, For the first The predicted probability distribution output by a linear classifier. Category tags for real speakers; The distillation loss function is equal to the sum of the following two terms, namely, the first to the second. The predicted probability distribution of the linear classifier and the first The Kullback-Leible divergence loss values between the predicted probability distributions of the linear classifier, from the first to the second... The weighted sum of the L2 distances between branch depth features and trunk depth features, and the formula for calculating the distillation loss function are as follows: , in, Represents the Kullback-Leible divergence. Represents L2 distance, and Representing the first Linear classifiers and the first The predicted probability distribution of a linear classifier and Representing the first Branch depth features and trunk depth features To balance the hyperparameters, the overall loss function for self-distillation learning pre-training is expressed as: , in, The parameter is used to balance the impact of the two losses on the network.
5. The far-field speaker verification method based on self-distillation pre-training and meta-learning fine-tuning according to claim 1, characterized in that, The meta-learning method in step S5 is a training strategy that can improve the generalization of deep neural networks. Meta-learning methods use tasks as training units. Each task consists of a support set and a query set, used to simulate the task scenario during testing. The support set contains... Different categories and each category The query set contains samples that are identical to the support set. Different categories and each category One sample; This training strategy is called -way, -shot's meta-learning strategy.
6. The far-field speaker verification method based on self-distillation pre-training and meta-learning fine-tuning according to claim 5, characterized in that, The process of step S5 is as follows: S5.1 Constructing the meta-learning task: Adopting the K-way, N-shot meta-learning strategy, in each training iteration, K different speakers are extracted from the fine-tuning dataset, and N voices of each speaker are used as the support set; Extract K speakers that are the same as the support set, and M voices from each speaker as the query set. The voices of each speaker in the query set are different from the voices of each speaker in the support set. The voices of each speaker in the support set and the voices of each speaker in the query set come from different recording environments. S5.2 Calculate the central features of each speaker in the support set: Input the log-Mel spectrum features of the speech of each speaker in the support set into the pre-trained backbone neural network to obtain the deep features of each speaker's speech. Then, calculate the mean of the deep features of the speech of each speaker as the central feature of each speaker, as shown in the following formula: , in, To support the centralized speaker A collection of speech samples, The speaker's deep features are output by the pre-trained backbone neural network; S5.3 Update the backbone neural network parameters: Calculate the cosine distance between the deep features of each speaker's speech in the query set and the central features of each speaker in the support set. Based on the true labels corresponding to each speaker's speech in the query set, calculate the angular prototype loss, as shown in the following formula: , in, To query the number of voices in the set, Representatives support the central speaker The central feature, The first in the representative query set Deep features of a speech, Representative support concentration and The speaker-centric features corresponding to the true labels, and These are the learnable scale factor and bias, respectively; during training, they are continuously reduced... The loss value reduces the distance between the deep features of each speaker's speech in the query set and the central features of the same speaker's speech in the support set, increases the distance between the deep features of different speakers' speech in the support set, and aligns the deep features of speech in different noise environments.
7. The far-field speaker verification method based on self-distillation pre-training and meta-learning fine-tuning according to claim 1, characterized in that, The process of step S6 is as follows: S6.1 Generating test sample pairs: Combine the speech in the near field registration set of the evaluation speech dataset with the speech in the far field test set in pairs to generate test sample pairs, including positive sample pairs and negative sample pairs. In the positive sample pair, the two speech samples belong to the same speaker, and in the negative sample pair, the two speech samples belong to different speakers. S6.2 Extracting deep features: Input the log-Mel spectrum features of the above test sample pairs into the pre-trained and fine-tuned backbone neural network to obtain the deep features of the test sample pairs; S6.3 Decision: Calculate the similarity between the deep features of the test samples, and determine whether the two voices in the test sample pair come from the same speaker based on a pre-set threshold.