Synthesized speech identification method, apparatus and system, storage medium, and device

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By constructing a multi-dimensional feature extraction and clustering identification model, the problem of low accuracy in identifying highly realistic AI synthesized speech in existing technologies has been solved, achieving effective recognition of synthesized speech from different speakers and improving the security of identity authentication.

WO2026123823A1PCT designated stage Publication Date: 2026-06-18CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
Filing Date: 2025-09-05
Publication Date: 2026-06-18

Smart Images

Figure CN2025119426_18062026_PF_FP_ABST

Patent Text Reader

Abstract

The present application discloses a synthesized speech identification method, apparatus and system, a storage medium, and a device. The method comprises: constructing a target data set on the basis of real speech data and synthesized speech data of multiple speakers, wherein each piece of speech data has a corresponding speaker label; constructing an identification model, wherein the identification model comprises a feature extraction module, a classification module, and a determination module; using the target data set to train the identification model; and after the model training is completed, processing target speech data by means of the identification model to obtain an identification result, wherein the identification result is used for indicating whether the target speech data is a synthesized speech.

Need to check novelty before this filing date? Find Prior Art

Description

Synthetic speech recognition methods, devices, systems, storage media and equipment

[0001] Related applications

[0002] This application claims priority to Chinese patent application filed on December 9, 2024, with application number 202411805567.6, entitled "A Synthetic Speech Identification Method, Apparatus, System and Product", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of speech authentication technology, and in particular to a method, apparatus, system, storage medium and device for synthesized speech authentication. Background Technology

[0004] With the rapid development of Artificial Intelligence (AI) technology, speech synthesis technology has become increasingly mature. Current speech synthesis technology is widely used in human-computer interaction, media entertainment, education and training, and the automotive industry. For example, smart speakers and voice assistants, through speech synthesis technology, can engage in voice conversations with users, greatly enhancing the user experience. In the automotive field, intelligent speech synthesis can enable functions such as in-vehicle navigation and in-car device status updates, improving driver comfort and driving safety.

[0005] However, the widespread application of speech synthesis technology also presents corresponding security challenges. For example, in scenarios where personal voiceprints are used for identity authentication, synthesized speech can be forged for reverse engineering, impersonating a user's real voice to bypass authentication and obtain the user's personal privacy data, device control, etc., causing significant losses to users and related personnel. Therefore, effective identification of synthesized speech is crucial. Traditional synthesized speech identification technologies have relatively low accuracy when facing highly realistic AI-forged synthesized audio, especially when it is difficult to effectively identify synthesized speech from speakers with different voice characteristics. Therefore, a method is needed to accurately identify highly realistic synthesized speech from different speakers. Summary of the Invention

[0006] In view of this, this application aims to propose a method, apparatus, system, storage medium and device for synthesized speech identification, so as to achieve accurate identification of highly realistic synthesized speech.

[0007] To achieve the above objectives, the technical solution of this application is as follows:

[0008] This application provides a synthetic speech identification method in a first aspect. The method includes: constructing a target dataset based on real speech data and synthetic speech data from multiple speakers; wherein each speech data has a corresponding speaker label; constructing an identification model, wherein the identification model includes: a feature extraction module, a classification module, and a judgment module; the feature extraction module is used to extract feature vectors of multiple dimensions from the speech data in the target dataset and generate fusion vectors; the classification module is used to cluster all fusion vectors to generate multiple speaker categories; wherein each speaker category has a corresponding centroid and a similarity threshold; the judgment module is used to determine whether any speech data is synthetic speech based on each speaker category; training the identification model using the target dataset; and after the model training is completed, processing the target speech data through the identification model to obtain an identification result; the identification result is used to indicate whether the target speech data is synthetic speech.

[0009] In some embodiments, constructing a target dataset based on real speech data and synthesized speech data of multiple speakers includes: acquiring real speech data and corresponding synthesized speech data of multiple speakers; adding perturbation noise to the synthesized speech data; the perturbation noise is generated based on the gradient of the target loss function of the discrimination model; adding corresponding speaker labels to each real speech data and synthesized speech data; and constructing the target dataset based on all speech data carrying speaker labels.

[0010] In some embodiments, training the discrimination model using the target dataset includes: extracting feature vectors of multiple dimensions from each speech data in the target dataset using the feature extraction module and generating corresponding fusion vectors; clustering all fusion vectors to obtain multiple speaker categories using the classification module and determining the centroid of each speaker's category; constructing a validation set for each speaker, wherein the validation set includes: the speaker's real speech data and / or synthesized speech data; determining a similarity threshold for the corresponding speaker's category based on the validation set for each speaker; constructing a target loss function and training the discrimination model based on the target loss function.

[0011] In some embodiments, constructing a target loss function includes: constructing a first loss function based on Euclidean distance and speaker labels; the first loss function is used to optimize the feature extraction accuracy of the feature extraction module; constructing a second loss function based on Gaussian density; the second loss function is used to optimize the clustering accuracy of the classification module; and constructing the target loss function based on the first loss function and the second loss function.

[0012] In some embodiments, the multi-dimensional feature vector includes: a spectrogram, Mel frequency cepstral coefficients, and a fundamental frequency; extracting multi-dimensional feature vectors from each speech data in the target dataset and generating a corresponding fusion vector specifically includes: performing a short-time Fourier transform on the speech data to extract the spectrogram of the speech data; calculating the Mel frequency energy spectrum of the speech data and calculating the Mel frequency cepstral coefficients based on the Mel frequency energy spectrum; extracting the fundamental frequency of the speech data using the librosa library; convolving the spectrogram, Mel frequency cepstral coefficients, and fundamental frequency to generate multi-dimensional feature vectors; and fusing the multi-dimensional feature vectors to generate a fusion vector.

[0013] In some embodiments, clustering all fused vectors to obtain multiple speaker categories and determining the centroid of each speaker category includes: obtaining a pre-set number of clusters; using a Gaussian mixture model to construct a probability density function based on the number of clusters to cluster all fused vectors; using an expectation-maximization algorithm to determine the parameters of the probability density function; the parameters include: weights, mean, and variance; and determining the mean of each Gaussian component in the probability density function as the centroid of the corresponding speaker category.

[0014] In some embodiments, determining the corresponding speaker category similarity threshold based on the validation set of each speaker includes: determining multiple candidate thresholds for each speaker category; traversing each candidate threshold and determining the error rate corresponding to each candidate threshold based on the validation set of the speaker; and in each round of training, determining the candidate threshold corresponding to the lowest error rate as the speaker category similarity threshold.

[0015] In some embodiments, determining the error rate corresponding to each candidate threshold based on the speaker's validation set includes: extracting the fusion vector of the speech data in the validation set through the feature extraction module; classifying the fusion vector of the speech data through the classification module to determine the speaker category closest to the speech data; obtaining the corresponding centroid and all candidate thresholds according to the speaker category; calculating the cosine distance between the centroid and the fusion vector through the judgment module; comparing the cosine distance with each candidate threshold to obtain the corresponding identification result; comparing the identification result with the label of the speech data; determining that the identification result is incorrect if the identification result is inconsistent with the label; determining the number of incorrect identification results for all speech data in the validation set for each candidate threshold; and calculating the error rate based on the number of incorrect identification results as the error rate corresponding to the candidate threshold.

[0016] In some embodiments, processing the target speech data using the discrimination model to obtain a discrimination result includes: extracting feature vectors of multiple dimensions of the target speech data using the feature extraction module and generating a fusion vector; classifying the fusion vector using the classification module to determine the speaker category closest to the target speech data; obtaining the corresponding centroid and similarity threshold based on the speaker category; calculating the cosine distance between the centroid and the fusion vector using the judgment module; and comparing the cosine distance with the similarity threshold to obtain a discrimination result.

[0017] In some embodiments, comparing the cosine distance with the similarity threshold to obtain an identification result includes: if the cosine distance is greater than or equal to the similarity threshold, determining that the target speech data is the speaker's real speech data; if the cosine distance is less than the similarity threshold, determining that the target speech data is synthetic speech data.

[0018] This application provides a synthetic speech identification device in a second aspect. The device includes: a preprocessing module configured to construct a target dataset based on real speech data and synthetic speech data from multiple speakers; wherein each speech data has a corresponding speaker label; a identification model constructed, wherein the identification model includes: a feature extraction module, a classification module, and a judgment module; the feature extraction module is used to extract feature vectors of multiple dimensions from the speech data in the target dataset and generate fusion vectors; the classification module is used to cluster all fusion vectors to generate multiple speaker categories; wherein each speaker category has a corresponding centroid and a similarity threshold; the judgment module is used to determine whether any speech data is synthetic speech based on each speaker category; a training module configured to train the identification model using the target dataset; and an identification module configured to process the target speech data through the identification model after the model training is completed to obtain an identification result; the identification result is used to indicate whether the target speech data is synthetic speech.

[0019] In some embodiments, the preprocessing module is further configured to: acquire real speech data of multiple speakers and corresponding synthesized speech data; add perturbation noise to the synthesized speech data, wherein the perturbation noise is generated based on the gradient of the target loss function of the discrimination model; add corresponding speaker labels to each real speech data and synthesized speech data; and construct the target dataset based on all speech data carrying speaker labels.

[0020] In some embodiments, the training module is further configured to: extract feature vectors of multiple dimensions from each speech data in the target dataset using the feature extraction module and generate corresponding fusion vectors; cluster all fusion vectors to obtain multiple speaker categories using the classification module and determine the centroid of each speaker's category; construct a validation set for each speaker, wherein the validation set includes: the speaker's real speech data and / or synthesized speech data; determine the corresponding speaker category similarity threshold based on the validation set of each speaker; construct a target loss function and train the discrimination model based on the target loss function.

[0021] In some embodiments, the training module is further configured to: construct a first loss function based on Euclidean distance and speaker labels, wherein the first loss function is used to optimize the feature extraction accuracy of the feature extraction module; construct a second loss function based on Gaussian density, wherein the second loss function is used to optimize the clustering accuracy of the classification module; and construct the target loss function based on the first loss function and the second loss function.

[0022] In some embodiments, the multi-dimensional feature vector includes: a spectrogram, Mel frequency cepstral coefficients, and a fundamental frequency; the training module is further configured to: perform a short-time Fourier transform on the speech data to extract the spectrogram of the speech data; calculate the Mel frequency energy spectrum of the speech data, and calculate the Mel frequency cepstral coefficients based on the Mel frequency energy spectrum; extract the fundamental frequency of the speech data using the librosa library; convolve the spectrogram, Mel frequency cepstral coefficients, and fundamental frequency to generate a multi-dimensional feature vector; and fuse the multi-dimensional feature vectors to generate a fused vector.

[0023] In some embodiments, the training module is further configured to: obtain a pre-set number of clusters; based on the number of clusters, use a Gaussian mixture model to construct a probability density function to cluster all fused vectors; use an expectation-maximization algorithm to determine the parameters of the probability density function, the parameters including: weights, mean, and variance; and determine the mean of each Gaussian component in the probability density function as the centroid of the corresponding human speaker.

[0024] In some embodiments, the training module is further configured to: determine multiple candidate thresholds for each speaker category; traverse each candidate threshold and determine the error rate corresponding to each candidate threshold based on the speaker's validation set; and in each round of training, determine the candidate threshold corresponding to the lowest error rate as the similarity threshold of the speaker's category.

[0025] In some embodiments, the training module is further configured to: extract the fusion vector of the speech data in the validation set through the feature extraction module; classify the fusion vector of the speech data through the classification module to determine the speaker category closest to the speech data; obtain the corresponding centroid and all candidate thresholds according to the speaker category; calculate the cosine distance between the centroid and the fusion vector through the judgment module; compare the cosine distance with each candidate threshold to obtain the corresponding discrimination result; compare the discrimination result with the label of the speech data; determine that the discrimination result is incorrect if the discrimination result is inconsistent with the label; for each candidate threshold, determine the number of discrimination results that are incorrect for all speech data in the validation set; calculate the error rate based on the number of discrimination results that are incorrect, as the error rate corresponding to the candidate threshold.

[0026] In some embodiments, the identification module is further configured to: extract feature vectors of multiple dimensions of the target speech data through the feature extraction module and generate a fusion vector; classify the fusion vector through the classification module to determine the speaker category closest to the target speech data; obtain the corresponding centroid and similarity threshold according to the speaker category; calculate the cosine distance between the centroid and the fusion vector through the judgment module; and compare the cosine distance with the similarity threshold to obtain the identification result.

[0027] In some embodiments, the identification module is further configured to include: determining that the target speech data is the speaker's real speech data when the cosine distance is greater than or equal to the similarity threshold; and determining that the target speech data is synthesized speech data when the cosine distance is less than the similarity threshold.

[0028] In a third aspect, this application also provides a synthetic speech identification system, comprising: a synthetic speech identification device, a data acquisition unit, and a speech conversion unit as provided in the second aspect of the embodiments of this application; the data acquisition unit is used to acquire real speech data of multiple speakers; and the speech conversion unit is configured to generate corresponding transcribed text based on the real speech data; and to process the transcribed text using a speech conversion model to generate corresponding synthetic speech data.

[0029] In a fourth aspect, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method provided in the first aspect of the embodiments of this application.

[0030] In a fifth aspect, this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when executed by the processor, the computer program implements the steps of the method provided in the first aspect of the embodiments of this application.

[0031] The synthesized speech identification method provided in this application first acquires multiple sets of real and synthesized speech data labeled with speaker tags, and constructs a target dataset based on all the speech data. An identification model is then built to determine whether the speech data is synthesized. This model specifically includes a feature extraction module, a classification module, and a judgment module. The feature extraction module extracts multi-dimensional feature vectors from the speech data and processes them to generate fusion vectors. The classification module clusters the fusion vectors of all speech data in the target dataset to generate multiple speaker categories. Each speaker category has a corresponding centroid and a similarity threshold. The judgment module judges any speech data based on the centroid and similarity threshold of each speaker category to determine whether the speech data is the real speech of a particular speaker or synthesized speech.

[0032] The synthetic speech identification method provided in this application constructs a target dataset based on real and synthetic speech data from different speakers. When extracting feature vectors, it fuses multiple dimensions of audio features to obtain richer feature representations. Based on this, it trains an identification model, enabling the model to learn richer speech information, improving its ability to recognize complex signals, and thus increasing the accuracy of the identification results. Because different speakers have different speech features, traditional detection models using fixed thresholds or simple classifiers are difficult to effectively identify highly realistic AI-synthesized speech. This application, however, clusters the fused features of multiple speakers into multiple speaker categories, and then determines a corresponding similarity threshold for each speaker category. Compared to traditional methods, the identification model in this application can better adapt to the feature differences of different speakers, improving the accuracy and reliability of the identification model in recognizing the authenticity of different speakers' speech, and achieving effective recognition of highly realistic AI-synthesized speech. Attached Figure Description

[0033] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0034] Figure 1 is a flowchart of a synthetic speech identification method proposed in an embodiment of this application.

[0035] Figure 2 is a flowchart of generating synthesized speech data in one embodiment of this application.

[0036] Figure 3 is a flowchart of training the discrimination model in one embodiment of this application.

[0037] Figure 4 is a schematic diagram of centroid clustering in one embodiment of this application.

[0038] Figure 5 is a schematic diagram of a synthesized speech identification device proposed in an embodiment of this application.

[0039] Figure 6 is a schematic diagram of a synthetic speech identification system proposed in an embodiment of this application. Detailed Implementation

[0040] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0041] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments.

[0042] In the various embodiments of this application, it should be understood that the sequence number of each process described below does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0043] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects as detailed in this application.

[0044] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other.

[0045] Traditional speech recognition methods often rely on a single approach to extract features for analysis and discrimination, which proves ineffective against highly realistic AI-synthesized speech. For example, speech recognition methods based on acoustic feature analysis use traditional acoustic features, judging based on the differences between real and synthesized speech. However, these subtle differences in acoustic features mean that highly realistic AI-synthesized speech suffers from low accuracy. Similarly, deep learning-based speech recognition methods utilize deep learning models to automatically learn speech features. While these models demonstrate good recognition capabilities for known types of synthesized speech, they struggle to effectively handle synthesized speech generated by new or unknown methods.

[0046] This application provides a method that can effectively identify AI-synthesized speech, accurately distinguish the subtle differences between real speech and synthesized speech, adapt to the differences in the voice characteristics of different speakers, and effectively deal with synthesized speech generated by known or unknown synthesis methods.

[0047] The present application will now be described in detail with reference to the accompanying drawings and embodiments.

[0048] Figure 1 is a flowchart of a synthetic speech identification method proposed in an embodiment of this application. As shown in Figure 1, the method includes the following steps S1 to S4.

[0049] S1: Construct the target dataset based on real speech data and synthesized speech data from multiple speakers; where each speech data has a corresponding speaker label.

[0050] S2: Construct an identification model; the identification model includes: a feature extraction module, a classification module, and a judgment module; the feature extraction module is used to extract feature vectors of multiple dimensions from the speech data in the target dataset and generate fusion vectors; the classification module is used to cluster all fusion vectors to generate multiple speaker categories; wherein, each speaker category has a corresponding centroid and similarity threshold; the judgment module is used to determine whether any speech data is synthetic speech based on each speaker category.

[0051] S3: Train the discrimination model using the target dataset.

[0052] S4: After the model training is completed, the target speech data is processed by the discrimination model to obtain the discrimination result; the discrimination result is used to indicate whether the target speech data is synthetic speech.

[0053] In this embodiment, real speech data and corresponding synthesized speech data from multiple different speakers are first acquired. A target dataset is then constructed based on all the speech data to train the discrimination model. The discrimination model includes a feature extraction module, a classification module, and a judgment module. The feature extraction module extracts multi-dimensional feature data from the speech data and fuses the multi-dimensional feature data to generate a fusion vector. This allows the model to learn richer speech information and improves its ability to recognize complex signals.

[0054] The classification module clusters all fused vectors in the target dataset to obtain multiple speaker categories for subsequent identification of whether the target speech data is synthetic. Since each speaker's speech features are different, clustering and centroid determination through the classification model can distinguish the feature information of each speaker category. Based on this, a corresponding similarity threshold is determined for each speaker category, which helps the model better adapt to the feature differences of different speakers, improving the accuracy and reliability of the identification model in recognizing the authenticity of different speakers' speech.

[0055] The trained discrimination model processes the target speech data to be discriminated against. It categorizes the target speech data to estimate the possible speaker categories it belongs to, and then determines whether the target speech data belongs to that speaker category based on the centroid and similarity threshold corresponding to that speaker category. Finally, it obtains the discrimination result for the target speech data. The discrimination result is either that the target speech data is the speaker's real speech or that the target speech data is synthesized speech.

[0056] This embodiment extracts and fuses audio features from multiple dimensions of speech data to obtain richer feature representations, enabling the discrimination model to learn more speech information, improving its ability to recognize complex signals, and thus increasing the accuracy of the discrimination results. Furthermore, by clustering the fused features of multiple speakers to form multiple speaker categories, it achieves the purpose of distinguishing the features of different speakers. Based on this, a corresponding similarity threshold is determined for each speaker category, allowing the model to better adapt to the speech features of each speaker, improving the accuracy and reliability of the discrimination model in identifying the authenticity of different speakers' speech, and achieving effective recognition of highly realistic AI-synthesized speech.

[0057] As one embodiment of this application, the step of constructing a target dataset based on real speech data and synthesized speech data from multiple speakers may include:

[0058] Acquire real speech data from multiple speakers and their corresponding synthesized speech data;

[0059] Perturbation noise is added to the synthesized speech data; where the perturbation noise is generated based on the gradient of the target loss function of the discrimination model.

[0060] Add corresponding speaker tags to each real speech data and synthesized speech data;

[0061] The target dataset is constructed based on all speech data carrying speaker labels.

[0062] In some embodiments, fake synthetic speech data is generated using real speech data, and a small perturbation is added to the synthetic speech data to blur the boundary between real and fake audio, thereby enhancing the robustness of the identification model. Figure 2 is a flowchart of generating synthetic speech data in one embodiment of this application. As shown in Figure 2, this embodiment uses two Chinese telephone speech corpora, 873 hours of HKUST Mandarin Telephone Speech Corpus and 15 hours of CallHome Mandarin Chinese Speech Corpus, as real speech data. The corresponding high-quality transcribed text is input into the speech conversion model to generate fake synthetic speech.

[0063] Optionally, the speech conversion model can be a text-to-speech (TTS) model such as PaddleSpeech, Azure Cognitive Services, or Alibaba Cloud TTS. This embodiment does not impose any restrictions on this.

[0064] Traditional methods suffer from overfitting during model training, leading to insufficient generalization and an inability to accurately distinguish synthesized speech in different environments. To enhance the robustness of the discrimination model, perturbation noise is added to the generated fake synthesized speech to blur the boundary between real and fake audio, resulting in the final synthesized speech data used for model training. Specifically, the input synthesized speech data x is slightly modified to fill the gap between real and synthesized speech, resulting in the final synthesized speech data x. a for:

[0065] x a =x+n;

[0066] Where n represents the perturbation noise; α represents the degree of control perturbation, ranging from [-1, 1]; and y represents the target true label for model training. The predicted labels for model training. This represents the gradient of the model's target loss function with respect to the input data x.

[0067] In this embodiment, speaker labels are added to each piece of real speech data and synthesized speech data according to the corresponding speaker. A target dataset is constructed based on the labeled speech data, and the discrimination model is iteratively trained using the target dataset. At the beginning of training, no perturbation noise is added to the synthesized speech data (i.e., α = 0). As the number of training iterations increases, the perturbation level is gradually increased.

[0068] Given that related technologies directly use synthesized and real speech for model training without perturbing the synthesized speech, the model tends to overfit to specific speech synthesis patterns and lacks generalization ability. In this embodiment, by introducing a perturbation based on the loss function gradient, the boundary between real and forged synthesized speech is blurred, thereby enhancing the model's ability to learn subtle differences. This allows the model to learn more robust discriminative features, enabling the discrimination model to effectively identify synthesized speech in complex environments and giving the model stronger generalization ability.

[0069] As one embodiment of this application, the step of training the discrimination model using the target dataset may include:

[0070] The feature extraction module extracts feature vectors from multiple dimensions of each speech data in the target dataset and generates corresponding fusion vectors.

[0071] The classification module clusters all fused vectors to obtain multiple speaker categories and determines the centroid of each speaker category.

[0072] Construct a validation set for each speaker; wherein the validation set includes: the speaker's real speech data and / or synthesized speech data;

[0073] Determine the corresponding speaker class similarity threshold based on the validation set of each speaker;

[0074] Construct a target loss function, and train the discrimination model based on the target loss function.

[0075] Figure 3 is a flowchart of training the discrimination model in one embodiment of this application. As shown in Figure 3, in some embodiments, the training process of the discrimination model includes: generating perturbed synthetic speech data from high-quality transcribed text of real speech data, and constructing a target dataset. The target dataset is fed into a feature extraction module to extract fused features, and then clustered by a classification module to obtain different speaker categories, and the centroid of each speaker category is determined. Then, a judgment module dynamically determines the similarity threshold corresponding to each speaker category based on the error rate of the training process, and jointly optimizes the feature extraction module and the classification module based on the target loss function to minimize the error rate. In this embodiment, based on all speakers corresponding to the speech data in the target dataset, the speech data (including real speech and / or synthetic speech) of each speaker is obtained, and a validation set for each speaker is constructed. The discrimination results of the validation set are used to determine the similarity threshold of the corresponding speaker category.

[0076] After the model training is completed, the target speech data to be identified is input into the identification model. The feature extraction module processes the data to obtain a fusion vector, and the classification module classifies the data to obtain the centroid of the closest human type. The judgment module determines whether the target speech data is synthetic speech or the real speech of the speaker based on the similarity threshold between the centroid and the human type.

[0077] As one embodiment of this application, the step of constructing the target loss function may include:

[0078] A first loss function is constructed based on Euclidean distance and speaker labels; this first loss function is used to optimize the feature extraction accuracy of the feature extraction module.

[0079] A second loss function is constructed based on Gaussian density; the second loss function is used to optimize the clustering accuracy of the classification module.

[0080] The target loss function is constructed based on the first loss function and the second loss function.

[0081] In one embodiment, a target loss function is constructed based on two different loss functions for training the discriminative model. Specifically, the target loss function is constructed based on a first loss function and a second loss function.

[0082] In this embodiment, the first loss function L feature A feature extraction loss function is used to optimize the extraction performance of the feature extraction module. By using contrastive loss, the distance between similar samples (e.g., different samples of the same speaker) is minimized, while the distance between dissimilar samples is maximized. The specific expression is as follows:

[0083] Where distance is the Euclidean distance between feature vectors, and y is the output label of the model.

[0084] The second loss function uses the GMM log-likelihood loss function to optimize the clustering performance of the classification module, maximize the log-likelihood of the data, and improve the accuracy of model clustering. The specific expression is as follows:

[0085] Where k represents the mixing coefficient of the k-th Gaussian component. This represents the Gaussian density function.

[0086] The target loss function L is constructed based on the first and second loss functions, and its specific expression is as follows:

[0087] L = L feature +λL GMM ; where λ is the weight of the second loss function.

[0088] Traditional methods use a single loss function to train speech detection models, which can only optimize performance in one dimension of the model, such as classification accuracy. This approach has limited impact on overall model performance. In this embodiment, however, a target loss function is constructed based on two different loss functions. This allows for simultaneous training of both feature extraction and classification performance, resulting in a significant improvement in the overall model performance.

[0089] As one embodiment of this application, the feature vector of the multiple dimensions includes: spectrum, Mel frequency cepstral coefficients, and fundamental frequency.

[0090] The steps of extracting multi-dimensional feature vectors from each speech data point in the target dataset and generating corresponding fusion vectors may specifically include:

[0091] Perform a short-time Fourier transform on the speech data to extract the speech data's spectrogram;

[0092] Calculate the Mel frequency energy spectrum of the speech data, and calculate the Mel frequency cepstral coefficients based on the Mel frequency energy spectrum;

[0093] Use the librosa library to extract the fundamental frequency of the speech data;

[0094] Convolve the spectrogram, Mel frequency cepstral coefficients, and fundamental frequency to generate a multi-dimensional feature vector;

[0095] Feature vectors from multiple dimensions are fused to generate a fused vector.

[0096] In some embodiments, a feature extraction module extracts three-dimensional feature vectors from the speech data. By extracting feature vectors of multiple dimensions, the model can learn more comprehensively the information in the speech data across multiple dimensions. Specifically, the steps for extracting feature vectors are as follows:

[0097] (1) Extract the spectrogram D. Acquire the speech data and its sampling rate sr. With a fixed frame length of 40ms and a frame shift of 20ms, divide the speech data into frames and denote the segmented speech data as y(t). Based on the speech data y(t) and the sampling rate sr, use the Short Time Fourier Transform (STFT) to obtain the spectrogram D, as follows:

[0098] Where mT is the frame shift (in seconds); It is the time interval for each sample.

[0099] Therefore, the calculation of the short-time Fourier transform is directly related to the sampling rate, ensuring that the time and frequency axes of the spectrum D are consistent with the sampling rate.

[0100] (2) Extract Mel-frequency cepstral coefficients. Calculate the Mel-frequency energy spectrum E of the speech data using a Mel filter. m E m After logarithmic transformation, L is obtained m =log(E m ).

[0101] The Mel-scale Frequency Cepstral Coefficients (MFCCs) are as follows:

[0102] Where M is the number of Mel filters and N is the number of MFCC coefficients required.

[0103] (3) Extract the fundamental frequency. Based on the speech data y(t) and the sampling rate sr, the fundamental frequency f0 of the audio signal is extracted using the librosa library in Python.

[0104] Convolution operations are performed on the spectrogram D, Mel-frequency cepstral coefficients (MFCC), and fundamental frequency f0 to generate feature vectors of corresponding dimensions. After obtaining the three feature vectors, they are fused to obtain the corresponding fused vector, thus completing the vector extraction operation for the speech data.

[0105] Compared to traditional detection methods, this approach can fully capture the time-domain, frequency-domain, and acoustic features of speech data (such as telephone voice signals), obtaining richer feature representations. By extracting and fusing multiple speech feature vectors with significant discriminative differences through a convolutional neural network, the model can understand speech signals from multiple perspectives, improving its ability to recognize complex signals and enabling it to effectively recognize synthetic speech from different speakers, in different environments, and of different types.

[0106] Furthermore, the feature extraction module in this embodiment is scalable. By increasing the dimensions of feature extraction, new feature extraction methods can be easily integrated to adapt to new AI-synthesized speech technologies that may emerge in the future.

[0107] As one implementation of this application, clustering is performed on all fused vectors to obtain multiple speaker categories, and the centroid of each speaker category is determined, including:

[0108] Get the pre-set number of clusters;

[0109] Based on the number of clusters, a probability density function is constructed using a Gaussian mixture model to cluster all fused vectors;

[0110] The parameters of the probability density function are determined using the expectation-maximization algorithm; the parameters include: weights, mean, and variance.

[0111] The mean of each Gaussian component in the probability density function is determined as the centroid of the corresponding human speaker.

[0112] In one embodiment, the classification module uses a Gaussian Mixture Module (GMM) to cluster the fused vectors of the speech data and uses the Expectation-Maximization algorithm (EM) to determine the parameters of the probability density function.

[0113] First, the GMM model is used to cluster the fused vectors of real speech data from different speakers. Let X = {x1, x2, ..., xn} be the matrix composed of the real speech feature vectors of different speakers. N}, where x i Let represent the feature vector of the i-th sample. The GMM probability density function is as follows:

[0114] Where k is the number of clusters; π i The weight of the i-th Gaussian component and For the i-th Gaussian distribution; μ i Σ is the mean; iLet Variance be the variance.

[0115] In this embodiment, the voice data of each speaker is clustered to obtain a speaker category, that is, let k = N.

[0116] Then, the GMM parameters are estimated using the Expectation-Maximization (EM) algorithm, which consists of two steps: the E-step and the M-step. The E-step and M-step are iteratively performed repeatedly until the parameters converge or the maximum number of iterations is reached.

[0117] In this step, E calculates x for each sample. n The probability γ of belonging to each Gaussian component i i (x n (i.e., responsibility value):

[0118] The M-step is used to update the parameters of the GMM, including weights, mean, and variance:

[0119] (1) Update weight π i :

[0120] (2) Update the mean vector μ i :

[0121] (3) Update the covariance matrix Σ i :

[0122] Figure 4 is a schematic diagram of centroid clustering in one embodiment of this application. As shown in Figure 4, in this embodiment, a Gaussian mixture model (GMM) is used to cluster different feature clusters (speaker categories). Figure 4 shows three speaker categories (feature cluster 1, feature cluster 2, and feature cluster 3). For each speaker category, the mean vector μ of its corresponding Gaussian component is... i This serves as the feature centroid for that speaker category. The feature centroids of each speaker category are then used as reference centroids in subsequent training processes.

[0123] When the classification module processes the target speech data to be identified, it first fuses the target speech data point vector χ. test The data is categorized and the corresponding responsibility value γ is calculated. i (χ test ):

[0124] Furthermore, by calculating argmax i γ i (χ testEstimate the closest speaker category for the target speech data. The `argmax()` function determines the position of the element corresponding to the maximum value, i.e., the speaker category. Based on the closest speaker category, determine the centroid C of that category. r (i.e., the feature center of this category). Then, the cosine distance d(x) between this centroid and the fusion vector of the target speech data is calculated using the judgment module. test C r ):

[0125] The cosine distance is compared with a similarity threshold for the speaker's personality to determine whether the target speech data is the speaker's real speech or synthesized speech. Specifically, if the cosine distance is less than the similarity threshold, it is determined to be synthesized speech; if the cosine distance is greater than or equal to the similarity threshold, it is determined to be the speaker's real speech.

[0126] As one embodiment of this application, the step of determining the corresponding speaker class similarity threshold based on the verification set of each speaker may include:

[0127] For each speaker category, determine multiple candidate thresholds;

[0128] Iterate through each candidate threshold and determine the error rate corresponding to each candidate threshold based on the speaker's verification set;

[0129] In each round of training, the candidate threshold corresponding to the lowest error rate is determined as the similarity threshold for the speaker's human counterpart.

[0130] Each speaker's validation set includes the speaker's real audio data and / or synthetic audio data. In some embodiments, during training, multiple candidate thresholds are obtained by taking values at uniform intervals within the range [0,1]. For example, using 0.01 as the interval in [0,1] yields 100 test thresholds (excluding 0).

[0131] In each round of model training, the discrimination model is used to process the speaker's validation set, and the error rate of the discrimination result is obtained based on each candidate threshold in a traversal manner. The candidate threshold with the smallest error rate is selected from the candidate thresholds and determined as the similarity threshold for that speaker's class.

[0132] In this embodiment, by dynamically adjusting the threshold for different speaker categories during the iterative training of the model, the model can improve its overall recognition performance and determine the most suitable judgment threshold based on the voice characteristics of different speakers, thereby improving the accuracy and reliability of the model in recognizing the authenticity of different speakers' speech.

[0133] As one embodiment of this application, the step of determining the error rate corresponding to each candidate threshold based on the speaker's verification set may include:

[0134] The feature extraction module extracts the fusion vector of the speech data in the validation set, and the classification module classifies the fusion vector of the speech data to determine the speaker category that is closest to the speech data.

[0135] Based on the speaker category, obtain the corresponding centroid and all candidate thresholds;

[0136] The judgment module calculates the cosine distance between the centroid and the fusion vector; the cosine distance is compared with each candidate threshold to obtain the corresponding identification result;

[0137] The identification result is compared with the label of the voice data; if the identification result is inconsistent with the label, the identification result is judged as incorrect.

[0138] For each candidate threshold, determine the number of incorrect identification results for all speech data in the validation set;

[0139] The error rate is calculated based on the number of identification results that are incorrect, and is used as the error rate corresponding to the candidate threshold.

[0140] In some embodiments, the error rate corresponding to each candidate threshold is determined as follows:

[0141] (1) For each speaker’s validation set, the fusion vector is extracted from the speech data in the validation set, and the responsibility value is calculated based on the fusion vector by the classification module, thereby determining the speaker category that the speech data is closest to.

[0142] (2) Obtain the centroid of the speaker and calculate the cosine distance between the centroid and the fusion vector;

[0143] (3) Traverse all candidate thresholds for the speaker category, compare the cosine distance with each candidate threshold, and obtain the corresponding identification result;

[0144] (4) Compare the identification result with the label of the speech data. If they do not match, the identification result is determined to be incorrect. For example, if the identification result is real speech but the label is synthetic speech, the identification result is determined to be incorrect.

[0145] (5) For each candidate threshold, calculate the error rate based on the number of incorrect identification results for all speech data in the validation set:

[0146] Wherein, FP is the number of synthesized speech that was mistakenly identified as real speech, and FN is the number of real speech that was mistakenly identified as synthesized speech.

[0147] As one embodiment of this application, the step of processing the target speech data through the discrimination model to obtain the discrimination result may include:

[0148] The feature extraction module extracts feature vectors of multiple dimensions from the target speech data and generates a fusion vector.

[0149] The fusion vector is categorized by the classification module to determine the speaker category that is closest to the target speech data;

[0150] Based on the speaker category, obtain the corresponding centroid and similarity threshold;

[0151] The steps of calculating the cosine distance between the centroid and the fusion vector through the judgment module and comparing the cosine distance with the similarity threshold to obtain the identification result may include: if the cosine distance is greater than or equal to the similarity threshold, determining that the target speech data is the speaker's real speech data; if the cosine distance is less than the similarity threshold, determining that the target speech data is synthetic speech data.

[0152] In one embodiment, the target speech data to be identified is input into a trained identification model, and the identification model processes the target speech data to obtain the identification result. The process is as follows:

[0153] First, a feature extraction module extracts multi-dimensional feature vectors from the target speech data and generates a fusion vector. Then, a classification module processes this fusion vector, calculates the corresponding responsibility value, and further estimates the closest speaker category based on the responsibility value. Next, the centroid of the speaker's category and the corresponding similarity threshold are obtained. A judgment module processes the fusion vector, calculating the cosine distance between the centroid and the fusion vector. Finally, the cosine distance is compared with the speaker's category similarity threshold. If the cosine distance is greater than or equal to the similarity threshold, the target speech data is determined to be the speaker's genuine speech; if the cosine distance is less than the similarity threshold, the target speech data is determined to be synthesized speech from a fabricated speaker.

[0154] Based on the same inventive concept, one embodiment of this application provides a synthetic speech identification device. Referring to FIG5, FIG5 is a schematic diagram of a synthetic speech identification device 100 proposed in an embodiment of this application. As shown in FIG5, the device includes: a preprocessing module 101, a training module 102, and an identification module 103.

[0155] The preprocessing module 101 is configured to construct a target dataset based on real speech data and synthesized speech data from multiple speakers; wherein each speech data has a corresponding speaker label; and to construct an identification model, wherein the identification model includes a feature extraction module, a classification module, and a judgment module; the feature extraction module is used to extract feature vectors of multiple dimensions from the speech data in the target dataset and generate fusion vectors; the classification module is used to cluster all fusion vectors to generate multiple speaker categories; wherein each speaker category has a corresponding centroid and a similarity threshold; and the judgment module is used to determine whether any speech data is synthesized speech based on each speaker category.

[0156] Training module 102 is configured to train the discrimination model using the target dataset.

[0157] The discrimination module 103 is configured to process the target speech data through the discrimination model after the model training is completed to obtain the discrimination result; wherein the discrimination result is used to indicate whether the target speech data is synthetic speech.

[0158] As one embodiment of this application, the preprocessing module 101 is further configured to:

[0159] Acquire real speech data from multiple speakers and their corresponding synthesized speech data;

[0160] Perturbation noise is added to the synthesized speech data; wherein the perturbation noise is generated based on the gradient of the target loss function of the discrimination model;

[0161] Add corresponding speaker tags to each real speech data and synthesized speech data;

[0162] The target dataset is constructed based on all speech data carrying speaker labels.

[0163] As one embodiment of this application, the training module 102 is further configured as follows:

[0164] The feature extraction module extracts feature vectors from multiple dimensions of each speech data in the target dataset and generates corresponding fusion vectors.

[0165] The classification module clusters all fused vectors to obtain multiple speaker categories and determines the centroid of each speaker category.

[0166] Construct a validation set for each speaker; wherein the validation set includes: the speaker's real speech data and / or synthesized speech data;

[0167] Determine the corresponding speaker class similarity threshold based on the validation set of each speaker;

[0168] Construct a target loss function and train the discrimination model based on the target loss function.

[0169] As one embodiment of this application, the training module 102 is further configured as follows:

[0170] A first loss function is constructed based on Euclidean distance and speaker labels; this first loss function is used to optimize the feature extraction accuracy of the feature extraction module.

[0171] A second loss function is constructed based on Gaussian density; the second loss function is used to optimize the clustering accuracy of the classification module.

[0172] The target loss function is constructed based on the first loss function and the second loss function.

[0173] As one embodiment of this application, the multi-dimensional feature vector includes: spectrogram, Mel frequency cepstral coefficients, and fundamental frequency.

[0174] Training module 102 is also configured as follows:

[0175] Perform a short-time Fourier transform on the speech data to extract the speech data's spectrogram;

[0176] Calculate the Mel frequency energy spectrum of the speech data, and calculate the Mel frequency cepstral coefficients based on the Mel frequency energy spectrum;

[0177] Use the librosa library to extract the fundamental frequency of the speech data;

[0178] Convolve the spectrogram, Mel frequency cepstral coefficients, and fundamental frequency to generate a multi-dimensional feature vector;

[0179] Feature vectors from multiple dimensions are fused to generate a fused vector.

[0180] In one embodiment of this application, the training module 102 is further configured as follows:

[0181] Get the pre-set number of clusters;

[0182] Based on the number of clusters, a probability density function is constructed using a Gaussian mixture model to cluster all fused vectors;

[0183] The expected value maximization algorithm is used to determine the parameters of the probability density function, including weights, mean, and variance.

[0184] The mean of each Gaussian component in the probability density function is used as the centroid of the corresponding human speaker.

[0185] As one embodiment of this application, the training module 102 is further configured as follows:

[0186] For each speaker category, determine multiple candidate thresholds;

[0187] Iterate through each candidate threshold and determine the error rate corresponding to each candidate threshold based on the speaker's verification set;

[0188] In each round of training, the candidate threshold corresponding to the lowest error rate is determined as the similarity threshold for the speaker's human counterpart.

[0189] As one embodiment of this application, the training module 102 is further configured as follows:

[0190] The feature extraction module extracts the fusion vector of the speech data in the validation set, and the classification module classifies the fusion vector of the speech data to determine the speaker category that is closest to the speech data.

[0191] Based on the speaker category, obtain the corresponding centroid and all candidate thresholds;

[0192] The judgment module calculates the cosine distance between the centroid and the fusion vector; the cosine distance is compared with each candidate threshold to obtain the corresponding identification result;

[0193] The identification result is compared with the label of the voice data; if the identification result is inconsistent with the label, the identification result is determined to be incorrect.

[0194] For each candidate threshold, determine the number of incorrect identification results for all speech data in the validation set;

[0195] The error rate is calculated based on the number of identification results that are incorrect, and is used as the error rate corresponding to the candidate threshold.

[0196] As one embodiment of this application, the identification module 103 is further configured as follows:

[0197] The feature extraction module extracts feature vectors from multiple dimensions of the target speech data and generates a fusion vector.

[0198] The fusion vectors are categorized by the classification module to determine the speaker category that is closest to the target speech data;

[0199] Based on the speaker category, obtain the corresponding centroid and similarity threshold;

[0200] The judgment module calculates the cosine distance between the centroid and the fusion vector; the cosine distance is compared with the similarity threshold to obtain the identification result.

[0201] As one embodiment of this application, the identification module 103 is further configured to: determine that the target speech data is the speaker's real speech data when the cosine distance is greater than or equal to the similarity threshold; and determine that the target speech data is synthetic speech data when the cosine distance is less than the similarity threshold.

[0202] Based on the same inventive concept, one embodiment of this application provides a synthetic speech identification system. Referring to FIG6, FIG6 is a schematic diagram of a synthetic speech identification system proposed in an embodiment of this application. As shown in FIG6, the system includes: a synthetic speech identification device, a data acquisition unit, and a speech conversion unit as provided in the above embodiment.

[0203] The data acquisition unit is used to collect real speech data from multiple speakers.

[0204] The speech conversion unit is configured to generate corresponding transcribed text based on real speech data, and to process the transcribed text using a speech conversion model to generate corresponding synthetic speech data.

[0205] In some embodiments, the system includes a data acquisition unit, a speech conversion unit, and a synthesized speech discrimination device. The data acquisition unit acquires raw speech from multiple speakers, the speech conversion unit generates high-quality transcribed text based on the raw speech, and then generates synthesized speech data based on the transcribed text. The data acquisition unit can continuously update the training samples in the target dataset, for example, by adding speech data samples from new speakers or increasing the number of existing speaker speech data samples. Based on this, the discrimination model can be continuously updated, enabling the model to cope with synthesized speech generated by new speech synthesis methods or the voice characteristics of new speakers.

[0206] An embodiment of this application also provides a readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the synthesized speech identification method as described in any of the above embodiments of this application.

[0207] One embodiment of this application also provides, for example, an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the steps in the synthesized speech identification method as described in any of the above embodiments of this application.

[0208] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0209] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

[0210] For the sake of simplicity, the method embodiments are described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and components involved are not necessarily essential to this application.

[0211] Those skilled in the art will understand that embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this application can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of this application can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0212] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in one or more blocks of the flowchart illustrations and / or one or more blocks of the block diagrams.

[0213] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means that implement the functions specified in one or more flowcharts and / or one or more block diagrams.

[0214] These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment to cause a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, such that the instructions, which execute on the computer or other programmable terminal equipment, provide steps for implementing the functions specified in one or more flowcharts and / or one or more block diagrams.

[0215] Although preferred embodiments of the embodiments of this application have been described, those skilled in the art, once they understand the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, this application is to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of this application.

[0216] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0217] The synthesized speech identification method, apparatus, system, storage medium, and device provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and its core ideas. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

[0218] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0219] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A method for synthesized speech identification, comprising: The target dataset is constructed based on real speech data and synthesized speech data from multiple speakers; each speech data has a corresponding speaker label. A discrimination model is constructed, comprising: a feature extraction module, a classification module, and a judgment module; the feature extraction module is used to extract multi-dimensional feature vectors from the speech data in the target dataset and generate fusion vectors; the classification module is used to cluster all fusion vectors to generate multiple speaker categories; wherein, each speaker category has a corresponding centroid and a similarity threshold; the judgment module is used to determine whether any speech data is synthetic speech based on each speaker category; The discrimination model is trained using the target dataset; and After the model training is completed, the target speech data is processed by the discrimination model to obtain the discrimination result; the discrimination result is used to indicate whether the target speech data is synthetic speech.

2. The synthetic speech identification method according to claim 1, wherein constructing a target dataset based on real speech data and synthetic speech data from multiple speakers includes: Acquire real speech data from multiple speakers and their corresponding synthesized speech data; Add perturbation noise to the synthesized speech data; The perturbation noise is generated based on the gradient of the target loss function of the discrimination model; Add corresponding speaker tags to each real speech data and synthesized speech data; The target dataset is constructed based on all speech data carrying speaker labels.

3. The synthetic speech discrimination method according to claim 1, wherein training the discrimination model using the target dataset includes: The feature extraction module extracts feature vectors of multiple dimensions from each speech data in the target dataset and generates corresponding fusion vectors. The classification module clusters all fused vectors to obtain multiple speaker categories and determines the centroid of each speaker category. Construct a verification set for each speaker; the verification set includes: the speaker's real speech data and / or synthesized speech data; Determine the corresponding speaker class similarity threshold based on the validation set of each speaker; Construct a target loss function, and train the discrimination model based on the target loss function.

4. The synthetic speech discrimination method according to claim 3, wherein constructing the target loss function includes: A first loss function is constructed based on Euclidean distance and speaker labels; the first loss function is used to optimize the feature extraction accuracy of the feature extraction module. A second loss function is constructed based on Gaussian density; the second loss function is used to optimize the clustering accuracy of the classification module. The target loss function is constructed based on the first loss function and the second loss function.

5. The synthetic speech identification method according to claim 3, wherein the multi-dimensional feature vector includes: Spectrum diagram, Mel frequency cepstral coefficients, and fundamental frequency; Extracting feature vectors from multiple dimensions of each speech data point in the target dataset and generating corresponding fusion vectors, including: Perform a short-time Fourier transform on the speech data to extract the spectrogram of the speech data; Calculate the Mel frequency energy spectrum of the speech data, and calculate the Mel frequency cepstral coefficients based on the Mel frequency energy spectrum; Use the librosa library to extract the fundamental frequency of the speech data; Convolve the spectrum, Mel frequency cepstral coefficients, and fundamental frequency to generate a multi-dimensional feature vector; The feature vectors of the multiple dimensions are fused to generate a fused vector.

6. The synthetic speech identification method according to claim 3, wherein clustering all fused vectors to obtain multiple speaker categories and determining the centroid of each speaker category includes: Get the pre-set number of clusters; Based on the number of clusters, a probability density function is constructed using a Gaussian mixture model to cluster all fused vectors; The parameters of the probability density function are determined using the expectation-maximization algorithm; the parameters include: weights, mean, and variance. The mean of each Gaussian component in the probability density function is determined as the centroid of the corresponding human speaker.

7. The synthetic speech identification method according to claim 3, wherein determining the corresponding speaker class similarity threshold based on the validation set of each speaker includes: For each speaker category, determine multiple candidate thresholds; Iterate through each candidate threshold and determine the error rate corresponding to each candidate threshold based on the speaker's verification set; In each round of training, the candidate threshold corresponding to the lowest error rate is determined as the similarity threshold of the speaking human.

8. The synthetic speech identification method according to claim 7, wherein determining the error rate corresponding to each candidate threshold based on the speaker's verification set includes: The feature extraction module extracts the fusion vector of the speech data in the validation set, and the classification module classifies the fusion vector of the speech data to determine the speaker category that is closest to the speech data. Based on the speaker category, obtain the corresponding centroid and all candidate thresholds; The judgment module calculates the cosine distance between the centroid and the fusion vector; the cosine distance is compared with each candidate threshold to obtain the corresponding identification result. The identification result is compared with the label of the voice data; if the identification result is inconsistent with the label, the identification result is determined to be incorrect. For each candidate threshold, determine the number of instances where the identification result of all speech data in the verification set is incorrect; The error rate is calculated based on the number of errors identified, and is used as the error rate corresponding to the candidate threshold.

9. The synthetic speech discrimination method according to claim 1, wherein processing the target speech data through the discrimination model to obtain the discrimination result includes: The feature extraction module extracts feature vectors of multiple dimensions from the target speech data and generates a fusion vector. The fusion vector is categorized by the classification module to determine the speaker category that is closest to the target speech data; Based on the speaker category, obtain the corresponding centroid and similarity threshold; The judgment module calculates the cosine distance between the centroid and the fusion vector; the cosine distance is compared with the similarity threshold to obtain the identification result.

10. The synthetic speech identification method according to claim 9, wherein comparing the cosine distance with the similarity threshold to obtain an identification result includes: If the cosine distance is greater than or equal to the similarity threshold, the target speech data is determined to be the speaker's real speech data; If the cosine distance is less than the similarity threshold, the target speech data is determined to be synthetic speech data.

11. A synthesized speech identification device, comprising: The preprocessing module is configured to construct a target dataset based on real and synthesized speech data from multiple speakers; each speech data has a corresponding speaker label; and to construct an identification model, which includes a feature extraction module, a classification module, and a judgment module. The feature extraction module is used to extract multi-dimensional feature vectors from the speech data in the target dataset and generate fused vectors. The classification module is used to cluster all fused vectors to generate multiple speaker categories; each speaker category has a corresponding centroid and a similarity threshold. The judgment module is used to determine whether any speech data is synthesized speech based on each speaker category. The training module is configured to train the discrimination model using the target dataset; and The discrimination module is configured to process the target speech data through the discrimination model after the model training is completed to obtain the discrimination result; the discrimination result is used to indicate whether the target speech data is synthetic speech.

12. The synthesized speech identification device according to claim 11, wherein the preprocessing module is further configured to: Acquire real speech data from multiple speakers and their corresponding synthesized speech data; Add perturbation noise to the synthesized speech data, wherein the perturbation noise is generated based on the gradient of the target loss function of the discrimination model; Add corresponding speaker tags to each real speech data and synthesized speech data; The target dataset is constructed based on all speech data carrying speaker labels.

13. The synthetic speech discrimination device according to claim 11, wherein the training module is further configured to: The feature extraction module extracts feature vectors of multiple dimensions from each speech data in the target dataset and generates corresponding fusion vectors. The classification module clusters all fused vectors to obtain multiple speaker categories and determines the centroid of each speaker category. Construct a verification set for each speaker, wherein the verification set includes: the speaker's real speech data and / or synthesized speech data; Determine the corresponding speaker class similarity threshold based on the validation set of each speaker; Construct a target loss function, and train the discrimination model based on the target loss function.

14. The synthetic speech discrimination device according to claim 13, wherein the training module is further configured to: A first loss function is constructed based on Euclidean distance and speaker labels, wherein the first loss function is used to optimize the feature extraction accuracy of the feature extraction module; A second loss function is constructed based on Gaussian density, wherein the second loss function is used to optimize the clustering accuracy of the classification module; The target loss function is constructed based on the first loss function and the second loss function.

15. The synthetic speech discrimination device according to claim 13, wherein the feature vector of the multiple dimensions includes: Spectrum diagram, Mel frequency cepstral coefficients, and fundamental frequency; The training module is also configured to: Perform a short-time Fourier transform on the speech data to extract the spectrogram of the speech data; Calculate the Mel frequency energy spectrum of the speech data, and calculate the Mel frequency cepstral coefficients based on the Mel frequency energy spectrum; Use the librosa library to extract the fundamental frequency of the speech data; Convolve the spectrum, Mel frequency cepstral coefficients, and fundamental frequency to generate a multi-dimensional feature vector; The feature vectors of the multiple dimensions are fused to generate a fused vector.

16. The synthetic speech discrimination apparatus according to claim 13, wherein the training module is further configured to: Get the pre-set number of clusters; Based on the number of clusters, a probability density function is constructed using a Gaussian mixture model to cluster all fused vectors; The parameters of the probability density function are determined using the expectation-maximization algorithm, and the parameters include: Weights, mean, and variance; The mean of each Gaussian component in the probability density function is determined as the centroid of the corresponding human speaker.

17. The synthetic speech discrimination apparatus according to claim 13, wherein the training module is further configured to: For each speaker category, determine multiple candidate thresholds; Iterate through each candidate threshold and determine the error rate corresponding to each candidate threshold based on the speaker's verification set; In each round of training, the candidate threshold corresponding to the lowest error rate is determined as the similarity threshold of the speaking human.

18. The synthetic speech discrimination apparatus according to claim 17, wherein the training module is further configured to: The feature extraction module extracts the fusion vector of the speech data in the validation set, and the classification module classifies the fusion vector of the speech data to determine the speaker category that is closest to the speech data. Based on the speaker category, obtain the corresponding centroid and all candidate thresholds; The judgment module calculates the cosine distance between the centroid and the fusion vector; the cosine distance is compared with each candidate threshold to obtain the corresponding identification result. The identification result is compared with the label of the voice data; if the identification result is inconsistent with the label, the identification result is determined to be incorrect. For each candidate threshold, determine the number of instances where the identification result of all speech data in the verification set is incorrect; The error rate is calculated based on the number of errors identified, and is used as the error rate corresponding to the candidate threshold.

19. The synthesized speech identification device according to claim 11, wherein the identification module is further configured to: The feature extraction module extracts feature vectors of multiple dimensions from the target speech data and generates a fusion vector. The fusion vector is categorized by the classification module to determine the speaker category that is closest to the target speech data; Based on the speaker category, obtain the corresponding centroid and similarity threshold; The judgment module calculates the cosine distance between the centroid and the fusion vector; the cosine distance is compared with the similarity threshold to obtain the identification result.

20. The synthesized speech identification device according to claim 19, wherein the identification module is further configured to include: If the cosine distance is greater than or equal to the similarity threshold, the target speech data is determined to be the speaker's real speech data; If the cosine distance is less than the similarity threshold, the target speech data is determined to be synthetic speech data.

21. A synthetic speech discrimination system, comprising: The synthesized speech identification device, data acquisition unit, and speech conversion unit as described in claim 11; in The data acquisition unit is used to collect real voice data from multiple speakers; The speech conversion unit is configured to generate corresponding transcribed text based on the real speech data; The transcribed text is processed using a speech conversion model to generate corresponding synthetic speech data.

22. A non-volatile computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method as described in any one of claims 1-10.

23. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the method as claimed in any one of claims 1-10.