Synthesized speech identification method, apparatus and system, storage medium, and device

By constructing a multi-dimensional feature extraction and clustering identification model, the problem of low accuracy in identifying highly realistic AI synthesized speech in existing technologies has been solved, achieving effective recognition of synthesized speech from different speakers and improving the security of identity authentication.

WO2026123823A1 Publication Date: 2026-06-18CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
Filing Date
2025-09-05
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing synthetic speech identification technologies have low accuracy when faced with highly realistic AI-forged synthetic audio, and they are particularly unable to distinguish synthetic speech from speakers with different voice characteristics, leading to identity authentication security issues.

Method used

A target dataset is constructed based on real and synthetic speech data from multiple speakers. Through a discrimination model consisting of a feature extraction module, a classification module, and a judgment module, feature vectors of multiple dimensions are extracted and clustered to generate speaker categories. The authenticity of the speech data is judged based on centroid and similarity threshold.

🎯Benefits of technology

It improves the accuracy and reliability of identifying synthesized speech from different speakers, effectively recognizes highly realistic AI-synthesized speech, and enhances the security of identity authentication.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
Patent Text Reader

Abstract

The present application discloses a synthesized speech identification method, apparatus and system, a storage medium, and a device. The method comprises: constructing a target data set on the basis of real speech data and synthesized speech data of multiple speakers, wherein each piece of speech data has a corresponding speaker label; constructing an identification model, wherein the identification model comprises a feature extraction module, a classification module, and a determination module; using the target data set to train the identification model; and after the model training is completed, processing target speech data by means of the identification model to obtain an identification result, wherein the identification result is used for indicating whether the target speech data is a synthesized speech.
Need to check novelty before this filing date? Find Prior Art