Deep learning-based adaptive voice speed playback system, method therefor, and computer program therefor

The deep learning-based system dynamically adjusts phoneme speeds per sentence to enhance speech playback quality and naturalness, addressing the limitations of uniform speed ratios in existing systems.

WO2026127353A1 Publication Date: 2026-06-18INDUSTRY UNIVERSITY COOPERATION FOUNDATION HANYANG UNIVERSITY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
INDUSTRY UNIVERSITY COOPERATION FOUNDATION HANYANG UNIVERSITY
Filing Date
2025-10-24
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing speech speed playback systems apply uniform speed ratios to all phonemes without considering sentence context, leading to signal distortion and poor performance, especially for synthesized speech.

Method used

A deep learning-based system that dynamically adjusts phoneme-level pronunciation speeds using a speech speed predictor and generative model, incorporating a variational autoencoder and flow model to adaptively modify speech speed based on sentence context.

🎯Benefits of technology

The system provides natural and flexible speech playback by adjusting phoneme speeds per sentence, improving sound quality and maintaining acoustic features like pitch and timbre, outperforming conventional methods in various evaluation metrics.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
Patent Text Reader

Abstract

Disclosed are a deep learning-based adaptive voice speed playback system and a method therefor. The deep learning-based adaptive voice speed playback system according to one disclosed embodiment comprises: an encoding module extracting language features on the basis of an input text sequence, extracting acoustic features on the basis of original voice data, predicting a phoneme-level pronunciation playback rate for each sentence on the basis of the language features, and combining Gaussian upsampled language features and the acoustic features using the predicted phoneme-level pronunciation playback rate to output adaptive acoustic feature data in which the pronunciation speed of each phoneme is dynamically adjusted for each sentence; and a decoding module providing a speed-adjusted voice signal in the form of an original voice waveform on the basis of the adaptive acoustic feature data using a deep learning-based generative model, wherein the deep learning-based generative model is generated by integrating a variational autoencoder (VAE) for modeling a probability distribution of the adaptive acoustic feature data and a flow model for converting the probability distribution predicted by the VAE into a mel-spectrogram.
Need to check novelty before this filing date? Find Prior Art