A method for generating enhanced natural-sounding speech in a text-to-speech system

ISH modulation enhances the naturalness of TTS systems by using inaudible subsonic frequencies to replicate human-like intonation, addressing the limitations of conventional methods in capturing subtle speech variations and emotional cues, resulting in a more natural and adaptable synthetic speech output.

WO2026132974A1PCT designated stage Publication Date: 2026-06-25NOS INOVAÇÃO SA

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NOS INOVAÇÃO SA
Filing Date
2025-12-05
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Conventional Text-to-Speech (TTS) systems struggle to generate synthetic speech that captures the natural nuances and expressive qualities of human speech, often resulting in unnatural or robotic-sounding outputs due to direct manipulation of audible features, failing to replicate the complex interplay of linguistic, prosodic, and emotional factors.

Method used

The application of Intonation-aware Subsonic Harmonics (ISH) modulation, which integrates inaudible subsonic frequencies (below 20 Hz) into the TTS system through a neural network-based Intonation Learning Module, Subsonic Harmonics Module, and Contextual Feedback Module, dynamically adjusting these frequencies based on contextual and emotional cues to enhance the perceived naturalness of synthesized speech.

Benefits of technology

This approach produces a more fluid and natural-sounding speech signal by subtly influencing prosodic features without altering the audible frequency spectrum, effectively conveying appropriate nuances in various conversational settings and seamlessly integrating into existing TTS systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure IB2025062481_25062026_PF_FP_ABST
    Figure IB2025062481_25062026_PF_FP_ABST
Patent Text Reader

Abstract

It is disclosed a method for generating an enhanced natural-sounding speech in a Text- to-Speech (TTS) system (1). Said method including a sequence of steps to be executed by an Intonation-Aware Subsonic Harmonics (ISH) unit (2, 2.1, 2.2, 2.3, 2.4), which is integrated with the TTS system (1), in order to generate and integrate inaudible subsonic frequencies (f.) with a synthesized speech waveform (g.), and to dynamically adjust them based on modulation parameters (e.) and contextual feedback (c.) derivable from the text inputted to the TTS system (1). This procedure influences prosodic features results in more natural-sounding, fluid, and expressive synthetic speech signal (h.).
Need to check novelty before this filing date? Find Prior Art

Description

[0001] DESCRIPTION

[0002] A METHOD FOR GENERATING ENHANCED NATURAL-SOUNDING SPEECH IN A TEXT- TO-SPEECH SYSTEM

[0003] TECHINCAL FIELD

[0004] The present disclosure is enclosed in technical field of synthetic speech synthesis. More particular, it relates to systems and methods for improving the naturalness of a synthetic speech in the context of Text-to-Speech frameworks.

[0005] PRIOR ART

[0006] Advances in speech synthesis technology have led to significant improvements in the intelligibility and quality of synthesized speech. Text-to-speech (TTS) systems, in particular, have become increasingly sophisticated, allowing for the conversion of written text into audible speech. Examples of these developments can be found in WO 2024 / 058147 or BEN HAYES ET AL1. However, despite these advancements, a key challenge that persists is the lack of naturalness in the generated speech signal.

[0007] In fact, synthetic speech, while often highly intelligible, frequently falls short in capturing the natural nuances and expressive qualities that characterize human speech. As the quality of TTS systems continues to improve, the perceived naturalness of the synthesized output has become a critical factor for user acceptance and usability.

[0008] Conventional methods for enhancing the naturalness of synthetic speech such as those disclosed in TAKAMICHI SHINNOSUKE ET AL2and AL MASUM SHAIKH M ET AL3, have focused on directly modifying various audible features, such as pitch, duration, and intensity. These approaches aim to mimic the prosodic patterns and rhythmic characteristics of natural speech. However, this direct manipulation of the audible frequency spectrum can sometimes lead to unnatural or robotic-sounding speech, as the synthetic output may lack the subtle variations and seamless transitions that are characteristics of human speech.

[0009] The challenge of achieving natural-sounding synthetic speech is further deepened by the complex interplay between linguistic, prosodic, and emotional factors that contribute to the perceived naturalness of human speech. Factors such as sentence structure, dialogue context, and emotional expression all play a crucial role in shaping the natural flow and cadence of speech, which is difficult to replicate accurately using conventional TTS techniques.

[0010] These limitations of conventual methodologies highlight the need for more innovative approaches that can more effectively capture the nuances and expressive qualities of human speech, while also ensuring seamless integration into TTS systems.

[0011] The present solution intended to innovatively overcome such issues.

[0012] SUMMARY OF THE DISCLOSURE

[0013] The present disclosure relates to a methodology for improving the perceived naturalness of synthesized speech generated by a Text-to-Speech (TTS) system, through the application of Intonation-aware Subsonic Harmonic (ISH) modulation.

[0014] At the core of this ISH-based methodology are three key modules: an Intonation Learning Module, a Subsonic Harmonics Module, and a Contextual Feedback Module. More precisely, the Intonation Learning Module is able to learn the relationship between subsonic harmonic profiles and intonation patterns across various emotional and contextual scenarios. The Subsonic Harmonics Module generates and integrates inaudible subsonic frequencies (below 20 Hz) with the synthesized speech waveform obtained from a TTS system, dynamically adjusting them based on the Intonation Learning Module's output and contextual feedback on contextual information derivable from the text inputted to the TTS system, given by the Contextual Feedback Module. This indirect approach to influencing prosodic features results in more natural-sounding, fluid, and expressive synthetic speech signal, without directly altering the audible frequency spectrum, thereby offering a singular approach for addressing the state-of-art issues.

[0015] With this context, it is an object of the present disclosure a method for generating enhanced natural-sounding speech, in the context of a Text-to-Speech (TTS) framework.

[0016] DESCRIPTION OF FIGURES

[0017] Figure 1 refers to a block diagram illustrating the ISH methodology and its integration with a TTS system, to generate an enhanced natural-sounding speech signal. The numeric references represent:

[0018] 1 - TTS system to which the ISH unit is to be integrated;

[0019] 2 - ISH unit;

[0020] 2.1 - Contextual Feedback Module;

[0021] 2.2 - Intonation Learning Module;

[0022] 2.3 - Subsonic Harmonics Module;

[0023] 2.4 - Integration Module; a. - Input text; b. - Contextual information; c. - Dynamic contextual feedback data; d. - Phonetic and prosodic structure information; e. - Subsonic harmonic modulation parameters; f. - Subsonic harmonic signal; g. - Preliminary speech waveform; h. - Enhanced natural-sounding speech signal.

[0024] Figure 2 is a block diagram representation of a TTS system comprising an ISH unit. The numeric references represent: 1 - TTS system;

[0025] 1.1 - Text Analysis and Pre-Processing Unit;

[0026] 1.2 - Linguistic Processing Unit;

[0027] 1.3 - Prosodic Modeling Unit;

[0028] 1.4 - Speech waveform generator;

[0029] 2 - ISH unit;

[0030] 2.1 - Contextual Feedback Module;

[0031] 2.2 - Intonation Learning Module;

[0032] 2.3 - Subsonic Harmonics Module;

[0033] 2.4 - Integration Module; a. - Input text; al. - Text-analysis output; a2. - Linguistic-output; a3. - Prosodic-output. b. - Contextual information; c. - Dynamic contextual feedback data; d. - Phonetic and prosodic structure information; e. - Subsonic harmonic modulation parameters; f. - Subsonic harmonic signal; g. - Preliminary speech waveform; h. - Enhanced natural-sounding speech signal.

[0034] DETAILED DESCRIPTION

[0035] The more general configurations of the present disclosure are described in the Summary of the disclosure. Such configurations are detailed below in accordance with other advantageous and / or preferred embodiments of implementation of the present disclosure.

[0036] It is disclosed an Intonation-aware Subsonic Harmonics (ISH) methodology for improving the perceived naturalness of synthesized speech signals generated by a Text-To-Speech framework. The application of ISH modulation leverages inaudible subsonic frequencies (below 20 Hz) to subtly influence the perceived intonation of synthesized speech, resulting in a more fluid and natural speech signal without directly altering the audible frequency spectrum.

[0037] The ISH unit (2) may be designed to be seamlessly integrated into an existing TTS pipeline as an independent, processing layer. In this context, figure 1 illustrates the four processing blocks that embody the ISH methodology of the present disclosure, and which interacts directly with a TTS system: the Contextual Feedback Module (2.1), the Intonation Learning Module (2.2), the Subsonic Harmonics Module (2.3), and the Integration Module (2.4). These components will be described below in order to emphasize their function conceptually and to present a practical realization of their implementation, without this being construed as limiting or exclusive. The order in which each component will be described is intended to favor understanding and the advantages offered by this methodology.

[0038] The Intonation Learning Module (2.2) may be a sophisticated neural network trained on a vast corpus of emotional and conversational speech data, annotated for intonation and prosody. The neural network architecture, which may employ techniques like Long Short-Term Memory or Transformer models, may be designed to effectively handle the temporal dependencies and contextual information inherent in speech data, in order to learn the complex relationships between subsonic harmonic profiles and intonation patterns, so as to be able to capture a wide range of nuances across different emotional and contextual scenarios.

[0039] The Subsonic Harmonics Module (2.3), in its turn, is responsible for generating and integrating the inaudible subsonic frequencies with the preliminary synthesized speech waveform (g.), generated by a TTS system (1) to which the ISH unit (2) is to be integrated. Using mathematical models, this module (2.3) is able to create subsonic sine waves or other waveforms that correspond to the desired prosodic structures. The frequency and amplitude of these subsonic harmonic signals (f.) may then be dynamically adjusted to reflect the intended intonation patterns, based on the modulation parameters (e.) received from the Intonation Learning Module (2.2), and dynamic contextual feedback data (c.) provided by the Contextual Feedback Module (2.1).

[0040] By continuously analyzing the input text (a.) to extract contextual and emotional cues, the Contextual Feedback Module (2.1) provides real-time feedback to the Intonation Learning Module (2.2) and to the Subsonic Harmonics Module (2.3), ensuring that the subsonic harmonic modulation parameters (e.) are tailored to the appropriate intonational patterns and emotional nuances, maintaining coherence with the speech synthesis process.

[0041] Finally, an Integration Module (2.4) may seamlessly combine the subsonic harmonic signals (f.) with the preliminary speech waveform (g.), in order to generate an enhanced natural-sounding speech signal (h.).

[0042] Although not a core part of the ISH methodology, a final post-processing stage may be used to refine and smooth any adjustments made to the speech waveform (h.), for example resulting from the addition between the signals (g.) and (f .), ensuring a seamless integration of the subsonic harmonics and excessive amplitude variations or distortions.

[0043] Figure 2 illustrates an example of a TTS system, being comprised by several units (1.1, 1.2, 1.3, 1.4), that establish several integration points for integrating the ISH unit (2) and the respective modules (2.1, 2.2, 2.3).

[0044] In particular, as a first integration point, the Text Analysis and Preprocessing Unit (1.1) of the TTS pipeline is connected to the Context Feedback Module (2.1) of the ISH Unit (2). More particularly, the Text Analysis and Pre-Processing unit (1.1) is configured to process an input text (a.) to extract linguist and contextual features based on which a contextual understanding of the input text (a.) is determined, thereby generating a text-analysis output (al.) including contextual information (b.). In this stage, the input text (a.) is analyzed for linguistic features, such as tokenization, part-of- speech tagging, and syntactic structure. More particularly, the Text Analysis and Pre- processing unit (1.1) may comprise a Text Analysis Module configured to extracts linguistic and contextual features, which are then passed on to a Context Extraction Module to generate a comprehensive understanding of the text's context, including sentence type, dialogue state, and pragmatic analysis. The Contextual Feedback Module (2.1) consumes the output from the Text Analysis and Pre-processing Unit (1.1), related with critical context information (b.), such as sentence type and emotion recognition, to generate dynamic contextual feedback data (c.) which is then used to guide the subsequent ISH processing stages. In fact, by integrating the Contextual Feedback Module (2.1) at this stage, the subsequent ISH processing can be guided by the appropriate contextual information, i.e., dynamic contextual feedback data (c.), ensuring that subsonic modulations parameters (e.) are tailored to the desired intonational patterns and emotional nuances.

[0045] As a second integration point, the Linguist Processing Unit (1.2) of the TTS pipeline is connected to the Intonation Learning Module (2.2) of the ISH unit (2.). More particularly, the Linguistic Processing unit (1.2) is configured to process text-analysis output (al.) to convert it into phonetic units, thereby generating a linguistic output (a 2. ) including phonetic and prosodic structure information (d.). Even more particularly, the text representation (al.) is transformed into phonetic and prosodic structures (d.), identifying phonemes, syllables, and stress patterns. In its turn, the Intonation Learning Module (2.2) consumes the output of the Linguistic Processing Unit (1.2), related to phonetic and prosodic structure information (d.), along with the dynamic contextual feedback data (c.) from the Contextual Feedback Module (2.1), in order to generate the appropriate subsonic harmonic modulation parameters (e.). By integrating the Intonation Learning Module (2.2) at this point, the module (2.2) can use the complete insights into the phonetic and prosodic characteristics to create the most effective subsonic modulation parameters (e.), thereby enhancing the intonation patterns subtly and effectively.

[0046] These subsonic harmonic modulation parameters (e.) are used by the

[0047] Subsonic Harmonics Module (2.3) in order to generate subsonic harmonic signals (f.).

[0048] This module (2.3) is integrated at a stage prior to the final waveform synthesis (h.), where the pitch contours and prosodic features are defined, which allows the subsonic modulations (e.) to align with the established prosodic patterns without altering the audible components, thereby preserving the intended prosody while enhancing naturalness.

[0049] In addition, based on the linguistic-output (a2.), the Prosodic Modeling Unit (1.3) of the TTS pipeline is configured to designed pitch contours, duration and intensity patterns for each phonetic unit, thereby generating prosodic output (a3.), which will then be consumed by the speech waveform generator (1.4), in combination with the linguistic output (a2.), to generate a preliminary speech waveform (g.). Therefore, the speech waveform generator (1.4) converts the prosodic and phonetic structures into an actual speech waveform, using for example, techniques like concatenative synthesis, parametric synthesis, or neural vocoders.

[0050] Consequently, as a third integration point, an Integration Module (2.4) is configured to merge subsonic harmonic signals (f.) with the preliminary speech waveform (g.), thereby generating an enhanced natural-sounding speech signal (h.) which can be subtly adjusted with subsonic modulations, enhancing the natural rise and fall of intonation without disrupting the original waveform synthesis.

[0051] A Post-Processing Module may implement a final processing stage to refine and smooth any adjustments made to the enhanced speech waveform (h.), ensuring a seamless integration of the subsonic harmonic signals (f.). In this stage, standard post-processing techniques may be employed to finalize the speech waveform (h.) for output, ensuring that the subsonic harmonics are smoothly integrated without introducing any perceptual artifacts.

[0052] Consequently, by integrating the ISH Unit (2) at these key points in the TTS pipeline, it can be seamlessly incorporated into existing TTS systems, making it a flexible and adaptable solution for enhancing the naturalness of synthesized speech.

[0053] In fact, the ISH methodology offers several advantages and benefits.

[0054] Firstly, by subtly influencing the prosodic features through subsonic harmonic modulation, it provides a more human-like intonation, significantly enhancing the perceived naturalness of TTS systems (1). Secondly, the ability to dynamically adjust the subsonic modulations (e.) based on context and emotional cues ensures that the synthesized speech can effectively convey appropriate nuances in various conversational settings. Thirdly, the ISH methodology can be integrated into existing TTS pipelines as an independent, post-processing layer, making it flexible and adaptable without requiring major overhauls of current TTS systems. Finally, the indirect approach of using subsonic modulation to influence intonation avoids direct manipulation of the audible frequency spectrum, preserving the intended prosodic features and preventing unnatural or robotic-sounding speech.

[0055] In conclusion, the ISH methodology represents a novel and innovative approach to enhancing the naturalness of text-to-speech synthesis. By leveraging inaudible subsonic frequencies to subtly influence the perceived intonation, it is possible to achieve a more fluid and natural delivery of synthetic speech without directly altering the audible components. The integration of the Contextual Feedback Module (2.1), Intonation Learning Module (2.2) and Subsonic and Harmonics Module (2.3) enables the ISH methodology to adapt to various emotional and contextual scenarios.

[0056] To illustrate the operation of a TTS system (1) enhanced with the intonation-aware subsonic harmonics (ISH) unit (2), an example workflow is shown, from text input to natural-sounding speech output.

[0057] Consider that the text inputted to the TTS system (1) is as follows: 'Hi John, are you free for a quick meeting at 2 PM today?’

[0058] First step - Text Analysis and Pre-processing & dynamic contextual feedback data:

[0059] The TTS system receives the input text and performs basic linguistic preprocessing operations, such as tokenization, part-of-speech tagging, and parsing, to generate text-analysis output (al.): • Tokenization: 'Hi' | 'John' | ',' | 'are' | 'you' | 'free' | 'for' | 'a' | 'quick' | 'meeting' | 'at' | '2' | 'PM' | 'today' | '?'

[0060] • Part-of-Speech Tagging: Hi (interjection), John (noun), are (verb), you (pronoun), free (adjective), for (preposition), a (article), quick (adjective), meeting (noun), at (preposition), 2 PM (time expression), today (adverb), ? (punctuation)

[0061] • Sentence Type: Question (identified by the sentence structure and punctuation)

[0062] • Sentiment & Emotion: Neutral, possibly urgent or formal based on keywords like 'quick' and 'meeting'.

[0063] Contextual information (b.), related to at least sentence type e emotion recognition, is shared with the Contextual Feedback Module (2.1), based on which it generates dynamic contextual feedback data (c.), relatable to, at least:

[0064] • Context: Question,

[0065] • Emotional Cue: Slight urgency.

[0066] Second step: Linguistic Processing & subsonic modulation parameters:

[0067] It processes text-analysis output (al.) to convert text into a sequence of phonemes and marks stress and prosodic features, thereby generating a linguist output (a2.), that includes phonetic and prosodic structure information (d.), identifying phoneme sequences, syllable boundaries and stress patterns.

[0068] Based on the phonetic and prosodic structure information (d.), the Intonation Learning Module (2.2) determines suitable intonation patterns for a question with slight urgency, and generates subsonic harmonic modulation parameters (e.) for subsonic harmonic signals (f.), by increasing pitch on ‘Hi John' and subtle rise over ‘are you free', with a more noticeable rise towards the end, emphasizing the question. This provides adjustments aligned with natural intonation for question intonation contours and adds subtle emotional cues. Third step: Prosodic Modeling & subsonic harmonic signals:

[0069] It processes linguist output (a2.) to develop pitch contours, duration and intensity patterns for each phonetic unit, thereby generating prosodic output (a3.):

[0070] • Define pitch contours:

[0071] - Slight rise in ‘Hi John', signalling greeting;

[0072] - Maintain average pitch until ‘2PM today', where there is a significant rise indication a question.

[0073] • Duration adjustments: quick and fluid, but slightly extended last syllable to indicate questioning intonation.

[0074] Based on the subsonic modulation parameters (e.) determined by the Intonation Learning Module (2.2), the Subsonic Harmonics Module (2.3) generates subsonic harmonic signals (f.) :

[0075] • Generates a 1-4 Hz sine wave modulation during ‘Hi John';

[0076] • Modulation intensifies gradually during ‘are you free for a quick meeting' and peaks at ‘today?’.

[0077] Fourth step: preliminary speech waveform generation & merging stage:

[0078] Based on linguist output (a2.) and on prosodic output (a3.), a preliminary speech waveform signal (g.) is generated, with an intended intonation, using synthesis technology (e.g., neural vocoder).

[0079] An Integration module (2.4) integrates the subsonic harmonic signals (f.) as an undercurrent to the preliminary speech waveform signal (g.), thereby generating enhanced natural-sounding speech signal (h.). This subsonic modulation subtly influences the pitch perception, introducing a more natural and varied intonation pattern without overtly changing the audible properties. More particularly, the speech signal (h.) carries perceptible question intonation, subtle urgency, and natural flow influenced by subsonic harmonics. Fifth step: Post-Processing stage:

[0080] The enhanced natural-sounding speech signal (h.) may be subject to smoothing and refinement to ensure seamless integration of subsonic harmonics:

[0081] • Apply smoothing algorithms to ensure transitions and modulations are fluid;

[0082] • Normalize amplitude levels for consistency;

[0083] • Final quality check to ensure the synthesized speech sounds natural and characterizes the intended questioning and slight urgency appropriately.

[0084] The result is a natural-sounding speech that reads, 'Hi John, are you free for a quick meeting at 2 PM today?’ with the following characteristics:

[0085] • Greeting and Respect: Slight elevation in pitch on 'Hi John’ introduced by subsonic modulation that subtly enhances the natural rising intonation of a casual greeting.

[0086] • Main Question: A noticeable but smooth pitch rise on ‘are you free’ and a steady rise towards 'today?' This intonation mimics a real human questioning pattern, making the query sound genuine and clear.

[0087] • Subtle Urgency: The subsonic frequencies subtly modulate the overall rhythm and intonation to inject an implicit sense of urgency, indicating the speaker's need to schedule the meeting promptly.

[0088] EMBODIMENTS

[0089] It is disclosed a method for generating enhanced natural-sounding speech

[0090] According to a preferred embodiment of the method, it is comprised by the following steps: i. receiving from a TTS system (1) contextual information (b.) and phonetic and prosodic structure information (d.), obtainable from an input text (a.); ii. generating dynamic contextual feedback data (c.), by a Contextual Feedback Module (2.1), based on the contextual information (b.); iii. feeding an Intonation Learning Module (2.2) with the dynamic contextual feedback data (c.) and phonetic and prosodic structure information (d.), to generate subsonic harmonic modulation parameters (e.); iv. feeding a Subsonic Harmonics Module (2.3) with subsonic harmonic modulation parameters (e.) to generate subsonic harmonic signals (f.); v. merging subsonic harmonic signals (f.) with a preliminary speech waveform (g.) obtainable from the TTS system (1), thereby generating an enhanced naturalsounding speech signal (h.); the methods steps i. to v. being continuously executed so that,

[0091] - the Contextual Feedback Module (2.1) continuously generates dynamic contextual feedback data,

[0092] - the Intonation Learning Module (2.2) generates updated subsonic harmonic modulation parameters (e.) based thereon; and

[0093] - the Subsonic Harmonics Module (2.3) adjusts the subsonic harmonic signals (f.) dynamically based on the updated subsonic harmonic modulation parameters (e.).

[0094] According to one embodiment of the method, contextual information (b.) may relate to at least sentence type and emotion recognition, and the dynamic contextual feedback data (c.) may include information related to at least sentence types and emotional cues.

[0095] In addition, the subsonic harmonic modulation parameters (e.) may relate to at least frequency and amplitude of the subsonic harmonic signals (f.).

[0096] According to one embodiment of the method, the Subsonic Harmonics Module (2.3) may further execute the following steps: generates waveforms within the subsonic frequency range, thereby generating subsonic harmonic signals (f.); - applying frequency modulation by adjusting frequency and amplitude of the generated signals (f.) based on the subsonic harmonic modulation parameters (e.), thereby aligning the subsonic harmonic signals (f.) with dynamic pitch contours of the preliminary speech waveform (g. ) .

[0097] In addition, the Subsonic Harmonics Module (2.3) may further execute the following steps:

[0098] - continuously processing incoming subsonic harmonic modulation parameters (e.);

[0099] - adjusting the subsonic harmonic signals (f.) dynamically by adapting amplitude and frequency parameters to match changes in the preliminary speech waveform (g.), identified by the phonetic and prosodic structure information (d.) and / or the dynamic contextual feedback data (c.).

[0100] According to another embodiment of the method, the Integration Module (2.4) may further execute the following steps:

[0101] - analyzing the preliminary speech waveform (g.) to classify pitch contours related to phonetic and prosodic events as suitable integration points;

[0102] - synchronizing the subsonic harmonic signals (f.) with the phonetic and prosodic events in the preliminary speech waveform (g.);

[0103] - adding the subsonic harmonic signals (f.) to the integration points identified in the preliminary speech waveform (g.) to adjust its prosodic features, i.e., pitch, rhythm and intonation.

[0104] In addition, the Integration Module (2.4) may further execute the following steps: applying amplitude normalization to avoid excessive amplitude variations arising from the addition of the subsonic harmonic signals (f.) to the preliminary speech waveform (g.). According to another embodiment of the method, phonetic and prosodic structure information (d.) may relate to phonemes and syllables included in the input text (a.), and the Intonation Learning Module (2.2) implements a neural network trained to correlate subsonic harmonic profiles with intonation patterns derived from the dynamic contextual feedback data (c.), to generate subsonic harmonic modulation parameters (e.), defining an appropriate subsonic harmonic profile, for each phoneme and syllable.

[0105] More particularly, the Intonation Learning Module (2.2) may further execute the following steps:

[0106] - feeding the neural network with phonetic and prosodic structure information (d.) and with dynamic contextual feedback data (c.);

[0107] - capturing relationships between intonation patterns, subsonic harmonic frequencies, and the dynamic contextual feedback data (c.);

[0108] - outputting subsonic harmonic modulation parameters (e.) for each phoneme and syllable.

[0109] Even more particularly, the neural network may be trained on a training dataset comprised by a plurality of labelled speech data annotated for at least intonation, prosody and emotion.

[0110] According to another embodiment of the method, it may further comprise the step of: implementing a final smooth and refinement signal processing stage by processing the enhanced natural-sounding speech signal (h.) to generate a refined speech signal. REFERENCES

[0111] 1 - BEN HAYES ET AL: "A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 August 2023 (2023-08-29), XP0916004. 2 - TAKAMICHI SHINNOSUKE ET AL: "Postfilters to Modify the Modulation Spectrum for

[0112] Statistical Parametric Speech Synthesis", ARXIV:1806.04885V2, vol. 24, no. 4, 1 April 2016 (2016-04-01), pages 755-767, XP011602380.

[0113] 3 - AL MASUM SHAIKH M ET AL: "Emotional speech synthesis by sensing affective information from text", AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION AND WORKSHOPS, 2009. ACII 2009. 3RD INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 10 September 2009 (2009-09-10), pages 1-6, XP031577804.

Claims

CLAIMS1. A method for generating enhanced natural-sounding speech comprising the following steps: i. receiving from a text-to-speech -TTS - system (1) contextual information (b.) and phonetic and prosodic structure information (d.), obtainable from an input text (a.); ii. generating dynamic contextual feedback data (c.), by a Contextual Feedback Module (2.1), based on the contextual information (b.); iii. feeding an Intonation Learning Module (2.2) with the dynamic contextual feedback data (c.) and phonetic and prosodic structure information (d.), to generate subsonic harmonic modulation parameters (e.); iv. feeding a Subsonic Harmonics Module (2.3) with subsonic harmonic modulation parameters (e.) to generate subsonic harmonic signals (f.); v. merging subsonic harmonic signals (f.) with a preliminary speech waveform (g.) obtainable from the TTS system (1), thereby generating an enhanced naturalsounding speech signal (h.); the methods steps i. to v. being continuously executed so that,- the Contextual Feedback Module (2.1) continuously generates dynamic contextual feedback data,- the Intonation Learning Module (2.2) generates updated subsonic harmonic modulation parameters (e.) based thereon; and- the Subsonic Harmonics Module (2.3) adjusts the subsonic harmonic signals (f.) dynamically based on the updated subsonic harmonic modulation parameters (e.).

2. The method according to claim 1, wherein contextual information (b.) relates to at least sentence type and emotion recognition; and the dynamic contextual feedback data (c.) includes information related to at least sentence types and emotional cues.

3. The method according to claims 1 or 2, wherein the subsonic harmonic modulation parameters (e.) relate to at least frequency and amplitude of the subsonic harmonic signals (f.).

4. The method according to claim 3, wherein the Subsonic Harmonics Module (2.3) further executes the following steps:- generates waveforms within the subsonic frequency range, thereby generating subsonic harmonic signals (f.);- applying frequency modulation by adjusting frequency and amplitude of the generated signals (f.) based on the subsonic harmonic modulation parameters (e.), thereby aligning the subsonic harmonic signals (f.) with dynamic pitch contours of the preliminary speech waveform (g.).

5. The method according to claim 4, wherein the Subsonic Harmonics Module (2.3) further executes the following steps:- continuously processing incoming subsonic harmonic modulation parameters (e.);- adjusting the subsonic harmonic signals (f.) dynamically by adapting amplitude and frequency parameters to match changes in the preliminary speech waveform (g.), identified by the phonetic and prosodic structure information (d.) and / or the dynamic contextual feedback data (c.).

6. The method according to any of the previous claims, wherein the Integration Module (2.4) further executes the following steps:- analyzing the preliminary speech waveform (g.) to classify pitch contours related to phonetic and prosodic events as suitable integration points;- synchronizing the subsonic harmonic signals (f.) with the phonetic and prosodic events in the preliminary speech waveform (g.);adding the subsonic harmonic signals (f.) to the integration points identified in the preliminary speech waveform (g.) to adjust its prosodic features, i.e., pitch, rhythm and intonation.

7. The method according to claim 6, wherein the Integration Module (2.4) further executes the following steps:- applying amplitude normalization to avoid excessive amplitude variations arising from the addition of the subsonic harmonic signals (f.) to the preliminary speech waveform (g.).

8. The method according to any of the previous claims, wherein phonetic and prosodic structure information (d.) relates to phonemes and syllables included in the input text (a.), and wherein, the Intonation Learning Module (2.2) implements a neural network trained to correlate subsonic harmonic profiles with intonation patterns derived from the dynamic contextual feedback data (c.), to generate subsonic harmonic modulation parameters (e.), defining an appropriate subsonic harmonic profile, for each phoneme and syllable.

9. The method according to claim 8, wherein Intonation Learning Module (2.2) further executes the following steps:- feeding the neural network with phonetic and prosodic structure information (d.) and with dynamic contextual feedback data (c.);- capturing relationships between intonation patterns, subsonic harmonic frequencies, and the dynamic contextual feedback data (c.);- outputting subsonic harmonic modulation parameters (e.) for each phoneme and syllable.

10. The method according to claims 8 or 9, wherein the neural network is trained on a training dataset comprised by a plurality of labelled speech data annotated for at least intonation, prosody and emotion.

11. The method according to any of the previous claims, further comprising:- implementing a final smooth and refinement signal processing stage by processing the enhanced natural-sounding speech signal (h.) to generate a refined speech signal.