Intonation-aware subsonic harmonics unit for a text-to-speech system
The ISH unit addresses the challenge of robotic-sounding TTS outputs by using subsonic harmonics to enhance naturalness through contextual and emotional cues, achieving more human-like intonation in synthetic speech.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NOS INOVAÇÃO SA
- Filing Date
- 2025-12-05
- Publication Date
- 2026-06-25
Smart Images

Figure IB2025062473_25062026_PF_FP_ABST
Abstract
Description
[0001] DESCRIPTION
[0002] INTONATION-AWARE SUBSONIC HARMONICS UNIT FOR A TEXT-TO-SPEECH SYSTEM
[0003] TECHINCAL FIELD
[0004] The present disclosure is enclosed in technical field of synthetic speech synthesis. More particular, it relates to systems and methods for improving the naturalness of a synthetic speech in the context of Text-to-Speech frameworks.
[0005] PRIOR ART
[0006] Advances in speech synthesis technology have led to significant improvements in the intelligibility and quality of synthesized speech. Text-to-speech (TTS) systems, in particular, have become increasingly sophisticated, allowing for the conversion of written text into audible speech. Examples of these developments can be found in WO 2024 / 058147 or BEN HAYES ET AL1. However, despite these advancements, a key challenge that persists is the lack of naturalness in the generated speech signal.
[0007] In fact, synthetic speech, while often highly intelligible, frequently falls short in capturing the natural nuances and expressive qualities that characterize human speech. As the quality of TTS systems continues to improve, the perceived naturalness of the synthesized output has become a critical factor for user acceptance and usability.
[0008] Conventional methods for enhancing the naturalness of synthetic speech such as those disclosed in TAKAMICHI SHINNOSUKE ET AL2and AL MASUM SHAIKH M ET AL3, have focused on directly modifying various audible features, such as pitch, duration, and intensity. These approaches aim to mimic the prosodic patterns and rhythmic characteristics of natural speech. However, this direct manipulation of the audible frequency spectrum can sometimes lead to unnatural or robotic-sounding speech, as the synthetic output may lack the subtle variations and seamless transitions that are characteristics of human speech. The challenge of achieving natural-sounding synthetic speech is further deepened by the complex interplay between linguistic, prosodic, and emotional factors that contribute to the perceived naturalness of human speech. Factors such as sentence structure, dialogue context, and emotional expression all play a crucial role in shaping the natural flow and cadence of speech, which is difficult to replicate accurately using conventional TTS techniques.
[0009] These limitations of conventual methodologies highlight the need for more innovative approaches that can more effectively capture the nuances and expressive qualities of human speech, while also ensuring seamless integration into TTS systems.
[0010] The present solution intended to innovatively overcome such issues.
[0011] SUMMARY OF THE DISCLOSURE
[0012] The present disclosure relates to a methodology for improving the perceived naturalness of synthesized speech generated by a Text-to-Speech (TTS) system, through the application of Intonation-aware Subsonic Harmonic (ISH) modulation.
[0013] At the core of this ISH-based methodology are three key modules: an Intonation Learning Module, a Subsonic Harmonics Module, and a Contextual Feedback Module. More precisely, the Intonation Learning Module is able to learn the relationship between subsonic harmonic profiles and intonation patterns across various emotional and contextual scenarios. The Subsonic Harmonics Module generates and integrates inaudible subsonic frequencies (below 20 Hz) with the synthesized speech waveform obtained from a TTS system, dynamically adjusting them based on the Intonation Learning Module's output and contextual feedback on contextual information derivable from the text inputted to the TTS system, given by the Contextual Feedback Module.
[0014] This indirect approach to influencing prosodic features results in more natural-sounding, fluid, and expressive synthetic speech signal, without directly altering the audible frequency spectrum, thereby offering a singular approach for addressing the state-of-art issues.
[0015] With this context, it is an object of the present disclosure an Intonation- Aware Subsonic Harmonics (ISH) unit adapted to be integrated into a Text-to-Speech (TTS) system.
[0016] DESCRIPTION OF FIGURES
[0017] Figure 1 refers to a block diagram illustrating the ISH methodology and its integration with a TTS system, to generate an enhanced natural-sounding speech signal. The numeric references represent:
[0018] 1 - TTS system to which the ISH unit is to be integrated;
[0019] 2 - ISH unit;
[0020] 2.1 - Contextual Feedback Module;
[0021] 2.2 - Intonation Learning Module;
[0022] 2.3 - Subsonic Harmonics Module;
[0023] 2.4 - Integration Module; a. - Input text; b. - Contextual information; c. - Dynamic contextual feedback data; d. - Phonetic and prosodic structure information; e. - Subsonic harmonic modulation parameters; f. - Subsonic harmonic signal; g. - Preliminary speech waveform; h. - Enhanced natural-sounding speech signal.
[0024] Figure 2 is a block diagram representation of a TTS system comprising an ISH unit. The numeric references represent:
[0025] 1 - TTS system;
[0026] 1.1 - Text Analysis and Pre-Processing Unit; 1.2 - Linguistic Processing Unit;
[0027] 1.3 - Prosodic Modeling Unit;
[0028] 1.4 - Speech waveform generator;
[0029] 2 - ISH unit;
[0030] 2.1 - Contextual Feedback Module;
[0031] 2.2 - Intonation Learning Module;
[0032] 2.3 - Subsonic Harmonics Module;
[0033] 2.4 - Integration Module; a. - Input text; al. - Text-analysis output; a2. - Linguistic-output; a3. - Prosodic-output. b. - Contextual information; c. - Dynamic contextual feedback data; d. - Phonetic and prosodic structure information; e. - Subsonic harmonic modulation parameters; f. - Subsonic harmonic signal; g. - Preliminary speech waveform; h. - Enhanced natural-sounding speech signal.
[0034] DETAILED DESCRIPTION
[0035] The more general configurations of the present disclosure are described in the Summary of the disclosure. Such configurations are detailed below in accordance with other advantageous and / or preferred embodiments of implementation of the present disclosure.
[0036] It is disclosed an Intonation-aware Subsonic Harmonics (ISH) methodology for improving the perceived naturalness of synthesized speech signals generated by a Text-To-Speech framework. The application of ISH modulation leverages inaudible subsonic frequencies (below 20 Hz) to subtly influence the perceived intonation of synthesized speech, resulting in a more fluid and natural speech signal without directly altering the audible frequency spectrum.
[0037] The ISH unit (2) may be designed to be seamlessly integrated into an existing TTS pipeline as an independent, processing layer. In this context, figure 1 illustrates the four processing blocks that embody the ISH methodology of the present disclosure, and which interacts directly with a TTS system: the Contextual Feedback Module (2.1), the Intonation Learning Module (2.2), the Subsonic Harmonics Module (2.3), and the Integration Module (2.4). These components will be described below in order to emphasize their function conceptually and to present a practical realization of their implementation, without this being construed as limiting or exclusive. The order in which each component will be described is intended to favor understanding and the advantages offered by this methodology.
[0038] The Intonation Learning Module (2.2) may be a sophisticated neural network trained on a vast corpus of emotional and conversational speech data, annotated for intonation and prosody. The neural network architecture, which may employ techniques like Long Short-Term Memory or Transformer models, may be designed to effectively handle the temporal dependencies and contextual information inherent in speech data, in order to learn the complex relationships between subsonic harmonic profiles and intonation patterns, so as to be able to capture a wide range of nuances across different emotional and contextual scenarios.
[0039] The Subsonic Harmonics Module (2.3), in its turn, is responsible for generating and integrating the inaudible subsonic frequencies with the preliminary synthesized speech waveform (g.), generated by a TTS system (1) to which the ISH unit (2) is to be integrated. Using mathematical models, this module (2.3) is able to create subsonic sine waves or other waveforms that correspond to the desired prosodic structures. The frequency and amplitude of these subsonic harmonic signals (f.) may then be dynamically adjusted to reflect the intended intonation patterns, based on the modulation parameters (e.) received from the Intonation Learning Module (2.2), and dynamic contextual feedback data (c.) provided by the Contextual Feedback Module (2.1).
[0040] By continuously analyzing the input text (a.) to extract contextual and emotional cues, the Contextual Feedback Module (2.1) provides real-time feedback to the Intonation Learning Module (2.2) and to the Subsonic Harmonics Module (2.3), ensuring that the subsonic harmonic modulation parameters (e.) are tailored to the appropriate intonational patterns and emotional nuances, maintaining coherence with the speech synthesis process.
[0041] Finally, an Integration Module (2.4) may seamlessly combine the subsonic harmonic signals (f.) with the preliminary speech waveform (g.), in order to generate an enhanced natural-sounding speech signal (h.).
[0042] Although not a core part of the ISH methodology, a final post-processing stage may be used to refine and smooth any adjustments made to the speech waveform (h.), for example resulting from the addition between the signals (g.) and (f .), ensuring a seamless integration of the subsonic harmonics and excessive amplitude variations or distortions.
[0043] Figure 2 illustrates an example of a TTS system, being comprised by several units (1.1, 1.2, 1.3, 1.4), that establish several integration points for integrating the ISH unit (2) and the respective modules (2.1, 2.2, 2.3).
[0044] In particular, as a first integration point, the Text Analysis and Preprocessing Unit (1.1) of the TTS pipeline is connected to the Context Feedback Module (2.1) of the ISH Unit (2). More particularly, the Text Analysis and Pre-Processing unit (1.1) is configured to process an input text (a.) to extract linguist and contextual features based on which a contextual understanding of the input text (a.) is determined, thereby generating a text-analysis output (al.) including contextual information (b.). In this stage, the input text (a.) is analyzed for linguistic features, such as tokenization, part-of- speech tagging, and syntactic structure. More particularly, the Text Analysis and Preprocessing unit (1.1) may comprise a Text Analysis Module configured to extracts linguistic and contextual features, which are then passed on to a Context Extraction Module to generate a comprehensive understanding of the text's context, including sentence type, dialogue state, and pragmatic analysis. The Contextual Feedback Module (2.1) consumes the output from the Text Analysis and Pre-processing Unit (1.1), related with critical context information (b.), such as sentence type and emotion recognition, to generate dynamic contextual feedback data (c.) which is then used to guide the subsequent ISH processing stages. In fact, by integrating the Contextual Feedback Module (2.1) at this stage, the subsequent ISH processing can be guided by the appropriate contextual information, i.e., dynamic contextual feedback data (c.), ensuring that subsonic modulations parameters (e.) are tailored to the desired intonational patterns and emotional nuances.
[0045] As a second integration point, the Linguist Processing Unit (1.2) of the TTS pipeline is connected to the Intonation Learning Module (2.2) of the ISH unit (2.). More particularly, the Linguistic Processing unit (1.2) is configured to process text-analysis output (al.) to convert it into phonetic units, thereby generating a linguistic output (a 2. ) including phonetic and prosodic structure information (d.). Even more particularly, the text representation (al.) is transformed into phonetic and prosodic structures (d.), identifying phonemes, syllables, and stress patterns. In its turn, the Intonation Learning Module (2.2) consumes the output of the Linguistic Processing Unit (1.2), related to phonetic and prosodic structure information (d.), along with the dynamic contextual feedback data (c.) from the Contextual Feedback Module (2.1), in order to generate the appropriate subsonic harmonic modulation parameters (e.). By integrating the Intonation Learning Module (2.2) at this point, the module (2.2) can use the complete insights into the phonetic and prosodic characteristics to create the most effective subsonic modulation parameters (e.), thereby enhancing the intonation patterns subtly and effectively.
[0046] These subsonic harmonic modulation parameters (e.) are used by the Subsonic Harmonics Module (2.3) in order to generate subsonic harmonic signals (f.). This module (2.3) is integrated at a stage prior to the final waveform synthesis (h.), where the pitch contours and prosodic features are defined, which allows the subsonic modulations (e.) to align with the established prosodic patterns without altering the audible components, thereby preserving the intended prosody while enhancing naturalness.
[0047] In addition, based on the linguistic-output (a2.), the Prosodic Modeling Unit (1.3) of the TTS pipeline is configured to designed pitch contours, duration and intensity patterns for each phonetic unit, thereby generating prosodic output (a3.), which will then be consumed by the speech waveform generator (1.4), in combination with the linguistic output (a2.), to generate a preliminary speech waveform (g.). Therefore, the speech waveform generator (1.4) converts the prosodic and phonetic structures into an actual speech waveform, using for example, techniques like concatenative synthesis, parametric synthesis, or neural vocoders.
[0048] Consequently, as a third integration point, an Integration Module (2.4) is configured to merge subsonic harmonic signals (f.) with the preliminary speech waveform (g.), thereby generating an enhanced natural-sounding speech signal (h.) which can be subtly adjusted with subsonic modulations, enhancing the natural rise and fall of intonation without disrupting the original waveform synthesis.
[0049] A Post-Processing Module may implement a final processing stage to refine and smooth any adjustments made to the enhanced speech waveform (h.), ensuring a seamless integration of the subsonic harmonic signals (f.). In this stage, standard post-processing techniques may be employed to finalize the speech waveform (h.) for output, ensuring that the subsonic harmonics are smoothly integrated without introducing any perceptual artifacts.
[0050] Consequently, by integrating the ISH Unit (2) at these key points in the TTS pipeline, it can be seamlessly incorporated into existing TTS systems, making it a flexible and adaptable solution for enhancing the naturalness of synthesized speech.
[0051] In fact, the ISH methodology offers several advantages and benefits. Firstly, by subtly influencing the prosodic features through subsonic harmonic modulation, it provides a more human-like intonation, significantly enhancing the perceived naturalness of TTS systems (1). Secondly, the ability to dynamically adjust the subsonic modulations (e.) based on context and emotional cues ensures that the synthesized speech can effectively convey appropriate nuances in various conversational settings. Thirdly, the ISH methodology can be integrated into existing TTS pipelines as an independent, post-processing layer, making it flexible and adaptable without requiring major overhauls of current TTS systems. Finally, the indirect approach of using subsonic modulation to influence intonation avoids direct manipulation of the audible frequency spectrum, preserving the intended prosodic features and preventing unnatural or robotic-sounding speech.
[0052] In conclusion, the ISH methodology represents a novel and innovative approach to enhancing the naturalness of text-to-speech synthesis. By leveraging inaudible subsonic frequencies to subtly influence the perceived intonation, it is possible to achieve a more fluid and natural delivery of synthetic speech without directly altering the audible components. The integration of the Contextual Feedback Module (2.1), Intonation Learning Module (2.2) and Subsonic and Harmonics Module (2.3) enables the ISH methodology to adapt to various emotional and contextual scenarios.
[0053] To illustrate the operation of a TTS system (1) enhanced with the intonation-aware subsonic harmonics (ISH) unit (2), an example workflow is shown, from text input to natural-sounding speech output.
[0054] Consider that the text inputted to the TTS system (1) is as follows: 'Hi John, are you free for a quick meeting at 2 PM today?’
[0055] First step - Text Analysis and Pre-processing & dynamic contextual feedback data:
[0056] The TTS system receives the input text and performs basic linguistic preprocessing operations, such as tokenization, part-of-speech tagging, and parsing, to generate text-analysis output (al.):
[0057] • Tokenization: 'Hi' | 'John' | ',' | 'are' | 'you' | 'free' | 'for' | 'a' | 'quick' | 'meeting' | 'at' | '2' | 'PM' | 'today' | '?' • Part-of-Speech Tagging: Hi (interjection), John (noun), are (verb), you (pronoun), free (adjective), for (preposition), a (article), quick (adjective), meeting (noun), at (preposition), 2 PM (time expression), today (adverb), ? (punctuation)
[0058] • Sentence Type: Question (identified by the sentence structure and punctuation)
[0059] • Sentiment & Emotion: Neutral, possibly urgent or formal based on keywords like 'quick' and 'meeting'.
[0060] Contextual information (b.), related to at least sentence type e emotion recognition, is shared with the Contextual Feedback Module (2.1), based on which it generates dynamic contextual feedback data (c.), relatable to, at least:
[0061] • Context: Question,
[0062] • Emotional Cue: Slight urgency.
[0063] Second step: Linguistic Processing & subsonic modulation parameters:
[0064] It processes text-analysis output (al.) to convert text into a sequence of phonemes and marks stress and prosodic features, thereby generating a linguist output (a2.), that includes phonetic and prosodic structure information (d.), identifying phoneme sequences, syllable boundaries and stress patterns.
[0065] Based on the phonetic and prosodic structure information (d.), the Intonation Learning Module (2.2) determines suitable intonation patterns for a question with slight urgency, and generates subsonic harmonic modulation parameters (e.) for subsonic harmonic signals (f.), by increasing pitch on ‘Hi John' and subtle rise over ‘are you free', with a more noticeable rise towards the end, emphasizing the question. This provides adjustments aligned with natural intonation for question intonation contours and adds subtle emotional cues.
[0066] Third step: Prosodic Modeling & subsonic harmonic signals: It processes linguist output (a2.) to develop pitch contours, duration and intensity patterns for each phonetic unit, thereby generating prosodic output
[0067] (a3.):
[0068] • Define pitch contours:
[0069] - Slight rise in ‘Hi John', signalling greeting;
[0070] - Maintain average pitch until ‘2PM today', where there is a significant rise indication a question.
[0071] • Duration adjustments: quick and fluid, but slightly extended last syllable to indicate questioning intonation.
[0072] Based on the subsonic modulation parameters (e.) determined by the Intonation Learning Module (2.2), the Subsonic Harmonics Module (2.3) generates subsonic harmonic signals (f.) :
[0073] • Generates a 1-4 Hz sine wave modulation during ‘Hi John';
[0074] • Modulation intensifies gradually during ‘are you free for a quick meeting' and peaks at ‘today?’.
[0075] Fourth step: preliminary speech waveform generation & merging stage:
[0076] Based on linguist output (a2.) and on prosodic output (a3.), a preliminary speech waveform signal (g.) is generated, with an intended intonation, using synthesis technology (e.g., neural vocoder).
[0077] An Integration module (2.4) integrates the subsonic harmonic signals (f.) as an undercurrent to the preliminary speech waveform signal (g.), thereby generating enhanced natural-sounding speech signal (h.). This subsonic modulation subtly influences the pitch perception, introducing a more natural and varied intonation pattern without overtly changing the audible properties. More particularly, the speech signal (h.) carries perceptible question intonation, subtle urgency, and natural flow influenced by subsonic harmonics.
[0078] Fifth step: Post-Processing stage: The enhanced natural-sounding speech signal (h.) may be subject to smoothing and refinement to ensure seamless integration of subsonic harmonics:
[0079] • Apply smoothing algorithms to ensure transitions and modulations are fluid;
[0080] • Normalize amplitude levels for consistency;
[0081] • Final quality check to ensure the synthesized speech sounds natural and characterizes the intended questioning and slight urgency appropriately.
[0082] The result is a natural-sounding speech that reads, 'Hi John, are you free for a quick meeting at 2 PM today?’ with the following characteristics:
[0083] • Greeting and Respect: Slight elevation in pitch on 'Hi John’ introduced by subsonic modulation that subtly enhances the natural rising intonation of a casual greeting.
[0084] • Main Question: A noticeable but smooth pitch rise on ‘are you free’ and a steady rise towards 'today?' This intonation mimics a real human questioning pattern, making the query sound genuine and clear.
[0085] • Subtle Urgency: The subsonic frequencies subtly modulate the overall rhythm and intonation to inject an implicit sense of urgency, indicating the speaker's need to schedule the meeting promptly.
[0086] EMBODIMENTS
[0087] It is disclosed an Intonation-Aware Subsonic Harmonics (ISH) unit (2) adapted to be integrated into a Text-to-Speech (TTS) system (1).
[0088] According to a preferred embodiment of the Unit (2), it is comprised by:
[0089] - A Subsonic Harmonics Module (2.3), configured to generate subsonic harmonic signals (f.) based on subsonic harmonic modulation parameters (e.);
[0090] - An Intonation Learning Module (2.2) configured to generate subsonic harmonic modulation parameters (e.) for feeding the Subsonic Harmonics Module (2.3), based on phonetic and prosodic structure information (d.) obtainable by a TTS system (1) from an input text (a.), and on dynamic contextual feedback data (c.);
[0091] - A Contextual Feedback Module (2.1) configured to continuously receive contextual information (b.) from the TTS system (1), extractable from the input text (a.), and to generate, based on the contextual information, dynamic contextual feedback data (c.) for feeding the Intonation Learning Module (2.2); wherein, the Intonation Learning Module (2.2) is configured to continuously receive dynamic contextual feedback data (c.) and to generate updated subsonic harmonic modulation parameters (e.), based on which the Subsonic Harmonics Module (2.3) is configured to adjust subsonic harmonic signals (f.) dynamically; and the ISH unit (2) further comprises:
[0092] - An integration Module (2.4) configured to merge subsonic harmonic signals (f.) with a preliminary speech waveform (g.) obtainable from the TTS system (1), thereby generating an enhanced natural-sounding speech signal (h.).
[0093] According to one embodiment of the ISH unit (2), the contextual information ( b .), extractable from the input text (a.), may relate to at least sentence type and emotion recognition and the dynamic contextual feedback data (c.) may include information related to at least sentence types and emotional cues.
[0094] In addition, the subsonic harmonic modulation parameters (e.) relate to at least frequency and amplitude of the subsonic harmonic signals (f.).
[0095] According to another embodiment of the ISH unit (2), the Subsonic Harmonics Module (2.3) comprises processing means programmed to:
[0096] - generate waveforms within the subsonic frequency range, thereby generating subsonic harmonic signals (f.);
[0097] - apply frequency modulation, by adjusting frequency and amplitude of the generated signals (f.) based on subsonic harmonic modulation parameters (e.) obtained from the Intonation Learning Module (2.2), thereby aligning the subsonic harmonic signals (f.) with dynamic pitch contours of the preliminary speech waveform (g.).
[0098] More particularly, the Subsonic Harmonics Module (2.3) may be further configured to:
[0099] - continuously process incoming subsonic harmonic modulation parameters (e.);
[0100] - adjust subsonic harmonic signals (f.) dynamically, by adapting amplitude and frequency parameters, to match changes in the preliminary speech waveform (g.), identified by phonetic and prosodic structure information (d.) and / or on dynamic contextual feedback data (c.).
[0101] In another embodiment of the ISH unit (2), the Integration Module (2.4) is operable to merge subsonic harmonic signals (f.) with the preliminary speech waveform (g.), by being further configured to:
[0102] - analyse the preliminary speech waveform (g.) in order to classify pitch contours related to phonetic and prosodic events as suitable integration points;
[0103] - synchronize subsonic harmonic signals (f.) with the phonetic and prosodic events in the preliminary speech waveform (g.);
[0104] - add subsonic harmonic signals (f.) to the integration points identified in the preliminary speech waveform (g.) so as to adjust its prosodic features, i.e., pitch, rhythm and intonation.
[0105] The Integration Module (2.4) may be further configured to:
[0106] - apply amplitude normalization, to avoid excessive amplitude variations arising from the addition of subsonic harmonic signals (f.) to the preliminary speech waveform (g.).
[0107] According to another embodiment of the ISH unit (2), phonetic and prosodic structure information (d.) may relate to phonemes and syllables included in the input text (a.). In addition, the Intonation Learning Module (2.2) may comprise a neural network trained to correlate subsonic harmonic profiles with intonation patterns derived from dynamic contextual feedback data (c.), thereby generating subsonic harmonic modulation parameters (e.), defining an appropriate subsonic harmonic profile, for each phoneme and syllable.
[0108] More particularly, the neural network of the Intonation Learning Module (2.2) may comprise:
[0109] - an input layer adapted for receiving phonetic and prosodic structure information (d.) and dynamic contextual feedback data (c.);
[0110] - a plurality of hidden layers of neurons, adapted to capture relationships between intonation patterns, subsonic harmonic frequencies, and dynamic contextual feedback data (c.);
[0111] - an output layer configured to generate subsonic harmonic modulation parameters (e.), for each phoneme and syllable.
[0112] Even more particularly, the neural network of the Intonation Learning Module (2.2) may be trained on a training dataset comprised by a plurality of labelled speech data annotated for at least intonation, prosody and emotion; preferably the neural network is a Long-Short Term Memory network or a transformer network.
[0113] In another embodiment of the ISH unit (2), it may be further comprised by a Post-Processing module configured to implement a final smooth and refinement signal processing stage. Said Post-Processing module may be configured to process the enhanced natural-sounding speech signal (h.) in order to generate a refined speech signal.
[0114] Finally, the present disclosure also relates to a TTS system (1) comprising the ISH unit (2) described. REFERENCES
[0115] 1 - BEN HAYES ET AL: "A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 August 2023 (2023-08-29), XP0916004. 2 - TAKAMICHI SHINNOSUKE ET AL: "Postfilters to Modify the Modulation Spectrum for
[0116] Statistical Parametric Speech Synthesis", ARXIV:1806.04885V2, vol. 24, no. 4, 1 April 2016 (2016-04-01), pages 755-767, XP011602380.
[0117] 3 - AL MASUM SHAIKH M ET AL: "Emotional speech synthesis by sensing affective information from text", AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION AND WORKSHOPS, 2009. ACII 2009. 3RD INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 10 September 2009 (2009-09-10), pages 1-6, XP031577804.
Claims
CLAIMS1. An Intonation-Aware Subsonic Harmonics - ISH - unit (2) adapted to be integrated into a Text-to-Speech - TTS - system (1); the ISH unit comprises:- A Subsonic Harmonics Module (2.3), configured to generate subsonic harmonic signals (f.) based on subsonic harmonic modulation parameters (e.);- An Intonation Learning Module (2.2) configured to generate subsonic harmonic modulation parameters (e.) for feeding the Subsonic Harmonics Module (2.3), based on phonetic and prosodic structure information (d.) obtainable by a TTS system (1) from an input text (a.), and on dynamic contextual feedback data (c.);- A Contextual Feedback Module (2.1) configured to continuously receive contextual information (b.) from the TTS system (1), extractable from the input text (a.), and to generate, based on the contextual information, dynamic contextual feedback data (c.) for feeding the Intonation Learning Module (2.2); wherein, the Intonation Learning Module (2.2) is configured to continuously receive dynamic contextual feedback data (c.) and to generate updated subsonic harmonic modulation parameters (e.), based on which the Subsonic Harmonics Module (2.3) is configured to adjust subsonic harmonic signals (f.) dynamically; and the ISH unit (2) further comprises:- An integration Module (2.4) configured to merge subsonic harmonic signals (f.) with a preliminary speech waveform (g.) obtainable from the TTS system (1), thereby generating an enhanced natural-sounding speech signal (h.).
2. The ISH unit according to claim 1, wherein the contextual information (b.), extractable from the input text (a.), relates to at least sentence type and emotion recognition; the dynamic contextual feedback data (c.) including information related to at least sentence types and emotional cues.
3. The ISH unit according to claims 1 or 2, wherein the subsonic harmonic modulation parameters (e.) relate to at least frequency and amplitude of the subsonic harmonic signals (f.).
4. The ISH unit according to claim 3, wherein the Subsonic Harmonics Module (2.3) comprises processing means programmed to:- generate waveforms within the subsonic frequency range, thereby generating subsonic harmonic signals (f.);- apply frequency modulation, by adjusting frequency and amplitude of the generated signals (f.) based on subsonic harmonic modulation parameters (e.) obtained from the Intonation Learning Module (2.2), thereby aligning the subsonic harmonic signals (f.) with dynamic pitch contours of the preliminary speech waveform (g.).
5. The ISH unit according to claim 4, wherein the Subsonic Harmonics Module (2.3) being further configured to:- continuously process incoming subsonic harmonic modulation parameters (e.);- adjust subsonic harmonic signals (f.) dynamically, by adapting amplitude and frequency parameters, to match changes in the preliminary speech waveform (g.), identified by phonetic and prosodic structure information (d.) and / or on dynamic contextual feedback data (c.).
6. The ISH unit according to any of the previous claims, wherein the Integration Module (2.4) is operable to merge subsonic harmonic signals (f.) with the preliminary speech waveform (g.), by being further configured to:- analyse the preliminary speech waveform (g.) in order to classify pitch contours related to phonetic and prosodic events as suitable integration points;- synchronize subsonic harmonic signals (f.) with the phonetic and prosodic events in the preliminary speech waveform (g.);add subsonic harmonic signals (f.) to the integration points identified in the preliminary speech waveform (g.) so as to adjust its prosodic features, i.e., pitch, rhythm and intonation.
7. The ISH unit according to claim 6, wherein the Integration Module (2.4) being further configured to:- apply amplitude normalization, to avoid excessive amplitude variations arising from the addition of subsonic harmonic signals (f.) to the preliminary speech waveform (g.).
8. The ISH unit according to any of the previous claims, wherein phonetic and prosodic structure information (d.) relates to phonemes and syllables included in the input text (a.), and the Intonation Learning Module (2.2) comprises a neural network trained to correlate subsonic harmonic profiles with intonation patterns derived from dynamic contextual feedback data (c.), thereby generating subsonic harmonic modulation parameters (e.), defining an appropriate subsonic harmonic profile, for each phoneme and syllable.
9. The ISH unit according to 8, wherein the neural network of the Intonation Learning Module (2.2) comprises:- an input layer adapted for receiving phonetic and prosodic structure information (d.) and dynamic contextual feedback data (c.);- a plurality of hidden layers of neurons, adapted to capture relationships between intonation patterns, subsonic harmonic frequencies, and dynamic contextual feedback data (c.);- an output layer configured to generate subsonic harmonic modulation parameters (e.), for each phoneme and syllable.
10. The ISH unit according to claims 8 or 9, wherein the neural network is trained on a training dataset comprised by a plurality of labelled speech data annotatedfor at least intonation, prosody and emotion; preferably the neural network is a Long- Short Term Memory network or a transformer network.
11. The ISH unit according to any of the previous claims, further comprising a Post-Processing module configured to implement a final smooth and refinement signal processing stage; said Post-Processing module being configured to process the enhanced natural-sounding speech signal (h.) in order to generate a refined speech signal.
12. ATTS system comprising the ISH unit according to any of the previous claims 1 to 11.