A multi-level speech synthesis system and method for high-deterministic professional scenarios

By using a multi-level speech synthesis system, the problems of polyphonic ambiguity, coarse-grained emotion control, and insufficient energy management of existing TTS technology in high-deterministic professional scenarios are solved. It achieves zero misreading of polyphonic characters, fine and controllable emotion, and closed-loop correction, thereby improving the accuracy and naturalness of speech synthesis. It is applicable to fields such as medical, aviation, finance, and judiciary.

CN122201250APending Publication Date: 2026-06-12王秉钦

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
王秉钦
Filing Date
2026-03-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing TTS technology suffers from several drawbacks in professional scenarios with high deterministic requirements, including ambiguous pronunciations of polyphonic characters, lack of certainty guarantees, coarse-grained emotional control, lack of closed-loop correction mechanisms, and lack of energy decay models. These issues result in a high risk of misreading and poor naturalness of speech, failing to meet the refined needs of fields such as medicine, aviation, finance, and the judiciary.

Method used

A multi-level speech synthesis system is adopted, including a G2P deterministic lookup table engine, a FAME bionic emotion engine, a VEDM acoustic energy attenuation model, and a SCAC speech correction audit chain, combined with a pluggable synthesis engine, to achieve deterministic pronunciation, natural speech quality, and closed-loop correction.

🎯Benefits of technology

It achieves zero misreading of polyphonic characters, precise and controllable emotion, biomimetic energy management, and closed-loop auditability, meeting the speech synthesis needs of highly deterministic professional scenarios, improving the accuracy and naturalness of speech synthesis, and meeting the stringent requirements of fields such as finance, judiciary, aviation, and medicine.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201250A_ABST
    Figure CN122201250A_ABST
Patent Text Reader

Abstract

The application discloses a multi-level speech synthesis system and method for high-determinacy professional scenarios. The system comprises a G2P deterministic lookup engine (530,000 entries, forward maximum matching, zero misreading of multi-syllable words), a FAME bionic emotion engine (8-dimensional continuous emotion space, pre-synthesis injection), a VEDM sound energy attenuation model (bionic energy management, micro-tremor and breathing simulation), an SCAC speech correction audit chain (closed-loop correction, digital watermarking, audit log), and a pluggable synthesis engine architecture. The application solves the core problems of existing TTS systems in professional scenarios, such as high misreading rate of multi-syllable words, lack of determinacy guarantee, and inability to audit and trace, and is suitable for vertical fields such as medical treatment, aviation, finance, and justice, which have zero-tolerance requirements for speech accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention belongs to the technical field of text-to-speech (TTS), and particularly relates to a multi-level speech synthesis system and method for scenarios with high certainty requirements such as medical, aviation, finance, and justice. Background Art

[0002] Deficiencies of the prior art: The current mainstream TTS technologies (such as Tacotron, VITS, GPT-SoVITS, etc.) are mainly oriented towards general consumer-level scenarios and have the following core defects: 1. Ambiguity of polyphonic characters cannot be eliminated: The end-to-end neural network model relies on statistical probability to infer pronunciations, and there is a systematic risk of misreading polyphonic characters in professional terms (such as "hang" in "bank president" and "kuai" in "accountant"), with an error rate of approximately 2 - 5%.

[0003] 2. Lack of certainty guarantee: General TTS systems cannot promise "zero misreading" to users. In scenarios such as medical (drug names), aviation (instruction codes), and finance (amount announcements), one misreading may lead to serious accidents or compliance risks.

[0004] 3. Coarse-grained emotion control: Existing emotion TTS only supports coarse classifications such as "happy / sad / angry", and cannot meet the refined emotion requirements such as "authority", "urgency", and "soothing" in professional broadcasts.

[0005] 4. Lack of a closed-loop correction mechanism: Existing systems are one-way generation and cannot automatically detect and correct pronunciation deviations after generation.

[0006] 5. Lack of an energy decay model: In long text synthesis, the voice energy distribution is unnatural, and there is a lack of energy management that simulates human physiological characteristics.

[0007] Market demand: The global TTS market size in 2024 was 3.349 billion US dollars, and it is expected to reach 8.588 billion US dollars in 2031 (CAGR 14.4%). Among them, the high-certainty professional TTS belongs to a blue ocean market segment that has been seriously underestimated: - Medical voice market (operating room announcements, drug verification, patient notifications): The estimated annual demand is 1.2 - 1.8 billion US dollars - Aviation voice market (ground crew instructions, cabin announcements, emergency broadcasts): The estimated annual demand is 0.8 - 1.2 billion US dollars - Financial compliance voice market (transaction announcements, risk control reminders, customer notifications): The estimated annual demand is 0.6 - 1 billion US dollars - Judicial voice market (courtroom records, evidence announcements, legal document readings): The estimated annual demand is 0.3 - 0.5 billion US dollars Summary of the Invention

[0008] Technical Problem to be Solved Provide a speech synthesis system that combines "certain pronunciation guarantee" and "natural speech quality", making it suitable for professional scenarios with zero tolerance for pronunciation accuracy. Brief Description of the Drawings Figure 1 This is a schematic diagram of the overall architecture of the multi-level speech synthesis system for high-certainty professional scenarios of the present invention, showing the data flow relationship between the G2P deterministic look-up table engine, FAME bionic emotion engine, VEDM sound energy attenuation model, SCAC speech correction audit chain, and pluggable synthesis engine. Technical Solution

[0009] The present invention discloses a multi-level speech synthesis system, including the following five core modules: I. G2P (Grapheme-to-Phoneme) deterministic look-up table engine 1. Multi-level look-up table architecture: - First level: Domain-specific dictionary (medical / aviation / finance / judiciary, etc., with the highest priority) - Second level: General phrase dictionary (369,583 Chinese phrases) - Third level: Single character dictionary (41,923 single characters, including all marked polyphones) - Fourth level: English dictionary (126,052 CMU pronunciation dictionaries) - Fifth level: Symbol mapping table (mathematical symbols, Greek letters, special symbols) 2. Forward maximum matching algorithm: - Preferentially match the longest phrase to ensure that "bank president" is matched as "yin2 hang2 hang2 zhang3" instead of querying character by character - Query speed < 0.03ms / time 3. Tone sandhi rule engine: - Tone sandhi for three consecutive tones (the first tone changes to the second tone when two third tones are consecutive) - Tone sandhi rules for "one" and "not" - Context-related prosodic adjustment 4. 4-level classification system for 10 major domains: - 10 first-level domains → 23 second-level subclasses - Domain priority > general classification - Support dynamic expansion of domain dictionaries 5. Bidirectional reuse: - Forward: G2P (text → pronunciation) for TTS synthesis - Reverse: P2G (pronunciation → text) for ASR post-processing - The same query table with 537,000 records serves a closed loop of TTS+ASR. II. FAME (Frequency-Amplitude Modulation Engine) Bionic Emotion Engine 1. 8-Dimensional Emotional Space Model: - Continuous parameterization of 8 independent sentiment dimensions - 10 preset emotion templates (authoritative, gentle, urgent, sad, joyful, etc.) - Supports customizable emotion ratios 2. Pre-synthesized emotion injection: - Sentiment parameters are injected into the BERT semantic encoder before synthesis. - Through CLAP text description → sentiment vector mapping - Emotions are "baked" into the generation process, not added later. 3. Speech rate linkage control: - Automatically adjusts speech rate based on emotion type (urgent → speeds up, sad → slows down). - Can be manually overwritten III. VEDM (Vocal Energy Decay Model) Sound Energy Decay Model 1. Bionic energy decay: - Simulates the natural energy decay of the human vocal cords during prolonged vocalization - Attenuation rate, recovery rate, and minimum energy can all be parameterized. - Default parameters: decay_rate=0.015, recovery_rate=0.3, min_energy=0.85 2. Micro-tremor simulation: - tremor_intensity=0.0015: Simulates the subtle vibrations caused by vocal cord fatigue. - Randomize the frequency to avoid a mechanical feel 3. Breathing noise injection: - breath_noise=0.00015: Injects extremely weak breathing noise when energy is below the threshold (0.88). - The biomimetic design principle of "making the interference as subtle as possible, just barely perceptible" 4. Streaming buffering function: - VEDM, as a post-processing layer, naturally forms a buffer / load mask for streaming output. - While waiting for the next synthesis, VEDM's decay effect masks the loading delay. IV. SCAC (Speech Correction and Auditing Chain) 1. Closed-loop correction mechanism: - Speech synthesis → ASR recognition → Comparison with original text → Deviation detection → Automatic resynthesis - Correct the loop until the deviation rate is 0 or the maximum number of iterations is reached. 2. Audit log chain: - Record a complete log for each synthesis (input text, G2P result, sentiment parameters, VEDM parameters, output checksum). - Supports traceability, auditability, and immutability - Meets financial compliance and judicial evidence requirements 3. Digital watermark embedding: - Embedding non-audible digital watermarks in synthesized speech - Includes synthesis timestamp, version number, and parameter signature. - Used for anti-counterfeiting and liability attribution V. Pluggable Synthesis Engine Architecture 1. Engine-independent design: - The four modules G2P / FAME / VEDM / SCAC are independent of the specific synthesis engine. - Compatible with any VITS / Bert-VITS2 / CosyVoice / Fish-Speech backend, etc. - The synthesis engine is hot-swappable, requiring no modification to the upper-level pipeline. 2. Compact model feed format: - Parallel int8 arrays, sentence-level fields do not repeat with each token. - Integer encoding of enumeration values - Minimize memory and bandwidth overhead during inference Beneficial effects

[0010] 1. Zero misreading of polyphonic characters: By using deterministic lookup tables instead of probabilistic inference, the ambiguity of polyphonic characters is completely eliminated (100% accuracy in actual testing). 2. Uncompromising sound quality: The lookup table only determines the pronunciation; the naturalness of the speech is entirely guaranteed by the neural network synthesis engine. 3. Finely controlled emotions: An 8-dimensional continuous emotional space supports arbitrary emotional ratios. 4. Biomimetic Energy Management: VEDM Significantly Improves the Naturalness of Long Text Synthesis 5. Closed-loop auditability: The SCAC chain meets the most stringent compliance requirements. 6. Pluggable engine: The technology stack is not bound to any single model.

Claims

1. A speech synthesis system for highly deterministic professional scenarios, characterized in that, It includes: The G2P deterministic look-up table module, which contains a multi-level look-up table architecture, namely the domain-specific dictionary, the general phrase dictionary, the single-character dictionary, the English dictionary, and the symbol mapping table in sequence. It uses the forward maximum matching algorithm for querying, and the query result has a 100% certainty. The FAME bionic emotion engine module, which adopts an 8-dimensional continuous emotion space model and injects emotion parameters into the semantic encoder before synthesis. The VEDM sound energy attenuation model module, which acts as a post-processing layer to simulate the characteristics of human vocal cord energy attenuation, including energy attenuation, micro-vibration simulation, and breathing noise injection. The SCAC speech correction audit chain module, which realizes automatic correction through the "synthesis → recognition → comparison → re-synthesis" closed loop and generates a complete audit log chain. The pluggable synthesis engine interface enables the above four modules to operate independently of the specific neural network synthesis engine.

2. The system according to claim 1, characterized in that, The multi-level look-up table architecture of the G2P deterministic look-up table module contains a 4-level classification system for 10 major domains. The domain-specific dictionary has a higher priority than the general dictionary, and the same look-up table can be reused bidirectionally for TTS forward synthesis and ASR reverse processing.

3. The system according to claim 1, characterized in that, The FAME bionic emotion engine module maps the emotion text description to an emotion vector through the CLAP model and injects it into the BERT semantic encoder to achieve the "baked-in" injection of emotion in the synthesis stage.

4. The system according to claim 1, characterized in that, The bionic parameters of the VEDM sound energy attenuation model module include: decay rate (decay_rate), recovery rate (recovery_rate), minimum energy threshold (min_energy), micro-vibration intensity (tremor_intensity), breathing threshold (breath_threshold), and breathing noise intensity (breath_noise), and each parameter can be configured independently.

5. The system according to claim 1, characterized in that, The SCAC speech correction audit chain module embeds an inaudible digital watermark in the synthesized speech, including the synthesis timestamp, version number, and parameter signature, for anti-counterfeiting verification and traceability of responsibility attribution.

6. A speech synthesis method based on the system of claim 1, characterized in that, It includes the following steps: S1: Receive the input text and determine the domain to which the text belongs through the domain classifier. S2: Send the text to the G2P deterministic look-up table module and perform forward maximum matching in the priority order of "domain dictionary → phrase dictionary → single-character dictionary → English dictionary → symbol mapping" to obtain a deterministic pronunciation sequence. S3: Generate an 8-dimensional emotion vector through the FAME module according to the emotion configuration and inject it into the semantic encoder. S4: Send the pronunciation sequence and emotion parameters to the pluggable synthesis engine to generate the original speech signal. S5: Apply VEDM sound energy attenuation processing to the original speech signal, including energy attenuation, micro-vibration simulation, and breathing noise injection. S6: Send the synthesized speech to the SCAC module, identify it through ASR and compare it with the original text. If the deviation rate exceeds the threshold, return to S4 for re-synthesis. S7: Generate an audit log and embed a digital watermark, and output the final speech.

7. The method according to claim 6, characterized in that, The tone sandhi rules in step S2 include the sandhi rules for consecutive third tones and the sandhi rules for "yi" and "bu". The sandhi rules are applied after the look-up table and before synthesis.

8. The method according to claim 6, characterized in that, The VEDM processing in step S5 follows the biomimetic design principle of "minimizing interference as much as possible", with default parameters as follows: attenuation rate 0.015, recovery rate 0.3, minimum energy 0.85, micro-vibration intensity 0.0015, and breathing noise 0.00015.