An on-board high-quality Chinese speech synthesis method based on particle synthesis

By combining particle synthesis with deep neural networks, the real-time and robustness issues of speech synthesis in airborne environments were solved, achieving high-quality Chinese speech synthesis and improving the naturalness and clarity of the speech.

CN122201247APending Publication Date: 2026-06-12AVIC HUADONG OPTOELECTRONICS (SHANGHAI) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
AVIC HUADONG OPTOELECTRONICS (SHANGHAI) CO LTD
Filing Date
2026-02-25
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing Chinese speech synthesis technology suffers from insufficient real-time performance and robustness, as well as issues of sound quality distortion and incoherence in airborne environments, especially in complex environments where it is difficult to guarantee high-quality output.

Method used

By employing particle synthesis technology combined with deep neural networks, text is processed through word segmentation, punctuation, and number normalization to build a high-quality pronunciation library. Audio particles are dynamically adjusted and real-time corrections are performed to achieve seamless splicing and speech clarity optimization.

🎯Benefits of technology

It improves the naturalness and fluency of speech synthesis, ensures high-fidelity speech output in noisy airborne environments, adapts to changes in background noise, and maintains high intelligibility and environmental adaptability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201247A_ABST
    Figure CN122201247A_ABST
Patent Text Reader

Abstract

The application provides an airborne high-quality Chinese speech synthesis method based on particle synthesis, which comprises the following steps: performing word segmentation, punctuation and digit / proper noun normalization processing on input Chinese text; converting into a sequence of pinyin with tones; predicting prosody parameters through a deep neural network; retrieving matched pinyin audio particles from a pronunciation library based on the parameters; performing time stretching, pitch adjustment and amplitude adjustment on the particles; dynamically combining the processed particles by using a particle synthesis algorithm, and realizing seamless splicing through cross-fading and time domain overlap; and correcting the output speech in real time according to the airborne environmental noise and reverberation characteristics. The application can improve speech intelligibility and robustness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech synthesis technology, and more specifically, to a particle synthesis-based airborne high-quality Chinese speech synthesis method. Background Technology

[0002] Significant progress has been made in Chinese speech synthesis technology, primarily employing methods such as statistical parametric synthesis, concatenation synthesis, and end-to-end neural network synthesis. Traditional concatenation synthesis relies on pre-recorded audio units for splicing, which, while generating relatively natural speech, is prone to incoherence and sound quality distortion at the splicing points. End-to-end neural network synthesis, while achieving significant progress in speech naturalness and generation efficiency, still faces limitations in airborne environments due to the high demands for real-time performance, robustness, and sound quality. Furthermore, particle synthesis technology, as a sound effects processing method, has been widely used in music and environmental audio synthesis. Its core idea is to decompose sound into numerous tiny particles and recombine these particles through parameter control to achieve natural transitions and diverse effects. However, the application of particle synthesis technology in speech synthesis is still relatively limited, especially in high-quality Chinese speech synthesis in airborne environments.

[0003] In implementing the embodiments of the present invention, the prior art has at least the following problems or defects: traditional splicing synthesis methods are prone to incoherence and sound quality distortion at the splicing points, while end-to-end neural network synthesis technology struggles to meet the requirements of real-time performance and robustness in airborne environments, and suffers from occasional dropped sounds, affecting the reliability and stability of speech synthesis. Furthermore, existing speech synthesis methods still need improvement in terms of sound quality subtlety and flexibility of emotional expression, especially in complex airborne environments, where it is difficult to guarantee high-quality speech synthesis output. Summary of the Invention

[0004] This invention provides an airborne high-quality Chinese speech synthesis method based on particle synthesis, comprising: S1: Perform word segmentation, punctuation processing, and normalization of numbers and proper nouns on the input Chinese text; S2: Convert the preprocessed text into a pinyin sequence with tone information; S3: Predict the prosodic parameters of the pinyin sequence using a deep neural network. The prosodic parameters include pitch envelope, duration, volume envelope, speech rate, and stress. S4: Based on the prosodic parameters, retrieve matching pinyin audio particles from the pre-built pronunciation library; S5: Perform time stretching, pitch adjustment, and amplitude adjustment on the retrieved audio particles; S6: The particle synthesis algorithm is used to dynamically combine the processed audio particles in syllable order, and seamless splicing is achieved through crossfading and temporal overlap. S7: Based on the real-time noise level and reverberation characteristics of the airborne environment, the synthesized speech is corrected and output in real time.

[0005] Furthermore, the normalization process in S1 includes: Convert numbers to Chinese pronunciation; Convert proper nouns to standard pronunciation.

[0006] Furthermore, the deep neural network in S3 takes the pinyin sequence and contextual information as input and outputs the prosodic parameters of each pinyin unit; The prosodic parameters are corrected or adaptively adjusted using historical data.

[0007] Furthermore, the pronunciation library is constructed through the following steps: Collect high-quality recordings of pinyin with different tones in a standard pronunciation environment; The recording is divided into short-time audio particles; Each audio particle is labeled with corresponding pinyin and prosodic feature range metadata, and a retrieval database is established.

[0008] Furthermore, S5 specifically includes: Temporal scaling of audio particles is performed based on the target duration; Pitch shift is performed on audio particles based on the target pitch envelope; The amplitude of the audio particles is adjusted according to the target volume envelope.

[0009] Furthermore, the particle synthesis algorithm in S6 optimizes the naturalness of speech under different speech rates and emotional expressions by controlling the particle overlap rate, processing window length, and spectral smoothing parameters.

[0010] Furthermore, dynamic composition in S6 includes: Audio particles are arranged in order of syllables and phrases; The splicing breaks were eliminated by cross-fading and temporal overlap techniques.

[0011] Furthermore, real-time correction in S7 includes: Dynamically adjust the volume and spectrum of the output voice; Filtering and reverberation compensation algorithms are used to improve speech clarity.

[0012] Furthermore, the segmentation granularity of the short-time audio particles is smaller than the duration of a single pinyin syllable.

[0013] Furthermore, following S6, it also includes: The synthesized speech is digitally filtered, denoised, and reverberation compensated.

[0014] The embodiments of the present invention have at least the following beneficial effects: 1. By combining particle synthesis technology with deep neural network prosody prediction, the problems of audio breaks and unnatural prosody in traditional splicing synthesis are effectively solved. Dynamic parameterization adjustment and seamless splicing of syllable-level particles are realized, improving the naturalness and fluency of synthesized speech.

[0015] 2. Based on a pre-built high-quality speech library and refined audio particle processing, it overcomes the defects of sound quality distortion and insufficient robustness that are prone to occur in end-to-end neural network synthesis under complex environments, ensuring that high-fidelity speech can still be output in noisy airborne environments.

[0016] 3. Through real-time noise monitoring and dynamic correction mechanisms, the problem of speech intelligibility degradation caused by airborne environmental noise interference is solved, enabling synthesized speech to adapt to changes in background noise and maintain high intelligibility and environmental adaptability. Attached Figure Description

[0017] The above and other objects, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description taken in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated in the drawings by way of example and not limitation, wherein: Figure 1 This is a flowchart illustrating an embodiment of the airborne high-quality Chinese speech synthesis method based on particle synthesis provided by the present invention. Detailed Implementation

[0018] The principles and spirit of the invention will now be described with reference to several exemplary embodiments. It should be understood that these embodiments are provided merely to enable those skilled in the art to better understand and implement the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided to make the invention more thorough and complete, and to fully convey the scope of the invention to those skilled in the art.

[0019] Those skilled in the art will recognize that embodiments of the present invention can be implemented as a system, apparatus, device, method, or computer program product. Therefore, the present invention can be specifically implemented in the following forms: entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.

[0020] It should be noted that the number of any elements in the accompanying drawings is for illustrative purposes only and not as a limitation, and any naming is for distinction only and has no limiting meaning.

[0021] The following is for reference. Figure 1 , Figure 1 The flowchart of the airborne high-quality Chinese speech synthesis method based on particle synthesis provided by an embodiment of the present invention. As Figure 1 shown, an airborne high-quality Chinese speech synthesis method based on particle synthesis includes: S1: Perform word segmentation, punctuation processing, normalization of numbers and proper nouns on the input Chinese text; S2: Convert the preprocessed text into a pinyin sequence with tone information; S3: Predict the prosody parameters of the pinyin sequence through a deep neural network, where the prosody parameters include pitch envelope, duration, volume envelope, speech rate, and stress; S4: Retrieve matching pinyin audio particles from a pre-constructed pronunciation library based on the prosody parameters; S5: Perform time stretching, pitch adjustment, and amplitude adjustment on the retrieved audio particles; S6: Use a particle synthesis algorithm to dynamically combine the processed audio particles in syllable order, and achieve seamless splicing through cross-fading and time-domain overlap; S7: Perform real-time correction and output on the synthesized speech according to the real-time noise level and reverberation characteristics of the airborne environment.

[0022] First, perform word segmentation, punctuation processing, normalization of numbers and proper nouns on the input Chinese text. The purpose of this step is to convert the input text into a format suitable for subsequent speech synthesis processing. Word segmentation is to cut continuous text into independent lexical units to better understand the semantic structure of the text. Punctuation processing is to identify punctuation marks in the text, as punctuation marks affect intonation and pauses in speech synthesis. Normalization of numbers and proper nouns means converting numbers in the text into their corresponding Chinese pronunciation forms, for example, 123 is converted to one hundred and twenty-three, and at the same time converting proper nouns, such as names of people and places, into standard pronunciation forms to ensure the accuracy and naturalness of speech synthesis.

[0023] Specifically, word segmentation means cutting sentences in the text into individual words. For example, the sentence "Today the weather is very good" will be segmented into "Today the weather is very good". Punctuation processing is to identify punctuation marks such as commas and periods in the text, which affect pauses and intonation changes in speech synthesis. Number normalization is to convert Arabic numerals into Chinese pronunciations, for example, "2024" is converted to "two thousand and twenty-four". Proper noun normalization is to convert proper nouns in the text, such as names of people and places, into standard pronunciation forms, for example, "Beijing" is converted to "běi jīng". These processing steps ensure that the text can be correctly understood and pronounced during the speech synthesis process.

[0024] Preferably, word segmentation can adopt a rule-based word segmentation method or a statistical word segmentation model, such as using the maximum matching method or the hidden Markov model. Punctuation processing can be achieved through predefined rules, such as adding a pause marker after a period. Digital normalization can be performed according to the number of digits and positions of the numbers, for example, 123 is converted to one hundred and twenty-three. Proper noun normalization can be completed by using a pre-built dictionary that contains the standard pronunciations of common proper nouns. Through these specific technical means, the accuracy and efficiency of text preprocessing can be ensured, laying a foundation for subsequent speech synthesis steps.

[0025] In some embodiments, the normalization processing in S1 includes: Converting numbers to Chinese pronunciations; Converting proper nouns to standard pronunciations.

[0026] It should be noted that the normalization processing in S1 includes converting numbers to Chinese pronunciations and converting proper nouns to standard pronunciations. The purpose of the normalization processing is to ensure that the numbers and proper nouns in the text can be correctly pronounced during the speech synthesis process. Converting numbers to Chinese pronunciations means converting Arabic numerals or numerical expressions into corresponding Chinese pronunciation forms, for example, 123 is converted to one hundred and twenty-three. Converting proper nouns to standard pronunciations means converting proper nouns in the text, such as personal names, place names, and organization names, into their standard pronunciation forms, for example, Peking University is converted to běi jīng dà xué. Through this normalization processing, errors or unnatural pronunciations can be avoided when the speech synthesis system processes these special words.

[0027] Specifically, the process of converting numbers to Chinese pronunciations can be achieved through predefined rules. For example, for three-digit numbers, they can be split into hundreds, tens, and units, and then converted into corresponding Chinese pronunciations respectively. For example, 123 can be split into 1 one hundred, 2 twenty, and 3 three, and finally combined into one hundred and twenty-three. Converting proper nouns to standard pronunciations can be completed by using a pre-built dictionary. The dictionary stores the standard pronunciations of common proper nouns, for example, Beijing corresponds to běijīng, and Shanghai corresponds to shàng hǎi. When processing the text, the system will look up the dictionary to obtain the standard pronunciation of the proper noun. If a proper noun does not exist in the dictionary, the system can adopt the default pinyin rules for conversion.

[0028] Preferably, the rules for converting numbers to Chinese pronunciations can be defined in detail according to the number of digits and positions of the numbers. For example, for numbers with more than four digits, they can be grouped and processed according to thousands, ten thousands, etc. to ensure the accuracy of the pronunciation. The standard pronunciations of proper nouns can be collected and sorted through a large-scale corpus to cover more proper nouns. In practical applications, the dictionary can be updated regularly to adapt to the emergence of new proper nouns.

[0029] Furthermore, for certain special numerical expressions, such as dates, times, and fractions, specific rules can be designed for conversion. For example, June 15, 2024 can be converted to June 15, 2024. Through these refined processing steps, the accuracy and robustness of normalization can be further improved, ensuring high-quality speech synthesis.

[0030] In some embodiments, the deep neural network in S3 takes the pinyin sequence and context information as input and outputs the prosodic parameters of each pinyin unit; The prosodic parameters are corrected or adaptively adjusted using historical data.

[0031] It's important to note that the deep neural network in S3 takes the pinyin sequence and contextual information as input and outputs prosodic parameters for each pinyin unit. These prosodic parameters include pitch envelope, duration, volume envelope, speech rate, and stress. Deep neural networks are powerful machine learning models capable of learning complex patterns and relationships from large amounts of data. In speech synthesis, prosodic parameters are key factors determining the naturalness and intelligibility of speech. Pitch envelope determines the pitch variation, duration determines the duration of each syllable, volume envelope determines the loudness variation, speech rate determines the speed of speech, and stress determines which syllables need to be emphasized. These parameters are corrected or adaptively adjusted using historical data to ensure that the synthesized speech closely matches the characteristics of real speech.

[0032] Specifically, the pinyin sequence refers to converting text into pinyin form with tone information, such as converting "Today's weather is good" into "jin1 tian1 tian1 qi4 hen3 hao3". Contextual information refers to the pinyin sequence surrounding the current pinyin unit, providing a semantic and phonological context. The input to the deep neural network includes these pinyin sequences and their contextual information. Through processing by multiple layers of the neural network, the network outputs prosodic parameters for each pinyin unit. For example, for "jin1", the network outputs a pitch envelope that is an upward-trending curve with a duration of 100 milliseconds, a volume envelope that is initially high and then low, a speech rate of 150 words per minute, and unstressed syllables. These parameters are set based on training with a large amount of speech data and can reflect the prosodic features of real speech.

[0033] Preferably, the deep neural network can employ a recurrent neural network (RNN) or its variants, such as a long short-term memory network (LSTM) or a gated recurrent unit (GRU), because these network structures are well-suited for handling sequential data. During model construction, the input layer receives the pinyin sequence and its contextual information, the hidden layers extract and learn features through multiple layers of neurons, and the output layer generates prosodic parameters for each pinyin unit. During training, real speech data labeled with prosodic parameters is used as training samples, and the network parameters are optimized using a backpropagation algorithm. For example, mean squared error can be used as the loss function to measure the difference between the predicted and true prosodic parameters, and gradient descent can be used to adjust the network weights.

[0034] Furthermore, to further improve the robustness and accuracy of the model, regularization techniques, such as Dropout, can be introduced to prevent overfitting. Through these specific techniques, deep neural networks can effectively predict prosodic parameters, supporting high-quality speech synthesis.

[0035] In some embodiments, the pronunciation library is constructed through the following steps: Collect high-quality recordings of pinyin with different tones in a standard pronunciation environment; The recording is divided into short-time audio particles; Each audio particle is labeled with corresponding pinyin and prosodic feature range metadata, and a retrieval database is established.

[0036] It's important to note that building a pronunciation database is a crucial step in achieving high-quality speech synthesis. A pronunciation database is a collection storing pre-recorded audio particles and their metadata; these audio particles are the basic units used for synthesized speech. The purpose of building a pronunciation database is to provide sufficiently rich and high-quality audio samples so that audio particles can be flexibly selected and adjusted based on prosodic parameters during speech synthesis. Specifically, building a pronunciation database involves acquiring high-quality recordings of pinyin with different tones in a standard pronunciation environment, segmenting the recordings into short audio particles, and labeling each audio particle with corresponding pinyin, prosodic feature ranges, and other metadata, ultimately establishing a retrieval database. This approach ensures that the audio particle best matching the target prosodic can be quickly and accurately retrieved during speech synthesis.

[0037] Specifically, the construction of the pronunciation database involves the following key steps. First, high-quality recordings of pinyin with different tones are acquired in a standard pronunciation environment. This standard pronunciation environment refers to a low-noise, high-fidelity recording environment to ensure clear and distortion-free speech. Second, the recordings are segmented into short-time audio particles. Short-time audio particles are segments of long audio, typically containing one or more pinyin units. For example, a complete sentence can be segmented into multiple short-time audio particles, each corresponding to a pinyin unit or a portion thereof. Then, metadata such as the corresponding pinyin and prosodic feature range is labeled for each audio particle. Metadata refers to information describing the characteristics of the audio particle, such as pinyin units, pitch range, duration range, and volume range. This metadata is used to quickly retrieve the audio particle that best matches the target prosodic parameters during speech synthesis. Finally, a retrieval database is established to efficiently query and extract the required audio particles during speech synthesis.

[0038] Preferably, the construction of the pronunciation database can be further refined into the following steps. First, during recording, professional broadcasters or speech experts can be invited to record to ensure the naturalness and standardization of the speech. High-quality recording equipment should be used during the recording process, and it should be carried out in a professional recording studio to reduce the interference of environmental noise. Second, when segmenting the recording into short-duration audio particles, precise segmentation can be performed based on the boundaries of the pinyin units. For example, for a sentence containing multiple pinyin units, a speech activity detection algorithm can be used to determine the start and end time points of each pinyin unit, thereby achieving accurate audio particle segmentation. Then, when annotating metadata, an automatic annotation tool combined with manual verification can be used to ensure the accuracy and completeness of the annotation. For example, the pitch range can be obtained by analyzing the spectral characteristics of the audio particles, the duration range can be obtained by measuring the duration of the audio particles, and the volume range can be obtained by analyzing the energy characteristics of the audio particles. Finally, when establishing the retrieval database, an efficient index structure, such as an inverted index or a hash table, can be used to quickly retrieve the audio particles that best match the target prosodic parameters during speech synthesis. Through these detailed steps, a high-quality and efficient pronunciation library can be built, providing a solid foundation for speech synthesis.

[0039] In some embodiments, S5 specifically includes: Temporal scaling of audio particles is performed based on the target duration; Pitch shift is performed on audio particles based on the target pitch envelope; The amplitude of the audio particles is adjusted according to the target volume envelope.

[0040] It should be noted that the operations in S5 involve time stretching, pitch adjustment, and amplitude adjustment of the retrieved audio particles. The purpose of these operations is to adjust the audio particles based on the target prosodic parameters generated by the prosodic prediction model, enabling them to better match the prosodic features of the target speech. Time stretching refers to changing the duration of the audio particle, pitch adjustment refers to changing the pitch of the audio particle, and amplitude adjustment refers to changing the loudness of the audio particle. These adjustment operations are key steps in achieving high-quality speech synthesis, ensuring the naturalness and coherence of the synthesized speech.

[0041] Specifically, time scaling refers to altering the duration of an audio particle using digital signal processing techniques without changing its pitch. For example, if the target prosodic parameter requires the duration of an audio particle to be adjusted from 100 milliseconds to 120 milliseconds, this can be achieved through interpolation or resampling techniques. Pitch adjustment refers to adjusting the pitch of an audio particle by changing its fundamental frequency. For example, if the target pitch envelope requires the pitch of an audio particle to be adjusted from 200Hz to 220Hz, this can be achieved through a pitch shifting algorithm. Amplitude adjustment refers to adjusting the loudness of an audio particle by changing its amplitude. For example, if the target volume envelope requires the loudness of an audio particle to be adjusted from 60dB to 70dB, this can be achieved through gain adjustment. The parameter settings for these operations are based on the output of the prosodic prediction model, ensuring that the adjusted audio particles meet the requirements of the target prosodic parameter.

[0042] Preferably, time scaling can be implemented using a phase-based vocoder algorithm, which can adjust the time scale of the audio signal while maintaining pitch. Pitch adjustment can be achieved using a waveform similarity overlap addition (PSOLA) algorithm, which can adjust the pitch of audio particles while preserving their naturalness. Amplitude adjustment can be achieved through simple gain adjustment, i.e., changing the amplitude of the audio particles by multiplying by a scaling factor. In practical applications, the specific parameters of these operations can be directly obtained from the output of the prosody prediction model, such as target duration, target pitch, and target volume.

[0043] Furthermore, to further improve the accuracy and naturalness of the adjustments, smooth transition techniques can be introduced during the adjustment process to avoid abrupt changes between audio particles. For example, in pitch adjustment, linear or nonlinear interpolation can be used to smoothly transition pitch changes; in amplitude adjustment, progressive gain adjustment can be used to avoid abrupt changes in loudness. Through these refined processing steps, it can be ensured that the audio particles can better match the target prosodic parameters after adjustment, thereby improving the overall quality of the synthesized speech.

[0044] In some embodiments, the particle synthesis algorithm in S6 optimizes the naturalness of speech under different speech rates and emotional expressions by controlling the particle overlap rate, processing window length, and spectral smoothing parameters.

[0045] It's worth noting that the particle synthesis algorithm in S6 optimizes the naturalness of speech at different speaking speeds and emotional expressions by controlling particle overlap rate, processing window length, and spectral smoothing parameters. Particle synthesis is a technique that combines audio particles into continuous speech according to specific rules. Its core lies in achieving natural transitions and coherence in speech by adjusting the overlap rate between particles, the processing window length, and the spectral smoothing parameters. The particle overlap rate determines the degree of overlap between adjacent audio particles, the processing window length determines the time range of each audio particle during processing, and the spectral smoothing parameter is used to reduce spectral discontinuities during audio particle splicing. Optimizing these parameters ensures that the synthesized speech remains natural and fluent at different speaking speeds and emotional expressions.

[0046] Specifically, particle overlap rate refers to the proportion of time that overlaps between adjacent audio particles. For example, if an audio particle has a duration of 100 milliseconds and an overlap rate of 50%, then there will be a 50-millisecond overlap between adjacent particles. Processing window length refers to the time range during audio processing where each audio particle is processed. For example, a processing window length of 100 milliseconds means that each audio particle will be processed considering the audio information before and after it for 50 milliseconds. Spectral smoothing parameters control the spectral transition during audio particle splicing. For example, adjusting the spectral smoothing parameters can reduce spectral discontinuities, making the transition between audio particles more natural. These parameters need to be adjusted according to specific speech synthesis needs. For example, a higher particle overlap rate is needed to ensure speech coherence at fast speech rates, while a longer processing window length is needed to capture the emotional features of the speech when expressing strong emotions.

[0047] Preferably, the particle synthesis algorithm can be implemented through the following steps: First, dynamically adjust the particle overlap rate according to the target speech rate and emotional expression. For example, at a fast speech rate, the particle overlap rate can be set to 30%~40% to reduce the transition time between audio particles; at a slow speech rate, the particle overlap rate can be increased to 50%~60% to increase the smoothness of the transition. Second, adjust the processing window length according to the duration and speech rate of the audio particles. For example, for shorter audio particles, a shorter processing window length, such as 50 milliseconds, can be set, while for longer audio particles, a longer processing window length, such as 150 milliseconds, can be set. Finally, perform spectral smoothing processing on the audio particles using a spectral smoothing algorithm, such as a Hanning window or a Hamming window, to reduce spectral discontinuities at splicing points. For example, a Hanning window can be used to weight the spectrum of the audio particles, making the spectrum smoother at splicing points. Through these refined processing steps, the particle synthesis algorithm can effectively optimize the naturalness of speech and adapt to different speech rates and emotional expression needs.

[0048] In some embodiments, the dynamic combination in S6 includes: Audio particles are arranged in order of syllables and phrases; The splicing breaks were eliminated by cross-fading and temporal overlap techniques.

[0049] It's important to note that dynamic combination in S6 refers to arranging audio particles according to syllable and word order, and eliminating splicing breaks through crossfading and temporal overlap techniques. The purpose of dynamic combination is to arrange and splice the processed audio particles according to the pronunciation order of natural language, ensuring the coherence and naturalness of the synthesized speech. Crossfading involves gradually decreasing the volume of the preceding particle while gradually increasing the volume of the following particle at the overlap between two audio particles, thus achieving a smooth transition. Temporal overlap involves partially overlapping two audio particles in time to reduce abruptness at splicing points. These techniques effectively avoid unnatural splicing between audio particles, improving the overall quality of the synthesized speech.

[0050] Specifically, dynamic composition involves the following key concepts: Syllable order refers to arranging audio particles according to the order of syllables, based on the pronunciation rules of natural language. For example, in the pinyin sequence jin1 tian1, jin1 and tian1 each correspond to a syllable and need to be arranged according to their order in the text. Phrase order refers to further considering the structure of phrases on top of syllables. For example, "today's weather" can be divided into two phrases: "today" and "weather," and the audio particles need to be arranged according to the order of these phrases. Crossfading technology achieves a smooth transition by adjusting the volume of overlapping parts of audio particles. For example, if the overlap time of two audio particles is 50 milliseconds, the volume of the first particle can be gradually decreased in the first 25 milliseconds, while the volume of the second particle can be gradually increased in the last 25 milliseconds. Temporal overlap refers to partially overlapping two audio particles in time. For example, if there is a 10-millisecond interval between the end time of one audio particle and the start time of the next audio particle, they can be overlapped by 10 milliseconds by adjusting the time axis, thereby reducing the abruptness of the splicing point.

[0051] Preferably, the dynamic combination process can be implemented through the following steps. First, based on the syllable and phrase information provided by the text analysis module, the order of the audio particles is determined. For example, for the input text "Today the weather is fine," it can be segmented into the phrase "Today the weather is fine," and then the corresponding audio particles can be arranged according to the pinyin sequence jin1 tian1 tian1 qi4 hen3 hao3. Second, during the splicing of the audio particles, appropriate crossfading parameters are set. For example, the overlap time of crossfading can be set to 30 milliseconds, with the volume of the previous particle gradually decreasing in the first 15 milliseconds and the volume of the next particle gradually increasing in the last 15 milliseconds. Finally, the time alignment of the audio particles is adjusted using temporal overlap technology. For example, if a time interval is detected between two audio particles, their time axes can be adjusted using interpolation or resampling techniques to overlap them by 10 milliseconds at the splicing point. Through these refined processing steps, it can be ensured that the audio particles can transition naturally during splicing, avoiding obvious breaks, thereby improving the overall naturalness and coherence of the synthesized speech.

[0052] In some embodiments, the real-time correction in S7 includes: Dynamically adjust the volume and spectrum of the output voice; Filtering and reverberation compensation algorithms are used to improve speech clarity.

[0053] It should be noted that the real-time correction in S7 includes dynamically adjusting the volume and spectrum of the output speech, and employing filtering and reverberation compensation algorithms to improve speech clarity. The purpose of real-time correction is to optimize the synthesized speech based on the real-time noise level and reverberation characteristics of the airborne environment, ensuring audibility and clarity in complex environments. Dynamic volume adjustment refers to changing the loudness of the speech in real time according to the intensity of ambient noise, making it clearly audible. Spectrum adjustment involves optimizing the frequency components of the speech signal to reduce interference from ambient noise. Filtering algorithms are used to remove noise components from the speech signal, while reverberation compensation algorithms are used to reduce the impact of ambient reverberation on the speech, thereby improving speech clarity and naturalness.

[0054] Specifically, dynamic volume adjustment refers to automatically adjusting the loudness of speech based on the real-time monitored ambient noise level. For example, if the ambient noise is high, the speech volume can be increased by 10-15 decibels; if the ambient noise is low, the speech volume can be decreased by 5-10 decibels. Spectrum adjustment refers to optimizing the frequency components of the speech signal to reduce interference from ambient noise. For example, if the ambient noise is mainly concentrated in the low-frequency band, the audibility of the speech can be enhanced by boosting the high-frequency components. Filtering algorithms are used to remove noise components from the speech signal; for example, a low-pass filter can be used to remove high-frequency noise, or a band-pass filter can be used to retain the main frequency components of the speech. Reverberation compensation algorithms are used to reduce the impact of ambient reverberation on speech; for example, the characteristics of ambient reverberation can be calculated, and then a corresponding anti-reverberation algorithm can be applied to cancel the reverberation effect. These parameter settings need to be adjusted according to the specific airborne environment to ensure the clarity and naturalness of the speech under different conditions.

[0055] Preferably, the real-time correction process can be achieved through the following steps. First, the noise level and reverberation characteristics of the airborne environment are acquired in real time through an environmental noise monitoring module. For example, a microphone array can be used to measure the intensity and frequency distribution of environmental noise, while reverberation characteristics are calculated using an acoustic model. Second, the speech volume is dynamically adjusted according to the environmental noise level. For example, a volume adjustment threshold can be set, and the speech volume is automatically increased when the environmental noise exceeds this threshold. Then, the speech signal is spectrally adjusted using a spectrum analysis algorithm. For example, if the environmental noise is detected to be mainly concentrated in the low-frequency band, the audibility of the speech can be enhanced by boosting the high-frequency components. Finally, filtering and reverberation compensation algorithms are applied to process the speech signal. For example, an adaptive filter can be used to remove noise components, while an anti-reverberation algorithm is applied to cancel the reverberation effect. Through these refined processing steps, it can be ensured that the synthesized speech maintains high-quality output in complex airborne environments, improving the clarity and naturalness of the speech.

[0056] In some embodiments, the segmentation granularity of the short-time audio particles is smaller than the duration of a single pinyin syllable.

[0057] It's important to note that the segmentation granularity of short-time audio particles is smaller than the duration of a single pinyin syllable. This requirement aims to provide greater flexibility and accuracy during speech synthesis. Short-time audio particles refer to dividing the audio signal into short time segments, each containing partial information from one or more pinyin units. Segmentation granularity refers to the duration of these audio particles. By setting the segmentation granularity of audio particles to be smaller than the duration of a single pinyin syllable, it's possible to finely adjust the prosodic features of the audio particles during speech synthesis, thereby improving the naturalness and coherence of the synthesized speech.

[0058] Specifically, the segmentation granularity of short-time audio particles refers to the duration of the audio particle, typically measured in milliseconds. For example, the duration of a pinyin syllable ranges from 100 to 300 milliseconds, while the segmentation granularity of short-time audio particles can be set to 50 milliseconds or less. Such granularity allows for finer processing of audio particles during speech synthesis, such as more precise matching of target prosodic parameters in time stretching, pitch adjustment, and amplitude modulation. For instance, if the target prosodic parameter requires the duration of an audio particle to be adjusted from 100 milliseconds to 120 milliseconds, a smaller segmentation granularity provides more adjustment space, resulting in a smoother transition. Furthermore, the segmentation granularity of short-time audio particles also affects the retrieval efficiency of audio particles and the quality of synthesized speech. Smaller segmentation granularity allows for a wider selection of audio segments, thereby improving the diversity and naturalness of the synthesized speech.

[0059] Preferably, the granularity of the short-time audio particles can be adjusted according to specific speech synthesis requirements. For example, when constructing a pronunciation library, the audio signal can be divided into multiple short-time audio particles, each with a duration of 30 to 50 milliseconds. During speech synthesis, based on the target prosodic parameters generated by the prosodic prediction model, the short-time audio particle that best matches the target prosodic is retrieved from the pronunciation library. For example, if the target prosodic parameters require the pitch of a certain audio particle to be adjusted from 200Hz to 220Hz, a smaller granularity of segmentation can provide more audio particles to choose from, thereby achieving more precise pitch adjustment.

[0060] Furthermore, to further improve the naturalness of synthesized speech, crossfading and temporal overlap techniques can be introduced during the concatenation of audio particles. For example, the overlap time for crossfading can be set to 10 to 20 milliseconds, and a smooth transition can be achieved by gradually adjusting the volume of the audio particles. Through these refined processing steps, it can be ensured that short-duration audio particles can better match the target prosodic parameters during speech synthesis, thereby improving the overall quality and naturalness of the synthesized speech.

[0061] In some embodiments, the method further includes the following after S6: The synthesized speech is digitally filtered, denoised, and reverberated.

[0062] It's important to note that the digital filtering, noise reduction, and reverberation compensation processes applied to synthesized speech after S6 are designed to further improve the quality of the speech signal, making it more suitable for use in airborne environments. Digital filtering is a technique that uses mathematical algorithms to process signals, removing unwanted frequency components or enhancing specific frequency components. Noise reduction aims to reduce the interference of background noise on the speech signal, improving speech clarity. Reverberation compensation addresses the effects of reflected sound in the environment, adjusting the speech signal through algorithms to make it sound more natural and clear. These processing steps are crucial for ensuring that synthesized speech maintains high quality in complex environments, such as airborne environments.

[0063] Specifically, digital filtering refers to the frequency-selective processing of speech signals using filters. For example, low-pass filters remove high-frequency noise, high-pass filters remove low-frequency interference, and band-pass filters preserve signal components within a specific frequency range. Noise reduction is typically achieved through noise estimation and signal enhancement algorithms. For instance, spectral subtraction can be used to estimate the power spectral density of background noise and subtract this noise from the speech signal. Reverberation compensation involves estimating the reverberation characteristics of the environment and applying anti-reverberation algorithms to counteract the reverberation effect. For example, reverberation characteristics can be estimated by measuring the room impulse response and then compensated using an adaptive filter. The parameter settings for these processing steps need to be optimized based on specific environmental conditions and the characteristics of the speech signal to ensure the quality of the speech signal.

[0064] Preferably, digital filtering, noise reduction, and reverberation compensation processing can be implemented through the following steps. First, for digital filtering, an appropriate filter type and parameters can be selected based on the frequency characteristics of the speech signal. For example, if the main frequency components of the speech signal are concentrated between 300Hz and 3400Hz, a bandpass filter with a passband range of 300Hz to 3400Hz can be designed to remove noise components outside this range. Second, for noise reduction, noise estimation methods based on short-time energy and short-time zero-crossing rate can be used, combined with spectral subtraction or Wiener filters for noise suppression. For example, the power spectral density of background noise can be estimated in the silent segments of the speech signal, and then spectral subtraction can be applied to the speech segments for noise reduction. Finally, for reverberation compensation, an anti-reverberation algorithm based on room impulse response can be used. For example, by measuring the room impulse response in the airborne environment, an adaptive filter can be used to perform anti-reverberation processing on the speech signal, thereby reducing the impact of reverberation on the speech. Through these refined processing steps, the quality of synthesized speech in the airborne environment can be effectively improved, ensuring the clarity and naturalness of the speech signal.

[0065] The above embodiments of the present invention have the following beneficial effects: 1. By combining particle synthesis technology with deep neural network prosody prediction, the problems of audio breaks and unnatural prosody in traditional splicing synthesis are effectively solved. Dynamic parameterization adjustment and seamless splicing of syllable-level particles are realized, improving the naturalness and fluency of synthesized speech.

[0066] 2. Based on a pre-built high-quality speech library and refined audio particle processing, it overcomes the defects of sound quality distortion and insufficient robustness that are prone to occur in end-to-end neural network synthesis under complex environments, ensuring that high-fidelity speech can still be output in noisy airborne environments.

[0067] 3. Through real-time noise monitoring and dynamic correction mechanisms, the problem of speech intelligibility degradation caused by airborne environmental noise interference is solved, enabling synthesized speech to adapt to changes in background noise and maintain high intelligibility and environmental adaptability.

[0068] Furthermore, the storage medium in the embodiments of this application stores program instructions capable of implementing all the above methods. These program instructions can be stored in the storage medium in the form of a software product, including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks, or terminal devices such as computers, servers, mobile phones, and tablets.

[0069] The above description is merely an explanation of some preferred embodiments of the present invention and the technical principles employed. Those skilled in the art should understand that the scope of the invention as described in the embodiments of the present invention is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of the present invention.

Claims

1. A high-quality airborne Chinese speech synthesis method based on particle synthesis, characterized in that, Includes the following steps: S1: Perform word segmentation, punctuation processing, and normalization of numbers and proper nouns on the input Chinese text; S2: Convert the preprocessed text into a pinyin sequence with tone information; S3: Predict the prosodic parameters of the pinyin sequence using a deep neural network. The prosodic parameters include pitch envelope, duration, volume envelope, speech rate, and stress. S4: Based on the prosodic parameters, retrieve matching pinyin audio particles from the pre-built pronunciation library; S5: Perform time stretching, pitch adjustment, and amplitude adjustment on the retrieved audio particles; S6: The particle synthesis algorithm is used to dynamically combine the processed audio particles in syllable order, and seamless splicing is achieved through crossfading and temporal overlap. S7: Based on the real-time noise level and reverberation characteristics of the airborne environment, the synthesized speech is corrected and output in real time.

2. The method according to claim 1, characterized in that: The normalization process in S1 includes: Convert numbers to Chinese pronunciation; Convert proper nouns to standard pronunciation.

3. The method according to claim 1, characterized in that: The deep neural network in S3 takes the pinyin sequence and context information as input and outputs the prosodic parameters of each pinyin unit. The prosodic parameters are corrected or adaptively adjusted using historical data.

4. The method according to claim 1, characterized in that: The pronunciation library is constructed through the following steps: Collect high-quality recordings of pinyin with different tones in a standard pronunciation environment; The recording is divided into short-time audio particles; Each audio particle is labeled with corresponding pinyin and prosodic feature range metadata, and a retrieval database is established.

5. The method according to claim 1, characterized in that: S5 specifically includes: Temporal scaling of audio particles is performed based on the target duration; Pitch shift is performed on audio particles based on the target pitch envelope; The amplitude of the audio particles is adjusted according to the target volume envelope.

6. The method according to claim 1, characterized in that: The particle synthesis algorithm in S6 optimizes the naturalness of speech under different speech rates and emotional expressions by controlling the particle overlap rate, processing window length, and spectral smoothing parameters.

7. The method according to claim 1, characterized in that: Dynamic composition in S6 includes: Audio particles are arranged in order of syllables and phrases; The splicing breaks were eliminated by cross-fading and temporal overlap techniques.

8. The method according to claim 1, characterized in that: Real-time correction in S7 includes: Dynamically adjust the volume and spectrum of the output voice; Filtering and reverberation compensation algorithms are used to improve speech clarity.

9. The method according to claim 4, characterized in that: The segmentation granularity of the short-time audio particles is smaller than the duration of a single pinyin syllable.

10. The method according to claim 1, characterized in that: Following S6, it also includes: The synthesized speech is digitally filtered, denoised, and reverberated.