Speech synthesis method and system based on bone conduction signal and lip image fusion

By fusing bone conduction signals and lip images, and utilizing cross-modal attention mechanisms and generative adversarial networks, the robustness and naturalness issues of speech synthesis in complex scenarios are solved, achieving efficient and accurate speech synthesis and improving the reliability of voice interaction.

CN116343793BActive Publication Date: 2026-06-23NAT INNOVATION INST OF DEFENSE TECH PLA ACAD OF MILITARY SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NAT INNOVATION INST OF DEFENSE TECH PLA ACAD OF MILITARY SCI
Filing Date
2022-11-21
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In complex scenarios such as high noise, strong concealment, and high mobility, traditional air-guided audio paths cannot effectively transmit speech information, and existing single-modal speech synthesis technologies suffer from one-sided speech information representation and insufficient noise resistance.

Method used

By fusing bone conduction signals and lip images, and through cross-modal attention mechanisms and generative adversarial networks, a common content of modal collaborative representation is established. This is then combined with Mel spectrograms to generate a high-resolution speech synthesis model, which is then used in conjunction with a back-end classification neural network to achieve speech synthesis.

Benefits of technology

It achieves efficient, natural, and accurate speech synthesis in complex scenarios, improves the robustness and feasibility of voice interaction, preserves the speaker's timbre and rhythm, and reduces the impact of external disturbances.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116343793B_ABST
    Figure CN116343793B_ABST
Patent Text Reader

Abstract

The present application relates to a kind of speech synthesis method and system based on bone conduction signal and lip image fusion, comprising the following steps: bone conduction signal, lip movement image signal are synchronously acquired when user speech input is collected;Determine the single-mode data characteristics of time domain and spatial domain based on bone conduction signal, lip movement image signal;Based on the two-source single-mode data characteristics of time domain and spatial domain determined, apply the generative adversarial network of cross-modal attention mechanism and mel-spectrogram fusion method, establish speech model, obtain modal collaborative feature expression;Based on the modal collaborative feature expression obtained, it can be recognized as specific phrase and instruction output by neural network model, and speech synthesis is realized using vocal synthesis model.The above algorithm realizes the commonality of modal collaborative representation, makes up the representation defect problem of single-mode independent existence, optimizes the effect of speech synthesis under high noise interference or mute mode, so as to expand the realizability of speech interaction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of silent speech signal processing technology, and specifically relates to a speech synthesis method and system based on the fusion of bone conduction signals and lip images. Background Technology

[0002] Voice interaction is the most common and natural mode of interpersonal communication, and therefore a reliable and effective carrier of information.

[0003] In the implementation of voice information transmission, the most obvious and mature method is to encode the audio signal generated by the speaker and conducted through the air on the transmitting side, transmit it through the communication equipment, and decode it on the receiving side, such as the common microphone and headphone-based communication system. However, in scenarios such as field operations, emergency rescue, military special operations, and medical rehabilitation, the traditional air-conducted audio path cannot fully or even completely represent the voice information due to high noise interference, strong concealment requirements, and the health condition of the speaker. Therefore, it is difficult to encode the voice information in the above complex scenarios on the transmitting side. Therefore, it is necessary to seek other information modalities released by the speaker when they are speaking or have the intention to speak, to assist or replace the role of air-conducted audio. This type of processing of voice information based on non-air-conducted audio information modalities is usually called silent speech processing (or silent communication), and its derived applications include silent speech recognition, silent speech enhancement, and silent speech synthesis.

[0004] Commonly used information modalities in silent speech processing include lip reading, electroencephalography (EEG), surface electromyography (sEMG), electromagnetic articulation, permanent magnetic articulation, and ultrasound imaging. Bone-conducted signals are also noteworthy. Although they rely on the vibrations of actual sound and are not strictly a silent modality, they are not non-traditional air-conducted audio and exhibit good robustness in high-noise environments, thus making them a commonly used information modality in silent speech processing.

[0005] Lip reading (lip imaging) and bone conduction signals are the two information modalities most closely related to practical applications in addressing such needs. Lip imaging refers to mapping speech information based on the characteristics of lip movements during phonation. Bone conduction signals share the same sound source as traditional air conduction audio, but travel through a different transmission path to the outside; typically, the original excitation signal passes through bones and tissues inside the body and is captured by sensors, thus naturally mapping speech information. In recent years, although novel mapping architectures based on deep learning have greatly empowered research in related fields, some inherent shortcomings still need to be addressed.

[0006] For lip images, speech information is represented solely through changes in lip shape. Firstly, the problem of homophones (lip shapes) with different meanings is unavoidable, especially as the length of the target speech increases, affecting comprehensibility. The essence of this problem lies in the limitations of single-source features, and the main solutions include fusing multi-source features for cross-verification and introducing contextually derived features. Secondly, conditions such as lighting and angle greatly affect image quality, effective information content, and subsequent modeling, impacting mapping accuracy. The essence of this problem lies in the inherent configuration of hardware and environment; from an algorithmic perspective, the main solution is still to fuse multi-source features to supplement them. Thirdly, speech synthesized solely from lip images lacks the speaker's tone, intonation, rhythm, and other acoustic features, affecting naturalness. The essence of this problem lies in the limitations of single-source features, and the main solution is to selectively fuse other modal information with corresponding acoustic features.

[0007] While bone conduction signals are a direct representation of sound source excitation signals and exhibit good noise resistance, the transmission path, being essentially through human bones and tissues, is equivalent to passing through a low-pass filter. High-frequency components are severely attenuated, with only low-frequency components remaining relatively intact. Therefore, the resulting audio waveform sounds muffled, affecting naturalness and clarity, and leading to the loss of some consonant syllables, impacting intelligibility. Furthermore, in highly mobile applications, physical noise introduced by friction between the sensing device and human skin, strong winds, and chewing further affects perception quality. The essence of this problem lies in the lack of high-frequency components and the absence of real-time supplementation from multiple sources. The main solution involves fusing multi-source features to ensure consistency while supplementing with compatible high-frequency signal components.

[0008] In summary, the main problems can be categorized as follows:

[0009] 1. From a global perspective, there is a lack of mature voice transmission methods for complex scenarios such as high noise, strong concealment, and high mobility, mainly referring to voice synthesis technology on the transmitting side.

[0010] 2. Specifically, to achieve the speech synthesis technology at the speech information generation end, the core lies in considering the needs and limitations of the application scenario, processing each information modality, and building a model based on deep learning methods. Common single-modal modeling methods all have their own shortcomings as mentioned above. Summary of the Invention

[0011] To address the aforementioned issues, this application proposes a speech synthesis method and system based on the fusion of bone conduction signals and lip images. This method enables modal collaborative representation of common content, compensates for the incomplete representation of independent single modalities, establishes a silent speech synthesis solution under a deep modal information fusion mechanism, and optimizes the speech synthesis effect in high noise interference or silent mode. This enhances the robustness of speech synthesis in complex scenarios and expands the feasibility of voice interaction.

[0012] The specific technical solution is as follows: A speech synthesis method based on the fusion of bone conduction signals and lip images, comprising:

[0013] Bone conduction signals and lip image signals are acquired synchronously during user voice input;

[0014] Based on the bone conduction signal and lip image signal, the unimodal data features in the time domain and spatial domain are determined;

[0015] Based on the determined single-modal data features in the time and spatial domains, a generative adversarial network incorporating a cross-modal attention mechanism and a Mel spectrogram fusion method are applied to obtain modal collaborative feature representations.

[0016] Based on the obtained modal collaborative feature representation, on the one hand, the back-end classification neural network model processes the data and outputs specific phrases and instructions to achieve speech recognition; on the other hand, a human voice synthesis model is applied to obtain audio waveforms to achieve speech synthesis.

[0017] Furthermore, the determination of the single-modal data features in the time and spatial domains based on the bone conduction signal and the lip image signal includes:

[0018] Process bone conduction signals to obtain Mel-BC spectrograms based on bone conduction signals;

[0019] Acquire a sequence of frame images of the lips, input the sequence of frame images into the front-end neural network model, and extract the lip image features F. v .

[0020] Furthermore, the modal collaborative feature representation obtained by applying a generative adversarial network incorporating a cross-modal attention mechanism and a Mel spectrogram fusion method based on the determined single-modal data features in the time and spatial domains includes:

[0021] Preliminary blind enhancement of bone conduction signals was performed, and certain high-frequency components were restored using traditional signal processing techniques;

[0022] The bone conduction signal and lip image signal, which have undergone preliminary blind enhancement, are collaboratively represented based on a cross-modal attention mechanism and then input into a trained generative adversarial network. After multiple iterations, a Mel-Vba spectrogram based on modal fusion is generated.

[0023] The Mel-Vba spectrogram based on modal fusion is fused with the Mel-BC spectrogram based on the primary bone conduction signal to obtain the final Mel-Ult spectrogram, which is the modal co-expression mentioned above.

[0024] Furthermore, the application of the human voice synthesis model to obtain audio waveforms and achieve speech synthesis includes:

[0025] Based on the fused final-state Mel-Ult spectrogram, it is transformed into a linear spectrogram through a post-processing network. The obtained linear spectrogram is then input into a vocoder to be converted into a speech waveform.

[0026] Furthermore, the blindly enhanced bone conduction signal and lip image signal are collaboratively represented based on a cross-modal attention mechanism and input into a trained generative adversarial network. After multiple iterations, a Mel-Vba spectrogram based on modal fusion is generated, including:

[0027] Step S321: The feature representation extracted based on the lip image signal is denoted as F. v , as the initial input I0;

[0028] Step S322: Initial input I0 and feature representation F extracted based on bone conduction signal b Perform a co-encoding based on a cross-modal attention mechanism (Visual-Bone Conducted Attention);

[0029] Step S323: The weighted features F after one cross-modal attention mechanism co-encoding a1 The fused feature F is formed by concatenating it with the original input I0. c1 The input is fed into the current-order (currently the first order) generator GE1 to obtain the generated feature F. m1 Simultaneously, the Mel language spectrogram IM1 of the current order (currently described as the first order) is generated;

[0030] Step S324: Input the Mel language spectrogram of the current order (currently referred to as the first order) into the classifier D1 of the current order (currently referred to as the first order). If the unconditional judgment result is true and the conditional judgment result contains a similarity K with a certain statement... h If the value exceeds a certain threshold, the Mel-Vba spectrogram generated in the current order is determined to be a usable Mel-Vba spectrogram (i.e., it can be used as a subsequent Mel-Vba spectrogram).

[0031] Step S325: If the judgment condition described in step S324 is not met, then the generated current order (currently the first order) feature expression F will be... m1 As the initial input I0 in step S321, it is iterated according to steps S322, S323, S324, and S325, and all involved F a1 F c1 F m1 Increment the subscripts of GE1, IM1, and D1 by 1;

[0032] Step S326: The iteration ends when the judgment condition in S234 is satisfied, and the usable Mel spectrogram generated in the current order is output, thereby obtaining the Mel spectrogram Mel-Vba based on the co-encoding of generative adversarial network and cross-modal attention mechanism.

[0033] Furthermore, the Mel-Vba spectrogram based on modal fusion is fused with the Mel-BC spectrogram based on the original bone conduction signal to obtain the final Mel-Ult spectrogram. The fusion of the Mel spectrogram is divided into two cases: silent scene and high noise scene.

[0034] In silent scenarios, the Mel-BC spectrogram based on bone conduction signals is ignored, and Mel-Vba is directly used as the final-state Mel-Ult spectrogram across the entire frequency band, i.e.:

[0035] Mel-Ult = 1 * Mel-Vba + 0 * Mel-BC

[0036] In high-noise scenarios, a time-frequency partitioning image fusion method is adopted for Mel speech spectrograms. That is, the Mel speech spectrograms are divided into local regions from two dimensions: time domain and frequency domain. The optimal representation is selected in the corresponding local regions of Mel-BC and Mel-Vba. The optimal representations are combined to obtain the final state Mel speech spectrogram Mel-Ult.

[0037] Furthermore, the Mel-Ult calculation in the high-noise scenario is as follows:

[0038] Let the amplitude at a certain moment and at a certain frequency be denoted as:

[0039] A(t,h)=P

[0040] Partitioning and merging based on frequency dimension: First, given a cutoff frequency z, the Mel-BC and Mel-Vba spectra are divided according to the cutoff frequency z. The area above the cutoff frequency z is divided into high-frequency partition H0, and the area at and below the cutoff frequency is divided into mid-low frequency partition L.

[0041] Remove the portion of Mel-BC in the high-frequency partition H0 and fill it with the high-frequency partition H0 of Mel-Vba. That is, the distribution of the high-frequency and mid-low-frequency partitions of Mel-Ult can be represented as the result of Mel-Vba and Mel-BC being spliced ​​together after being trimmed according to the frequency threshold.

[0042] Therefore, the high-frequency partitions of the final-state Mel speech spectrogram in a high-noise scene can be determined as follows:

[0043] Mel-Ult(H0)=1*Mel-Vba(H0)+0*Mel-BC(H0)

[0044] From the time domain dimension, the bone conduction signal is partitioned and fused: In the mid and low frequency band, the bone conduction signal has a relatively accurate time-frequency distribution, but the amplitude is slightly attenuated and consonant syllables are lost. Therefore, the fusion method is divided into two cases: amplitude attenuation and consonant syllable loss.

[0045] To address the amplitude attenuation issue and reasonably enhance the amplitude in the corresponding region, the amplitude distribution of Mel-BC(L) in the low-frequency region of Mel-BC is first calculated:

[0046] In Mel-BC(L), if at a certain moment, the amplitude at a certain frequency is greater than a certain threshold x, but less than a certain threshold y, then this point is determined to be the accurate time-frequency distribution point of the audio information, and its amplitude is attenuated to a certain extent, and is enhanced by the amplitude of the same time-frequency point in Mel-Vba(L); otherwise, no enhancement is performed, and thus the amplitude distribution of the mid-low frequency region of the final state Mel spectrogram is determined as follows:

[0047] A(t,h)|Mel-Ult(L)=1*A(t,h)|Mel-Vba(L)(x <A(t,h)|Mel-BC<y)

[0048] A(t,h)|Mel-Ult(L)=1*A(t,h)|Mel-BC(L)(other)

[0049] To address the issue of missing consonant syllables, and in order to restore the amplitude of the corresponding region, it is necessary to further divide the mid-to-low frequency region into n smaller regions of equal duration based on time resolution.

[0050]

[0051]

[0052] When examining any small region, if the small regions before and after it all have a certain amplitude distribution, then this region should also have a certain amplitude distribution; if the region has no amplitude distribution, then it is considered that there is a consonant syllable missing in this region, and it needs to be supplemented with the corresponding region of Mel-Vba(L).

[0053] The specific process is as follows: First, the amplitude of each small region of Mel-BC(L) is integrated to calculate the Mel-BC(L) value for each small region. i The sum of amplitudes within ) A(L) i Its time domain dimension has upper and lower limits of T. i0 To T ie The upper and lower limits of the frequency dimension are 0 to 2kHz;

[0054]

[0055] When A(L) i-1) and A(L) i+1 All are greater than a certain threshold p (for the first region, only A(L) is considered). i+1 For the tail region, only A(L) is considered. i-1 ), while A(L i When the value is less than a certain threshold q, the corresponding small region is filled using Mel-Vba(L):

[0056] Mel-Ult(L i )=1*Mel-Vba(L i ) (A(L i ) <q&&A(L i-1 )>p&&A(L i+1 )>p)

[0057] Mel-Ult(L i ) = 1 * Mel - BC(Li) (Other)

[0058] Therefore, the low-to-mid frequency partitions of the final state Mel speech spectrogram in high-noise scenarios can be determined as follows:

[0059]

[0060] Therefore, the final state Mel speech spectrogram under high noise conditions can be determined as follows:

[0061] Mel-Ult=Mel-Ult(H0)+Mel-Ult(L).

[0062] The present invention also provides a speech synthesis system based on the fusion of bone conduction signals and lip images, characterized in that it includes: a data acquisition module, a feature extraction module, an encoding module, and a speech recognition and synthesis module;

[0063] The data acquisition module collects bone conduction signals and lip image signals synchronously acquired during user voice input and sends them to the feature extraction module;

[0064] The feature extraction module preprocesses and extracts features from the received bone conduction signal and lip image signal data respectively, determines the single-modal data features in the time domain and spatial domain, and sends them to the encoding module.

[0065] The encoding module obtains modal collaborative feature representation based on the received single-modal data features in the time and spatial domains, and sends it to the speech recognition and synthesis module;

[0066] The speech recognition and synthesis module, based on the received modal collaborative feature representation, on the one hand, applies a backend classification neural network model for processing, outputting specific phrases and instructions to achieve speech recognition; on the other hand, it uses a human voice synthesis model to obtain speech waveforms to achieve speech synthesis.

[0067] Furthermore, the speech synthesis system also includes an interaction module, which evaluates the quality of the speech results synthesized by the speech recognition and synthesis modules and transmits them subsequently through existing communication channels.

[0068] Furthermore, the quality evaluation includes:

[0069] Objective evaluation metrics are used to calculate the intelligibility and speech quality score of the generated speech waveform. Speech scores below a certain threshold are considered unusable.

[0070] Subjective evaluation metrics are used to provide feedback on the voice results to the user and obtain instructions from the user regarding whether to use the audio. Selection buttons can be set in the application scenario, allowing users to choose whether to use the audio.

[0071] Compared with the prior art, the present invention has the following beneficial effects:

[0072] The modal fusion algorithm based on bone conduction signals and lip image information provided by this invention establishes cross-modal collaborative feature representation through deep modal fusion. This can compensate for the one-sidedness of single-modal representation of speech information and reduce instability under different external perturbations, while retaining the advantages of each single modality, such as the interpretability of lip image signals in silent mode, the excellent noise resistance of bone conduction signals, and the preservation of the speaker's timbre characteristics. Through the synergy and iteration of generative adversarial networks and cross-modal attention mechanisms, effective representations of modalities and intermodalities can be deeply mined under the constraints of a reliable local closed-loop control chain. High-resolution Melanographic spectrograms are gradually generated, and further fusion at the time-frequency dual-dimensional image level is performed to obtain the final collaborative coding form. Then, a pre-trained back-end classification neural network model can be applied to identify specific short sentences, instructions, and logical characters. Furthermore, it can achieve efficient, natural, accurate, and robust speech synthesis in silent mode or under high noise intervention.

[0073] The architecture design of the speech synthesis system for complex scenarios, which integrates bone conduction and lip reading, can support the aforementioned algorithm model to achieve speech synthesis. By preserving the low-frequency components of the bone conduction signal path and integrating the features contained in the bone conduction signal into the generative adversarial network path through a cross-modal attention mechanism, the final collaborative feature representation fully utilizes the speaker's vocal characteristics, intonation, rhythm, timbre, etc., contained in the bone conduction signal. This results in synthesized speech with low mechanicality, good naturalness, and high fidelity to the speaker's real voice. Furthermore, the introduction of quality evaluation can further ensure the effectiveness of the output audio and improve the human-computer interaction effect. Attached Figure Description

[0074] Figure 1 This is a schematic diagram of the region division in the Mel language spectrogram;

[0075] Figure 2 A schematic diagram illustrating the secondary partitioning of the low-frequency region in the Mel language spectrogram;

[0076] Figure 3 The flowchart shows a speech synthesis method based on the fusion of bone conduction signals and lip images.

[0077] Figure 4 This is a block diagram of a speech synthesis system based on the fusion of bone conduction signals and lip images; Detailed Implementation

[0078] To more clearly illustrate the technical solutions of the embodiments of the present invention, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the following description only relates to some embodiments of the present invention and is not intended to limit the present invention.

[0079] A speech synthesis method based on the fusion of bone conduction signals and lip images, comprising:

[0080] Step S1: Collect bone conduction signals and lip image signals synchronously acquired during user voice input;

[0081] Step S2: Determine the single-modal data features in the time and spatial domains based on the bone conduction signal and lip image signal;

[0082] Step S3: Based on the determined bi-source monomodal data features in the time and space domains, a speech model is established by applying a generative adversarial network incorporating cross-modal attention mechanisms and a Mel spectrogram fusion method to obtain modal collaborative feature representations.

[0083] Step S4: Based on the obtained modal collaborative feature expression, on the one hand, the back-end classification neural network model can identify it as a specific phrase and instruction output; on the other hand, the human voice synthesis model is applied to obtain the audio waveform.

[0084] The following is a detailed explanation of each step S2, S3, and S4 above:

[0085] Step S2 specifically includes:

[0086] Step S21: After acquiring the bone conduction signal, process it (feature extraction) to obtain the Mel-BC spectrogram based on the bone conduction signal;

[0087] The bone conduction signal is processed using existing techniques, namely, pre-emphasis, windowing, framing, short-time Fourier transform (STFT), power spectrum acquisition, spectral subtraction for noise reduction, Mel filter bank, Mel-spectrum acquisition, and Mel frequency cepstrum coefficient (MFCC) extraction. The obtained Mel-BC will serve as one of the input pathways in the spectrogram-based synthesis described later.

[0088] Step S22: Obtain a sequence of lip images, perform motion detection, input the sequence of lip image data stream into the front-end neural network model, and extract the lip image features F. v The specific front-end neural network model used here can be a convolutional neural network model such as VGG19 or ResNet-50.

[0089] This yielded two aspects of single-modal data characteristics.

[0090] Step S3 specifically includes:

[0091] Step S31: Perform blind enhancement on the bone conduction signal and use traditional signal processing techniques to recover certain high-frequency components; specifically:

[0092] First, blind enhancement based on spectral envelope transformation is performed on the bone conduction signal to perform preliminary coarse-grained augmentation (i.e., preliminary expansion) of its high-frequency components. Blind enhancement relies solely on the remaining mid- and low-frequency signals to recover the high-frequency components. Blind enhancement based on spectral envelope transformation utilizes a pre-trained model to map the spectral envelope features of the bone conduction signal to the spectral envelope features of the air conduction signal, and then convolves this with the original excitation signal to obtain a complete signal with expanded estimated high-frequency components. Based on this signal, the method for extracting bone conduction signal data features described above is repeated to obtain the corresponding Mel-frequency cepstral coefficients, which serve as the feature representation F of the bone conduction signal in this step. b ;

[0093] Step S32: The blindly enhanced bone conduction signal and lip image signal are collaboratively represented based on a cross-modal attention mechanism and input into a trained generative adversarial network. After multiple iterations, a Mel-Vba spectrogram based on modal fusion is generated; specifically:

[0094] First, based on Generative Adversarial Networks (GANs), a speech model is trained using commonly used Chinese short phrases, text commands, and logical commands to construct a mapping relationship between short phrases / commands and their corresponding Mel spectrograms. The training of the GANs can be viewed as training the generator and the discriminator separately. The generator generates simulated images based on features to be processed. In this design, the features to be processed specifically refer to the collaborative features based on lip images and bone conduction signals, generating realistic and reliable Mel spectrograms for subsequent fusion with the Mel spectrograms from the bone conduction signal pathway to synthesize speech. The discriminator, on the other hand, provides a consistency evaluation between the simulated images output by the generator and the prior real images. For example, a Mel speech spectrogram generated based on collaborative features can be classified into two types of results by a classifier: one is an unconditional result, i.e. whether the Mel speech spectrogram is real or not; the other is a conditional result, such as the similarity to the prior real Mel speech spectrogram of statement A is K1, the similarity to the prior real Mel speech spectrogram of statement B is K2, and the similarity to the prior real Mel speech spectrogram of statement C is K3, etc.

[0095] Meanwhile, to obtain highly realistic and accurate generated Mel speech spectrograms, this scheme requires that the features input into the generative adversarial network (i.e., the generator's input) contain as much effective information as possible from both modalities. Therefore, this embodiment introduces a cross-modal attention mechanism to collaboratively represent the input features. Effective information is selected based on weights between bone conduction signal features and lip image signal features, while secondary information is discarded. Modality refers to the form in which data exists; for example, different file formats such as text, audio, images, and video are different modalities. Cross-modal tasks can effectively integrate and process information from two modalities by studying the correlation and relationship between data from different modalities.

[0096] The collaborative representation method applying cross-modal attention mechanism is as follows:

[0097] These are the query vector, key vector, and value vector, respectively. The original features of the bone conduction signal are processed by a flattening operator that combines the spectral and channel dimensions of the speech representation, and then multiplied by the query vector to obtain... Multiply the original features of the lip image by the key vector and the value vector respectively to obtain The weights are then calculated using the following formula;

[0098]

[0099]

[0100]

[0101] Then compare the obtained F with the original feature F of the lip image. v Perform Concat concatenation to obtain the desired feature representation.

[0102] On the other hand, it cannot be assumed that a Mel spectrogram obtained through a single generator will perform satisfactorily in both unconditional and conditional results; its realism and accuracy can always be improved. Therefore, this embodiment introduces an iterative mechanism, setting up generators and classifiers of multiple orders. The evaluation results of the classifier guide the iteration process, and a cross-modal attention mechanism continuously updates the input features of each order of generator, gradually improving feature purity. This enhances the realism of the generated Mel spectrogram and ensures accurate mapping, making the subsequent conversion of the Mel spectrogram into a speech waveform more feasible.

[0103] The following section will elaborate on the generative adversarial network model Vba-GAN (Visual-Bone Conducted Attentional GAN) based on the bone-guided lip-reading attention mechanism designed in this invention:

[0104] Step S321: The feature representation extracted based on the lip image signal is denoted as F. v , as the initial input I0;

[0105] Step S322: Initial input I0 and feature representation F extracted based on bone conduction signal b Perform a co-encoding based on a cross-modal attention mechanism (Visual-Bone Conducted Attention);

[0106] Step S323: The weighted features F after one cross-modal attention mechanism co-encoding a1 The fused feature F is formed by concatenating it with the original input I0. c1 The input is fed into the current-order generator GE1 to obtain the generated features F. m1 At the same time, the Melan spectrogram IM1 of the current order is generated;

[0107] Step S324: Input the Mel language spectrogram of the current order into the classifier D1 of the current order. If the unconditional result of the judgment is true and the conditional result of the judgment contains a similarity K with a certain statement, then... h The highest similarity K between the above a certain threshold (set to 90% in this example) and the remaining statements. s If the value is below a certain threshold (set to 3% in this embodiment), the Mel language spectrogram generated in the current order is considered a usable Mel language spectrogram.

[0108] Step S325: If the judgment condition described in step S324 is not met, then the generated current order feature is expressed as F. m1 As the initial input I0 in step S321, it is iterated according to steps S322, S323, S324, and S325, and all involved F a1 F c1 F m1 Increment the subscripts of GE1, IM1, and D1 by 1;

[0109] Step S326: The iteration ends when the judgment condition in S234 is satisfied, and the usable Mel spectrogram generated in the current order is output, thereby obtaining the Mel spectrogram Mel-Vba based on the co-encoding of generative adversarial network and cross-modal attention mechanism.

[0110] It should be noted that, due to the need for iterative mechanisms, multiple orders of generators and classifiers should be configured. There are two approaches: one is to use the same generator and classifier for each order; the other is to train the generators and classifiers for each order independently during the pre-training phase, based on different fine-grained levels. Although this approach increases computational cost, it can further improve the accuracy of generation and classification at each order. When designing the model, the specific approach should be chosen based on the actual needs.

[0111] The above steps yielded the Mel-BC spectrogram based on bone conduction signals and the Mel-Vba spectrogram based on co-coding using generative adversarial networks and cross-modal attention mechanisms. Since the Mel spectrograms of the two pathways each have their own advantages in representing speech information, step S33 is needed to further fuse the Mel spectrograms of the two pathways to obtain the final Mel-Ult spectrogram input to the human voice synthesis model, resulting in a comprehensive and realistic description of the original speech.

[0112] Step S33: Perform Mel-Vba, the Mel speech spectrogram based on modal fusion, and Mel-BC, the Mel speech spectrogram based on the primary bone conduction signal obtained in step S21, to obtain the final Mel speech spectrogram Mel-Ult.

[0113] The fusion of Mel spectrograms is divided into two cases: silent scenarios and high-noise scenarios, which are described in detail below;

[0114] In silent scenarios, the situation is relatively simple. The person does not generate significant auditory stimulation, and the signal richness perceived by the bone conduction sensor is low. Therefore, the Mel-BC spectrogram based on bone conduction signals is ignored, and Mel-Vba is directly used as the final-state Mel-Ult spectrogram across the entire frequency range.

[0115] Mel-Ult = 1 * Mel-Vba + 0 * Mel-BC

[0116] In high-noise scenarios, where the speaker does generate significant auditory stimulation, but the air-guided speech is rendered ineffective by noise interference, this embodiment employs a time-frequency partitioning image fusion method for the Mel speech spectrogram. The core concept is to divide the Mel speech spectrogram into local regions based on its essential characteristics, considering both time and frequency domains. This ensures that each local region of the final-state Mel-Ult spectrogram retains the optimal representation from the corresponding local regions of Mel-BC and Mel-Vba. See the schematic diagram of the region partitioning of the Mel speech spectrogram. Figure 1 .

[0117] In a Melanogram, the horizontal axis represents time, the vertical axis represents frequency, and the intensity of the corresponding frequency component at a specific moment (amplitude). For ease of calculation, it is converted into a grayscale image, and its amplitude can be characterized by a one-dimensional linearized value P ranging from 0 (black) to 255 (white). Since white has the highest energy and black has the lowest energy, to conveniently correlate numerical values ​​directly with energy levels, the amplitude at a specific moment and frequency is denoted as:

[0118] A(t,h)=P

[0119] First, the spectrum is partitioned and fused based on frequency to determine the high-frequency band of the final Mel speech spectrogram. Given a cutoff frequency z, the Mel speech spectrogram is divided into a high-frequency band H0 and a mid-to-low-frequency band L. This is because bone conduction signals exhibit severe attenuation and disappearance of high-frequency components, typically considered to be the complete disappearance of components above the cutoff frequency. Therefore, the high-frequency bands above the cutoff frequency need to be filled using a Mel speech spectrogram, Mel-Vba, based on generative adversarial networks and cross-modal attention mechanisms. The cutoff frequency is strongly correlated with the human body medium through which the bone conduction signal travels; therefore, once the placement of the bone conduction sensor is determined, the cutoff frequency can also be determined accordingly. In this embodiment, the bone conduction sensor is placed on the cheek, with a cutoff frequency of approximately 3 kHz, i.e., z = 3.

[0120] Mel(H0) = Mel(Clip|FRQ>3kHz)

[0121] Mel(L) = Mel(Clip|0 <FRQ<=3KHZ)

[0122] The portion of Mel-BC in high-frequency partition H0 is removed and filled with the high-frequency partition H0 of the Mel-Vba spectrogram; thus, the high-frequency partitions of the final state Mel spectrogram can be determined as follows:

[0123] Mel-Ult(H0)=1*Mel-Vba(H0)+0*Mel-BC(H0)

[0124] Second, partitioning and fusion are performed from a time-domain perspective to determine the mid-to-low frequency band of the final-state Mel speech spectrogram. In the mid-to-low frequency band, bone conduction signals have a relatively accurate time-frequency distribution, but the amplitude is slightly attenuated. Simultaneously, because the original sound excitation does not pass through areas such as the oral cavity, interlips, and nasal cavity, consonant syllables that rely on friction and plosives in these areas are lost. Therefore, it is necessary to reasonably enhance the amplitude of the corresponding areas based on the time-frequency distribution of the bone conduction signals in the mid-to-low frequency band, and in particular, to restore the amplitude of the corresponding areas of consonant syllables.

[0125] To address amplitude attenuation, it is necessary to appropriately enhance the amplitude in the corresponding region. First, the amplitude distribution in the low-frequency region of Mel-BC is calculated, assuming that:

[0126] In Mel-BC(L), if the amplitude at a certain time and frequency is greater than a certain threshold x, but less than a certain threshold y (i.e., within a certain interval), then it is determined that this is an accurate time-frequency distribution point of the audio information, but its amplitude has a certain attenuation. Therefore, the amplitude at the same time-frequency point in Mel-Vba(L) is used to reasonably enhance it; otherwise, no enhancement is performed. In this embodiment, x = 30, y = 80, thus the amplitude distribution of the mid-low frequency region of the final Mel spectrogram can be determined as follows:

[0127] A(t,h)|Mel-Ult(L)=1*A(t,h)|Mel-Vba(L)(30 <A(t,h)|Mel-BC<80)

[0128] A(t,h)|Mel-Ult(L)=1*A(t,h)|Mel-BC(L)(other)

[0129] Based on this, to address the issue of consonant syllable loss, it is necessary to further divide the mid-to-low frequency region into n smaller regions of equal duration according to time resolution, as shown in Figure 2.

[0130]

[0131]

[0132] Since each phonation is a short instruction, the Mel-BC(L) spectrogram of interest should have a continuous amplitude distribution in the time domain. Therefore, when examining any small region, if its preceding and following small neighboring regions all have a certain amplitude distribution, that region should also have a certain amplitude distribution. If the region has no amplitude distribution, it is determined that a consonant syllable is missing, and it needs to be supplemented using the corresponding region of Mel-Vba(L). The specific process is as follows: First, the amplitude of each small region of Mel-BC(L) is integrated to calculate the amplitude of the small region Mel-BC(L). i The sum of amplitudes within ) A(L) i Its time domain dimension has upper and lower limits of T. i0 To T ie The upper and lower limits of the frequency dimension are from 0 to the cutoff frequency z, i.e., 0 to 3 kHz:

[0133]

[0134] When A(L) i-1 ) and A(L i+1 All are greater than a certain threshold p (for the first region, only A(L) is considered). i+1 For the tail region, only A(L) is considered. i-1 ), while A(L i When the value is less than a certain threshold q, the corresponding small region is filled using Mel-Vba(L):

[0135] Mel-Ult(L i )=1*Mel-Vba(L i ) (A(L i ) <q&&A(L i-1 )>p&&A(L i+1 )>p)

[0136] Mel-Ult(L i )=1*Mel-BC(L i ) (other)

[0137] Therefore, the low-to-mid frequency partitions of the final state Mel speech spectrogram in high-noise scenarios can be determined as follows:

[0138]

[0139] Therefore, the final state Mel speech spectrogram under high noise conditions can be determined as follows:

[0140] Mel-Ult=Mel-Ult(H0)+Mel-Ult(L).

[0141] This yields the final-state Mel-Ult spectrogram based on Mel spectrogram fusion for both silent and high-noise scenarios.

[0142] The above describes the time-frequency partitioned image fusion method for Mel language spectrograms provided by this invention. The overall process is as follows: Figure 3 As shown, the Mel speech spectrogram is divided into regions in both the time and frequency domains. Based on the time-frequency distribution characteristics of the modalities, image fusion is performed separately to obtain richer and more comprehensive information encoding of the original modal information in the Mel speech spectrogram presentation format. In this invention, a single-modal Mel speech spectrogram based on bone conduction signals and a Mel speech spectrogram based on the fusion of two modal information (bone conduction signals and lip image signals) are spliced ​​and fused at the image level. In other application examples, the above method can also be extended to process Mel speech spectrograms from other information modalities.

[0143] Step S4 mainly includes:

[0144] Based on the obtained modal collaborative feature representation, namely Mel-Ult, a back-end classification neural network model is applied, which can be identified according to the pre-training target. In this embodiment, the pre-training target is a specific short sentence, text instruction, logical instruction, etc.

[0145] like Figure 3 As shown, the final Mel-Ult spectrogram after image fusion is transformed into a linear spectrogram through a post-processing network. The post-processing network employs the mature 1-D Convolution Bank+.

[0146] The HighwayNetwork+Bidirectional GRU architecture allows the obtained linear spectrogram to be input into a mature vocoder based on the Griffin-Lim algorithm, which can then be converted into a speech waveform for speech synthesis.

[0147] The speech synthesis of this invention incorporates bone conduction signals, which contain the speaker's vocal characteristics (tone, rhythm, timbre, etc.), resulting in a high degree of fidelity to the original speaker. Furthermore, by calling a pre-trained model of a registered user during synthesis, the synthesized speech can further preserve the original speaker's vocal quality.

[0148] Optionally, in another embodiment, after obtaining the single-modal data features in the time and spatial domains based on bone conduction signals and lip image signals, the Transformer model is used for encoding and decoding, followed by intra-modal and inter-modal feature fusion and weight allocation based on multi-head attention to construct multi-modal information based on feature fusion. Based on a language classification model, the multi-modal information is mapped to text information, and then synthesized into audio information using a TTS (Text to Speech) conversion model such as Tacotron. Since the text result loses the original audio characteristics, if it is necessary to further preserve the speaker's vocal characteristics, tone, rhythm, timbre, etc., in the synthesized speech, it is necessary to incorporate the feature representation of bone conduction signals into the existing TTS model.

[0149] Optionally, in another embodiment, after determining the single-modal data features in the time and spatial domains based on bone conduction signals and lip image signals, these features are independently input into a deep neural network and subsequent model to complete the text mapping and classification. The output two-channel text results are then filtered and fused by a decision layer. The final text information filtered by the decision layer is then processed by a TTS (Text to Speech) conversion model such as Tacotron to synthesize audio information. Since the text results lose the original audio characteristics, if it is necessary to further preserve the speaker's vocal characteristics, tone, rhythm, timbre, etc. in the synthesized speech, it is necessary to incorporate the feature representation of bone conduction signals into the existing TTS model.

[0150] Based on the aforementioned speech synthesis algorithm, this invention also designs a speech synthesis system based on the fusion of bone conduction signals and lip images, such as... Figure 4 As shown, it includes a data acquisition module, a feature extraction module, an encoding module, and a speech recognition and synthesis module;

[0151] The data acquisition module collects bone conduction signals and lip image signals synchronously acquired during user voice input and sends them to the feature extraction module.

[0152] The feature extraction module preprocesses and extracts features from the received bone conduction signal and lip image signal data respectively, determines the single-modal data features in the time domain and spatial domain, and sends them to the encoding module.

[0153] The encoding module, based on the received bi-source monomodal data features in the time and space domains, performs encoding using a generative adversarial network incorporating a cross-modal attention mechanism and a Mel spectrogram fusion method to obtain modal collaborative feature representations, which are then sent to the speech recognition and synthesis module.

[0154] The speech recognition and synthesis module, based on the received modal collaborative feature expression, on the one hand, applies a backend classification neural network model for processing, and outputs specific phrases and instructions, including short sentences, instructions, logical characters, etc.; on the other hand, it uses a human voice synthesis model to obtain speech waveforms and realize speech synthesis.

[0155] In another embodiment, in addition to the four modules mentioned above, the speech synthesis system also includes an interaction module. This interaction module evaluates the quality of the speech results synthesized by the speech recognition and synthesis modules and then transmits them via an existing communication channel. The quality evaluation includes: objective index evaluation, calculating the Extended Short-Time Objective Intelligibility (ESTOI) and Perceptual Evaluation of Speech Quality scores for the generated speech waveform; scores below a certain threshold are considered unusable speech.

[0156] Subjective evaluation involves feeding back the voice results to the speaker for their perception and confirmation, assessing the accuracy of the information and the clarity of the audio. Selection buttons are set up for different application scenarios, allowing users to choose whether to use the audio.

[0157] Although the present disclosure has been described in detail above with general descriptions and specific embodiments, modifications or improvements can be made to the embodiments of the present disclosure, which will be obvious to those skilled in the art. Therefore, all such modifications or improvements made without departing from the spirit of the present disclosure are within the scope of protection claimed by the present disclosure.

Claims

1. A speech synthesis method based on the fusion of bone conduction signals and lip images, characterized in that, include: Bone conduction signals and lip image signals are acquired synchronously during user voice input; Based on the bone conduction signal and lip image signal, the unimodal data features in the time domain and spatial domain are determined; Based on the determined single-modal data features in the time and spatial domains, a generative adversarial network incorporating a cross-modal attention mechanism and a Mel spectrogram fusion method are applied to obtain modal collaborative feature representations; among them, the fusion of Mel spectrograms is divided into two cases: silent scenarios and high-noise scenarios. In silent scenarios, the Mel-BC spectrogram based on bone conduction signals is ignored, and the Mel-Vba spectrogram based on modal fusion is directly used as the final Mel-Ult spectrogram across the entire frequency band, i.e.: Mel-Ult=1 Mel-Vba+0 Mel-BC In high-noise scenarios, the Mel spectrogram is divided into local regions from both time and frequency domains. The optimal representation is selected from the corresponding local regions of Mel-BC and Mel-Vba and combined to obtain the final Mel spectrogram Mel-Ult. Based on the obtained modal collaborative feature representation, the back-end classification neural network model processes the data to output specific phrases and instructions; an audio waveform is obtained by applying a human voice synthesis model to achieve speech synthesis.

2. The speech synthesis method based on the fusion of bone conduction signals and lip images according to claim 1, characterized in that, The determination of the single-modal data features in the time and spatial domains based on the bone conduction signal and lip image signal includes: Process bone conduction signals to obtain Mel-BC spectrograms based on bone conduction signals; Acquire a sequence of frame images of the lips, input the sequence of frame images into the front-end neural network model, and extract the lip image features F. v .

3. The speech synthesis method based on the fusion of bone conduction signals and lip images according to claim 2, characterized in that, Based on the determined single-modal data features in the time and spatial domains, a speech model is established by applying a generative adversarial network incorporating a cross-modal attention mechanism and a Mel spectrogram fusion method to obtain modal collaborative feature representations, including: Blind enhancement of bone conduction signals was performed to restore high-frequency components; The blindly enhanced bone conduction signal and lip image signal are collaboratively represented based on a cross-modal attention mechanism and input into a trained generative adversarial network. After multiple iterations, a Mel-Vba spectrogram based on modal fusion is generated. The Mel-Vba spectrogram based on modal fusion is fused with the Mel-BC spectrogram based on the primary bone conduction signal to obtain the final Mel-Ult spectrogram.

4. The speech synthesis method based on the fusion of bone conduction signals and lip images according to claim 2, characterized in that, The application of a human voice synthesis model to obtain audio waveforms and achieve speech synthesis includes: Based on the final Mel-Ult spectrogram of the image fusion, it is transformed into a linear spectrogram through a post-processing network. The obtained linear spectrogram is then input into a vocoder to be converted into a speech waveform.

5. The speech synthesis method based on the fusion of bone conduction signals and lip images according to claim 3, characterized in that, The process involves performing collaborative representation of the blindly enhanced bone conduction signal and lip image signal based on a cross-modal attention mechanism, and inputting this representation into a trained generative adversarial network. After multiple iterations, a Mel-Vba spectrogram based on modal fusion is generated, including: Step S321: The feature representation extracted based on the lip image signal is denoted as F. v , as the initial input I0; Step S322: Initial input I0 and feature representation F extracted based on bone conduction signal b Perform a collaborative encoding based on a cross-modal attention mechanism; Step S323: The weighted features F after one cross-modal attention mechanism co-encoding a1 The fused feature F is formed by concatenating it with the original input I0. c1 The input is fed into the current-order generator GE1 to obtain the generated features F. m1 At the same time, the Melan spectrogram IM1 of the current order is generated; Step S324: Input the Mel language spectrogram of the current order into the classifier D1 of the current order. If the unconditional result of the judgment is true and the conditional result of the judgment contains a similarity K with a certain statement, then... h The value is above a certain threshold and has the highest similarity K with the remaining statements. s If the value is below a certain threshold, the Mel language spectrogram generated in the current order is determined to be a usable Mel language spectrogram. Step S325: If the judgment condition described in step S324 is not met, then the generated current order feature expression F will be... m1 As the initial input I0 in step S321, it is iterated according to steps S322, S323, S324, and S325, and all involved F a1 F c1 F m1 Increment the subscripts of GE1, IM1, and D1 by 1; Step S326: Until the judgment condition in S234 is satisfied, the iteration ends, and the usable Mel spectrogram generated in the current order is output to obtain the Mel-Vba Mel spectrogram based on the co-encoding of generative adversarial network and cross-modal attention mechanism.

6. The speech synthesis method based on the fusion of bone conduction signals and lip images according to claim 1, characterized in that, The Mel-Ult calculation in the high-noise scenario is as follows: Let A(t,h) denote the amplitude at a certain moment and frequency. Partitioning and merging based on frequency dimension: First, given a cutoff frequency z, the Mel language spectrogram is divided according to the cutoff frequency z. The area above the cutoff frequency z is divided into high-frequency partition H0, and the area at and below the cutoff frequency z is divided into mid-low frequency partition L. The portion of Mel-BC in the high-frequency partition H0 is removed and filled with the high-frequency partition H0 of the Mel-Vba spectrogram. Therefore, the high-frequency partitions of the final-state Mel speech spectrogram in high-noise scenarios are determined as follows: Mel-Ult(H0)=1 Mel-Vba(H0)+0 Mel-BC(H0) The time-domain partitioning and fusion are divided into two cases: amplitude attenuation and consonant syllable loss. Regarding amplitude attenuation, in Mel-BC(L), if at a certain moment, the amplitude at a certain frequency is greater than a certain threshold x but less than a certain threshold y, then this point is determined to be an accurate time-frequency distribution point of the audio information, and its amplitude is attenuated to a certain extent. The amplitude at the same time-frequency point in Mel-Vba(L) is then used to enhance it; otherwise, no enhancement is performed. Thus, the amplitude distribution in the mid-to-low frequency region of the final Mel spectrogram is determined as follows: A(t,h)|Mel-Ult(L)=1 A(t,h)|Mel-Vba(L)(x<A(t,h)|Mel-BC<y) A(t,h)|Mel-Ult(L) = 1 A(t,h)|Mel-BC(L)(others) To address the issue of consonant syllable loss, the mid-to-low frequency region is further divided into n smaller regions of equal duration based on time resolution. First, the amplitude of each small region of Mel-BC(L) is integrated to calculate the Mel-BC(L) value for that small region. i The sum of amplitudes within ) A(L) i Its time domain dimension has upper and lower limits of T. i0 To T ie The upper and lower limits of the frequency dimension are 0 to 2kHz; When A(L) i-1 ) and A(L i+1 All of them are greater than a certain threshold p, while A(L) i When the value is less than a certain threshold q, the corresponding small region is filled using Mel-Vba(L): Mel-Ult(L i )=1 Mel-Vba(L i )(A(L i )<q&&A(L i-1 )>p&&A(L i+1 )>p) Mel-Ult(L i ) = 1 Mel-BC(L i )(other) Therefore, the low-to-mid frequency partitions of the final state Mel speech spectrogram in high-noise scenarios can be determined as follows: Therefore, the final state Mel speech spectrogram under high noise conditions can be determined as follows: Mel-Ult=Mel-Ult(H0)+Mel-Ult(L).

7. A speech synthesis system based on the fusion of bone conduction signals and lip images, characterized in that, include: Data acquisition module, feature extraction module, encoding module, speech recognition and synthesis module; The data acquisition module collects bone conduction signals and lip image signals synchronously acquired during user voice input and sends them to the feature extraction module. The feature extraction module preprocesses and extracts features from the received bone conduction signal and lip image signal data respectively, determines the single-modal data features in the time domain and spatial domain, and sends them to the encoding module. The encoding module, based on the received single-modal data features in the time and spatial domains, applies a generative adversarial network incorporating cross-modal attention mechanisms and a Mel spectrogram fusion method to obtain modal collaborative feature representations. The Mel spectrogram fusion is divided into two cases: a silent scenario and a high-noise scenario. In the silent scenario, the Mel spectrogram Mel-BC based on bone conduction signals is ignored, and the Mel spectrogram Mel-Vba based on modal fusion is directly used as the final Mel spectrogram Mel-Ult across the entire frequency band, i.e., Mel-Ult = 1. Mel-Vba+0 Mel-BC; In high-noise scenarios, the Mel spectrogram is divided into local regions from both time and frequency domains. The optimal representation is selected from the corresponding local regions of Mel-BC and Mel-Vba and combined to obtain the final Mel spectrogram Mel-Ult. The modal collaborative feature expression is sent to the speech recognition and synthesis module. The speech recognition and synthesis module processes the received modal collaborative feature representation using a backend classification neural network model, outputting specific phrases and commands to achieve speech recognition; it also uses a human voice synthesis model to obtain speech waveforms to achieve speech synthesis.

8. The speech synthesis system based on the fusion of bone conduction signals and lip images according to claim 7, characterized in that, It also includes an interaction module, which evaluates the quality of the speech results synthesized by the speech recognition and synthesis modules and transmits them subsequently through the existing communication channel.

9. A speech synthesis system based on the fusion of bone conduction signals and lip images according to claim 8, characterized in that, The quality evaluation includes: Objective evaluation metrics are used to calculate the intelligibility and speech quality score of the generated speech waveform. Speech scores below a certain threshold are considered unusable. Subjective evaluation metrics are used to provide feedback on the voice results to the user, obtaining instructions from the user regarding whether or not to use the generated voice.