A method for generating a multimodal dataset of speech and electroglottic maps

By generating a multimodal dataset of speech and electroglottic maps, the problem of insufficient diversity in speech datasets in existing technologies is solved, and the performance and accuracy of the model in speech recognition and emotion recognition tasks are improved.

CN119091884BActive Publication Date: 2026-06-30BEIHANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIHANG UNIV
Filing Date
2024-09-12
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing speech datasets lack diversity and electroglottic information synchronized with speech, which limits the generalization ability and robustness of speech recognition systems in scenarios with extremely low signal-to-noise ratios and multiple speakers.

Method used

Generate a speech and electroglottic multimodal dataset by acquiring the original multimodal data read aloud by preset personnel, performing mixing operations and generating label information, storing it as a multimodal dataset, and storing it using a preset file structure.

Benefits of technology

It improves the performance and accuracy of the model in speech recognition and emotion recognition tasks, and enhances the model's generalization ability, robustness and reliability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119091884B_ABST
    Figure CN119091884B_ABST
Patent Text Reader

Abstract

This invention belongs to the field of information technology and discloses a method for generating a speech and electroglottic multimodal dataset. The method includes: acquiring raw multimodal data read aloud by several preset personnel based on preset text, wherein the raw multimodal data includes speech data and electroglottic data; performing a mixing operation on the raw multimodal data to obtain mixed multimodal data under different signal-to-noise ratio conditions; generating corresponding label information for each raw multimodal data and each mixed multimodal data; and storing each multimodal data and its corresponding label information based on a preset file storage structure to obtain a multimodal dataset. The technical solution of this invention, combining speech and electroglottic data, can obtain more comprehensive and multi-dimensional information, thereby improving the generalization ability, accuracy, robustness, and reliability of the model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of information technology, and in particular relates to a method for generating a multimodal dataset of speech and electroglottic diagrams. Background Technology

[0002] Speech contains rich information about human speech, including content, pitch, rate of speech, and identity information, and can be used for tasks such as speech recognition, semantic understanding, voiceprint recognition, and speech emotion recognition. Electroglottics (EGG) is a graphical representation of vocal cord vibration. Compared to speech, it has the following advantages: First, EGG data provides a visual representation of vocal cord vibration, intuitively showing the characteristics and waveforms of pitch; second, EGG data is unaffected by speech quality and environmental noise, exhibiting high stability and reliability; finally, EGG data has lower storage and transmission costs and is easier to process and analyze. Speech datasets have wide applications in speech recognition, speech synthesis, and emotion recognition, and can be used in scenarios such as intelligent assistants, intelligent customer service, and speech translation. EGG datasets can be used not only for speech recognition and emotion recognition but also for speaker recognition and speech quality assessment.

[0003] Existing speech datasets often lack diversity, which limits the generalization ability and robustness of speech recognition systems in scenarios with extremely low signal-to-noise ratios and multiple speakers. In addition, most datasets lack electroglottic information synchronized with speech, which limits the possibility of a deeper understanding of the speech generation mechanism. Summary of the Invention

[0004] The purpose of this invention is to provide a method for generating a multimodal dataset of speech and electroglottic diagrams to solve the problems existing in the prior art.

[0005] To achieve the above objectives, the present invention provides a method for generating a speech and electroglottic multimodal dataset, comprising:

[0006] Acquire raw multimodal data read aloud by several preset personnel based on preset text, wherein the raw multimodal data includes speech data and electroglottic diagram data;

[0007] The original multimodal data are mixed to obtain mixed multimodal data under different signal-to-noise ratio conditions;

[0008] Generate corresponding label information for each of the original multimodal data and each of the mixed multimodal data, and store each multimodal data and its corresponding label information based on a preset file storage structure to obtain a multimodal dataset.

[0009] Optionally, acquire raw multimodal data read aloud by several preset personnel based on preset text, specifically including:

[0010] Based on the preset division rules, each preset person and each preset text are numbered and grouped to obtain preset person groups and preset text groups with the same number of groups;

[0011] According to the grouping order, each preset personnel group and each preset text group are matched one by one. Each preset personnel group reads the text aloud according to the preset pause interval, resulting in several dual-channel audio files. The dual-channel audio files include a left channel audio file and a right channel electroglot diagram file.

[0012] Based on a preset pause interval, each of the dual-channel audio files is segmented into text segments to obtain several segmented texts. Each segmented text is then named to obtain the original multimodal data.

[0013] Optionally, during the process of having each group of preset personnel read the text aloud at preset pause intervals, each dual-channel audio file is acquired based on a preset audio acquisition device, which includes a microphone and an electroglottogram instrument.

[0014] Optionally, before performing text segmentation on each of the dual-channel audio files based on a preset pause interval, the method further includes performing noise reduction processing on each of the dual-channel audio files based on an empirical mode decomposition method to obtain a noise-reduced dual-channel audio file, and performing text segmentation based on the noise-reduced dual-channel audio file.

[0015] Optionally, the dual-channel audio files are segmented into text based on a preset pause interval, specifically including:

[0016] Audio segments whose volume is below a preset decibel threshold and whose duration exceeds a preset pause interval are designated as silent segments. The audio segments after removing silent segments from each dual-channel audio file are extracted as the segmented text.

[0017] Optionally, a mixing operation is performed on the original multimodal data, specifically including:

[0018] The original multimodal data of two preset personnel are randomly selected. The selected original multimodal data are processed to unify the audio length and mixed according to the set signal-to-noise ratio. The data is then named to obtain the mixed multimodal data.

[0019] Optionally, it also includes storing the audio information of each of the hybrid multimodal data and the corresponding original multimodal data, specifically including:

[0020] The mixed multimodal data are input into the target speech extraction model for speech extraction to obtain the original multimodal data corresponding to each mixed multimodal data. The audio information of each mixed multimodal data and the corresponding original multimodal data is stored in CSV file format. The target speech extraction model is constructed based on recurrent neural networks and gated recurrent networks.

[0021] Optionally, it also includes verifying the availability of the multimodal dataset, specifically including:

[0022] Each multimodal data in the multimodal dataset is input into the speech recognition model for classification and recognition to obtain the predicted label information corresponding to each multimodal data. The predicted label information is compared with the label information of the corresponding multimodal data to obtain the usability verification result. The speech recognition model is built based on a neural network model.

[0023] The technical effects of this invention are as follows:

[0024] This invention combines speech and electroglottic data to obtain more comprehensive and multi-dimensional information, thereby improving the performance and accuracy of the model in speech recognition and emotion recognition tasks. In training and evaluating speech processing systems, such as speech recognition and speech synthesis, this database can better train and evaluate the model, and improve the model's generalization ability, accuracy, robustness and reliability. Attached Figure Description

[0025] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0026] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:

[0027] Figure 1 This is a flowchart of the speech electroglottic dual-modal data acquisition and processing method in an embodiment of the present invention;

[0028] Figure 2 This is a flowchart illustrating the audio content recording process in an embodiment of the present invention; wherein, Figure 2 (a) illustrates the continuous recording process within a group in an embodiment of the present invention. Figure 2 (b) illustrates the inter-group rest and recording process in this embodiment of the invention;

[0029] Figure 3 This is a flowchart of the bimodal dataset segmentation process in an embodiment of the present invention;

[0030] Figure 4 This is a flowchart of the target speech extraction model in an embodiment of the present invention;

[0031] Figure 5 This is a flowchart of the speech recognition model in an embodiment of the present invention;

[0032] Figure 6 This is a graph showing the change in word error rate during the testing process in this embodiment of the invention;

[0033] Figure 7 This is a flowchart illustrating the multimodal dataset generation process in an embodiment of the present invention. Detailed Implementation

[0034] Various exemplary embodiments of the present invention will now be described in detail. This detailed description should not be considered as a limitation of the present invention, but rather as a more detailed description of certain aspects, features, and embodiments of the present invention.

[0035] It should be understood that the terminology used in this invention is merely for describing particular embodiments and is not intended to limit the invention. Furthermore, with respect to numerical ranges in this invention, it should be understood that each intermediate value between the upper and lower limits of the range is also specifically disclosed. Every smaller range between any stated value or intermediate value within a stated range, and any other stated value or intermediate value within said range, is also included in this invention. The upper and lower limits of these smaller ranges may be independently included or excluded from the range.

[0036] Unless otherwise stated, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. While only preferred methods have been described herein, any methods similar or equivalent to those described herein may be used in the implementation or testing of this invention. All references to this specification are incorporated by way of citation to disclose and describe the methods associated with those references. In the event of any conflict with any incorporated reference, the content of this specification shall prevail.

[0037] Various modifications and variations can be made to the specific embodiments described in this specification without departing from the scope or spirit of the invention, as will be apparent to those skilled in the art. Other embodiments derived from this specification will also be obvious to those skilled in the art. This application specification and embodiments are merely exemplary.

[0038] The terms “include,” “including,” “have,” “contain,” etc., used in this article are all open-ended terms, meaning that they include but are not limited to.

[0039] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0040] Example 1

[0041] like Figure 1 - Figure 7 As shown, this embodiment provides a method for generating a speech and electroglottic diagram multimodal dataset, including: acquiring raw multimodal data read aloud by several preset personnel based on preset text, wherein the raw multimodal data includes speech data and electroglottic diagram data; performing a mixing operation on each of the raw multimodal data to obtain mixed multimodal data under different signal-to-noise ratio conditions; generating corresponding label information for each of the raw multimodal data and each of the mixed multimodal data; and storing each multimodal data and the corresponding label information based on a preset file storage structure to obtain a multimodal dataset.

[0042] This embodiment combines speech and electroglottic data to obtain more comprehensive and multi-dimensional information, thereby improving the model's performance and accuracy in speech recognition and emotion recognition tasks. The emergence of multimodal datasets has made significant contributions to today's technological field, providing more intelligent and user-friendly solutions for areas such as intelligent voice assistants, smart healthcare, and smart homes.

[0043] In the fields of speech recognition and emotion recognition, the quality and quantity of datasets are crucial to model performance. By providing larger-scale, more organized, and more diverse multimodal datasets, richer and more reliable data support can be provided for research and applications in the field of artificial intelligence, thereby promoting the development and application of AI technologies.

[0044] This embodiment involves using specialized equipment to simultaneously acquire speech and electroglottic signals from one hundred different individuals, and then segment and integrate them into a dual-channel dataset. It can be applied to fields such as speech recognition, speaker recognition, voiceprint recognition, noise suppression, and speech enhancement. It can also be used as a training set and test set for model training, providing important support and assistance for the development and application of related technologies in multiple fields.

[0045] The method specifications for creating, collecting, and establishing a bimodal Chinese speech electroglottic dataset are as follows:

[0046] Step 1: Use hardware devices to measure the speech and electroglottic features of the experimenters;

[0047] Step 1.1: Plan the dataset content. Select one thousand sentences as the recording content for this dataset, sort these one thousand sentences, number them from 000 to 999, and divide them into ten groups in order, with one hundred sentences in each group. It is stipulated that ten people read one group from beginning to end, and a total of 10,000 recordings will be obtained.

[0048] Step 1.2: Select 100 test subjects, aged 18 to 40, all in good health, and whose occupations are students and teachers, including 40 women and 60 men;

[0049] Step 1.3: Control the ambient noise to below 20 decibels. Testers must wear the device in the correct posture, read fluently and clearly, and pause at least 3 seconds between each sentence to facilitate subsequent segmentation.

[0050] Step 1.4: While the tester is reading, use [the software] to collect the electroglot diagram and speech signals separately, and ensure that there is no noise or interference such as noise caused by friction of the receiving device during recording. Use Goldwave software to record and store the waveform and data of the signal. After recording is completed, record the tester's gender, age and other information.

[0051] Every step in the entire process of acquiring and creating the bimodal dataset should adhere to specific standards, including tester standards, data storage standards, naming standards, and experimental record standards. The specific standards and their meanings are shown in Table 1.

[0052] Table 1

[0053]

[0054] Tester specifications: Select 100 testers, aged 18 to 40, whose occupations are students and teachers, including 40 women and 60 men;

[0055] The experiment selected these 1,000 reading passages for the following reasons: 1) to ensure coverage of most common phonemes, syllables, tones, and grammatical structures in Chinese, so that the dataset can reflect the language features of Chinese as comprehensively as possible; 2) to include different types of vocabulary and sentence structures; and 3) to ensure that the selected sentences are as balanced and evenly distributed as possible.

[0056] To facilitate the organization and retrieval of audio data, a naming convention for bimodal data files was designed. Both audio files and tag files must be named according to this convention, as shown in Table 2 below. The filename consists of 11 characters, divided into four parts.

[0057] Table 2

[0058] file name Tester ID Tester gender Tester age Test statement sequence number character length 2 1 2 3 character range 00~99 F,M 18~40 000~999

[0059] 1) Tester ID: Record and number the tester information. Based on the current number of testers (100 people), the tester ID occupies 2 numbers, ranging from 00 to 99. Every ten people read the same set of text.

[0060] 2) Tester Gender: The speaker's voice signal varies in pitch, formants, intensity, and timbre depending on the speaker's gender; simultaneously, the electroglottic signal will also differ in glottal period, vocal cord contact ratio, and amplitude. Gender is indicated by one character: F: female; M: male.

[0061] 3) Age of the tester: The speaker's age affects the fundamental frequency, timbre, and formants of their speech, and also influences the speaker's electroglottic diagram in terms of vocal cord contact pattern, amplitude variation, and sound stability. The age indicator consists of two digits, ranging from 18 to 40.

[0062] 4) Test Statement Numbers: There are a total of 1000 test statements, divided into groups of 100. Each tester must read 100 statements. Statements with the same number are considered the same statement. The statement identifier occupies 3 digits, ranging from 000 to 999.

[0063] The steps for preprocessing the dataset are as follows:

[0064] Step 2: Perform noise reduction, segmentation, and naming operations on the collected dataset;

[0065] Step 2.1: After recording, you will get 100 dual-channel .wav files, each 30 to 40 minutes long. The left channel is the speech and the right channel is the electroglot diagram. Back up this source data.

[0066] Step 2.2: Denoise the audio file by using Empirical Mode Decomposition (EMD) and process each of the two channels separately.

[0067] Step 2.3: Segment the audio based on the pauses in the left channel. Define segments with a sound level below 20 decibels and lasting at least 3 seconds as silent segments. Record the time points of each silent segment and audio segment in the entire .wav file. Combine the two channels based on the recorded time points and segment the audio. Remove the silent segments and extract and save the audio segments before and after each silent segment. Each audio file contains 100 texts, resulting in 10,000 .wav files of 7-13 seconds each, with one text in each file.

[0068] Step 2.4: Name the files in the format of Tester ID (00-99)_Gender (F / M)_Age (18-40)_Statement ID (000-999). Save the 100 .wav files obtained by the same tester into one folder, and name the folder as Tester ID_Gender_Age.

[0069] Empirical Mode Decomposition (EMD) is a method for processing nonlinear and non-stationary signals. It helps analyze the essential characteristics of a signal by decomposing it into several Intrinsic Mode Functions (IMFs) and a residual. EMD is often used for noise reduction because it can effectively separate noise from useful information in a signal.

[0070] The basic principle of signal denoising using EMD is as follows: First, the original signal is decomposed into several IMFs and a residual using the EMD algorithm. These IMFs have different frequency components, which gradually decrease from high frequency to low frequency. Each IMF is a single-component function of local feature scale. Second, since some IMFs mainly contain noise while others contain useful signal components, the frequency characteristics and energy distribution of these IMFs are analyzed to select the IMFs containing the main signal components. High-frequency IMFs usually contain more noise, while low-frequency IMFs contain more signal components. A threshold is set to filter out the IMFs that mainly contain signal components. Finally, the selected IMFs and the residual are recombine to obtain the denoised signal. The denoised signal needs to be verified to check whether it retains the main features of the original signal and effectively removes noise.

[0071] The advantages of EMD denoising include its ability to handle nonlinear and non-stationary signals, its adaptive nature, the absence of pre-defined basis functions, and its ability to intuitively separate different frequency components of a signal. However, its limitations are: firstly, its high computational complexity, making it slow for long signals; secondly, its sensitivity to noise, potentially requiring combination with other methods (such as thresholding) for more effective denoising; and finally, its strong dependence on performance, as different IMF selection strategies affect the final denoising effect, making the IMF selection threshold particularly important.

[0072] In summary, EMD is a powerful signal processing tool, particularly suitable for handling complex nonlinear and nonstationary signals, and has wide applications in noise reduction, feature extraction, and other fields. Therefore, in this embodiment, EMD is chosen to perform noise reduction processing on electroglot diagrams and speech signals. In particular, for electroglot diagram signals, due to their nonlinear and nonstationary characteristics, traditional linear noise reduction methods are not applicable, while EMD can adaptively decompose the signal and effectively process such complex signals.

[0073] The creation of the mixed speech dataset and the modeling of the target speech extraction task are as follows:

[0074] Step 3: Perform a mixing operation on the original speech and electroglottic multimodal dataset described in Step 2 to establish a mixed speech and electroglottic multimodal dataset under different signal-to-noise ratio conditions, and test the target speech extraction model;

[0075] Step 3.1: Randomly select the recordings of two speakers from 100 subjects and mix the two audio segments at a signal-to-noise ratio of 0dB.

[0076] Step 3.2: Since the original audio files have different lengths, the audio files were padded and trimmed to make all audio files the same length of 16 seconds and the sampling rate 16000 for easier recognition and processing.

[0077] Step 3.3: A total of 10,000 mixed audio recordings were created. These audio recordings were named by combining the names of their original audio recordings, i.e., "target speaker's sentence audio file name - interfering speaker's sentence file name". Based on gender, the dataset was divided into four types of mixed speech data: male-male, female-female, male-female, and female-male. At the same time, the clean speech and electroglottic diagram data of the target speaker were also created.

[0078] Step 3.4: Divide the mixed speech and electroglottogram datasets into training and testing sets in an 8:1:1 ratio. To facilitate management and subsequent data extraction, record all detailed information about the mixed audio in a structured CSV file. This will allow for the training and testing of a target speaker speech extraction model based on a convolutional neural network, using the corresponding clean speech as the target.

[0079] To train the target speaker speech extraction and recognition algorithm, the raw clean speech data needs to be mixed to simulate the multi-speaker environment commonly encountered in real-world applications. Recordings of two speakers are randomly selected from the dual-mode dataset, and these two audio segments are mixed with a signal-to-noise ratio of 0dB. This means that the sound intensities of both speakers are adjusted to the same level, making it impossible to distinguish between the main and background speakers solely by volume. This step aims to increase the difficulty of speech recognition and better simulate sound conditions under complex circumstances. During dataset creation, the raw clean audio was preprocessed. Since the original audio files varied in length, they were padded and trimmed to ensure a uniform length of 16 seconds and a sampling rate of 16000Hz for subsequent recognition processing. This standardization not only ensured data consistency but also improved the efficiency and accuracy of the model processing.

[0080] A total of 10,000 mixed audio tracks were created during the mixing process. These tracks were named by combining the names of their original audio files, in the format of "target speaker's sentence audio filename - interfering speaker's sentence filename," ensuring traceability for subsequent processing and analysis. All mixed audio tracks were divided into training, validation, and test sets in an 8:1:1 ratio for algorithm training, tuning, and final performance evaluation, respectively. To facilitate management and subsequent data extraction, detailed information about the mixed audio tracks (i.e., the original audio filenames of the target and interfering sources involved in the mixing, and the identifiers of the mixed audio tracks in the new dataset) was systematically recorded in a structured CSV file. This data management approach not only improved the transparency of the experiment but also facilitated reproduction and further research.

[0081] The purpose of this embodiment is to extract the target speaker's audio from overlapping mixed speech using a neural network. Unlike traditional speech separation tasks, this task does not attempt to separate all speech signals in the mixture, but focuses on identifying and extracting the specific target speaker's voice. Therefore, the study adopts the SpeakerBeam method, which enables the model to be speaker-independent and universal by utilizing additional speaker information obtained from the target speaker's adaptive utterances. This is because it can extract the target audio by acquiring new speaker adaptive utterances not encountered during the training phase. This additional information allows the neural network to focus on the target speaker, treating all other sounds as interference, thus effectively avoiding the label arrangement problem in traditional methods.

[0082] This embodiment uses a recurrent neural network (RNN) and a gated recurrent unit (GRU) to train and test the target speech extraction model.

[0083] Recurrent Neural Networks (RNNs) are a type of neural network used to process sequential data and are widely applied in fields such as speech recognition and natural language processing. Unlike traditional feedforward neural networks, RNNs have recurrent connections, enabling them to capture temporal dependencies in sequential data. The basic unit of an RNN includes an input layer, hidden layers, and an output layer. The input at each time step depends not only on the current input data but also on the hidden state of the previous time step.

[0084] GRU is an improved variant of RNN, specifically designed to solve the vanishing gradient problem. GRU uses gating mechanisms to control the flow of information, making it better suited for handling long sequences of data than standard RNNs. The basic structure of GRU includes update gates and reset gates. These gates determine how the current hidden state is updated by controlling the flow of information.

[0085] The specific structure of the model is as follows: The speech input signal is first processed by a one-dimensional convolutional layer (encoder layer) to convert the original temporal signal into a set of feature representations. These feature representations are then output to the Bottleneck layer. The main function of the Bottleneck layer is to compress the high-dimensional feature representations to a lower dimension through a linear transformation, reducing computational complexity. Then, in the Temporal Convolutional Network (TCN), the data is processed through multiple convolutional blocks and residual connections. These convolutional blocks further process these encoded features, capturing more complex information relevant to the target speaker.

[0086] The auxiliary network is responsible for extracting features from the target speaker's adaptive speech as an embedding vector. In the experiment, the adaptive speech is the electroglottic signal corresponding to the target speaker's speech. The input electroglottic signal is converted by the encoder and then undergoes dimensionality transformation through a lambda layer to conform to the input of the temporal convolutional network. After feature extraction by the convolutional block, the average value is taken on the last dimension to generate a fixed-length embedding vector. Finally, a lambda layer is used to adjust the dimensions of the final speaker embedding vector, which is then output to the adaptive layer of the masking network.

[0087] The application of the dataset and the modeling of the speech recognition task are as follows:

[0088] Step 4: Create labels and test models such as speech recognition using the dataset:

[0089] Step 4.1: Tag recognition is performed on each WAV file, that is, a TXT file contains the audio reading content and its phonemes, resulting in a total of 10,000 TXT tag files that correspond one-to-one with the WAV files;

[0090] Step 4.2: Construct a txt file that provides a list of training and testing data and label information to facilitate batch processing of large amounts of data by the model;

[0091] Step 4.3: To test the practicality and reliability of the dataset, the dataset is run using a speech recognition model. The model is built using machine learning and neural network models such as CNN and CTC, and implemented using the Keras framework based on Tensorflow.

[0092] Following the aforementioned steps, a dual-mode dataset containing 10,000 audio files with speech and electroglottic information, along with 10,000 corresponding label files, can be obtained. To verify the applicability of this dataset, speech recognition detection can be established using machine learning or neural network models based on the aforementioned data results.

[0093] This embodiment proposes to use a Convolutional Neural Network (CNN) model and a Connectionist Temporal Classification (CTC) loss function for speech feature extraction and model training. Through a large dataset and a multi-layered network structure, it can effectively recognize various speech inputs.

[0094] Unlike RNNs, which process data sequentially over time, CNNs offer significant advantages in parallel processing. CNN convolutional operations can be performed in parallel across the entire input because each convolutional kernel can scan different parts of the input simultaneously. This makes CNNs faster during training and inference, especially when processing images or other structured data. Secondly, CNNs can utilize local receptive fields to capture local features. By using small convolutional kernels (e.g., 3×3) and multiple stacks, CNNs can progressively expand their receptive fields to capture higher-level features. This allows CNNs to handle spatial local correlations more effectively, and the size of the receptive field can be flexibly controlled by adjusting the kernel size, number of layers, and pooling operations to adapt to different task requirements. Thirdly, the convolutional kernels in CNNs share parameters across the entire input. This means that the same kernel is used to scan different regions of the input, significantly reducing the number of parameters and lowering model complexity and computational cost. Finally, the hierarchical structure of CNNs enables them to extract multi-level features. For example, initial layers extract simple edge and texture features, while as the number of layers increases, the extracted features become more complex and abstract, allowing CNNs to better capture different levels of information in the data.

[0095] CTC (Concurrent Type Coding) is a loss function used for sequence labeling tasks. Unlike other loss functions, it doesn't require aligning labels with predicted outputs. In speech recognition tasks, the length of the output sequence typically varies with the input. CTC can handle label sequences and model output sequences of different lengths because the input and output lengths of speech data may differ and be misaligned. Secondly, its goal is to learn a mapping from the model's output sequence to the label sequence, rather than learning an exact alignment. This correspondence allows the model to learn important features and patterns in the sequence regardless of specific time steps. Thirdly, CTC allows empty and repeated labels, making the model more flexible and adaptable to different types of sequence data. Fourthly, it can be directly applied to the model's output, enabling end-to-end training. This means the entire model can be optimized through backpropagation without additional alignment steps or post-processing. CTC is widely used in sequence labeling tasks such as speech recognition, handwriting recognition, and text recognition, achieving good results in these tasks. By minimizing the CTC loss function, the model can progressively optimize its parameters, thereby improving the accuracy of recognizing input speech features and generating corresponding text labels.

[0096] The specific structure of the model is as follows: the input layer is a 200-dimensional feature value sequence, and the maximum length of a speech data is specified as 1600 (approximately 16 seconds); the model contains multiple convolutional and pooling layers to extract high-level features from the input data. The first hidden layer is a convolutional pooling layer with a kernel size of 3×3 and a pooling window of 2. A batch normalization layer is added after each convolutional layer to accelerate training and improve model stability; the second hidden layer is a fully connected layer; the output layer is a fully connected layer, using softmax as the activation function; CTC previously used CTC's loss function to achieve connectivity-based temporal multi-output.

[0097] The training process of the model used is as follows:

[0098] 1. Determine the training batch size and batch, run the CNN model for forward propagation, and learn and extract representative features from the speech data;

[0099] 2. Perform backpropagation from output to input, calculate the gradient of the loss function with respect to each weight and bias, and update the weights and biases using the gradients calculated through backpropagation to reduce the value of the loss function;

[0100] 3. The CTC loss function is used to learn the mapping relationship from input speech features to output text labels, so as to guide the optimization and updating of model parameters;

[0101] 4. Repeat the above training and model parameter update process until the number of iterations reaches 50 epochs.

[0102] This embodiment establishes a 50-hour database containing speech and electroglottic information of 100 subjects (60 men and 40 women). In training and evaluating speech processing systems, such as speech recognition and speech synthesis, this database can better train and evaluate models, and improve the generalization ability, accuracy, robustness and reliability of the models.

[0103] This embodiment provides a dual-modal (speech and electroglottic) dataset resource, which can be used for speaker recognition and voiceprint recognition tasks. By analyzing the features of the two modalities, the recognition task can be performed more accurately.

[0104] In this embodiment, the dual-mode dataset can improve speech generation and conversion effects. By fusing and analyzing electroglottic features and speech features, the emotional state of the speech can be further altered, making the conversion effect more natural and fluent.

[0105] The above description is merely a preferred embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for generating a multimodal dataset of speech and electroglottic maps, characterized in that, include: Acquire raw multimodal data read aloud by several preset personnel based on preset text, wherein the raw multimodal data includes speech data and electroglottic diagram data; Obtain raw multimodal data read aloud by several pre-defined individuals based on pre-defined text, specifically including: Based on the preset division rules, each preset person and each preset text are numbered and grouped to obtain preset person groups and preset text groups with the same number of groups; According to the grouping order, each preset personnel group and each preset text group are matched one by one. Each preset personnel group reads the text aloud according to the preset pause interval, resulting in several dual-channel audio files. The dual-channel audio files include a left channel audio file and a right channel electroglot diagram file. Based on a preset pause interval, each of the dual-channel audio files is segmented into text segments to obtain several segmented texts. Each segmented text is named to obtain the original multimodal data. The original multimodal data are mixed to obtain mixed multimodal data under different signal-to-noise ratio conditions; Generate corresponding label information for each of the original multimodal data and each of the mixed multimodal data, and store each multimodal data and its corresponding label information based on a preset file storage structure to obtain a multimodal dataset; The audio information of each of the hybrid multimodal data and the corresponding original multimodal data is stored, specifically including: Each of the aforementioned mixed multimodal data is input into the target speech extraction model for speech extraction, thereby obtaining the original multimodal data corresponding to each of the aforementioned mixed multimodal data. The audio information of each of the aforementioned mixed multimodal data and the corresponding original multimodal data is stored in CSV file format. The target speech extraction model is constructed based on recurrent neural networks and gated recurrent networks. The usability of the multimodal dataset is verified, specifically including: Each multimodal data in the multimodal dataset is input into the speech recognition model for classification and recognition to obtain the predicted label information corresponding to each multimodal data. The predicted label information is compared with the label information of the corresponding multimodal data to obtain the usability verification result. The speech recognition model is built based on a neural network model. The speech recognition model includes an input layer, multiple convolutional layers, pooling layers, a first hidden layer, a second hidden layer, an output layer, and a CTC layer. The input layer is a multi-dimensional sequence of feature values, with a maximum length of 16 seconds for each speech data point. The multiple convolutional and pooling layers are used to extract high-level features from the input data. The first hidden layer is a convolutional-pooling layer, with a batch normalization layer added after each convolutional layer to accelerate training and improve model stability. The second hidden layer is a fully connected layer. The output layer uses softmax as the activation function. The CTC layer uses the CTC loss as the loss function to achieve connected temporal multiple outputs.

2. The method for generating a speech and electroglottic multimodal dataset according to claim 1, characterized in that, During the process of having each group of pre-set personnel read the text aloud according to a pre-set pause interval, each dual-channel audio file is acquired based on a pre-set audio acquisition device, which includes a microphone and an electroglottogram instrument.

3. The method for generating a speech and electroglottic multimodal dataset according to claim 1, characterized in that, Before performing text segmentation on each of the dual-channel audio files based on a preset pause interval, the method further includes denoising each of the dual-channel audio files based on an empirical mode decomposition method to obtain denoised dual-channel audio files, and then performing text segmentation based on the denoised dual-channel audio files.

4. The method for generating a speech and electroglottic multimodal dataset according to claim 1, characterized in that, Based on a preset pause interval, each of the dual-channel audio files is segmented into text segments, specifically including: Audio segments whose volume is below a preset decibel threshold and whose duration exceeds a preset pause interval are designated as silent segments. The audio segments after removing silent segments from each dual-channel audio file are extracted as the segmented text.

5. The method for generating a speech and electroglottic multimodal dataset according to claim 1, characterized in that, The mixing operation of the original multimodal data includes: The original multimodal data of two preset personnel are randomly selected. The selected original multimodal data are processed to unify the audio length and mixed according to the set signal-to-noise ratio. The data is then named to obtain the mixed multimodal data.