Multimodal speech separation method, training method, and related apparatuses
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- IFLYTEK CO LTD
- Filing Date
- 2022-09-15
- Publication Date
- 2026-06-26
AI Technical Summary
In multimodal speech separation, factors such as time deviation and lip occlusion affect the accuracy of multimodal speech separation.
By dividing the training samples of the multimodal speech separation network into multiple subsets according to the first loss and using different training methods for different subsets, the multimodal speech separation network is retrained, thereby improving the training speed and accuracy.
This improves the training speed of the multimodal speech separation network and the accuracy of speech separation from audio and video data.
Smart Images

Figure CN115620723B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of speech recognition technology, and in particular to a multimodal speech separation method, training method and related apparatus. Background Technology
[0002] With the continuous development of human-computer interaction methods, from traditional touch interaction to voice interaction, and now to multimodal human-computer interaction, the efficiency, convenience, comfort, and security they bring have become new demands of users. Multimodal speech separation, as one of the most important technologies in multimodal front-end systems, has become a hot research topic for researchers in related fields. Multimodal speech separation further identifies interfering speech and the speaker's voice by extracting the speaker's lip movements. However, factors such as time deviations between multimodal signals and occlusion of the speaker's lip movements can easily affect the accuracy of multimodal speech separation results. Summary of the Invention
[0003] The main technical problem addressed by this application is to provide a multimodal speech separation method, training method, and related apparatus that can improve the accuracy of multimodal speech separation.
[0004] To address the aforementioned technical problems, this application provides a multimodal speech separation method, comprising: obtaining audio and video data containing a target object; wherein the audio and video data includes lip video data of the target object; inputting the audio and video data into a trained multimodal speech separation network to obtain audio data related to the lip video data of the target object; wherein multiple training samples used to train the multimodal speech separation network are divided into multiple subsets based on a first loss obtained after passing through the multimodal speech separation network, and the multimodal speech separation network is retrained based on at least a portion of the subsets.
[0005] To address the aforementioned technical problems, another technical solution adopted in this application is: providing a multimodal speech separation network training method, comprising: training a multimodal speech separation network using multiple first training samples; wherein each first training sample has a ground truth label; in response to a first loss between the multiple first training samples during training of the multimodal speech separation network being greater than or equal to a first threshold, dividing the multiple first training samples into multiple subsets based on the multiple first losses; wherein the first loss between the multiple training samples within the same subset is less than the first threshold; and retraining the multimodal speech separation network based on at least a portion of the subsets.
[0006] To address the aforementioned technical problems, another technical solution adopted in this application is: providing a multimodal speech separation device, comprising: a first obtaining module for obtaining audio and video data containing a target object; wherein the audio and video data includes lip video data of the target object; a second obtaining module for inputting the audio and video data into a trained multimodal speech separation network to obtain audio data related to the lip video data of the target object; wherein multiple training samples for training the multimodal speech separation network are divided into multiple subsets based on a first loss obtained after passing through the multimodal speech separation network, and the multimodal speech separation network is retrained based on at least a portion of the subsets.
[0007] To solve the above-mentioned technical problems, another technical solution adopted in this application is: to provide an electronic device, including a memory and a processor coupled to each other, wherein the memory stores program instructions, and the processor is used to execute the program instructions to implement the multimodal speech separation method or the multimodal speech separation network training method in the above-mentioned technical solution.
[0008] To solve the above-mentioned technical problems, another technical solution adopted in this application is: to provide a computer-readable storage medium storing program instructions that can be executed by a processor, wherein the program instructions are used to implement the multimodal speech separation method or the multimodal speech separation network training method in the above-mentioned technical solution.
[0009] The beneficial effects of this application are as follows: Unlike existing technologies, this application divides multiple training samples into multiple subsets based on a first loss of multiple training samples. Different training methods are then used to train the multimodal speech separation network for different subsets, thereby improving the training speed of the multimodal speech separation network. Furthermore, the trained multimodal speech separation network exhibits high accuracy in speech separation of audio and video data containing target objects. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein:
[0011] Figure 1 This is a flowchart illustrating one embodiment of the multimodal speech separation method in this application;
[0012] Figure 2 yes Figure 1 A schematic diagram of the network structure of a multimodal speech separation network according to one embodiment;
[0013] Figure 3 This is a flowchart illustrating one implementation method of the multimodal speech separation network training method of this application;
[0014] Figure 4 This is a schematic diagram of one embodiment of the multimodal speech separation device of this application;
[0015] Figure 5 This is a schematic diagram of the structure of one embodiment of the electronic device of this application;
[0016] Figure 6 This is a schematic diagram of an embodiment of the computer-readable storage medium proposed in this application. Detailed Implementation
[0017] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0018] Please see Figure 1 , Figure 1 This is a flowchart illustrating one embodiment of the multimodal speech separation method in this application, which includes:
[0019] S101: Obtain audio and video data containing the target object. The audio and video data includes lip video data of the target object.
[0020] Specifically, in this embodiment, the implementation process of step S101 may include: obtaining video data containing the target object and mixed audio data containing the target object's speech. The video data can be obtained by a camera capturing the target object, and simultaneously, the speech of the target object can be captured by a microphone to obtain the mixed audio data. However, since the environment in which the target object is located may contain people other than the target object, the mixed audio data may contain the target object's audio data, noise data, and audio data of non-target objects.
[0021] S102: Input the audio and video data into the trained multimodal speech separation network to obtain audio data related to the lip video data of the target object. Multiple training samples used to train the multimodal speech separation network are divided into multiple subsets based on the first loss obtained after passing through the multimodal speech separation network, and the multimodal speech separation network is retrained based on at least some subsets.
[0022] Specifically, in this embodiment, the specific implementation process of step S102 can be as follows:
[0023] A: The probability of the first speech of the target object is obtained based on video data, mixed audio data, and a trained multimodal speech separation network.
[0024] Please see Figure 2 , Figure 2 for Figure 1 A schematic diagram of the network structure of a multimodal speech separation network according to one embodiment is shown. Specifically, the audio and video data of the target object are input into the trained multimodal speech separation network 10. The multimodal speech separation network 10 extracts lip features from the video data of the target object to obtain the lip video features of the target object; and performs FFT (Fast Fourier Transform) on the above-mentioned mixed audio data to obtain the amplitude spectrum and phase spectrum of the mixed audio data, and uses the amplitude spectrum and phase spectrum as the audio features corresponding to the mixed audio data. Further, the lip video features and audio features are simultaneously input into the Unet network 20, which includes a feature extraction branch and an upsampling branch. The feature extraction branch performs feature extraction multiple times on the video data and audio features, and fuses the video features and audio features after each feature extraction to obtain multiple fused features. The depth of feature extraction increases with the number of feature extractions. In this embodiment, after multiple feature extractions, the final fused features are input into the bottleneck layer to reduce the subsequent computational load of the model. The bottleneck layer can be an LSTM (Long Short-Term Memory) network. Further, by combining multiple fused features obtained from the feature extraction branch, the fused features output by the LSTM network are upsampled and passed through a 1×1 convolution to finally obtain the first speech probability of the target object.
[0025] In one embodiment, lip features can be extracted from video data of a target object using a lip feature extraction network. Specifically, the lip feature extraction network can be constructed using a residual neural network, and multiple labeled training datasets can be obtained to extract features from this network. The specific training process will not be described in detail in this application.
[0026] B: Obtain speech data related to the lip video data of the target object from the mixed audio data based on the first speech presence probability. Specifically, multiply the first speech presence probability by the mixed audio data and perform an inverse short-time Fourier transform to obtain the separated speech data related to the target object.
[0027] This application divides multiple training samples into multiple subsets based on a first loss of multiple training samples. Different training methods are then used to train the multimodal speech separation network for each subset, thereby improving the training speed of the multimodal speech separation network. Furthermore, the trained multimodal speech separation network exhibits high accuracy in speech separation of audio and video data containing target objects.
[0028] Steps S101-S102 above mainly describe the application layer; the training process is described below. Please refer to [link / reference needed]. Figure 3 , Figure 3 This is a flowchart illustrating one embodiment of the multimodal speech separation network training method of this application. The training process mainly includes:
[0029] S201: Train a multimodal speech separation network using multiple first training samples. Each first training sample has a ground truth label.
[0030] Specifically, the implementation process of step S201 includes: acquiring multiple first training samples with real value labels, the real value labels corresponding to the audio data of the target object, i.e., the speech of the target object.
[0031] In one embodiment, the first training sample includes video data and mixed audio data. The mixed audio data includes speech of the target object, speech of non-target objects, and noise. The process of obtaining the ground truth label for the first training sample includes: obtaining the sum of the energy of the target object's speech, the energy of the non-target object's speech, and the energy of the noise speech in the first training sample, and using the ratio of the energy of the target object's speech to the sum as the ground truth label. Specifically, the formula for calculating the ground truth label is as follows:
[0032]
[0033] Where label represents the true value label, s1 2 s2 represents the energy of the speech of the sample object. 2 n represents the energy of speech from non-sample objects. 2 This represents the energy of noisy speech.
[0034] Furthermore, constructing such Figure 2The multimodal speech separation network shown is initially trained using multiple first training samples. The training process includes: inputting multiple first training samples into the constructed multimodal speech separation network to obtain the probability of the second speech corresponding to each first training sample; obtaining a first loss function for each first training sample; and adjusting the parameters of the multimodal speech separation network using the first loss. Specifically, the first loss function, Loss, can be a mean squared error loss function, as shown below:
[0035] Loss = ∑(mask-label) 2
[0036] Here, mask represents the probability of the second speech, and label represents the true value label of the first training sample.
[0037] S202: In response to the fact that the difference between the first losses of multiple first training samples when training the multimodal speech separation network is greater than or equal to a first threshold, the multiple first training samples are divided into multiple subsets based on the multiple first losses. Among them, the difference between the first losses of multiple training samples within the same subset is less than the first threshold.
[0038] Specifically, step S202 includes the following steps: In response to the difference between the first losses of multiple first training samples in step S101 being greater than or equal to a first threshold, the multiple first training samples are divided into multiple subsets based on the multiple first losses. Specifically, for each of the multiple first losses, the difference between any two first losses is obtained. In response to the existence of a difference greater than or equal to the first threshold, it is considered that the difference between the first losses obtained after processing the multimodal speech separation network by the multiple first training samples and their corresponding ground truth labels is large; that is, some first training samples have large first losses, while others have small first losses. If the multimodal speech separation network is trained using first training samples with large first losses and their corresponding ground truth labels, it may easily affect the reliability of the multimodal speech separation network. Therefore, it is necessary to divide the multiple first training samples into multiple subsets, and the difference between the first losses of multiple first training samples within the same subset is less than the first threshold. The first threshold can be obtained through multiple experiments or estimated by relevant researchers.
[0039] In other embodiments, to avoid data randomness, if the number of differences between all first losses that are greater than or equal to a first threshold exceeds a preset number, then the multiple first training samples are divided into multiple subsets, and the difference between the first losses of multiple first training samples within the same subset is less than the first threshold.
[0040] In one embodiment, the step of dividing multiple first training samples into multiple subsets based on multiple first losses includes: dividing the multiple first training samples into a first subset and a second subset based on multiple first losses.
[0041] Optionally, the first loss corresponding to each first training sample is compared with a third threshold to divide the multiple first training samples into a first subset and a second subset. Wherein, the first loss of multiple first training samples in the first subset is less than the third threshold, and the first loss of multiple first training samples in the second subset is greater than or equal to the third threshold. In this embodiment, the aforementioned third threshold can be obtained through back-calculation from multiple trials, or it can be estimated by those skilled in the art.
[0042] In another embodiment, multiple first training samples can be divided into a first subset and a second subset by constructing a Gaussian Mixture Model (GMM). Specifically, the first loss corresponding to all first training samples is input into the constructed GMM to classify the multiple first losses into multiple categories and obtain a Gaussian distribution histogram for each category. Further, for each category's Gaussian distribution histogram, all first training samples corresponding to first losses less than a threshold value in that category are designated as the first subset, and all first training samples corresponding to first losses greater than a threshold value in that category are designated as the second subset. That is, the first losses corresponding to multiple first training samples in the first subset are smaller, and the first losses corresponding to multiple first training samples in the second subset are larger. In this embodiment, the threshold value can be determined based on the mean and variance of the first losses corresponding to multiple first training samples. Dividing multiple first training samples into multiple first subsets and multiple second subsets facilitates the determination of different training methods for training the multimodal speech separation network based on different datasets, thereby improving the separation performance of the multimodal speech separation network. In addition, after obtaining multiple first subsets and multiple second subsets, the multiple first subsets can be merged into one first subset, and the multiple second subsets can be merged into one second subset.
[0043] S203: Retrain the multimodal speech separation network based on at least a subset of data.
[0044] Specifically, the implementation process of step S203 includes the following steps:
[0045] A: At least a subset is re-inputted into the multimodal speech separation network to retrain it. Specifically, in response to the smaller first loss corresponding to all first training samples in the subset, the first training samples in these subsets and their corresponding ground truth labels are re-inputted into the multimodal speech separation network for supervised training.
[0046] In one embodiment, in response to dividing the plurality of first training samples into a first subset and a second subset in step S202, at least a portion of the first training samples in the first subset and their corresponding ground truth labels are input into a multimodal speech separation network. Lip video features are obtained based on the video data in the first training samples, and audio features are obtained based on the mixed audio data in the first training samples. Further, the lip video features and audio features are input into a Unet network to obtain the second speech presence probability of the target object in the first training samples. A first loss corresponding to the first training samples is obtained using the second speech presence probability and the ground truth labels corresponding to the first training samples, and the parameters in the multimodal speech separation network are adjusted based on the first loss.
[0047] Optionally, the first training samples in the first subset can be augmented, and the augmented first training samples and their corresponding ground truth labels can be input into the multimodal speech separation network. The ground truth labels of the augmented first training samples are the same as those of the original first training samples. Furthermore, the step of augmenting the first training samples may include: processing the video data corresponding to the first training samples by inverting, shifting, adding noise, and adjusting brightness; and processing the mixed audio data corresponding to the first training samples by inserting noise and adjusting speed.
[0048] B: Construct multiple second training samples using at least a subset of first training samples. Specifically, in response to a first loss corresponding to all first training samples in a subset being greater than a third threshold, obtain any two first training samples and their corresponding ground truth labels from these subsets and fuse them to obtain second training samples and their corresponding ground truth labels. The ground truth labels of the second training samples are determined by the ground truth labels of the first training samples used to construct the second training samples.
[0049] In one embodiment, in response to step S202 dividing multiple first training samples into a first subset and a second subset, multiple second training samples can be constructed using the multiple first training samples in the second subset. Specifically, data augmentation can be performed on the multiple first training samples in the second subset, and second training samples and corresponding ground truth labels can be obtained based on the data-augmented first training samples. Optionally, any two first training samples and their corresponding ground truth labels can be fused from the second subset to obtain fused training samples and fused labels, and the fused training samples can be used as second training samples, and the fused labels can be used as ground truth labels for the second training samples. Before mixing any two first training samples and their corresponding ground truth labels, data augmentation is performed on the selected first training samples. For example, the video data corresponding to the first training samples can be processed by inversion, translation, noise addition, brightness adjustment, etc., and the mixed audio data corresponding to the first training samples can be processed by noise insertion, speed adjustment, etc. Of course, in other embodiments, other numbers of first training samples and ground truth labels can also be selected for fusion to obtain second training samples and their corresponding ground truth labels. In this embodiment, multiple first training samples can be processed using the MixMatch algorithm to obtain second training samples and their corresponding ground truth labels. The specific process will not be described in detail here.
[0050] Alternatively, the MixMatch algorithm can be used to predict the first training sample in the second subset based on the first training sample in the first subset and the corresponding ground truth label, so as to obtain the predicted label, and use the predicted label as the ground truth label of the first training sample in the second subset.
[0051] Furthermore, after obtaining multiple second training samples, the obtained second training samples and their corresponding ground truth labels are input into the multimodal speech separation network to retrain the network. Specifically, based on the input second training samples and their corresponding ground truth labels, the probability of the presence of third speech corresponding to the second training samples is obtained, and the corresponding third loss is obtained using the probability of the presence of third speech and their corresponding ground truth labels. The parameters in the multimodal speech separation network are then adjusted based on the third loss.
[0052] It should be noted that step S203 may include steps A and B, that is, step A is executed first and then step B is executed, or step S203 may only include step A or step B.
[0053] In one embodiment, after dividing the multiple first training samples into multiple subsets and retraining the multimodal speech separation network based on at least some subsets, multiple new first training samples are acquired, and the multimodal speech separation network is trained a new time using the newly acquired first training samples. Each newly acquired first training sample also has a corresponding ground truth label. Further, in response to the difference between the first losses of the newly acquired multiple first training samples during the training of the multimodal speech separation network being greater than or equal to a first threshold, the newly acquired multiple first training samples are divided into multiple subsets based on the multiple first losses, and the multimodal speech separation network is retrained based on at least some subsets. The specific process can be referred to steps S202-S203.
[0054] In another embodiment, in response to the difference between the first losses of the reacquired multiple first training samples during the training of the multimodal speech separation network being less than a first threshold, the multiple first training samples are shifted to obtain corresponding third training samples. The ground truth labels of the first training samples corresponding to the third training samples are the same. Specifically, at least a portion of the first training samples and their corresponding ground truth labels are selected from the multiple first training samples and the reacquired multiple first training samples. The video and audio data of the selected first training samples are shifted relative to each other by a preset time, and the shifted video and audio data are used as the third training samples.
[0055] Optionally, when using the GMM model to partition multiple first training samples, the parameters of the GMM model iterate continuously during the partitioning process of datasets composed of different first losses. This leads to the GMM model ultimately being unable to distinguish between the first losses, meaning it cannot partition multiple first training samples into multiple subsets based on the input first losses. Therefore, multiple newly acquired first training samples are input into the multimodal speech separation network to obtain the first losses corresponding to the newly acquired first training samples. Furthermore, after inputting multiple first losses into the GMM model, all first losses are less than a threshold value, making it impossible to partition the multiple first losses using the threshold value. Therefore, this implementation method performs offset processing on the multiple first training samples to obtain corresponding third training samples.
[0056] Further, a multimodal speech separation network is trained using multiple third training samples. Specifically, the third training samples are input into the multimodal speech separation network to obtain the probability of the presence of the fourth speech corresponding to the third training samples, and a second loss function is used to obtain the second loss between the probability of the presence of the fourth speech and the corresponding ground truth label. In response to the fact that the difference between the second losses of multiple third training samples during training of the multimodal speech separation network is greater than or equal to a second threshold, the multiple third training samples are divided into multiple sets based on the multiple second losses. Wherein, the difference between the second losses of multiple third training samples within the same set is less than the second threshold. Further, the multimodal speech separation network is trained again based on at least a portion of the sets. The process of training the multimodal speech separation network using multiple third training samples can be referred to steps S201-S203.
[0057] Furthermore, the second loss function Loss′ mentioned above is as follows:
[0058] Loss′=∑α i (mask(t i )-label) 2
[0059] Among them, t i The preset time representing the relative offset between video and audio data, mask(t) i α represents the probability of the fourth speech being present in the third training sample. i This represents the weighting coefficient, which can be obtained through multiple experiments or estimated by relevant researchers.
[0060] Alternatively, in other embodiments, the process of training a multimodal speech separation network using multiple third training samples can also be as follows: inputting multiple third training samples into the multimodal speech separation network to obtain the probability of the existence of the fourth speech corresponding to the third training sample, obtaining the second loss corresponding to each first training sample using the aforementioned second loss function, and adjusting the parameters of the multimodal speech separation network using the second loss.
[0061] Furthermore, in response to the convergence of the loss obtained during the training of the multimodal speech separation network using the third training sample, or the completion of the preset number of training rounds, the training of the multimodal speech separation network is stopped. The multimodal speech separation network training method proposed in this application improves the training speed of the multimodal speech separation network and makes the trained multimodal speech separation network more robust by dividing multiple training samples into different training sample subsets and formulating corresponding training methods for different training sample subsets.
[0062] Please see Figure 4 , Figure 4This is a schematic diagram of the structure of an embodiment of the multimodal speech separation device of this application. The multimodal speech separation device includes a first acquisition module 40 and a second acquisition module 50.
[0063] Specifically, the first obtaining module 40 is used to obtain audio and video data containing the target object. The audio and video data includes lip video data of the target object.
[0064] The second acquisition module 50 is used to input audio and video data into the trained multimodal speech separation network to obtain audio data related to the lip video data of the target object. The training samples used to train the multimodal speech separation network are divided into multiple subsets based on the first loss obtained after passing through the multimodal speech separation network, and the multimodal speech separation network is retrained based on at least some subsets.
[0065] In another implementation, please refer to [link / reference]. Figure 4 The multimodal speech separation device may further include a training module 60, which is connected to the second acquisition module 50. The training module 60 includes a first training submodule, a partitioning module, and a second training submodule. The first training submodule is used to train the multimodal speech separation network using multiple first training samples. Each first training sample has a ground truth label.
[0066] The partitioning module is used to divide the multiple first training samples into multiple subsets based on the multiple first losses when the difference between the first losses of multiple first training samples during the training of the multimodal speech separation network is greater than or equal to a first threshold. Specifically, the difference between the first losses of multiple first training samples within the same subset is less than the first threshold.
[0067] The second training submodule is used to retrain the multimodal speech separation network based on at least a subset of data.
[0068] In one embodiment, the sum of the energy of the sample object speech, the energy of the non-sample object speech, and the energy of the noisy speech in the first training sample is obtained, and the ratio of the energy of the sample object speech to the above sum is used as the true value label.
[0069] In one embodiment, at least a subset is re-inputted into the multimodal speech separation network to retrain the multimodal speech separation network; and / or, multiple second training samples are constructed using multiple first training samples from at least a subset, and the multiple second training samples are input into the multimodal speech separation network to retrain the multimodal speech separation network; wherein the ground truth labels of the second training samples are determined by the ground truth labels of the first training samples used to construct the second training samples.
[0070] In one application scenario, multiple first training samples are divided into a first subset and a second subset based on multiple first losses, where the first losses of multiple first training samples in the first subset are less than a threshold value, and the first losses of multiple first training samples in the second subset are greater than or equal to the threshold value. The step of re-inputting at least a subset into a multimodal speech separation network to retrain the network includes: inputting the first subset into the multimodal speech separation network, obtaining the first loss corresponding to each first training sample in the first subset, and adjusting the parameters in the multimodal speech separation network according to the first losses. The step of constructing multiple second training samples using multiple first training samples from at least a subset includes: obtaining second training samples and their corresponding ground truth labels based on any two first training samples in the second subset and their corresponding ground truth labels.
[0071] In one embodiment, in response to a difference between multiple first losses and a first threshold being less than a first threshold, multiple first training samples are shifted to obtain corresponding third training samples. The ground truth labels of the first training samples corresponding to the third training samples are the same. A multimodal speech separation network is trained using these multiple third training samples.
[0072] Furthermore, in response to the fact that the difference between the second losses of multiple third training samples during the training of the multimodal speech separation network is greater than or equal to a second threshold, the multiple third training samples are divided into multiple sets based on the multiple second losses. Wherein, the difference between the second losses of multiple third training samples within the same set is less than the second threshold. The multimodal speech separation network is then retrained based on at least a portion of these sets.
[0073] Please see Figure 5 , Figure 5This is a schematic diagram of the structure of an embodiment of the electronic device of this application. The electronic device includes a memory 70 and a processor 80 coupled to each other. The memory 70 stores program instructions, and the processor 80 is used to execute the program instructions to implement the steps of the multimodal speech separation method and the multimodal speech separation network training method described in the above embodiments. Specifically, the electronic device includes, but is not limited to, desktop computers, laptops, tablets, servers, etc., and is not limited thereto. In addition, the processor 80 can also be called a CPU (Center Processing Unit). The processor 80 may be an integrated circuit chip with signal processing capabilities. The processor 80 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The general-purpose processor can be a microprocessor or any conventional processor. In addition, the processor 80 can be implemented by integrated circuit chips.
[0074] Please see Figure 6 , Figure 6 This is a schematic diagram of an embodiment of the computer-readable storage medium proposed in this application. The computer-readable storage medium 90 stores program instructions 95 that can be executed by a processor. The program instructions 95 are used to implement the multimodal speech separation method and the training method of the multimodal speech separation network in any of the above embodiments.
[0075] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A multimodal speech separation method, characterized in that, include: Obtain audio and video data containing a target object; wherein the audio and video data includes lip video data of the target object; The audio and video data are input into the trained multimodal speech separation network to obtain audio data related to the lip video data of the target object; wherein, multiple training samples for training the multimodal speech separation network are divided into multiple subsets based on the first loss obtained after passing through the multimodal speech separation network, and the multimodal speech separation network is retrained based on at least a portion of the subsets; The training process of the multimodal speech separation network includes: training the multimodal speech separation network using multiple first training samples; wherein each first training sample has a ground truth label; in response to the difference between the first losses of the multiple first training samples during training of the multimodal speech separation network being greater than or equal to a first threshold, dividing the multiple first training samples into multiple subsets based on the multiple first losses; wherein the difference between the first losses of the multiple first training samples within the same subset is less than the first threshold; and retraining the multimodal speech separation network based on at least a portion of the subsets.
2. The method according to claim 1, characterized in that, The step of retraining the multimodal speech separation network based on at least a subset of the data includes: At least a portion of the subset is re-inputted into the multimodal speech separation network to retrain the multimodal speech separation network; and / or, Multiple second training samples are constructed using at least a subset of the first training samples, and the multiple second training samples are input into the multimodal speech separation network to retrain the multimodal speech separation network; wherein, the ground truth labels of the second training samples are determined by the ground truth labels of the first training samples used to construct the second training samples.
3. The method according to claim 2, characterized in that, The step of dividing the multiple first training samples into multiple subsets based on multiple first losses includes: The plurality of first training samples are divided into a first subset and a second subset based on the plurality of first losses, and the first loss of the plurality of first training samples in the first subset is less than a threshold value, and the first loss of the plurality of first training samples in the second subset is greater than or equal to the threshold value. The step of re-inputting at least a portion of the subset into the multimodal speech separation network to retrain the multimodal speech separation network includes: inputting the first subset into the multimodal speech separation network, obtaining a first loss corresponding to each first training sample in the first subset, and adjusting the parameters in the multimodal speech separation network according to the first loss; The step of constructing multiple second training samples using at least a portion of the subset of first training samples includes: obtaining a second training sample and a corresponding real value label based on any two first training samples in the second subset and their corresponding real value labels.
4. The method according to claim 1, characterized in that, Following the step of retraining the multimodal speech separation network based on at least a subset of the data, the method further includes: In response to the difference between the plurality of first losses being less than a first threshold, the plurality of first training samples are shifted to obtain a corresponding third training sample; wherein the third training sample has the same true value label as the first training sample. The multimodal speech separation network is trained using multiple of the third training samples.
5. The method according to claim 4, characterized in that, After the step of training the multimodal speech separation network using multiple of the third training samples, the method includes: In response to the fact that the difference between the second losses of the plurality of third training samples when training the multimodal speech separation network is greater than or equal to a second threshold, the plurality of third training samples are divided into a plurality of sets based on the plurality of second losses; wherein the difference between the second losses of the plurality of third training samples within the same set is less than the second threshold; The multimodal speech separation network is retrained based on at least a portion of the set.
6. The method according to claim 1, characterized in that, include: The sum of the energy of the sample object speech, the energy of the non-sample object speech, and the energy of the noise speech in the first training sample is obtained, and the ratio of the energy of the sample object speech to the sum is used as the true value label.
7. A method for training a multimodal speech separation network, characterized in that, include; A multimodal speech separation network is trained using multiple first training samples; wherein each first training sample has a ground truth label. In response to the fact that the difference between the first losses of the plurality of first training samples when training the multimodal speech separation network is greater than or equal to a first threshold, the plurality of first training samples are divided into a plurality of subsets based on the plurality of first losses; wherein the difference between the first losses of the plurality of training samples within the same subset is less than the first threshold. The multimodal speech separation network is retrained based on at least a subset of the data.
8. A multimodal speech separation device, characterized in that, include: The first obtaining module is used to obtain audio and video data containing a target object; wherein the audio and video data includes lip video data of the target object; The second acquisition module is used to input the audio and video data into the trained multimodal speech separation network to obtain audio data related to the lip video data of the target object; wherein, multiple training samples for training the multimodal speech separation network are divided into multiple subsets based on the first loss obtained after passing through the multimodal speech separation network, and the multimodal speech separation network is retrained based on at least some of the subsets; The training process of the multimodal speech separation network includes: training the multimodal speech separation network using multiple first training samples; wherein each first training sample has a ground truth label; in response to the difference between the first losses of the multiple first training samples during training of the multimodal speech separation network being greater than or equal to a first threshold, dividing the multiple first training samples into multiple subsets based on the multiple first losses; wherein the difference between the first losses of the multiple first training samples within the same subset is less than the first threshold; and retraining the multimodal speech separation network based on at least a portion of the subsets.
9. An electronic device, characterized in that, The method includes a memory and a processor coupled to each other, wherein the memory stores program instructions and the processor executes the program instructions to implement the multimodal speech separation method according to any one of claims 1-6 or the multimodal speech separation network training method according to claim 7.
10. A computer-readable storage medium, characterized in that, The system stores program instructions that can be executed by a processor, the program instructions being used to implement the multimodal speech separation method according to any one of claims 1-6 or the multimodal speech separation network training method according to claim 7.