Packet loss compensation model training method, packet loss compensation method and device
By optimizing the loss function of the packet loss compensation model and combining time-domain and frequency-domain signal features, the problem of envelope distortion caused by error accumulation in speech packet loss compensation is solved, thereby improving speech quality and the accuracy of automatic speech recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN UNIV
- Filing Date
- 2023-02-01
- Publication Date
- 2026-06-23
AI Technical Summary
Existing speech packet loss compensation technologies are prone to error accumulation under high packet loss rates or continuous packet loss, leading to distortion of the reconstructed signal envelope, abnormal situations such as mute and unusual noises, and reducing speech quality and the accuracy of automatic speech recognition.
By optimizing the loss function of the packet loss compensation model and combining the signal characteristics in the time and frequency domains, the error between the target signal and the reconstructed signal is constructed. The model parameters are then optimized in reverse to enhance the envelope reconstruction capability and reduce noise interference and error accumulation.
It improves the quality and intelligibility of the voice compensation signal, enhances the accuracy of automatic speech recognition, avoids abnormal reconstruction signal envelope, and improves user experience and system performance.
Smart Images

Figure CN116312571B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of voice communication, and specifically to a method for training a packet loss compensation model, a packet loss compensation method, and a device. Background Technology
[0002] The purpose of speech packet loss compensation is to reconstruct lost speech at the decoding end, thereby improving speech quality and intelligibility. Speech packet loss compensation technology is a key technology for maintaining the robustness of various voice communication systems under complex communication networks. Most of the latest speech packet loss compensation technologies employ neural networks to reconstruct lost speech, with one of the most typical algorithms being a reconstruction method based on a Convolutional Recurrent Network (CRN). In the speech packet loss compensation algorithm, given a CRN architecture, the time-domain signal is input in frames. By constructing a suitable training loss function, the CRN is guided to extract optimal features and reconstruct the time-domain signal of the lost frames.
[0003] In related technologies, under conditions of increased packet loss rate or continuous packet loss, frequent packet loss compensation can lead to error accumulation, potentially distorting the envelope of the reconstructed signal and causing abnormalities such as muteness, unusual noises, or mechanical sounds. Furthermore, envelope distortion caused by the algorithm may reduce the recognition accuracy of automatic speech recognition systems, impacting functions such as conference transcription and real-time captioning. Summary of the Invention
[0004] Therefore, the technical problem to be solved by the present invention is to overcome the defects of large packet loss compensation error and low recognition accuracy in the prior art, thereby providing a packet loss compensation model training method, packet loss compensation method and device.
[0005] In conjunction with the first aspect, the present invention provides a method for training a packet loss compensation model, the method comprising:
[0006] Acquire the input signal and the initial packet loss compensation model;
[0007] Based on the input signal, construct a target signal corresponding to the input signal;
[0008] Based on the initial packet loss compensation model and the input signal, the reconstructed signal is obtained;
[0009] Based on the reconstructed signal and the target signal, the initial packet loss compensation model is trained to obtain the target packet loss compensation model.
[0010] In this approach, the loss function used in the speech packet loss compensation algorithm is further optimized by training the packet loss compensation model. Optimizing the loss function reduces noise interference and prevents the packet loss compensation model from being affected by error accumulation, thus avoiding abnormalities in the reconstructed signal envelope. The optimized loss function constrains the reconstruction of the signal envelope and its fine structure, thereby enhancing the model's ability to reconstruct the envelope while reducing envelope distortion. This further improves the speech quality, intelligibility, and accuracy of automatic speech recognition of the compensated signal.
[0011] In conjunction with the first aspect, in a first embodiment of the first aspect, constructing a target signal corresponding to the input signal based on the input signal includes:
[0012] The input signal is arranged in frames to obtain an input signal divided into several frames;
[0013] The next frame after the current frame in the input signal is taken as the current frame of the target signal, and the corresponding frame number of the speech signal is extracted to obtain the target signal corresponding to the input signal.
[0014] In conjunction with the first aspect, in the second embodiment of the first aspect, training the initial packet loss compensation model based on the reconstructed signal and the target signal to obtain the target packet loss compensation model includes:
[0015] Based on the aforementioned initial packet loss compensation model, determine the initial loss function;
[0016] Based on the initial loss function, the error between the reconstructed signal and the target signal is calculated;
[0017] Based on the error, the initial packet loss compensation model is trained to obtain the target packet loss compensation model.
[0018] In conjunction with the second embodiment of the first aspect, in the third embodiment of the first aspect, the step of calculating the error between the reconstructed signal and the target signal based on the initial loss function includes:
[0019] Based on the target signal and the reconstructed signal, the target envelope signal and the reconstructed envelope signal are calculated;
[0020] Calculate the first error between the target envelope signal and the reconstructed envelope signal;
[0021] Based on the target signal and the reconstructed signal, the target frequency domain signal and the reconstructed frequency domain signal are calculated;
[0022] Calculate the second error between the target frequency domain signal and the reconstructed frequency domain signal;
[0023] The error between the reconstructed signal and the target signal is calculated based on the first error and the second error.
[0024] In a second aspect, the present invention also provides a packet loss compensation method, the method comprising:
[0025] Acquire voice data;
[0026] Based on the voice data, determine whether the current frame voice signal is a blank frame;
[0027] When the current frame speech signal is a blank frame, a reconstructed signal corresponding to the current frame speech signal is obtained based on the packet loss compensation model, and the current frame speech signal is replaced with the reconstructed signal. The packet loss compensation model is trained using the packet loss compensation model training method of the first aspect and any of its optional embodiments.
[0028] In conjunction with the second aspect, in the first embodiment of the second aspect, when the current frame audio signal is not a blank frame, the current frame audio signal is output to the device.
[0029] In a third aspect, the present invention also provides a packet loss compensation model training apparatus, the apparatus comprising:
[0030] The first acquisition unit is used to acquire the input signal and the initial packet loss compensation model;
[0031] A construction unit is used to construct a target signal corresponding to the input signal based on the input signal;
[0032] The reconstruction unit is used to obtain the reconstructed signal based on the initial packet loss compensation model and the input signal;
[0033] The training unit is used to train the initial packet loss compensation model based on the reconstructed signal and the target signal to obtain the target packet loss compensation model.
[0034] In conjunction with the third aspect, in the first embodiment of the third aspect, the constructing unit includes:
[0035] A framing unit is used to divide the input signal into frames to obtain an input signal divided into several frames.
[0036] The interception unit is used to take the next frame of the current frame in the input signal as the current frame of the target signal, and intercept the speech signal of the corresponding frame number to obtain the target signal corresponding to the input signal.
[0037] In conjunction with the third aspect, in the second embodiment of the third aspect, the training unit includes:
[0038] The determining unit is used to determine the initial loss function based on the initial packet loss compensation model;
[0039] An error unit is used to calculate the error between the reconstructed signal and the target signal based on the initial loss function.
[0040] The training subunit is used to train the initial packet loss compensation model based on the error to obtain the target packet loss compensation model.
[0041] In conjunction with the second embodiment of the third aspect, in the third embodiment of the third aspect, the error unit includes:
[0042] An envelope unit is used to calculate a target envelope signal and a reconstructed envelope signal based on the target signal and the reconstructed signal.
[0043] The first error unit is used to calculate the first error between the target envelope signal and the reconstructed envelope signal;
[0044] A frequency domain unit is used to calculate the target frequency domain signal and the reconstructed frequency domain signal based on the target signal and the reconstructed signal;
[0045] The second error unit is used to calculate the second error between the target frequency domain signal and the reconstructed frequency domain signal;
[0046] The third error unit is used to calculate the error between the reconstructed signal and the target signal based on the first error and the second error.
[0047] In a fourth aspect, the present invention also provides a packet loss compensation device, the device comprising:
[0048] The second acquisition unit is used to acquire voice data;
[0049] The judgment unit is used to determine whether the current frame of voice signal is a blank frame based on the voice data;
[0050] The replacement unit is configured to, when the current frame speech signal is a blank frame, obtain the reconstructed signal corresponding to the current frame speech signal based on the packet loss compensation model, and replace the current frame speech signal with the reconstructed signal, wherein the packet loss compensation model is trained using the packet loss compensation model training method of the first aspect and any of its optional embodiments.
[0051] In conjunction with the fourth aspect, in the first embodiment of the fourth aspect, the apparatus further includes:
[0052] The output unit is used to output the current frame audio signal to the device when the current frame audio signal is not a blank frame.
[0053] According to a fifth aspect, the present invention also provides a computer device, including a memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform a packet loss compensation model training method of any one of the first aspect and its optional embodiments or to perform a packet loss compensation method of any one of the second aspect and its optional embodiments.
[0054] According to a sixth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer instructions for causing the computer to perform a packet loss compensation model training method of any one of the first aspect and its optional embodiments, or to perform a packet loss compensation method of any one of the second aspect and its optional embodiments. Attached Figure Description
[0055] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0056] Figure 1 This is a flowchart of a packet loss compensation model training method proposed according to an exemplary embodiment.
[0057] Figure 2 This is a schematic diagram of a training phase signal pair construction proposed according to an exemplary embodiment.
[0058] Figure 3 This is a flowchart of a packet loss compensation model training method proposed according to an exemplary embodiment.
[0059] Figure 4 This is a flowchart illustrating the construction of an analytical signal according to an exemplary embodiment.
[0060] Figure 5 This is a flowchart of training a packet loss compensation model according to an exemplary embodiment.
[0061] Figure 6 This is a flowchart of a packet loss compensation method proposed according to an exemplary embodiment.
[0062] Figure 7 This is a schematic diagram of the basic framework of a speech packet loss compensation algorithm proposed according to an exemplary embodiment.
[0063] Figure 8This is a structural block diagram of a packet loss compensation model training device proposed according to an exemplary embodiment.
[0064] Figure 9 This is a structural block diagram of a packet loss compensation device according to an exemplary embodiment.
[0065] Figure 10 This is a schematic diagram of the hardware structure of a computer device according to an exemplary embodiment. Detailed Implementation
[0066] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0067] In related technologies, the commonly used loss function is to use the time-domain signal of the original speech as the network output target and the Mean Absolute Error (MAE) as the loss function to optimize the network, enabling the network model to reconstruct the lost speech signal using the received speech signal. This type of algorithm is characterized by its relatively simple implementation and good speech packet loss compensation effect. However, this method has two problems:
[0068] 1. When the packet loss rate increases or there are consecutive packet losses, frequent packet loss compensation can cause error accumulation, potentially distorting the envelope of the reconstructed signal and leading to abnormalities such as mute, unusual noises, or mechanical sounds. These abnormalities significantly impair voice quality, thereby reducing the user's call experience.
[0069] 2. Envelope distortion caused by the algorithm may reduce the recognition accuracy of the automatic speech recognition system, which will affect functions such as conference transcription and real-time captioning.
[0070] To address the aforementioned problems, this invention provides a packet loss compensation model training method for use in a computer device. It should be noted that the executing entity can be a packet loss compensation model training device, which can be implemented as part or all of the computer device through software, hardware, or a combination of both. The computer device can be a terminal, client, or server. The server can be a single server or a server cluster composed of multiple servers. In this embodiment, the terminal can be a smartphone, personal computer, tablet computer, wearable device, or other intelligent hardware device such as a smart robot. The following method embodiments all use a computer device as the executing entity for illustration.
[0071] The computer device in this embodiment is suitable for use scenarios where voice communication results in lost speech. Through the packet loss compensation model training method provided by this invention, the loss function used in the speech packet loss compensation algorithm is further optimized by training the packet loss compensation model. Optimizing the loss function reduces noise interference and avoids the impact of error accumulation on the packet loss compensation model, thereby preventing abnormalities in the reconstructed signal envelope. The optimized loss function constrains the reconstruction of the signal envelope and the fine structure of the signal, thus enhancing the model's ability to reconstruct the envelope while reducing envelope distortion, further improving the speech quality, intelligibility, and accuracy of automatic speech recognition of the compensated signal.
[0072] Figure 1 This is a flowchart illustrating a packet loss compensation model training method proposed according to an exemplary embodiment. For example... Figure 1 As shown, the packet loss compensation model training method includes the following steps S101 to S104.
[0073] In step S101, the input signal and the initial packet loss compensation model are acquired.
[0074] In step S102, a target signal corresponding to the input signal is constructed based on the input signal.
[0075] In this embodiment of the invention, after receiving the input signal, to ensure that the constructed reconstructed signal is closer to the real signal, a pair of input signals and a target signal is set so that the packet compensation model can be trained using the target signal and the reconstructed signal. Specifically, constructing the target signal corresponding to the input signal based on the input signal includes: arranging the input signal into frames to obtain an input signal divided into several frames; using the frame following the current frame in the input signal as the current frame of the target signal, and extracting the corresponding number of audio frames to obtain the target signal corresponding to the input signal.
[0076] In one example, Figure 2 This is a schematic diagram illustrating the construction of signal pairs during the training phase, based on an exemplary embodiment. For example... Figure 2 As shown, constructing a signal pair may include: arranging a speech signal into frames (with no overlap between frames), selecting frames 0 to N-1 as input signals, and correspondingly selecting frames 1 to N as target signals, so that the timing of the signal pairs differs by one frame.
[0077] In step S103, the reconstructed signal is obtained based on the initial packet loss compensation model and the input signal.
[0078] In this embodiment of the invention, to facilitate the training of the initial packet loss compensation model, the input signal is calculated using the initial packet loss compensation model to obtain the reconstructed signal, which provides data support for the next step of training the initial packet loss compensation model based on the target signal and the reconstructed signal.
[0079] In step S104, the initial packet loss compensation model is trained based on the reconstructed signal and the target signal to obtain the target packet loss compensation model.
[0080] In this embodiment of the invention, the loss of the reconstructed signal relative to the target signal is determined by acquiring the error between the reconstructed signal and the target signal. To make the reconstructed signal closer to the target signal, the initial packet loss compensation model is trained based on the error between the reconstructed signal and the target signal, so as to minimize the distortion between the reconstructed signal obtained by the target packet loss compensation model and the target signal, thus better meeting the user's needs.
[0081] Through the above embodiments, by training the packet loss compensation model, the loss function used in the speech packet loss compensation algorithm was further optimized. Optimizing the loss function reduced noise interference and prevented the packet loss compensation model from being affected by error accumulation, thus avoiding abnormalities in the reconstructed signal envelope. The optimized loss function constrained the reconstruction of the signal envelope and its fine structure, thereby reducing envelope distortion while enhancing the model's ability to reconstruct the envelope, further improving the speech quality, intelligibility, and accuracy of automatic speech recognition of the compensated signal.
[0082] The following examples will illustrate the process of training the initial packet loss compensation model.
[0083] Figure 3 This is a flowchart illustrating a packet loss compensation model training method proposed according to an exemplary embodiment. For example... Figure 3 As shown, the training method for the packet loss compensation model includes the following steps.
[0084] In step S301, the initial loss function is determined based on the initial packet loss compensation model.
[0085] In this embodiment of the invention, the initial packet loss compensation model includes an initial loss function, which is used to calculate the error between the reconstructed signal and the target signal, thereby providing data support for the training of the initial packet loss compensation model.
[0086] In one example, the initial loss function may include: the loss function used by the CRN-based speech packet loss compensation algorithm:
[0087]
[0088] In the formula, Loss1 is the loss function, and x(t) is the target signal. For the reconstructed signal, T is the signal length. This loss function achieves good training results by directly constraining the time-domain waveform. However, since x(t) is susceptible to noise interference, the historical buffer of the initial packet loss compensation model and the hidden layer state of the model will be affected by error accumulation. Furthermore, Loss1 does not directly constrain the envelope signal, thus leading to abnormal envelope conditions in the reconstructed signal.
[0089] In step S302, the error between the reconstructed signal and the target signal is calculated based on the initial loss function.
[0090] In this embodiment of the invention, since the initial loss function will cause a certain envelope distortion, in order to reduce the envelope distortion and ensure the degree of restoration of the reconstructed signal, it is necessary to optimize the error loss function between the reconstructed signal and the target signal.
[0091] In one implementation scenario, when the packet loss rate is high or continuous packet loss occurs, in order to reduce the impact of envelope distortion, the error between the reconstructed signal and the target signal can be calculated as follows: based on the target signal and the reconstructed signal, the target envelope signal and the reconstructed envelope signal are calculated; and the first error between the target envelope signal and the reconstructed envelope signal is calculated.
[0092] In one example, the temporal loss function constructed for envelope distortion is:
[0093]
[0094] Where z(t) is the target analytic signal, To reconstruct the analytic signal, |z(t)| is the target envelope signal. To reconstruct the envelope signal.
[0095] Figure 4 This is a flowchart illustrating the construction of an analytic signal according to an exemplary embodiment. Specifically, the construction methods of the target analytic signal and the reconstructed analytic signal may include:
[0096] z(t) = z r (t)+jz i (t) (3)
[0097] z r (t)=x(t) (4)
[0098] z i (t)=HT[x(t)] (5)
[0099] Where HT[.] is the Hilbert transform, z r (t) represents the real part of the analytic signal, specifically the original real signal, z. i(t) represents the imaginary part of the analytic signal, specifically the Hilbert transform result of the original real signal, where j is the imaginary unit. The time-domain envelope of the signal can be represented by its analytic signal as:
[0100]
[0101] Time-domain signals can be decomposed into a slowly varying envelope structure and a rapidly varying fine structure using Hilbert transform. The envelope structure primarily affects semantics and speech quality, while the fine structure mainly affects pitch and timbre perception. When the packet loss rate is high or consecutive packet losses occur, the effective information available to the decoder is significantly reduced, making it difficult to guarantee the fidelity of the fine structure. In such cases, priority should be given to ensuring the integrity of the signal envelope to guarantee the transmission of semantic information and avoid abnormal compensation. By decoupling the time-domain envelope from the time-domain signal using Hilbert transform, the time-domain envelope error between the reconstructed signal and the target signal can be directly calculated. This allows for targeted training of the model's ability to reconstruct the signal envelope, avoiding interference from rapidly varying signal components and reducing the impact of noise and accumulated errors on the model.
[0102] In another implementation scenario, in order to constrain the fine structure reconstruction of the signal, the error between the reconstructed signal and the target signal can be calculated as follows: based on the target signal and the reconstructed signal, the target frequency domain signal and the reconstructed frequency domain signal are calculated; and a second error between the target frequency domain signal and the reconstructed frequency domain signal is calculated.
[0103] In one example, the frequency domain loss function constructed for signal reproducibility is:
[0104]
[0105] In the formula, X(t,f) is the short-time Fourier transform result of the target signal. To reconstruct the short-time Fourier transform result of the signal, Let |X(t,f)| be the amplitude spectrum of the target signal, |X(t,f)| be the amplitude spectrum of the reconstructed signal, and α be the weight control factor, which can be changed according to user needs. The suggested value given after testing under the target CRN framework is 0.1. Formula (7) calculates the amplitude spectrum distance and spectral distance between the reconstructed signal and the target signal, respectively, and adjusts the proportion of amplitude spectrum information and phase spectrum information by weight control factor. Since the human ear's perception of sound is related to frequency, the amplitude spectrum more directly and accurately represents the harmonic structure of the signal than the time domain signal, and indirectly constrains the phase by calculating the spectral distance. Therefore, formula (7) can better optimize the training of the model. Since the time domain envelope belongs to the low-frequency component of the signal, the low-frequency component in the amplitude spectrum contains the time domain envelope information. However, since the frequency domain resolution is limited by the time domain window length in actual calculation, the low-frequency component is not enough to accurately represent the envelope structure.
[0106] In this embodiment of the invention, in order to take into account both the ability of the training model to reconstruct the signal envelope and the ability to restore the fine structure of the signal, the error between the reconstructed signal and the target signal is calculated based on the first error and the second error.
[0107] In one example, due to the limitation of the time-domain window length on the frequency domain resolution in actual calculations, the low-frequency components are insufficient to accurately represent the envelope structure. Equation (2) provides an accurate method for calculating the envelope loss, which can specifically enhance the model's ability to reconstruct the envelope. Combining Equations (2) and (7), the loss function for reducing envelope distortion is expressed as:
[0108] Loss4=β*Loss2+(1-β)*Loss3 (8)
[0109] Wherein, β is the envelope control factor, which can be changed according to user needs; the suggested value given after testing under the target CRN framework is 0.25. Loss2 serves as a constraint term to enhance the model's ability to reconstruct the envelope signal, while Loss3 serves as a constraint term to enhance the model's ability to reconstruct the signal (especially the fine structure part). Therefore, the loss function Loss4 proposed in this patent, which reduces envelope distortion, can significantly improve the speech quality after algorithm compensation and avoid abnormal compensation caused by envelope distortion.
[0110] In step S303, the initial packet loss compensation model is trained based on the error to obtain the target packet loss compensation model.
[0111] Figure 5 This is a flowchart illustrating the training of a packet loss compensation model based on an exemplary embodiment. For example... Figure 5 As shown in this embodiment of the invention, since the packet loss compensation model outputs a reconstructed signal based on the input signal, calculates the error between the reconstructed signal and the target signal based on the loss function, and optimizes the model parameters through the backpropagation algorithm to reduce the error between the reconstructed signal and the target signal, the loss function directly constrains the way the reconstructed signal approximates the target signal.
[0112] Through the above embodiments, in order to make the reconstructed signal closer to the target signal, the loss function of the target packet loss compensation model is determined, the error between the reconstructed signal and the target signal is calculated based on the loss function, and the parameters of the initial packet loss compensation model are optimized in reverse training, so that the reconstructed signal obtained by the target packet loss compensation model is closer to the target signal.
[0113] Figure 6 This is a flowchart of a packet loss compensation method proposed according to an exemplary embodiment. Figure 6 As shown, the packet loss compensation method includes the following steps.
[0114] In step S601, voice data is acquired.
[0115] In step S602, based on the voice data, it is determined whether the current frame voice signal is a blank frame.
[0116] In step S603, when the current frame speech signal is a blank frame, the reconstructed signal corresponding to the current frame speech signal is obtained based on the packet loss compensation model, and the current frame speech signal is replaced with the reconstructed signal. The packet loss compensation model is trained using the packet loss compensation model training method of any of the above embodiments.
[0117] In this embodiment of the invention, when the current frame audio signal is not a blank frame, the current frame audio signal is output to the device.
[0118] In one example, Figure 7 This is a schematic diagram of the basic framework of a speech packet loss compensation algorithm proposed according to an exemplary embodiment. For example... Figure 7 As shown, packet loss compensation methods may include: first, detecting whether the receiver has received the voice signal of the current frame; if the frame signal x is successfully received... t Then x t The data is output normally to the device, stored in and updated in the history buffer, and input into the packet loss compensation model (CRN structure) to update the hidden layer state of the model. If the current frame of audio signal is not received, the packet loss compensation model will reconstruct a frame of audio signal x based on the history buffer data and the hidden layer state. t Replace the current blank frame with the output, and update the history buffer and the hidden layer state of the model again.
[0119] Based on the same inventive concept, the present invention also provides a packet loss compensation model training device.
[0120] Figure 8 This is a structural block diagram of a packet loss compensation model training device proposed according to an exemplary embodiment.
[0121] like Figure 8 As shown, the packet loss compensation model training device includes a first acquisition unit 801, a construction unit 802, a reconstruction unit 803, and a training unit 804.
[0122] The first acquisition unit 801 is used to acquire the input signal and the initial packet loss compensation model.
[0123] Construction unit 802 is used to construct a target signal corresponding to the input signal based on the input signal.
[0124] The reconstruction unit 803 is used to obtain the reconstructed signal based on the initial packet loss compensation model and the input signal.
[0125] Training unit 804 is used to train the initial packet loss compensation model based on the reconstructed signal and the target signal to obtain the target packet loss compensation model.
[0126] In one embodiment, the construction unit 802 includes: a framing unit for arranging the input signal into frames to obtain an input signal divided into several frames; and a truncation unit for taking the next frame of the current frame in the input signal as the current frame of the target signal and truncation of the corresponding number of audio frames to obtain the target signal corresponding to the input signal.
[0127] In another embodiment, the training unit 804 includes: a determination unit for determining an initial loss function based on an initial packet loss compensation model; an error unit for calculating the error between the reconstructed signal and the target signal based on the initial loss function; and a training subunit for training the initial packet loss compensation model based on the error to obtain a target packet loss compensation model.
[0128] In another embodiment, the error unit includes: an envelope unit for calculating a target envelope signal and a reconstructed envelope signal based on the target signal and the reconstructed signal; a first error unit for calculating a first error between the target envelope signal and the reconstructed envelope signal; a frequency domain unit for calculating a target frequency domain signal and a reconstructed frequency domain signal based on the target signal and the reconstructed signal; a second error unit for calculating a second error between the target frequency domain signal and the reconstructed frequency domain signal; and a third error unit for calculating the error between the reconstructed signal and the target signal based on the first error and the second error.
[0129] The specific limitations and beneficial effects of the aforementioned packet loss compensation model training device can be found in the limitations of the packet loss compensation model training method described above, and will not be repeated here. Each of the above modules can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the corresponding operations of each module.
[0130] Based on the same inventive concept, the present invention also provides a packet loss compensation device.
[0131] Figure 9 This is a structural block diagram of a packet loss compensation device according to an exemplary embodiment.
[0132] like Figure 9 As shown, the packet loss compensation model training device includes
[0133] The second acquisition unit 901 is used to acquire voice data.
[0134] The judgment unit 902 is used to determine whether the current frame of speech signal is a blank frame based on the speech data.
[0135] The replacement unit 903 is used to obtain the reconstructed signal corresponding to the current frame speech signal based on the packet loss compensation model when the current frame speech signal is a blank frame, and replace the current frame speech signal with the reconstructed signal. The packet loss compensation model is trained using any of the above packet loss compensation model training methods.
[0136] In one embodiment, the packet loss compensation device provided by the present invention further includes: an output unit, used to output the current frame audio signal to the device when the current frame audio signal is not a blank frame.
[0137] The specific limitations and beneficial effects of the aforementioned packet loss compensation device can be found in the limitations of the packet loss compensation method described above, and will not be repeated here. Each of the above modules can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the corresponding operations of each module.
[0138] Figure 10 This is a schematic diagram of the hardware structure of a computer device according to an exemplary embodiment. For example... Figure 10 As shown, the device includes one or more processors 1010 and a memory 1020, the memory 1020 including persistent memory, volatile memory, and a hard disk. Figure 10 Taking a processor 1010 as an example, the device may also include an input device 1030 and an output device 1040.
[0139] The processor 1010, memory 1020, input device 1030, and output device 1040 can be connected via a bus or other means. Figure 10 Taking the example of a connection between China and Israel via a bus.
[0140] Processor 1010 can be a Central Processing Unit (CPU). Processor 1010 can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations thereof. The general-purpose processor can be a microprocessor or any conventional processor.
[0141] The memory 1020, as a non-transitory computer-readable storage medium, includes persistent memory, volatile memory, and a hard disk. It can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the packet loss compensation model training method and the corresponding program instructions / modules in the embodiments of this application. The processor 1010 executes various functional applications and data processing of the server by running the non-transitory software programs, instructions, and modules stored in the memory 1020, thereby implementing any of the above-mentioned packet loss compensation model training methods and packet loss compensation methods.
[0142] The memory 1020 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data that is needed and required. Furthermore, the memory 1020 may include high-speed random access memory and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 1020 may optionally include memory remotely located relative to the processor 1010, and these remote memories can be connected to the data processing device via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0143] Input device 1030 can receive input digital or character information, and generate key signal inputs related to user settings and function control. Output device 1040 may include display devices such as a display screen.
[0144] One or more modules are stored in memory 1020, and when executed by one or more processors 1010, they perform actions such as... Figures 1-7 The method shown.
[0145] The above-described product can execute the method provided in the embodiments of the present invention, and has the corresponding functional modules and beneficial effects for executing the method. Technical details not described in detail in this embodiment can be found in [reference 1]. Figures 1-7 The relevant descriptions in the illustrated embodiments.
[0146] This invention also provides a non-transitory computer storage medium storing computer-executable instructions that can execute the authentication method in any of the above method embodiments. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), random access memory (RAM), flash memory, hard disk drive (HDD), or solid-state drive (SSD), etc.; the storage medium may also include combinations of the above types of memory.
[0147] Obviously, the above embodiments are merely illustrative examples for clear explanation and are not intended to limit the implementation. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is neither necessary nor possible to exhaustively list all possible implementations here. However, obvious variations or modifications derived therefrom are still within the scope of protection of this invention.
Claims
1. A method for training a packet loss compensation model, characterized in that, The method includes: Acquire the input signal and the initial packet loss compensation model; Based on the input signal, construct a target signal corresponding to the input signal; Based on the initial packet loss compensation model and the input signal, the reconstructed signal is obtained; Based on the reconstructed signal and the target signal, the initial packet loss compensation model is trained to obtain the target packet loss compensation model; The step of constructing a target signal corresponding to the input signal based on the input signal includes: The input signal is arranged in frames to obtain an input signal divided into several frames; The next frame after the current frame in the input signal is taken as the current frame of the target signal, and the corresponding frame number of the speech signal is extracted to obtain the target signal corresponding to the input signal. The step of training the initial packet loss compensation model based on the reconstructed signal and the target signal to obtain the target packet loss compensation model includes: Based on the aforementioned initial packet loss compensation model, determine the initial loss function; Based on the initial loss function, the error between the reconstructed signal and the target signal is calculated; Based on the error, the initial packet loss compensation model is trained to obtain the target packet loss compensation model.
2. The method according to claim 1, characterized in that, The step of calculating the error between the reconstructed signal and the target signal based on the initial loss function includes: Based on the target signal and the reconstructed signal, the target envelope signal and the reconstructed envelope signal are calculated; Calculate the first error between the target envelope signal and the reconstructed envelope signal; Based on the target signal and the reconstructed signal, the target frequency domain signal and the reconstructed frequency domain signal are calculated; Calculate the second error between the target frequency domain signal and the reconstructed frequency domain signal; The error between the reconstructed signal and the target signal is calculated based on the first error and the second error.
3. A packet loss compensation method, characterized in that, The method includes: Acquire voice data; Based on the voice data, determine whether the current frame voice signal is a blank frame; When the current frame speech signal is a blank frame, a reconstructed signal corresponding to the current frame speech signal is obtained based on the packet loss compensation model, and the current frame speech signal is replaced with the reconstructed signal. The packet loss compensation model is trained using the packet loss compensation model training method as described in any one of claims 1-2.
4. The method according to claim 3, characterized in that, When the current frame audio signal is not a blank frame, the current frame audio signal is output to the device.
5. A packet loss compensation model training device, characterized in that, The device includes: The first acquisition unit is used to acquire the input signal and the initial packet loss compensation model; A construction unit is used to construct a target signal corresponding to the input signal based on the input signal; The reconstruction unit is used to obtain the reconstructed signal based on the initial packet loss compensation model and the input signal; The training unit is used to train the initial packet loss compensation model based on the reconstructed signal and the target signal to obtain the target packet loss compensation model. The construction unit is specifically used to: arrange the input signal into frames to obtain an input signal divided into several frames; take the next frame of the current frame in the input signal as the current frame of the target signal, and extract the speech signal of the corresponding frame number to obtain the target signal corresponding to the input signal; The training unit is specifically used for: determining an initial loss function based on the initial packet loss compensation model; calculating the error between the reconstructed signal and the target signal based on the initial loss function; and training the initial packet loss compensation model based on the error to obtain the target packet loss compensation model.
6. A packet loss compensation device, characterized in that, The device includes: The second acquisition unit is used to acquire voice data; The judgment unit is used to determine whether the current frame of voice signal is a blank frame based on the voice data; The replacement unit is configured to, when the current frame speech signal is a blank frame, obtain the reconstructed signal corresponding to the current frame speech signal based on the packet loss compensation model, and replace the current frame speech signal with the reconstructed signal, wherein the packet loss compensation model is trained using the packet loss compensation model training method as described in any one of claims 1-2.
7. An electronic device, characterized in that, include: The system includes a memory and a processor, which are communicatively connected to each other. The memory stores computer instructions, and the processor executes the computer instructions to perform the packet loss compensation model training method of any one of claims 1-2 or the packet loss compensation method of any one of claims 3-4.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions, which are used to cause the computer to execute the packet loss compensation model training method of any one of claims 1-2 or the packet loss compensation method of any one of claims 3-4.