A generative neural network model for processing audio samples in a filterbank domain

By using a generative model that operates in the filter bank domain, the problem of integrating generative models with frequency domain tools in existing technologies is solved, enabling efficient processing and phase reconstruction of audio signals and improving the parallel processing capability of the synthesis process.

CN116391191BActive Publication Date: 2026-06-23DOLBY INTERNATIONAL AB

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DOLBY INTERNATIONAL AB
Filing Date
2021-10-15
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing generative models are difficult to integrate with other signal processing tools with frequency domain interfaces when operating in the time domain, and models that operate on the spectrum cannot effectively reconstruct the phase of audio signals, which complicates the synthesis process.

Method used

Employing a generative model operating in the filter bank domain, a hierarchical neural network system is used to autoregressively generate filter bank representations of audio signals, learn how to eliminate aliasing and suppress frequency bands, directly process the amplitude and phase of audio signals, and provide increased parallel processing capabilities.

Benefits of technology

It achieves easy integration of the generative model with the frequency domain interface tools, effectively processes general audio, especially music, reduces the complexity of phase reconstruction, and improves the parallel processing capability of the synthesis process.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116391191B_ABST
    Figure CN116391191B_ABST
Patent Text Reader

Abstract

A neural network system is provided that implements a generative model for autoregressively generating a distribution of a plurality of current filterbank samples of an audio signal, where the current samples correspond to a current time bin, and each current sample corresponds to a channel of the filterbank. The system includes a hierarchy of a plurality of neural network processing layers ordered from a top layer to a bottom layer, each layer trained to generate conditioning information based on previous filterbank samples, and for at least each layer other than the top layer, also based on conditioning information from a higher layer in the hierarchy; and an output stage trained to generate a probability distribution based on previous samples of one or more previous time bins and conditioning information from the lowest processing layer.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-reference to related applications

[0002] This application claims priority to the following prior applications: U.S. Provisional Application 63 / 092,754 (reference number: D20037USP1), filed October 16, 2020, and European Application 20207272.4 (reference number: D20037EP), filed November 12, 2020. Technical Field

[0003] This disclosure relates to the intersection of machine learning and audio signal processing. Specifically, this disclosure relates to a generative neural network model for processing samples in a filter bank domain. Background Technology

[0004] Generative neural network models can be trained to at least approximately learn the true distribution of a training dataset, such that the model can then generate new data by sampling from this learned distribution. Therefore, generative neural network models have proven useful in various signal synthesis schemes, including speech and audio synthesis, audio coding, and audio enhancement. These generative models are known to operate in the time domain, or on the amplitude spectrum of the signal's frequency representation (i.e., on a spectrogram).

[0005] However, generative models operating in the time domain (such as WaveNet and sampleRNN) may not always be helpful for integration with other signal processing tools that have frequency domain interfaces (such as tools for equalization) and often use recurrent networks that may have limited parallelization potential. Furthermore, state-of-the-art generative models operating on the spectrogram (such as MelNet) do not reconstruct the phase of the audio signal during synthesis, but instead rely on phase reconstruction algorithms (such as Griffin-Lim) as post-processing to adequately reconstruct the audio.

[0006] Given the above, it is necessary to improve the generative model used for audio signal processing. Summary of the Invention

[0007] This disclosure aims to at least partially satisfy the needs identified above.

[0008] According to a first aspect of this disclosure, a neural network system (hereinafter referred to as the "system") is provided for autoregressively generating a filter bank representation of the probability distribution of multiple current samples of an audio signal. The system may be, for example, a computer-implemented system.

[0009] For the purposes of this disclosure, the current sample corresponds to the current time slot, and each current sample corresponds to a corresponding channel of the filter bank.

[0010] The system comprises a hierarchical structure of multiple neural network processing layers (hereinafter referred to as “tiers”) ordered from top to bottom, wherein each tier has been trained to generate conditioning information based on previous samples represented by a filter bank, and for at least each processing layer other than the top tier, conditioning information is also generated by higher processing layers in the hierarchy (e.g., directly above the tiers in the hierarchy).

[0011] The system further includes an output stage that has been trained to generate a probability distribution based on previous samples corresponding to one or more previous time slots represented by the filter bank and conditioning information generated from the lowest processing layer.

[0012] According to a second aspect of this disclosure, a method is provided for autoregressively generating a probability distribution of multiple current samples in a filter bank representation of an audio signal. Each current sample corresponds to a current time slot, and each current sample corresponds to a corresponding channel of the filter bank. This method can be implemented, for example, using a computer-implemented system according to the first aspect.

[0013] According to a third aspect of this disclosure, a non-transitory computer-readable medium (hereinafter referred to as the "medium") is provided. The medium stores instructions that, when executed by at least one computer processor belonging to computer hardware, are operable to use the computer hardware to implement the system of the first aspect and / or perform the method of the second aspect.

[0014] This disclosure improves upon the prior art in several ways. By operating directly in the filter bank domain, the generative model according to this disclosure (e.g., used in the system of the first aspect, in the method of the second aspect, and / or implemented / executed using the medium of the third aspect) makes it easier to integrate with other signal processing tools (e.g., tools for equalization) that have a frequency domain interface. The model can learn how to eliminate, for example, aliasing inherent in real-valued filter banks. By separating the audio signal into dedicated frequency bands, the model can also learn to suppress, for example, quiet or empty frequency bands, and handle general audio (e.g., music) more satisfactorily than models operating in the time domain. From another perspective, the model operates on a filter bank representation that is equivalent to inherently processing the amplitude and phase of the audio signal, and the synthesis process does not require, for example, various spectrogram inversion methods (e.g., the Griffin-Lim method) to approximately recover the phase information. As will be described in more detail later herein, in some embodiments, the model can also provide increased parallel processing capabilities during audio generation, thereby generating up to an entire filter bank time slot in each step.

[0015] Other objects and advantages of this disclosure will become apparent from the following description, drawings, and claims. Within the scope of this disclosure, it is contemplated that all features and advantages of the generative model described with reference to, for example, the system of the first aspect are relevant to, and can be used in conjunction with, the method of the second aspect and / or the medium of the third aspect, and vice versa. Attached Figure Description

[0016] Exemplary embodiments will be described below with reference to the accompanying drawings, in which:

[0017] Figure 1 A general filter bank is schematically illustrated;

[0018] Figure 2 The use of a generative model according to one or more embodiments of the present disclosure in a signal processing scheme is illustrated schematically;

[0019] Figure 3a and Figure 3b Two or more embodiments of a system for implementing a generative model according to the present disclosure are illustrated schematically;

[0020] Figure 4a and Figure 4b Two or more embodiments of a system for implementing a generative model according to the present disclosure are schematically illustrated.

[0021] Figure 5 The flowcharts of one or more embodiments of the method according to this disclosure are illustrated schematically.

[0022] In the accompanying drawings, unless otherwise stated, the same reference numerals will be used for the same elements. Unless explicitly stated otherwise, the drawings only show those elements necessary for the illustrated exemplary embodiments, while other elements may be omitted or merely suggested for clarity. Detailed Implementation

[0023] A vector random variable of dimension K can be represented by the symbol X and is assumed to have a probability density function q. X (x). In this disclosure, such a random variable is represented by x, and may, for example, represent a vector of continuous samples of an audio signal. It is conceivable that the dimension K can be arbitrarily large, and unless stated otherwise, it does not need to be explicitly specified in the following.

[0024] Distribution q X (x) is, in principle, unknown and is assumed to be described only by training data. The generative model (implemented through a system as described in this paper) represents the probability density function p. X (m), and the generative model is trained to make q X (x) and pX Maximize the distribution matching between (x). To do this, there are several distribution matching metrics that can be used. For example, it can be envisioned that the model can be trained to make the (unknown) function q according to, for example, the following formula. X (x) and (trainable) function p X Minimize the Kullback-Leibler (KL) divergence between (x):

[0025] D KL (q|p)=∫q X (x)log q X (x)dx-∫q X (x)log p X (x)dx。(1)

[0026] Since only the second term in equation (1) above is affected by model training, it is conceivable to minimize D by, for example, the negative log-likelihood (NLL) defined by the following formula. KL Minimize:

[0027] l NLL (q|p)=-∫q X (x)log p X (x)dx。(2)

[0028] However, due to q X (x) is unknown, and log p X The expectation of (x) is often not analytically computed, which can lead to practical problems. To address this, data-driven approximations can be used. For example, it can be assumed that it has a probability density q. X The set of N realizations of the random variable X(x) (i.e., the set of N vectors x) can be obtained from the training data, and this set is denoted by Q. Then, we envision using an approximation...

[0029]

[0030] If N is large enough (thus resembling the form of Monte Carlo sampling), then it is assumed that the approximation is accurate. In practice, the set Q only constitutes a small fraction of the training data and can be referred to as a "minibatch".

[0031] The main characteristic of a trained generative model is that it allows for the generation of data from, for example, a training (or learned) distribution function p. X The signal is reconstructed by random sampling. In practice, the function p X The parameterization will be done by a (trainable) neural network model (i.e., instead of trying to directly assign a function p to a large set of input values).X Given a large set of output values, the network will instead attempt to find some parameters, such as the mean, standard deviation, and / or additional moments that can fully describe, for example, a Gaussian distribution or a similar distribution.

[0032] If processing media signals, such as audio signals, then p can be expected. X This will require complication in order to capture the statistical correlations frequently found in such signals. Therefore, the function p is used for learning... X The associated neural network needs to be very large. To reduce the required neural network size, a recursive model can be used. As a first step in implementing this recursive model, the signal samples are divided into frames. Here, the notation will be used, where x n Let x represent all samples belonging to the vector x of the nth such frame. Typically, in previous state-of-the-art models, x... n It is a scalar (including samples of the audio signal). As a next step, the function p... X Approximated recursively as

[0033]

[0034] Where T is the total number of frames, and x <n It refers to all frames before frame n, i.e., x0, x1, ..., xn. n-1 , is an abbreviation of .

[0035] The above formula allows the construction of conditional probability density p instead of unconditional probability density p. X Conditional models, compared to unconditional models, allow for the use of a relatively small number of model parameters (i.e., smaller neural networks). During the training of such models, previously available samples can be used for conditioning. During generation, the model can generate single frames at a time, conditioning previously generated samples.

[0036] The conditioning is typically expanded with additional side information, denoted by Θ, which modifies equation (4) to read...

[0037]

[0038] Depending on the task for which the generative model is used, the additional auxiliary information Θ can represent alternative information relevant to the task. For example, in an encoding task, Θ may include, for instance, quantization parameters (sent in the bitstream) corresponding to the frame to be reconstructed in the current recursive step of the model (i.e., for frame n, depending on one or more previous frames < n). In another example, in a signal enhancement task, Θ may include, for instance, samples of the distorted signal, or features extracted, for example, from samples of the distorted signal.

[0039] For simplicity, Θ will be removed from the following discussion. However, it should be understood that once the generative model is applied to a specific problem, Θ (i.e., additional auxiliary information) can be added to the conditioning.

[0040] To make the model trainable, it is assumed that p has an analytic form. This can be achieved by choosing a prototype distribution for p. For example, simple parametric distributions can be used, including, for example, the logistic distribution, the Laplace distribution, the Gaussian distribution, or similar distributions. As an example, the case of the Gaussian distribution will be discussed below.

[0041] We can assume

[0042]

[0043] in, The distribution represents a normal (Gaussian) distribution, for which parameters including the mean μ and standard deviation μ are provided by a neural network and updated on a per-frame basis. To achieve this result, the neural network can be trained using, for example, backpropagation and an NLL loss function.

[0044] However, in practice, modeling capabilities can be improved by using mixture models. In this case, when the prototype distribution is Gaussian, it is alternatively possible to assume...

[0045]

[0046] Where J is the number of components in the mixture model, and where w j These are the weights of the mixture model (which are also provided by the neural network). By using several components, the neural network can therefore estimate a more complex probability distribution than a single Gaussian.

[0047] In the case of scalars, for example, it is conceivable to use other prototype distributions to create mixture models, such as logistic, Laplace, etc. In the case of vectors (M dimensions), mixture models can be created by using M scalar distributions and an MxM linear transformation to introduce correlations among the M dimensions.

[0048] As discussed earlier in this document, previously known generative models for, for example, audio, operate in the time domain or (in a lossy manner due to the inherent need for approximate phase reconstruction) on the spectrum, which can complicate integration with other audio signal processing components that only provide a frequency domain interface. To overcome this problem, this disclosure therefore provides a generative model that operates on a filter bank representation of the signal. Thus, x nThis will then be followed by multidimensional time slots corresponding to signal samples in the filter bank domain.

[0049] For descriptive purposes, references will now be made. Figure 1 To describe a general filter bank.

[0050] Figure 1 An example of a general-purpose filter bank 100 is schematically illustrated. In filter bank 100, signal samples x[n] (where n represents a specific time step) are passed through analysis stage 110, where each sample is provided to multiple channels, each channel including a corresponding analysis filter H0(z), H1(z), ..., H... M-1 H0(z), where M is the total number of such analysis filters and channels. Each analysis filter may, for example, correspond to a specific frequency band. In the smallest filter bank consisting of only two channels, H0(z) may, for example, correspond to a low-pass filter, and H1(z) may, for example, correspond to a high-pass filter. If more than two channels are used, the filter between the first and last filters may, for example, be a properly tuned bandpass filter. The output from each analysis filter can then be downsampled by a factor M, and the output from analysis level 110 is a plurality of filter bank samples x0[m], x1[m], ..., x M-1 [m], where all these samples correspond to the current filter bank slot m. In this paper, samples x0[m], x1[m], ..., x M-1 [m] is referred to as the "filter bank representation" that is in the "filter bank domain" or constitutes the input signal x[n].

[0051] In the sample x j [m] Before being provided to the synthesis stage 120 of the filter bank 100, various operations can be performed on the samples (such as additional filtering, extraction of co-correlation features between different channels, energy estimation within each frequency band / channel, etc.), wherein the corresponding synthesis filters F0(z), F1(z), ..., E in each channel of the filter bank 100 are used to pass the samples. M-1(z) Before this, the sample may be upsampled, for example, by a factor M. The outputs from synthesis stage 120 may then be summed, for example, to generate an output sample x′[n], which may, for example, represent a time-delayed version of the input sample x[n]. Depending on the exact construction of the various analysis filters and synthesis filters, and any final processing performed between analysis stage 110 and synthesis stage 120, the output signal x′[n] may or may not be a perfect reconstruction of the input signal x[n]. In many cases, such as in the encoding / decoding of audio signals, the analysis portion of the filter bank can be used on the encoder side to extract various samples in the filter bank domain, and various processing can be applied to them to extract, for example, features that can be used to reduce the number of bits required to adequately reconstruct the signal in the synthesis stage located on the decoder side. For example, the information extracted from the various samples in the filter bank domain may be provided as additional auxiliary information, and the samples in the filter bank domain themselves may be quantized and / or otherwise compressed before being transmitted to the decoder side along with the additional auxiliary information. In another example, the filter bank samples themselves may be omitted, and only the additional auxiliary information is transmitted to the decoder side. The decoder can then reconstruct the signal x′[n] based on the compressed / quantized samples (if available) from the filter bank and the additional auxiliary information provided, such that the signal satisfactorily resembles the original input signal x[n]. The filter bank 100 can be, for example, a quadrature mirror filter (QMF) filter bank, but other suitable types of filter banks are also conceivable. The filter bank can be, for example, a critical sampling filter bank, but other variations are also conceivable. The filter bank can be, for example, a real-valued arithmetic filter bank, such as a cosine modulation filter bank, but other variations, such as a complex exponential modulation filter bank, are also conceivable.

[0052] Now refer to Figure 2 The model according to this disclosure is described in more detail how it can be used in signal processing schemes.

[0053] Figure 2The processing scheme 200 is schematically illustrated. In the preprocessing stage 210, it is assumed that a time-domain dataset 211 provides, for example, multiple time samples of audio (preferably a large number). For example, the time-domain dataset 211 may include various recordings of various sounds sampled at, for example, a specific sampling rate, such that vectors 212 of time-domain samples of one or more audio signals can be extracted from the dataset 211. These vectors 212 can be considered to include samples commonly referred to as "ground truth". For example, each such sample may represent the amplitude of the audio signal in the time domain at a specific sampling time. The time-domain dataset 211 may also include various features (or additional auxiliary information) 213 associated with the time-domain samples 212, such features including, for example, quantized waveforms in the time domain (e.g., decoded by a conventional codec), quantized spectral data from a time-domain transformation (e.g., reconstructed by a decoder of a conventional codec), spectral envelope data, parameter descriptions of the signal, or other information describing the frame. Such features 213 do not necessarily need to be updated for each sample 212, but instead may be updated once for each frame containing multiple time-domain samples 212.

[0054] Time-domain samples 212 are then provided at least to the analysis stage 214 of the filter bank, where (as referenced above) Figure 1 As described, the signal represented by time-domain samples is divided into multiple filter bank bands / channels, and can be grouped together, for example, for the same time slot m, such that multiple filter bank samples, each corresponding to a different filter bank channel, constitute a vector x. m = [x0[m], x1[m], ..., x M-1 [m]], where M is the total number of filter bank channels as described earlier in this document. It is conceivable that additional auxiliary information 213' can also be extracted using the filter bank and provided together with (or as a supplement to) the additional auxiliary information 213.

[0055] Filter bank samples 215 and additional auxiliary information 213 and / or 213' provided by filter bank analysis stage 214 are then provided to filter bank dataset 221. Filter bank dataset 221 defines both a training set of data (from which the model will learn) and a disturbance set of data (from which the model can make predictions based on what it has learned from the training set). Typically, the data is separated such that the disturbance set does not include audio signals that are exactly the same as the audio signals in the training dataset, thereby forcing the model to extract and learn more general features of the audio, rather than simply learning how to reproduce audio signals that have already been experienced. Filter bank sample 215 may be referred to as "filter bank real data" sample.

[0056] During training stage 220, real filter bank data samples 222 belonging to the training dataset are provided to system 224 according to this disclosure, which may include, for example, computer hardware for implementing a generative model. Additional auxiliary information 223 may also be provided to system 224. Based on the provided samples 222 (and possibly also based on the provided additional auxiliary information 223), system 224 is iteratively trained to predict filter bank samples for the current time slot m using one or more previously generated filter bank samples from previous time slots < m. During training stage 220, such “previously generated filter bank samples” may also be, for example, previous real data samples. In the most general embodiment, the system learns how to estimate the probability distribution of filter bank samples belonging to the current time slot, and can then obtain actual samples by sampling from such distribution.

[0057] For each current (filter bank) time slot m, the model of system 224 sequentially learns how to estimate p(x) m |x <m And therefore p X (x). As described earlier in this paper, this can be achieved by using backpropagation in an attempt to make the loss function (e.g., one or more of the loss functions described together in the above reference equations (2)-(7)) l NLL ) minimized.

[0058] After successful training, the model of system 224 is defined by multiple optimized model parameters 225 (including, for example, various weights and biases of the system). After training stage 220 ends, processing scheme 200 can proceed to inference stage 220'. In the inference stage, the trained model 224 can generalize and operate on unseen data. In inference stage 220', the model of system 224 can use the optimized model parameters 225 and does not need to access any real data samples of the filter bank. In some cases, it is conceivable that the model of system 224 is at least allowed access to additional auxiliary information 223', which may correspond to, for example, features of an audio signal, which system 224 will reconstruct by iteratively predicting the probability distribution of filter bank samples for each time slot. Since the model is capable of generalization and can operate in inference stage 220' once deployed, the additional auxiliary information 223' is different from the additional auxiliary information 223 provided to the model of system 224 during training stage 220. The model of system 224 should be generalizable, and therefore it is able to generate audio samples (not seen during training) by using additional auxiliary information 223'.

[0059] In post-processing stage 230, filter bank samples 226 reconstructed by sampling from the probability distribution generated by system 224 (model) can be passed, for example, at least through filter bank synthesis stage 231, so that output signal 232 can be generated (e.g., in the time domain).

[0060] In the following text, unless explicitly stated otherwise, “system” and “system model” will not be distinguished. In other words, the term may be referred to as “a system trained to…” or “a system that learns…”, and any such reference should be interpreted as a model of a trained / learned system implemented using, for example, computer hardware included in the system.

[0061] from Figure 2 As can be seen, once trained, the system 224 according to this disclosure can be used in, for example, encoding / decoding schemes. For instance, as described earlier herein, the system 224 can form part of the decoder side and be tasked with predicting current filter bank samples based solely on its own previously generated samples and additional auxiliary information provided, for example, from the encoder. Therefore, it is conceivable that a lower bit rate may be required to stream sufficient information over the channel between the encoder and decoder, as the system 224 can learn on its own how to “fill in the blanks” in the information given to it in order to adequately reconstruct, for example, the audio signal on the decoder side. Once trained, the system 224 can also be used for other tasks, such as signal enhancement or other tasks, as described earlier herein.

[0062] Now refer to Figure 3a and Figure 3b Describe the system according to this disclosure (e.g., references) Figure 2 Two or more embodiments of the system 224 described.

[0063] Figure 3a A schematic illustration shows a system 300 that is conceived as being implemented or is feasible on one or more computers. System 300 includes a (neural network processing) layer T. N-1 T N-2 The hierarchical structure 310 consists of N layers, from T0 to T0. In total, the hierarchical structure 300 comprises N layers. Although... Figure 3a This indicates that there are at least three such layers, but it is also conceivable that there may be fewer than three layers, such as only two layers T1 and T0.

[0064] These layers are ordered hierarchically from top to bottom. Figure 3a In the configuration shown, the top layer is layer T. N-1 The bottom layer is layer T0. As will be described later in this article, each layer T... j(where j is an integer between 0 and N-1) has been trained to generate conditioning information c that is passed down to the next layer in the hierarchy. j For example, by layer T N-1 The generated conditioning information is passed to the lower layer T. N-2 And so on. Preferably, each layer provides conditioning information only to the next lower layer in the hierarchy, but it is also conceivable that, if possible, one or more layers provide conditioning information to even lower layers in the hierarchy.

[0065] Each layer T j It has been trained based on previous filter bank samples {x} generated by system 300 during previous time slots < m. <m} j To generate its conditioning information c j As shown in the figure, the set of previous filter bank samples provided to each layer is not necessarily equal. In some embodiments, each layer or at least some layers may receive different numbers of previous samples (e.g., for different sets of filter bank channels and / or for different sets of previous time slots). In some embodiments, each layer may receive the same set of previously generated filter bank samples.

[0066] In some embodiments, each layer T j It has been trained to also respond to additional auxiliary information {a} j To generate its conditioning information c j As shown in the figure, the content of this supplementary auxiliary information is not necessarily the same for all layers. In some embodiments, the supplementary auxiliary information may be different for each layer, or at least different for some layers; while in other embodiments, the supplementary auxiliary information provided to each layer may be equal. For example, through... Figure 1 Since the (multiple) additional auxiliary information in the diagram does not have the m-index, it is conceivable that the additional auxiliary information may not necessarily change for each time slot. In some embodiments, the additional auxiliary information may be constant for two or more consecutive time slots, while in other embodiments, the additional auxiliary information may change between each time slot. In this document, "different" additional auxiliary information may include, for example, auxiliary information belonging to the same category for each layer, but for example, auxiliary information being updated more frequently in one layer than in another, etc. Similarly, "different" may include, for example, auxiliary information provided to one layer not belonging to the same category as auxiliary information provided to another layer, etc. Here, "category" may include, for example, data associated with the quantized waveform, spectral envelope (energy) data, quantized filter bank coefficients, parametric signal descriptions (e.g., vocoder parameters), and / or other such additional auxiliary information as described herein.

[0067] It should be noted that generative models can still produce useful results even without conditioning for additional auxiliary information. Examples include situations where various noises, wind sounds, background noises, unknown musical fragments, etc., are generated, for instance, where the sounds lack "meaning" (i.e., they do not include speech, lyrics, known melodies, etc.). For example, a generative model can be exposed to various wind noise recordings during training and then learn how to (during inference) reproduce similar "wind-sounding" noise on its own without requiring additional auxiliary information. This noise can be constructed in a non-repeating manner by randomly sampling from the generated probability distribution.

[0068] Below the underlying layer T0, system 300 includes an additional neural network 380, which may be, for example, a multilayer perceptron (MLP) network. In some embodiments, the MLP may be fully connected or configured in a desired manner to operate as intended. The neural network 380 receives conditioning information c0 from the underlying layer T0. In some embodiments, the neural network 380 may also receive a set {x} of previously generated filter bank samples. <m (The set may be equal to or not equal to one or more other such sample sets provided to the layer). In some embodiments, the network 380 may also receive additional auxiliary information {a} * The additional auxiliary information may be equal to or not equal to one or more other such additional auxiliary information provided to the layer. The neural network 380 forms part of the output stage 320 of the system 300, from which an estimated probability distribution p(x) is generated. m |x <m ).like Figure 3a As illustrated, in some embodiments, output stage 320 may further include, for example, part or all of the underlying layer T0, while in other embodiments, the output stage does not include any layer. Further example embodiments of system 300 will be given below, wherein the output stage includes one or more layers.

[0069] As described in this paper, in order to generate / obtain a probability distribution, the output stage can (e.g., by using a multilayer perceptron network to estimate the corresponding parameters) use a model that includes a single prototype distribution or a hybrid model that includes several such prototype distributions (which may be of different types).

[0070] In some embodiments, system 300 may further include means for generating a plurality of current samples (i.e., current filter bank samples) of filter bank representation by sampling from the generated probability distribution.

[0071] In some embodiments, each layer T jIt may include one or more convolutional networks or modules configured to receive previously generated filter bank samples. Each such convolutional network / module can learn to extract features from the filter bank samples. In some embodiments, such a convolutional network is configured to use a kernel whose size decreases with the layer order in the hierarchy (i.e., from the top to the bottom). For example, the kernel size may decrease in the temporal dimension of the lower layers, and thus allow for an increase in temporal resolution. In some embodiments, the kernel size does not change in the frequency dimension, but such variations are also conceivable. Reference will be made later to, for example... Figure 4a This describes a hypothetical implementation with a lower kernel layer that reduces kernel size.

[0072] In some embodiments, various layers T j It is configured to operate recursively, achieved by including one or more recurrent neural networks in each layer. For example, each layer may include at least one recurrent unit (or module) that can be configured to receive the sum of the outputs from the convolutional network as its input. At least each layer, except the lowest / bottom layer, may also include at least one (learning) upsampling module that is configured to take the output from at least one recurrent unit as its input and generate conditioning information c. j As its output. In some embodiments, the lowest layer may also include at least one such (learning) upsampling module. If, for example, the lowest layer does not include an upsampling module, the output c0 from the lowest layer may, for example, be the output from at least one recursive unit in the lowest layer T0.

[0073] In such recursive networks, the network's internal (latent) states (at a higher or lower level) are remembered, allowing new latent states to be computed based on one or more previous states. This use of "memory" can be beneficial when processing sequential data (e.g., filter banks (audio) samples one after another in a time-slot sequence).

[0074] Figure 3b Another embodiment of system 300 is schematically illustrated, wherein output stage 320 includes a bottom layer, and wherein the output stage is divided into several sub-layers 390-0, ..., 390-(L-1) (where L is the total number of sub-layers). In output stage 320, each sub-layer 390-j includes a sub-layer T of layer T0. 0,jSublayers can be executed sequentially, and each sublayer can be trained to generate probability distributions for one or more current samples corresponding to a proper subset (i.e., at least one but not all) of the channels in the filter bank. For example, the proper subsets of each sublayer are different. Proper subsets can be overlapping (i.e., at least one subset includes channels that are also included in another subset) or non-overlapping (each channel is included in only one subset of the subsets). For at least all sublayers except the first executed sublayer, each sublayer can be trained to also generate probability distributions based on the current samples generated by one or more previously executed sublayers. Each sublayer T... 0,j A set {x} of previously generated samples is provided. ≤m} 0,j The set may also (as will be described below) include one or more currently generated filter bank samples for channels processed by the previous sublayer 390-<j. The same applies to neural network 380-j, which may also be provided with such a set of previously generated filter bank samples. The set may include filter bank samples generated during the current step but used for channels (in frequency) lower than those being processed by the sublayer in question. The current samples for lower channels can be obtained, for example, using masking kernels included in convolutional layers across various sublayers, as will be referenced later herein. Figure 4a As described above, it can also be envisioned that each sublayer T 0,j Each neural network 380-j also receives a corresponding set of additional auxiliary information {a}. 0,j and The additional auxiliary information may be the same as or different from the corresponding additional auxiliary information provided to higher levels in the hierarchy 310.

[0075] In some embodiments, the first executed sublayer 390-0 can generate one or more current samples corresponding to at least the lowest channel of the filter bank, and the last executed sublayer 390-(L-1) can generate one or more current samples corresponding to at least the highest channel of the filter bank. Within each sublayer 390-j, the corresponding sublayer T... 0,j Providing conditioning information to subsequent neural networks 380-j (e.g., MLP) 0,j .

[0076] Now refer to Figure 4a and Figure 4b A more detailed example of using the output level 320 to subdivide into multiple sub-layers / sub-tiers is given.

[0077] Figure 4aA system 400 including a hierarchical structure 410 of layers is schematically illustrated. In the example system 400, the hierarchy 410 includes three layers: T2, T1, and T0. In other examples, it is conceivable that the hierarchy may include fewer or more than three layers.

[0078] In each layer, the previously generated filter bank samples {x <m} with additional auxiliary information component a m and b m They were received together. Figure 4a In the specific example given, auxiliary information component a m Includes envelope energy, and auxiliary information component b m This includes the quantized representation of the samples (e.g., obtained from reconstruction from a conventional codec). Here, the supplementary auxiliary information is updated for each slot (thus updating the index "m"), and can be the same for all layers, for example. In some embodiments, it is conceivable that systems 400 and others described herein also use a "look-ahead" approach to the supplementary auxiliary information (if provided), meaning that supplementary auxiliary information for one or more "future" slots > m is also used and provided to various layers. In some embodiments, it is even conceivable that supplementary auxiliary information is provided to layers only for one or more future frames.

[0079] For illustrative purposes, it can be assumed that the filter bank samples correspond to 16 channels. For example, a 16-channel QMF filter bank might already provide filter bank samples for use in the filter bank training dataset for training System 400. Therefore, assume that each filter bank sample vector x m It includes 16 components, each corresponding to one of the 16 QMF filter bank channels.

[0080] In each layer, the set of previously received samples {x} <m} can include multiple recently generated sample vectors. These multiple vectors can, for example, include the most recently generated sample vector of Z, thus obtaining {x} <m}={x m-Z x m-Z+1 , ..., x m-1 In general, if each sample vector represents 16 filter bank channels, then the set of previous samples {x} <mThe system comprises 16*Z channel elements. It is conceivable that, in this example, each layer receives the same set of previous samples. It is also conceivable that each layer receives a different set of previous samples. For example, as the temporal resolution increases as the hierarchy of layers 410 is traversed downwards, some embodiments of system 400 may include providing fewer previous samples to lower layers. Here, "fewer previous" may include, for example, only the last Z' samples (where Z' < Z), while higher layers (e.g., layer T2) may receive all Z available previous samples. It should also be noted that, for example, lower layers may run / execute more frequently than higher layers.

[0081] The top layer (i.e., the top layer) T2 includes a convolutional network 432, which takes the set of previous samples {x} <m The energy a is used as its input. Convolutional network 432 may, for example, include 16 channels and use a kernel size of 15. Layer T2 further includes another convolutional network 442, which encapsulates the energy a. m As its input, convolutional network 442 may, for example, include 19 channels. Layer T2 further includes convolutional network 452, which quantizes sample b. m As its input, convolutional network 452 may include, for example, 16 channels. The corresponding kernel size and stride of convolutional networks 442 and 452 may be adapted, for example, to the additional auxiliary information component a provided. m and b m Multiple temporal resolutions. Typically, the exact kernel size and stride of various convolutional networks 432, 442, and 452 (and their corresponding convolutional networks in lower layers) can be adjusted based on several factors, including, for example, the previously provided samples {x}. <m} frame size, additional auxiliary information a m and b m Multiple time resolutions, etc. If from the same layer T j Since the number of output samples in the temporal (sequence) direction varies among the various convolutional networks (given the choice of kernel size and stride for networks 432, 442, and 452), it is conceivable that one or more upsampling units could be provided to allow the outputs from various convolutional networks to be summed in a desirable manner. For a specific combination of quantization samples and envelope conditioning, due to a in layer T2 m and b m In the separate initial processing, quantized samples may be localized. For example, when quantized samples are provided to lower layers, this localization can be further improved down the hierarchy of layer 410 as the kernel size becomes shorter.

[0082] It should be noted here that, for example, the given kernel size is for illustrative purposes only and can be obtained through limited experimentation, depending on, for example, the type of audio to be processed (e.g., speech, general audio, music, single instrument, etc.). Various convolutional networks can, for example, use a nominal stride equal to the frame size of each layer, i.e., depending on the exact number of previous sample vectors provided to each layer. The number of output channels of various convolutional networks can, for example, correspond to the number of hidden dimensions used in the model, and it can also be found that these numbers are based, for example, on the type of audio to be processed.

[0083] The outputs from all convolutional networks 432, 442, and 452 are then summed together and fed as input to a recurrent neural network (RNN) 462. The RNN 462 can be implemented, for example, using one or more stateful network units (e.g., gated recurrent units (GRUs), long short-term memory units (LSTMs), quasi-recurrent neural networks, Elman networks, etc.). A key property of this RNN is its ability to remember (at least to some extent) the hidden, latent states between each time slot.

[0084] Typically, it is envisioned that at least some convolutional networks (e.g., 432) can use the maximum possible number of groups in order to, for example, keep all filter banks channel-separated until a summation stage is envisioned at the end of the convolutional network. In other words, each channel can be convolved with its own set of filters. This can provide the system disclosed in this paper with improved ability to model / learn inter-channel correlations. Although Figure 4a While not explicitly illustrated, it can be envisioned that, for example, convolutional network 432 may include convolutional components and the following summation components.

[0085] The output from RNN 462 is provided as input to upsampling stage 472, which can be implemented, for example, using a transposed convolutional network. It is conceivable that the network itself can learn precisely how to perform this upsampling, i.e., the upsampling provided by stage 472 can be “learned upsampling.”

[0086] The output from upsampling stage 472 is provided to the next layer T1 as conditioning information c2. Unless otherwise stated below, layer T1 is conceived to include components of the same type as the top layer T2, and it is conceivable that everything described above with reference to layer T2 also applies to layer T1 and the lowest layer T. 0,0 ... T 0,L-1 .

[0087] One difference between the layers is that at least some convolutional networks operate with a smaller kernel size than the corresponding convolutional networks in the layers described above. For example, convolutional network 431 may still include 16 channels, but uses a kernel size of, for example, 5. Similarly, convolutional network 451 may still include 16 channels, but uses a kernel size of, for example, 15. In some embodiments, convolutional network 441 may be envisioned to differ from its corresponding component in the top layer T2, depending on, for example, the exact additional auxiliary information provided.

[0088] After summing the outputs of convolutional networks 431, 441, and 451, and after further processing via RNN 461 and a learned upsampling stage 471, layer T1 outputs conditioning information c1, which is passed down through the hierarchy to the next layer(s).

[0089] In system 400, the lowest layer is further divided into multiple sequentially executed sublayers 490-j (where j = 0, ..., (L-1), and L is the total number of such sublayers). Each sublayer 490-j includes a corresponding sublayer T. 0,j The output stage is 480-j. Figure 4a In this embodiment, it is assumed that the conditioning information c1 is the same for all sublayers 490-j. In other embodiments, it is envisioned that the conditioning information from layer T1 may be different for some or all sublayers, and is defined, for example, as c for sublayer 490-j. 1,j The same possibility applies to what will be referenced later in this article. Figure 4b The system described.

[0090] Here, "execute sequentially" means that processing first occurs in sublayer 490-0, then in the next sublayer, and so on, until processing is included in the last sublayer 490-(L-1). In sublayer T 0,j In this context, the envisioned layer can also access the samples that have been computed (or generated) so far for time slot m, and accordingly “mask” the kernel of the corresponding convolutional network 430-j so that each sublayer can compute one or more of the total number of channels, wherein, for each of the following sublayers, the first sublayer 490-0 is estimating one or more probability distributions of the current samples associated with the lowest channel or a subset of channels, and so on, up to the last sublayer 490-(L-1) responsible for estimating one or more probability distributions of the current samples associated with the highest channel or a subset of channels.

[0091] For example, sublayers 490-j can be configured such that each sublayer processes the same number of channels. If there are, for example, a total of 16 channels and, for example, L = 4 sublayers, then the first sublayer can be responsible for channels 0-3, the next sublayer can be responsible for channels 4-7, and so on, and the last sublayer can be responsible for channels 12-15. Other divisions of responsibility for the various channels among sublayers 490-j are, of course, also possible. By using a masking kernel in convolutional network 430-j, the convolutional network can be arranged such that even if placeholders (placeholder values ​​close to "true" or at least not zero) are provided for the current sample of channels that have not yet been computed, the first sublayer T... 0,0 The convolutions performed by the 430-0 convolutional network also mean that placeholder values ​​are not considered. Therefore, the computation of, for example, the filter bank samples of the first four channels 0-3 depends only on previously generated samples from one or more previous time slots. The next sublayer T... 0,1 (not in) Figure 4a The corresponding convolutional network 430-1 (not shown in the diagram) is also not included. Figure 4a The diagram shows a kernel that allows it to consider samples from channels generated by a probability distribution produced by the first sublayer 490-0, and so on, up to the last sublayer, in which layer T... 0,L-1 The convolutional network 430-(L-1) is allowed to consider all previously generated samples of all channels plus the currently generated samples of all channels below the channels to be processed by the sublayer 490-(L-1).

[0092] It should be noted in this paper that when implying that a convolutional network has a "masked kernel," it is assumed that this feature may only be important during training, when the entire system has access to real data samples across the entire channel range, but during which, for example, sublayers should not "see" such samples except for specific proper subsets of channels. However, during inference (i.e., after the generative model has been trained), such "samples belonging to frequency bands outside the band associated with a certain sublayer" will not exist (due to the sequential execution of sublayers), or at least it is assumed to be zero. Therefore, a masked kernel may not be necessary during inference.

[0093] As an example, using the above configuration of samples associated with a total of 16 channels, where there are four sub-layers, and where each sub-layer processes 4 channels, the filter bank samples for the current time slot m can be computed as follows:

[0094] The previously generated sample x <m The set of samples (including samples from all channels) and the current sample x to be computed in the current step. mThe placeholders are provided together to convolutional network 430-j. Convolutional network 430-j has a masking kernel that ignores all placeholders for the current sample but considers all previously generated samples provided across all channels. As a result, sublayer T 0,0 Output conditioning data c 0,0 The conditioning data is provided to the sub-output stage 480-0 (which may be a multilayer perceptron, as described earlier in this document). The sub-output stage 480-0 generates a probability distribution p(x). c,m |x :,<m ), where “x c,m "" refers to a vector of samples that includes a set of time slots "m" and channels "c", where "x" is a vector of samples. :,<m The colon (":") refers to a vector containing samples from one or more previous time slots < m across all channels. For the first sublayer 490-0, c = 0, 1, 2, 3. The probability distribution is conditioned on samples previously generated from previous time slots < m and all 16 channels. After sampling from this distribution, samples x from the first 4 channels of the current time slot m can be generated. 0,m x 1,m x 2,m and x 3,m The placeholders in these samples can then be replaced with the actual generated values.

[0095] In the next sublayer 490-1 (not shown), the kernel of the convolutional network 430-1 (not shown) is masked, allowing access to all previously provided samples from previous time slots < m, but only for the four first channels of the current time slot m just generated from the first sublayer 490-0. Conditioning information c is generated to provide the sub-output stage (not shown) to this layer. 0,1 Then, the probability distribution p(x) is generated. c,m |x <c,m ;x :,<mThe probability distribution is given by the following sublayers: ), where c = 4, 5, 6, 7 and < c = 0, 1, 2, 3. In other words, this probability distribution is valid for the current samples belonging to channels 4-7, but conditioned on previously generated samples during the previous time slot < m, and also conditioned on samples generated for channels just processed by the first sublayer 490-0. The current samples for channels 4-7 are then obtained by sampling from this distribution and inserted to replace their corresponding placeholders, and so on. Each sublayer is executed sequentially in the same manner until the last sublayer has generated a probability distribution for the last 4 channels (i.e., such that c = 12, 13, 14, 15), which is conditioned on previously generated samples for the previous time slot < m, and also conditioned on all previously generated samples for the lower channels < c = 0, ..., 11 in the current time slot m. After all sublayers 490-j have been executed sequentially, the probability distributions for all current samples of all channels have been obtained, and all such samples can or have been generated by sampling from the corresponding generated probability distributions.

[0096] In the various sub-layers 490-0 of the output stage 420, sub-layer T 0,j The components are essentially the same as those of the layers described above. For example, convolutional network 430-j is similar to the convolutional network in layer T1, except for the masking kernel of 430-j. Similarly, the kernel size used in convolutional networks 430-j and 450-j is smaller than the kernel size in the aforementioned layer (layer T1). As mentioned earlier, convolutional network 440-j can be the same as or different from the convolutional networks in higher layers. This also applies to the conditioning information c from its output. 0,j The RNN460-j and (learned) upsampling stage 470-j. In some embodiments, sublayer T may be assumed, for example. 0,j The corresponding sub-output stage 480-j operates at the same (e.g., the highest possible) time resolution. In this case, it is conceivable that, for example, the corresponding upsampling stage 470-j is not required.

[0097] For example Figure 4a As illustrated in the diagram, the various sub-output stages 480-j can also receive corresponding previously generated sample sets. In other embodiments, it is conceivable that such a previously generated set of samples is not required for the various sub-output stages 480-j.

[0098] Figure 4bAnother embodiment of the system according to this disclosure is illustrated, wherein an additional recursive unit 464 is provided. The recursive unit 464 may be, for example, a GRU or LSTM, or any other type already described herein, and is common to all sub-layers 490-j. The recursive unit 464 can further assist in predicting samples from higher filter bank channels from lower filter bank channels. In contrast to the various recursive units 460-j (which may be assumed to operate in the "time direction"), the recursive unit 464 operates in the "layer direction". For each sub-layer 490-j, the sum of the outputs from the various convolutional layers 430-j, 440-j, 450-j, and the conditioning information c1 from the aforementioned layers are split into two parts. One part is mixed with the output of the recursive unit 460-j and provided as input a to the recursive unit 464. j As explained earlier in this paper, another portion is provided as input directly to recursive unit 460-j. It can be assumed that recursive unit 464 receives this input d from each sub-layer 490-j. j And for each such input d j Update its internal state. Output d from recursive unit 464. * (i.e., the current state) is then fed as additional auxiliary information into each sub-output stage 480-j. In some embodiments, the output d from the recursive unit 464 * It can be used to replace conditioning information output c 0,j .

[0099] As described at the beginning above, state-of-the-art models typically operate using scalar sample vectors, where only a single value (e.g., a sample of a mono audio signal in the time domain) is computed for each time slot. This allows for the use of simple scalar prototype distributions (e.g., Gaussian, logistic, Laplace, etc.) to create hybrid models, as described above with reference to equation (7). However, this disclosure proposes operating generative models in the filter bank domain, where the vector is multidimensional for each time slot m, and where the dimension is controlled by the number of filter bank channels. For example, as described above, in this disclosure, vector x m This results in each vector element comprising multiple components, each corresponding to one of the filter bank channels. In other words, this disclosure can rely on multi-dimensional time slots, where each time slot comprises multiple frequency bands / time slots.

[0100] Therefore, the generative model of this disclosure can be configured to output multiple samples at once (from the same time slot). For example, in the reference... Figure 3b and Figure 4aIn the embodiment described in / 4b, if there are an equal number of sublayers as channels, it is conceivable that each layer involves only a single channel, and reconstruction occurs sequentially from 490-0 to 490-(L-1). However, the more common scenario is that multiple bands need to be processed in a single step (e.g., when there are fewer sublayers than channels, or when there is only a single layer / MLP in the output stage, as described in Reference). Figure 3a (As described). To reconstruct inter-band / inter-channel correlations in this context, the generative model may need to rely on a multivariate prototype distribution. This allows the reconstruction of these bands to be performed in a single step. This can provide some computational advantages because the model's MLP sublayers can then be executed in parallel (or alternatively, parallel sublayers can be combined in a single sublayer, so that only a single sublayer needs to be executed). For example, the model can be configured to reconstruct all bands in a single step, eliminating the need for sequential execution of MLP sublayers. The model can also be configured to output fewer bands than the total number of channels at that time, which would then require the use of more than one sublayer from the MLP sublayers operating in sequence.

[0101] While other examples may be considered, the multivariate Gaussian case can be considered first. It can be assumed that for a single time slot m, the generator model output corresponds to parameters of an M-dimensional frame, where M is the number of filter bank channels. In some cases, M may include all available filter bank channels. However, in other cases, M, as used below, can be considered to include not all but at least two or more such channels. The Gaussian mixture model in this case may include J components and can be written as...

[0102]

[0103] For the j-th component, w j It is a scalar weight, μ j It is the M-dimensional average, and ∑ j It is an M x M covariance matrix. Note that ∑ j It needs to be positive semidefinite. This constraint is imposed instead of directly providing ∑. j The generative model implemented by the system of this disclosure provides its Cholesky decomposition U. j Parameters, such as:

[0104]

[0105] Among them, U j It is a lower triangular matrix with a non-zero main diagonal. This is sufficient to guarantee ∑ j It is reversible, thus allowing for schemes that optimize, for example, NLL loss functions.

[0106] However, a potential drawback of this method might be the large number of model parameters, such as... The same growth rate is likely undesirable. To address this potential problem, this disclosure proposes using shared filter banks (such as QMF, MDCT, DCT, etc.) that incorporate decorrelation properties, meaning that as M increases, the various dimensions of the frame (slot) become increasingly decorrelated (due to energy concentration, for example, occurring within these filter banks). This allows for the application of structure to Σ. j And therefore it can also be applied to U j .

[0107] For illustrative purposes, the 16-dimensional case (e.g., corresponding to a filter bank with 16 channels) will be used again. For this case, this disclosure proposes assuming Σ j At least some of its diagonals are zero, even if...

[0108]

[0109] And therefore

[0110]

[0111] Among them, c 1,2 c 15,16 These are scalar parameters provided by the network to parameterize ∑. j Typically, if U j Having a small number of diagonals greater than 1 but less than M may be preferred.

[0112] Parameter U j It can be further broken down into

[0113]

[0114] in,

[0115]

[0116] and

[0117]

[0118] In order to find the inverse ∑ j This improves numerical stability. In this case, d 1,2 , ..., d 15,16 and σ1, ..., σ 16 These are scalar parameters provided by the network for parameterizing Σ. j .

[0119] In the scalar case, due to the nature of the associated training process, it is often correct to use, for example, a Laplace or logistic distribution to provide better results than a Gaussian distribution. This disclosure proposes a method to generalize the above approach to distributions other than Gaussian, and it is also effective for multidimensional time slots.

[0120] In the first step, it is recommended to use the scalar parameter μ provided by the system. b and s b Define the M-scalar distribution F of the mixture model component j. b (μ b s b Next, it is proposed to define a linear transformation L in the form of a triangular matrix with a unit main diagonal and a small number of non-zero diagonals. j Typically, a matrix can be either lower triangular or upper triangular. For clarity, it will be assumed that the matrix is ​​lower triangular, but it is understood that upper triangular cases can also be considered. Here, "a small number of non-zero diagonals" refers to the number of superdiagonals (for upper triangular matrices) or the number of subdiagonals (for lower triangular matrices). Furthermore, as an example, in the case of a Gaussian distribution, L... j Will equal to It is important to note that L j It is always reversible, and assuming that the dimensions are independent after such a transformation, it can be achieved through l. NLL The scalar prototype distribution in the formula is used Instead To calculate the loss l NLL This assumption is reasonable because L j The purpose is to introduce inter-band / inter-channel correlation, while This correlation is eliminated, and the goal of training is to achieve a model that conforms to the canonical.

[0121] For example, the transformation described above can be applied as in the reference. Figure 3a The system described, such as system 300, uses an output stage 320 that is not further divided into multiple sequentially executed sub-layers. Instead, it generates the probability distribution of the current sample across all channels in a single step using a single bottom layer T0 and a single output-level neural network 380 (e.g., an MLP). This is achieved by (to a large extent) eliminating the reference... Figure 3b and Figure 4a The example described in / 4b exhibits intra-frame / slot recursion. Figure 3a The system can provide alternatives that are more suitable for parallelization on appropriate hardware. In the output stage 320 and neural network 380, a linear transformation L can be provided for the corresponding hybrid model components. j Updates and F b The parameters, where, as described above, L jIt is a lower triangular matrix with one and b non-zero diagonals on its main diagonal, where 1 < b < M. In some embodiments, in order to reconstruct, for example, a signal, or to generate filter bank samples for the current time slot m, (random) sampling can be performed, wherein the sampling procedure includes using L j Perform the transformation.

[0122] This disclosure also envisions a method for autoregressively generating a probability distribution of multiple current samples in a filter bank representation of an audio signal, wherein the current sample corresponds to a current time slot, and wherein each current sample corresponds to a corresponding channel of the filter bank. Of course, this method is envisioned to generate such a probability distribution using the generative model of this disclosure implemented in any of the systems described herein. Reference will now be made to... Figure 5 This method will be briefly described.

[0123] Figure 5 A flowchart of a method 500 according to one or more embodiments of the present disclosure is schematically illustrated. Step S501 includes generating conditioning information 510 (e.g., c0 as described above) using a hierarchical structure of multiple neural network processing layers, wherein the layers are ordered from top to bottom, wherein each processing layer has been trained to generate conditioning information based on previous samples represented by a filter bank, and for at least each processing layer other than the top layer, also based on conditioning information generated by higher processing layers in the hierarchy. In step S501, “generating conditioning information” means using conditioning information generated by the bottom processing layer.

[0124] In step S502, the conditioning information 510 provided / generated in step S501 is used with the output stage, which has been trained to generate a probability distribution 520 (e.g., p(x)) based on previous samples corresponding to one or more previous time slots represented by the filter bank and the conditioning information 510 generated in step S501. m |x <m )).

[0125] In some embodiments of method 500, step S503 includes generating a plurality of current samples in the form of a filter bank by sampling from the generated probability distribution 520. The resulting samples 530 are then provided as, for example, previously generated samples from one or both of steps S501 and S502.

[0126] In method 500, steps S501 and S502 can of course be combined into a single step (not shown), which simply corresponds to generating probability distribution 520 using a system as disclosed herein.

[0127] It is conceivable that method 500 can be modified based on what has been described and / or discussed for any embodiment of the system disclosed herein. For example, the system (and thus steps S501 and S502) may use additional auxiliary information, layers may be configured as described above, steps may include the use of recursive units as described above, the output level used in step S502 may be configured as described above, and so on. In other words, it is conceivable that the flow of method 500 can be implemented, for example, by using any embodiment of the system also described herein, to implement the generative model described herein.

[0128] This disclosure also envisions providing a non-transitory computer-readable medium storing instructions operable, when executed by at least one computer processor belonging to computer hardware, to implement a generative model using the computer hardware (i.e., by implementing a system as described herein, and / or by performing the methods described above).

[0129] It is conceivable that generative models implemented in the systems described herein or executed by the methods described herein can be used, for example, in coding schemes, preferably in decoders. Instead of sending the complete audio signal to the decoder, the generative model can learn how to generate current samples based on previously generated samples (i.e., "fill in the gaps"), and by providing additional auxiliary information (which could be, for example, quantized filter bank samples or other coded data), the generative model can learn how to generate filter bank samples such that a signal sufficiently similar to the original signal can be reconstructed in a later synthesis stage. Similarly, as mentioned earlier, other tasks can also be applied to the generative model, such as signal enhancement. The generative model can, for example, receive noisy signals as additional auxiliary information and learn how to remove such noise by adjusting the generated probability distribution, thereby generating samples accordingly.

[0130] As described in the example embodiments above, the neural network system of this disclosure can be implemented, for example, using a computer, using computer hardware including, for example, a processor (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), one or more application-specific integrated circuits (ASICs), one or more radio frequency integrated circuits (RFICs), or any combination thereof) and memory coupled to the processor. As described above, the processor can be adapted to perform some or all of the steps of the methods also described throughout this disclosure.

[0131] Computer hardware can be, for example, a server computer, client computer, personal computer (PC), tablet PC, set-top box (STB), personal digital assistant (PDA), cellular phone, smartphone, web device, network router, switch or bridge, or any machine capable of (sequentially or otherwise) executing instructions specifying actions to be taken by said computer hardware. Furthermore, this disclosure will relate to any collection of computer hardware that individually or in combination executes instructions to perform any one or more concepts discussed herein.

[0132] As used herein, the term "computer-readable medium" includes, but is not limited to, data storage libraries in the form of, for example, solid-state storage, optical media, and magnetic media.

[0133] Unless otherwise specifically stated, it is evident from the following discussion that, throughout this public discussion, terms such as “processing,” “computing,” “calculating,” “determining,” and “analyzing” are used to refer to the actions and / or processes by which data represented as physical (e.g., electronic) quantities are manipulated and / or transformed into other data similarly represented as physical quantities by computer hardware or computing systems or similar electronic computing devices.

[0134] In a similar manner, the term "computer processor" can refer to any device or part of a device that processes electronic data, for example, from registers and / or memory, to transform said electronic data into other electronic data, for example, that can be stored in registers and / or memory. "Computer," "computing machine," "computing platform," or "computer hardware" can include one or more processors.

[0135] In one or more example embodiments, the concepts described herein can be executed by one or more processors that accept computer-readable (also known as machine-readable) code containing a set of instructions that, when executed by one or more processors, perform at least one of the methods described herein. This includes any processor capable of executing a set of instructions (sequential or otherwise) specifying an action to be taken. Thus, one example is a typical processing system (i.e., computer hardware) including one or more processors. Each processor may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit. The processing system may further include a memory subsystem comprising main RAM and / or static RAM and / or ROM. A bus subsystem may be included for communication between components. The processing system may further be a distributed processing system in which processors are coupled together via a network. If the processing system requires a display, it may include such a display, for example, a liquid crystal display (LCD) or a cathode ray tube (CRT) display. If manual data input is required, the processing system also includes one or more input devices, such as alphanumeric input units (e.g., a keyboard), pointing control devices (e.g., a mouse), etc. The processing system may also encompass storage systems such as disk drive units. In some configurations, the processing system may include sound output devices and network interface devices. The memory subsystem therefore includes a computer-readable carrier medium carrying computer-readable code (e.g., software) comprising a set of instructions that, when executed by one or more processors, cause one or more of the methods described herein to be performed. It should be noted that when the methods comprise several elements (e.g., several steps), no particular order of these elements is implied unless specifically stated otherwise. During the execution of software by a computer system, the software may reside on a hard disk, or it may reside wholly or at least partially in RAM and / or a processor. Therefore, the memory and processor also constitute a computer-readable carrier medium carrying computer-readable code. Furthermore, the computer-readable carrier medium may be formed or included in a computer program product.

[0136] In alternative example embodiments, one or more processors may operate as standalone devices or may be connected to (e.g., networked to) other processors in a networked deployment. These processors may operate as server or user machines in a server-user network environment, or as peer-to-peer machines in a peer-to-peer or distributed network environment. The one or more processors may form a personal computer (PC), tablet PC, personal digital assistant (PDA), cellular phone, web facility, network router, switch, or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) specifying the actions to be taken by the machine.

[0137] It should be noted that the term "machine" should also be considered to include any set of machines that individually or jointly execute a set (or more sets) of instructions to perform any or more of the methods discussed herein.

[0138] Therefore, an example embodiment of each method described herein is in the form of a computer-readable carrier medium carrying a set of instructions, such as a computer program for execution on one or more processors (e.g., one or more processors as part of a web server arrangement). Thus, as those skilled in the art will recognize, example embodiments of this disclosure can be embodied as methods, apparatus such as dedicated devices, apparatus such as data processing systems, or computer-readable carrier media (e.g., computer program products). A computer-readable carrier medium carries computer-readable code comprising a set of instructions that, when executed on one or more processors, cause one or more processors to implement the method. Therefore, aspects of this disclosure can take the form of methods, entirely hardware example embodiments, entirely software example embodiments, or example embodiments combining software and hardware aspects. Furthermore, this disclosure can take the form of a carrier medium (e.g., a computer program product on a computer-readable storage medium) carrying computer-readable program code embodied in the medium.

[0139] Software can be further sent or received over a network via a network interface device. While the carrier medium is a single medium in the example embodiment, the term "carrier medium" should be considered to include a single medium or multiple media (e.g., a centralized or distributed database and / or associated caches and servers) that store one or more sets of instructions. The term "carrier medium" should also be considered to include any medium capable of storing, encoding, or carrying a set of instructions for execution by one or more processors and causing one or more processors to perform any one or more of the methods disclosed herein. The carrier medium can take many forms, including but not limited to non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical discs, magnetic disks, and magneto-optical discs. Volatile media include dynamic memory, such as main memory. Transmission media include coaxial cables, copper wires, and optical fibers, including conductors containing a bus subsystem. Transmission media can also take the form of acoustic or optical waves, such as acoustic or optical waves generated during radio wave and infrared data communication. For example, the term "carrier medium" should therefore be considered to include, but is not limited to, solid-state storage; computer products embodied in optical and magnetic media; media carrying propagation signals that can be detected by at least one or more processors and represent a set of instructions, which, when executed, implement a method; and transmission media in a network that carry propagation signals that can be detected by at least one of one or more processors and represent the set of instructions.

[0140] It will be understood that, in one example embodiment, the steps of the method in question are performed by a suitable processor (or processors) in a processing (e.g., computer) system / hardware that executes instructions (computer-readable code) stored in a storage device. It will also be understood that this disclosure is not limited to any particular implementation or programming technique, and that this disclosure can be implemented using any suitable technique for implementing the functionality described herein. This disclosure is not limited to any particular programming language or operating system.

[0141] Throughout this disclosure, references such as "one example embodiment," "some example embodiments," or "example embodiment" mean that a particular feature, structure, or characteristic described in connection with an example embodiment is included in at least one example embodiment of this disclosure. Therefore, phrases appearing throughout this disclosure, such as "in one example embodiment," "in some example embodiments," or "in an example embodiment," do not necessarily refer to the same example embodiment. Furthermore, in one or more example embodiments, particular features, structures, or characteristics may be combined in any suitable manner, as will be apparent to those skilled in the art based on this disclosure.

[0142] As used herein, unless otherwise specified, ordinal adjectives such as “first,” “second,” “third,” etc., are used to describe common objects only to indicate different instances of similar objects and are not intended to imply that the objects described must be in a given order in time, space, hierarchy, or any other way.

[0143] In the claims below and in the description herein, the terms *comprising*, *comprised of*, or *which comprises* are open-ended terms meaning that at least the following element / feature is included, but not excluding other elements / features. Therefore, when the term *comprising* is used in a claim, it should not be construed as limited to the means, elements, or steps listed thereafter. For example, the expression of a device including A and B should not be limited to a device that includes only elements A and B. As used herein, the terms *including*, *which includes*, or *that includes* are also open-ended terms meaning that at least the element / feature following the term is included, but not excluding other elements / features. Therefore, *including* is synonymous with *comprising* and means *comprising*.

[0144] It should be recognized that in the foregoing description of exemplary embodiments of this disclosure, various features of this disclosure are sometimes combined in a single exemplary embodiment, drawing, or description thereof in order to simplify the disclosure and aid in understanding one or more of the inventive aspects. However, the approach of this disclosure should not be construed as reflecting an intention in the claims to require more features than expressly recited in each claim. Rather, as reflected in the following claims, the inventive aspects lie in fewer than all features of a single foregoingly disclosed exemplary embodiment. Therefore, the claims following this specification are hereby expressly incorporated, wherein each claim is an independent, separate exemplary embodiment of this disclosure.

[0145] Furthermore, while some of the exemplary embodiments described herein include some features included in other exemplary embodiments but not others included in other exemplary embodiments, as those skilled in the art will understand, combinations of features from different exemplary embodiments are intended to be within the scope of this disclosure and to form different exemplary embodiments. For example, any exemplary embodiment of the claimed exemplary embodiments in the following claims may be used in any combination.

[0146] Numerous specific details are set forth in the description provided herein. However, it should be understood that exemplary embodiments of this disclosure may be practiced without these specific details. In other instances, well-known methods, structures, and techniques have not been shown in detail to avoid obscuring the understanding of this specification.

[0147] Therefore, although the mode considered to be the best mode of this disclosure has been described, those skilled in the art will recognize that other and further modifications can be made thereto without departing from the spirit of this disclosure, and all such changes and modifications falling within the scope of this disclosure are intended to be claimed. For example, any formulas given above merely represent processes that can be used. Functions can be added or removed from the block diagram, and operations can be interchanged between functional blocks. Steps can be added or removed from the methods described within the scope of this disclosure.

[0148] Various aspects of the invention can be understood from the following enumerated example embodiments (EEE):

[0149] EEE 1. A filter bank representation of multiple current samples (x) for autoregressive generation of audio signals. m A neural network system (300) of probability distribution of ), wherein the current sample corresponds to the current time slot (m), and wherein each current sample corresponds to a corresponding channel of the filter bank, the neural network system comprising:

[0150] From the top processing layer (T) N-1Multiple neural network processing layers (T0) are sorted from the bottom processing layer (T0) to the bottom processing layer (T0). N-1 T N-2 The hierarchical structure (310) of processing layers (T, ..., T0), wherein each processing layer (T... j ) has been trained to generate conditioning information (c j The generation of previous samples (x) based on the filter bank representation. <m Furthermore, for at least each processing layer other than the top layer, the processing is also based on the higher processing layer (T) in the hierarchy. j+1 The generated conditioning information (C) j+1 ),as well as

[0151] Output stage (320), which has been trained based on previous samples (x) corresponding to one or more previous time slots (<m) represented by the filter bank. <m The probability distribution is generated using conditioning information generated from the lowest processing layer.

[0152] EEE 2. The neural network system as described in EEE 1, wherein each processing layer has been trained to generate the conditioning information based on additional auxiliary information (a) provided for the current time slot.

[0153] EEE 3. A neural network system as described in EEE 1 or 2, further comprising means configured to generate a plurality of current samples represented by the filter bank by sampling from the generated probability distribution.

[0154] EEE 4. A neural network system as described in any one of EEE 1 to 3, wherein each processing layer includes a convolutional module configured to receive previous samples represented by the filter bank, wherein the number of input channels of each convolutional module is the same as the number of channels of the filter bank, and wherein the kernel size of the convolutional module decreases from the top processing layer to the bottom processing layer in the hierarchy.

[0155] EEE 5. A neural network system as described in EEE 4, wherein each processing layer includes at least one recursive unit configured to receive the sum of the outputs from the convolutional modules as its input, and at least each processing layer other than the lowest processing layer includes at least one learned upsampling module configured to receive the outputs from the at least one recursive unit as its input and generate the conditioning information as its output.

[0156] EEE 6. The neural network system as described in any one of the preceding EEEs, wherein the output stage includes the bottom processing layer, and wherein the bottom processing layer is subdivided into a plurality of sequentially executed sublayers, wherein each sublayer has been trained to generate a probability distribution of one or more current samples corresponding to a proper subset of the channels of the filter bank, and for at least all sublayers except the first executed sublayer, each sublayer has been trained to also generate the probability distribution based on the current samples generated by one or more previously executed sublayers.

[0157] EEE 7. A neural network system as described in EEE 6, wherein the first executed sublayer generates one or more current samples corresponding to at least the lowest channel of the filter bank, and wherein the last executed sublayer generates one or more current samples corresponding to at least the highest channel of the filter bank.

[0158] EEE 8. The neural network system as described in any one of the preceding EEE, wherein the probability distribution of the current sample is obtained using a mixture model.

[0159] EEE 9. A neural network system as described in EEE 8, wherein generating the probability distribution includes providing a linear transformation (Lj) to the mixing coefficients (j) of the mixture model. j The update of the linear transformation is defined by a triangular matrix having one on the main diagonal, wherein the number of non-zero diagonals of the triangular matrix is ​​greater than one and less than the number of channels of the filter bank.

[0160] EEE 10. A neural network system as described in EEE 9 when it is subordinate to EEE 3, wherein the sampling includes a transformation having a linear transformation.

[0161] EEE 11. The neural network system as described in EEE 6 when subordinate to EEE 5, further includes an additional recursive unit (464) common to all sub-layers of the bottom processing layer, and configured to receive i) the sum of the outputs from the convolutional module and ii) a mixture of the outputs of at least one recursive unit (460-j) as its input, and based thereon generate additional auxiliary information (d) for the corresponding sub-output stage (480-j) of each sub-layer (490-j). * ).

[0162] EEE 12. A method for autoregressively generating a probability distribution of a plurality of current samples in the form of a filter bank for an audio signal, wherein the current samples correspond to current time slots, and wherein each current sample corresponds to a corresponding channel of the filter bank, the method comprising generating the probability function by using a neural network system as described in any of the preceding EEEs.

[0163] EEE 13. A non-transitory computer-readable medium storing instructions that, when executed by at least one computer processor belonging to computer hardware, are operable to use the computer hardware to implement a neural network system according to any one of EEE 1 to 11 and / or perform the method as described in EEE 12.

Claims

1. A computer-implemented neural network system for autoregressively generating a filter bank representation of an audio signal from multiple current filter bank samples, wherein, The plurality of current filter bank samples correspond to the current time slot, and each current filter bank sample corresponds to a corresponding channel of the filter bank. The neural network system includes: A hierarchical structure of multiple neural network processing layers ordered from top to bottom, wherein each of the multiple neural network processing layers has been trained to generate conditioning information based on previous filter bank samples represented by the filter bank, and for at least each processing layer other than the top processing layer, also based on conditioning information generated by higher processing layers in the hierarchy. An output stage, trained to generate a probability distribution of the plurality of current filter bank samples based on previous filter bank samples corresponding to one or more previous time slots of the filter bank representation and the conditioning information generated from the lowest processing layer, is configured to sample the probability distribution to obtain the current filter bank samples. The output stage includes the bottom processing layer, which is further subdivided into a plurality of sequentially executed sublayers, each sublayer being trained to generate the probability distribution of one or more current filter bank samples corresponding to a proper subset of the channels of the filter bank, and for at least all sublayers except the first executed sublayer, each sublayer being trained to also generate the probability distribution based on the current filter bank samples generated by one or more previously executed sublayers.

2. The system as claimed in claim 1, wherein, Each processing layer has been trained to generate the conditioning information based on additional auxiliary information provided for the current time slot.

3. The system of claim 1 or 2, further comprising means configured to generate the plurality of current filter bank samples representing the filter bank by sampling from the generated probability distribution.

4. The system as described in claim 1 or 2, wherein, Each processing layer includes a convolutional module configured to receive samples of the previous filter bank represented by the filter bank, wherein the number of input channels of each convolutional module is the same as the number of channels of the filter bank, and wherein the kernel size of the convolutional module decreases from the top processing layer to the bottom processing layer in the hierarchy.

5. The system as described in claim 4, wherein, Each processing layer includes at least one recursive unit configured to receive the sum of the outputs from the convolutional modules as its input, and for at least each processing layer other than the lowest processing layer, includes at least one learning upsampling module configured to receive the outputs from the at least one recursive unit as its input and generate the conditioning information as its output.

6. The system of claim 5, further comprising an additional recursive unit common to all sub-layers of the bottom processing layer, and configured to receive i) the sum of the outputs from the convolution module and ii) a mixture of the outputs from the at least one recursive unit as its input, and based thereon generate additional auxiliary information for the corresponding sub-output level of each sub-layer.

7. The system as claimed in claim 1 or 2, wherein, The first executed sublayer generates one or more current filter bank samples corresponding to at least the lowest channel of the filter bank, and the last executed sublayer generates one or more current filter bank samples corresponding to at least the highest channel of the filter bank.

8. The system as claimed in claim 1 or 2, wherein, The probability distribution of the plurality of current filter bank samples is obtained using a mixture model.

9. The system of claim 8, wherein, Generating the probability distribution includes updating the mixing coefficients of the mixing model with a linear transformation, wherein the linear transformation is defined by a triangular matrix having one on the main diagonal, and wherein the number of non-zero diagonals of the triangular matrix is ​​greater than one and less than the number of channels of the filter bank.

10. The system of claim 9, wherein, The sampling includes transformations with linear transformations.

11. A computer-implemented neural network system for autoregressively generating a filter bank representation of an audio signal from multiple current filter bank samples, wherein, The plurality of current filter bank samples correspond to the current time slot, and each current filter bank sample corresponds to a corresponding channel of the filter bank. The neural network system includes: A hierarchical structure of multiple neural network processing layers ordered from top to bottom, wherein each of the multiple neural network processing layers has been trained to generate conditioning information based on previous filter bank samples represented by the filter bank, and for at least each processing layer other than the top processing layer, also based on conditioning information generated by higher processing layers in the hierarchy. An output stage, trained to generate a probability distribution of the plurality of current filter bank samples based on previous filter bank samples corresponding to one or more previous time slots of the filter bank representation and the conditioning information generated from the lowest processing layer, is configured to sample the probability distribution to obtain the current filter bank samples. Each processing layer includes a convolutional module configured to receive samples of the previous filter bank represented by the filter bank, wherein the number of input channels of each convolutional module is the same as the number of channels of the filter bank, and wherein the kernel size of the convolutional module decreases from the top processing layer to the bottom processing layer in the hierarchy.

12. A method for regressively generating a filter bank representation of an audio signal from multiple current filter bank samples, wherein, The plurality of current filter bank samples correspond to current time slots, and each current filter bank sample corresponds to a corresponding channel of the filter bank, the method comprising generating and sampling a probability distribution using a system as described in any of the preceding claims.

13. The method of claim 12, comprising the following steps: - The plurality of neural network processing layers are used to generate conditioning information, wherein the conditioning information is generated using the bottom processing layer; and - The probability distribution is generated using the output stage based on previous filter bank samples corresponding to one or more previous time slots represented by the filter bank and the conditioning information generated using the bottom processing layer.

14. A non-transitory computer-readable medium storing instructions that, when executed by at least one computer processor belonging to computer hardware, are operable to use the computer hardware to implement the system according to any one of claims 1 to 11 and / or perform the method according to claims 12 and 13.