Generating audio using generative neural networks
By combining a diffuse neural network hierarchy with a language model, the problem of generating high-quality audio over long periods of time has been solved, enabling the efficient generation of high-quality musical works and soundscapes while ensuring the temporal continuity and detail of the audio.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GDM HOLDING LLC
- Filing Date
- 2024-11-20
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies struggle to efficiently generate high-quality audio over long periods, especially musical works, and direct generation may require significant computational resources and result in poor output coherence.
Using a hierarchical structure of a diffusion neural network, a low-resolution spectrogram is first generated, and then a longer audio is generated through iterative expansion. Combined with a language model neural network to process user input, high-quality musical works or soundscapes are generated.
It enables the efficient generation of high-quality audio in various lengths, ensuring temporal continuity and reflecting fine-grained details of user input, while avoiding the large amount of computing resources and quality degradation required to directly generate long audio.
Smart Images

Figure CN122228545A_ABST
Abstract
Description
[0001] Cross-reference to related applications
[0002] This application claims priority to U.S. Provisional Application No. 63 / 601,183, filed November 20, 2023. The disclosure of the earlier application is considered part of the disclosure of this application and is incorporated herein by reference. Background Technology
[0003] This specification relates to the use of neural networks to generate audio conditioned on conditional input.
[0004] A neural network is a machine learning model that uses one or more non-linear units to predict the output in response to a received input. In addition to the output layer, some neural networks also include one or more hidden layers. The output of each hidden layer serves as the input to one or more other layers in the network (i.e., one or more other hidden layers, the output layer, or both). Each layer of the network generates an output from the received input based on the current values of its corresponding set of parameters. Summary of the Invention
[0005] This specification describes a system implemented as a computer program on one or more computers in one or more locations, which uses a generative neural network to generate audio representing an audio signal conditioned on a conditional input.
[0006] For example, the system can generate a spectrogram representing an audio signal and then convert that spectrogram into an audio waveform.
[0007] As a specific example, the system can generate musical works conditioned on user input that characterizes the desired properties of the musical work, such as songs with lyrics or instrumental pieces without lyrics.
[0008] As another specific example, the system can generate a soundscape representing a particular audio environment, conditioned on user input that characterizes the desired audio properties of that environment.
[0009] In some cases, the system can use the hierarchical structure of the diffusion model to generate the musical work.
[0010] For example, the system can use a first set of diffusion neural networks to first generate a spectrogram representing a musical work of a first length.
[0011] Alternatively, the system can then generate an expanded spectrogram representing a longer musical work by using a second set of diffuse neural networks, the expanded spectrogram being conditioned on the user input and the spectrogram of the first length.
[0012] As used in this specification, a "spectrum graph" is a visual representation of the frequency spectrum of an audio signal over time. Frequency in a spectrum graph can be represented in any suitable manner. For example, a spectrum graph can be a log-Mel amplitude spectrum graph with frequency expressed on a log-Mel scale, or a different type of spectrum graph with frequency expressed on different scales.
[0013] In some contexts, a "spectrum" refers to a "stereo" spectrogram, which is a combination of the corresponding spectrograms of two or more distinct, time-aligned audio signals. For example, these audio signals can be played simultaneously from two or more different sound sources to create a multi-dimensional perspective. In these cases, a stereo spectrogram can be, for example, a connection of multiple distinct spectrograms along the depth dimension.
[0014] In addition, in these cases, the system can generate stereo audio signals from stereo spectrograms, that is, it can generate two or more separate waveforms designed to be played simultaneously from the corresponding sound sources.
[0015] Optionally, the system can also generate images (“album covers”) that visually represent certain properties of the generated musical work.
[0016] In some cases, the system can use a language model neural network to transform user input (e.g., natural language user input) into input cues for a generative neural network system to generate musical works, and optionally into another input cues for a generative neural network system to generate album cover images.
[0017] Specific embodiments of the subject matter described in this specification may be implemented in order to achieve one or more of the following advantages.
[0018] Using the described technique, the system can receive user input and, in response, efficiently generate high-quality music that accurately reflects the musical properties characterized by the user input, for example, by generating music described by natural language text or audio input from the user.
[0019] Specifically, the system can effectively utilize diffusion neural networks by using one diffusion neural network to first generate a low-resolution spectrogram of the desired music, and then using another diffusion neural network to upsample that low-resolution spectrogram. By using this hierarchical structure of diffusion neural networks, the system can ensure that the resulting audio is temporally coherent while still exhibiting fine-grained details that reflect the user input.
[0020] Optionally, longer musical pieces may be required. In these cases, the system can use another ensemble of diffuse neural networks to expand the initially generated "blocks" of music. By iteratively expanding the initially generated blocks, the system can generate significantly longer audio without requiring excessive memory and without degrading the quality or coherence of the audio as the length increases. That is, directly generating long (e.g., several minutes long) audio may require a large amount of computational resources (e.g., memory) and may result in outputs with poor coherence. Iteratively generating new blocks that expand on the already generated blocks solves these problems, thereby producing high-quality long audio.
[0021] Therefore, the system can generate high-quality music of any length in a computationally efficient manner. That is, the system can generate high-quality music even if the desired or target length of the music exceeds the modeling capability of any given diffusion neural network employed by the system.
[0022] Furthermore, by utilizing diffusion models, the system can efficiently generate multiple high-quality, plausible musical compositions in response to a given user input. These compositions represent multiple different plausible interpretations of the user input. In other words, since the diffusion model starts with a noisy representation of the output, by sampling different noises to initialize the representation for any given diffusion model used by the system, the system can generate different musical compositions that represent high-quality, plausible interpretations of the user input.
[0023] Details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the following description. Other features, aspects, and advantages of this subject matter will become apparent from the description and drawings. Attached Figure Description
[0024] Figure 1 This is a diagram of an example audio generation system.
[0025] Figure 2 An example of the operation of an audio generative neural network system is shown.
[0026] Figure 3 This illustrates another example of how an audio generative neural network system operates.
[0027] Figure 4 This is a flowchart of an example process for generating a high-resolution spectrogram with a first length.
[0028] Figure 5 This is a flowchart of an example process for generating a high-resolution spectrogram with an extended length.
[0029] Figure 6This is a flowchart of an example process for generating audio and image inputs from user input.
[0030] Figures 7A to 7C An example of system operation is shown.
[0031] In the various figures, the same reference numerals and names indicate the same elements. Detailed Implementation
[0032] This specification describes a system implemented as a computer program on one or more computers in one or more locations, which generates audio conditioned on conditional input.
[0033] Figure 1 This is a diagram of an example audio generation system 100. Audio generation system 100 is an example of a system implemented as a computer program on one or more computers at one or more locations, wherein the systems, components, and techniques described below may be implemented.
[0034] System 100 uses conditional input 110 as a condition and employs an audio generative neural network system 130 to generate audio 120 representing an audio signal. For example, audio 120 may be a waveform representing the audio signal, wherein the waveform includes the corresponding amplitude value of the audio signal at each of a plurality of time points.
[0035] Typically, system 100 can generate audio 120 with a target length, i.e., a time window spanning the target length. In some implementations, the target length is fixed, i.e., such that each audio signal generated by system 100 has the same length. In some other implementations, the target length is variable, and different audio signals generated by system 100 can have different lengths.
[0036] For example, system 100 can generate a spectrogram representing an audio signal and then convert the spectrogram into an audio waveform.
[0037] As used in this specification, a "spectrum graph" is a visual representation of the frequency spectrum of an audio signal over time. Frequency in a spectrum graph can be represented in any suitable manner. For example, a spectrum graph can be a log-Mel amplitude spectrum graph with frequency expressed on a log-Mel scale, or a different type of spectrum graph with frequency expressed on different scales.
[0038] In some contexts, a "spectrum" refers to a "stereo" spectrogram, which is a combination of the corresponding spectrograms of two or more distinct, time-aligned audio signals. For example, these audio signals can be played simultaneously from two or more different sound sources to create a multi-dimensional perspective. In these cases, a stereo spectrogram can be, for example, a connection of multiple distinct spectrograms along the depth dimension.
[0039] In addition, in these cases, system 100 can generate stereo audio signals from stereo spectrograms, that is, it can generate two or more separate waveforms intended to be played simultaneously from corresponding sound sources.
[0040] System 100 can then play audio to the user or save the audio waveform in memory for later playback.
[0041] As a specific example, system 100 can generate musical works conditioned on user input representing desired properties of a musical work (e.g., songs with lyrics or instrumental works without lyrics), or generate soundscapes conditioned on user input representing desired properties of a soundscape. A "soundscape" is audio representing a specific environment such as a home, office, hospital, shopping mall or shopping center, beach, forest, etc.
[0042] In other words, audio 120 can be a musical work or a soundscape, and conditional input 110 can be user input that represents the desired properties of the musical work or soundscape.
[0043] For example, user input can be natural language text describing the desired content of a musical work or soundscape.
[0044] As another example, system 100 may present a user interface that allows a user to provide structured text containing values for one or more properties of a musical work or soundscape. Therefore, in these cases, the user input is structured text containing the corresponding value for each of the one or more properties of the musical work or soundscape.
[0045] In some implementations, system 100 may receive audio input 132 from a user as user input or generate audio input 132 from user input. This audio input may include (i) affirmative input specifying the desired characteristics of a musical work or soundscape, and optionally (ii) lyrics to be sung in the musical work. In some implementations, the input may also include (iii) negative input specifying characteristics that the musical work or soundscape should not possess.
[0046] Then, system 100 uses audio generative neural network system 130 to process audio input 132 to generate audio 120.
[0047] Optionally, the system 100 may also generate an image 140 (“album cover”) that visually represents certain properties of the generated musical work.
[0048] Specifically, system 100 can generate image input 142 that characterizes the desired properties of image 140, and then use image generative neural network system 150 to generate image 140 conditioned on image input 142.
[0049] For example, system 100 may receive image input 142 from a user as user input or generate image input 142 from user input, the image input including (i) an affirmative image input specifying desired properties of image 140. In some of these implementations, image input 142 may also include (ii) a negative image input specifying properties that image 142 should not have.
[0050] The image generative neural network system 150 can use any suitable generative neural network to generate the image 140.
[0051] For example, system 150 can use a diffusion neural network to generate image 140. An example of such a neural network is a latent diffusion model, such as Stable Diffusion. Another example of such a neural network is a diffusion model that uses a text-to-image diffusion model to generate a first image, and then applies one or more super-resolution diffusion models to that first image to generate the final image 140. An example of such a model is Imagen.
[0052] As another example, system 150 can use an autoregressive generative model to generate image 140. An example of such a model is Parti (Yu et al., “Scaling Autoregressive Models for Content-RichText-to-Image Generation”, arXiv:2206.10789v1, 2022).
[0053] As yet another example, system 150 can use a masked lexical generative model that sequentially demasks visual lexical units during generation. An example of such a generative model is Muse (Chang et al., “Muse: Text-To-Image Generation via Masked Generative Transformers”, arXiv:2301.00704v1, 2023).
[0054] In some implementations, system 100 may use a language model neural network to generate both audio input 132 and music input 142 from user input.
[0055] For example, system 100 can use a language model neural network to map natural language user input describing a musical work or soundscape to a structured text sequence that includes at least positive audio input. Optionally, the sequence may also include negative audio input. Further optionally, the structured text sequence may also include lyrics generated by the language model neural network. System 100 can then generate at least positive image cues from the structured text sequence provided to an image generation model to generate album art.
[0056] A language model neural network can have any suitable neural network architecture that allows the neural network to map an input sequence of lexical units from the vocabulary to an output sequence of lexical units from the vocabulary.
[0057] A vocabulary of lexical units can include any lexical unit from a variety of lexical units that represent text symbols or other symbols. For example, a vocabulary of text lexical units can include one or more of characters, subwords, words, punctuation marks, numbers or other symbols that appear in a text corpus of natural language and / or computer programming languages.
[0058] For example, a language model neural network can be a Transformer-based language model neural network or a recurrent neural network-based language model. As a specific example, a language model neural network can be an autoregressive Transformer-based neural network with, for example, an encoder-only Transformer architecture, an encoder-decoder Transformer architecture, or a decoder-only Transformer architecture.
[0059] Examples of such architectures include those described in the following: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, arXivpreprint arXiv:1910.10683, 2019; Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le, Towards a human-like open-domain chatbot, CoRR, abs / 2001.09977, 2020; Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners, arXiv preprint arXiv:2005.14165, 2020; Aakanksha Chowdhery, et al. PaLM: Scaling Language Modeling with Pathways, arXiv preprint arXiv:2204.02311; and Rohan Anil, et al. Palm 2 technical report, arXiv preprint arXiv:2305.10403, 2023.
[0060] Figure 2 An example audio generative neural network system 130 is shown that generates musical works or soundscapes 220 based on audio input 132.
[0061] System 130 can use a hierarchical structure of a diffusion model to generate the musical work or soundscape 220, which includes a base diffusion neural network 260 and an upsampled diffusion neural network 270.
[0062] Specifically, system 130 can use a basic diffuse neural network 260 to first generate a low-resolution spectrogram 240 representing a musical work or soundscape with a first resolution.
[0063] Typically, the low-resolution spectrogram 240 is referred to as a low-resolution spectrogram because it spans the same time window as the musical work or soundscape 220, but has a first resolution lower than the target resolution of the musical work or soundscape 220.
[0064] For example, if the target resolution is x times y, then the first resolution could be x / 4 times y / 4, x / 8 times y / 8, or x / 16 times y / 16.
[0065] The basic diffusion neural network 260 is a diffusion neural network configured to receive and process a "basic" diffusion input to generate a "basic" denoised output, the "basic" diffusion input including a representation of the audio input 210 and a current representation of the low-resolution spectrogram 240. For example, the representation of the audio input 210 may be an embedding of the audio input 210 generated by a text embedding neural network.
[0066] System 130 uses a basic diffusion neural network 260 to generate a low-resolution spectrogram 240 across multiple sampling steps.
[0067] Before performing the multiple sampling steps, system 130 initializes the representation of spectrogram 240 by sampling the corresponding noisy value of each value of low-resolution spectrogram 240 from a noise distribution (e.g., a Gaussian distribution). Thus, the representation of spectrogram 240 initially has only noisy values.
[0068] At each sampling step, system 130 uses a basic diffusion neural network 260 to process the basic diffusion input, which includes the representation of audio input 210 and the current representation of spectrogram 240, to generate a basic denoised output for that sampling step.
[0069] When the audio input 210 includes positive audio input, negative audio input, and lyrics, the system 130 can process two separate base diffusion inputs at each sampling iteration.
[0070] Specifically, system 130 can use a basic diffusion neural network 260 to process the affirmative basic diffusion input, which includes the affirmative audio input and the representation of lyrics as well as the current representation of the spectrogram 240, to generate an affirmative basic denoising output for that sampling step.
[0071] System 130 can also use a basic diffusion neural network 260 to process the negative basic diffusion input, which includes the representation of the negative audio input and the current representation of the spectrogram 240, to generate a negative basic denoising output for that sampling step.
[0072] Then, the system 130 can combine the negative basis denoising output and the positive basis denoising output according to the guiding weights for the sampling step to generate the final basis denoising output for the sampling step.
[0073] System 130 then uses the base denoised output to update the representation.
[0074] For example, system 130 can calculate an estimate of spectrogram 240 from the current representation and the base denoised output, and then use the estimate to update the current representation.
[0075] For each sampling step except the last sampling step, system 130 may apply a diffusion sampler to the estimate to generate an updated representation. The system may use any suitable diffusion sampler, such as a DDPM (Denoising Diffusion Probability Model) sampler, a DDIM (Denoising Diffusion Implicit Model) sampler, or another suitable sampler.
[0076] For the final sampling step, system 130 can use this estimate as an updated representation.
[0077] Then, system 130 uses the updated representation after the final sampling step as the final spectrogram 240. In other words, system 130 uses a basic diffuse neural network to progressively "de-noise" the initial noisy representation of spectrogram 240 to generate a low-resolution spectrogram 240 from audio input 210.
[0078] Then, system 130 can conditionally upsampled diffuse neural network 270 using a low-resolution spectrogram 240 of a first length of musical work or soundscape to generate a second, higher-resolution spectrogram 280.
[0079] Alternatively, the upsampled diffusion neural network 270 may also be conditioned on audio input generated from a user request, for example, by a language model neural network.
[0080] System 130 can condition upsampled diffusion neural network 270 with low-resolution spectrogram 240 in any of a variety of ways.
[0081] For example, system 130 may upsample the low-resolution spectrogram to have a higher resolution, and then at each sampling step during the generation of the higher-resolution spectrogram 280, include the upsampled low-resolution spectrogram in the diffusion input for the upsampled diffusion neural network 270. For example, the diffusion input may include a connection between the upsampled low-resolution spectrogram and the current representation of the higher-resolution spectrogram 280.
[0082] System 130 can use a low-resolution spectrogram and optionally an upsampled diffusion neural network 270 conditioned on the audio input to generate a higher resolution spectrogram 280 across multiple sampling steps, for example, in the same manner as described above for generating the low-resolution spectrogram.
[0083] In some implementations, system 130 then uses vocoder 290 to generate musical works or soundscapes 220 from spectrogram 280.
[0084] The vocoder 290 can be any suitable software that maps a spectrogram to a waveform.
[0085] For example, vocoder 290 could be a vocoder that applies a phase reconstruction method to a spectrogram to generate a waveform, such as a Griffin-Lim vocoder.
[0086] As another example, vocoder 290 can be a trained neural network-based vocoder, such as a diffusion neural network or a WaveNet-based neural network.
[0087] In some other implementations, spectrogram 280 may represent a musical work or soundscape having a length (in the time dimension) shorter than the target length (in the time dimension) of the final work to be generated by system 130.
[0088] In other words, spectrogram 280 can span a time window shorter than the target time window to be spanned by the final musical work or soundscape. As a specific example, spectrogram 280 can be 30 seconds long, while the target length of the final musical work or soundscape can be 3 minutes.
[0089] For example, this difference in length could be due to the target length being longer than the length of the audio that can be accurately modeled directly from the audio input by the diffusion model hierarchy.
[0090] To address this issue, in some implementations, system 130 may use spectrogram 280 to iteratively generate the final musical composition or soundscape.
[0091] In other words, at each of the multiple iterations, system 130 can use the spectrograms generated up to that iteration to generate a new spectrogram spanning a time window immediately following the most recently generated spectrogram in the final musical work or soundscape. In other words, system 130 can iteratively extend the length of the generated musical work or soundscape until a target length is reached.
[0092] Figure 3 An example 300 is shown of the operation of the audio generative neural network system 130 as iteratively expands the generated audio.
[0093] Specifically, as described above, system 130 first generates a spectrogram 280, which represents the initial time window of the final, longer time window. More specifically, the initial time window has a first length, while the final time window has a second, longer length. For example, the system can generate 15-second, 30-second, or 45-second "blocks" of longer 3-minute, 4-minute, or 5-minute songs.
[0094] Before iteratively expanding the spectrogram 280, the system 130 uses the spectrogram 280 to generate an initial expanded spectrogram 310 that spans the entire final time window but has a lower resolution than the target resolution of the musical work or soundscape (and therefore a lower resolution than the spectrogram 280).
[0095] System 130 can use an initial low-resolution extended diffusion neural network 320 to generate an initial extended spectrogram 310.
[0096] The diffusion neural network 320 is a diffusion neural network configured (trained) to process an initial low-resolution extended diffusion input and process the diffusion input to generate a low-resolution extended denoising output. The initial low-resolution extended diffusion input includes a representation of audio input 210, a representation of spectrogram 280, and a current representation of spectrogram 310.
[0097] System 130 uses an initial low-resolution extended diffusion neural network 320 to generate a spectrogram 310 across multiple sampling steps.
[0098] Before performing the multiple sampling steps, system 130 initializes the representation of spectrogram 310.
[0099] Specifically, system 130 initializes the representation by sampling the corresponding noisy value of each value in the spectrum 310 from a noise distribution (e.g., a Gaussian distribution). Thus, the representation of spectrum 310 initially has only noisy values.
[0100] At each sampling step, system 130 uses initial low-resolution spread diffusion neural network 320 to process the initial low-resolution spread diffusion input, which includes the current representation of spectrogram 310 and other data described above, to generate initial low-resolution spread denoising output for that sampling step.
[0101] Then, system 130 uses the initial low-resolution extended denoising output to update the representation.
[0102] For example, the system can calculate an estimate of the spectrogram 310 from the current representation and the low-resolution extended denoised output, and then use the estimate to update the current representation.
[0103] For each sampling step except the last sampling step, system 130 can apply a diffusion sampler to the estimate to generate an updated representation.
[0104] For the final sampling step, system 130 can use this estimate as an updated representation.
[0105] After generating spectrogram 310, system 130 then iteratively expands spectrogram 280 at each of multiple iterations.
[0106] Specifically, at each iteration, system 130 generates a new high-resolution spectrogram 330 conditioned on (i) a low-resolution spectrogram used to generate the most recently generated high-resolution spectrogram, which will be described in more detail below, and (ii) a corresponding portion of the initial expanded spectrogram 310.
[0107] At the first iteration, the most recently generated high-resolution spectrogram is spectrogram 280. At each subsequent iteration, the most recently generated high-resolution spectrogram is the high-resolution spectrogram generated at the previous iteration.
[0108] The corresponding portion of the initial expanded spectrogram 310 is the portion that spans the same time window as the new high-resolution spectrogram 330. Optionally, the corresponding portion may also include the portion preceding the time window spanned by the new high-resolution spectrogram 330 as additional context.
[0109] System 130 can use high-resolution extended system 340 to generate new high-resolution spectrograms 330.
[0110] System 340 may then include (i) a low-resolution extended diffusion neural network and (ii) an upsampled diffusion neural network.
[0111] In some implementations, the low-resolution extended diffusion neural network is the same as the basic diffusion neural network 260, and the upsampling diffusion neural network is the same as the upsampling diffusion neural network 270.
[0112] In some other implementations, the parameter values of one or both of the low-resolution spread diffusion neural network and the upsampled spread neural network have been learned separately from the parameter values of neural networks 260 and 270.
[0113] The low-resolution extended diffusion neural network is a diffusion neural network configured to process a low-resolution extended diffusion input and process that diffusion input to generate a high-resolution extended denoising output. The low-resolution extended diffusion input includes a representation of audio input 210, a representation of the corresponding portion of an initially extended spectrogram 310, a representation of the most recently generated low-resolution spectrogram, and a current representation of spectrogram 330. When audio input 210 includes lyrics, the representation of audio input 210 may include only the portion of the lyrics that approximately corresponds to the time window spanned by spectrogram 330.
[0114] System 130 uses a low-resolution extended diffusion neural network 340 to generate a low-resolution spectrogram across multiple sampling steps, for example, a spectrogram with a first resolution but spanning the same time window as spectrogram 330.
[0115] Before performing the multiple sampling steps, system 130 initializes the representation of the low-resolution spectrogram.
[0116] In some implementations, system 130 initializes the representation by sampling the corresponding noisy value of each value in the spectrum 240 from a noise distribution (e.g., a Gaussian distribution). Thus, the representation initially has only noisy values.
[0117] In some other implementations, system 130 may upsample a corresponding portion of the initial expanded spectrogram 310 to generate an upsampled spectrogram with a first resolution. The system can then use the upsampled spectrogram as an initialized representation.
[0118] At each sampling step, system 340 uses a low-resolution extended diffusion neural network to process the low-resolution extended diffusion input, which includes the current representation of the spectrogram and other data described above, to generate a low-resolution extended denoising output for that sampling step.
[0119] Then, system 340 uses the low-resolution extended denoising output to update the representation.
[0120] For example, system 340 can compute an estimate of the spectrogram from the current representation and the high-resolution extended denoised output, and then use the estimate to update the current representation. For each sampling step except the last sampling step, system 340 can apply a diffuse sampler to the estimate to generate the updated representation. For the last sampling step, system 340 can use the estimate as the updated representation.
[0121] Then, system 340 can use an upsampling diffusion neural network to upsample the low-resolution spectrogram to achieve the target resolution, for example, as referenced above. Figure 2 As stated above.
[0122] The diffusion neural networks described in this specification—namely, diffusion neural networks 260, 270, 320, and 340—can have any suitable architecture. For example, some or all of the diffusion neural networks can be convolutional neural networks with the same or different numbers of channels, such as U-Net, or other architectures that map an input of a given dimension to an output of the same dimension. More generally, since diffusion neural networks generate spectrograms as outputs, and spectrograms can be represented as images, diffusion neural networks can have any suitable text-to-image diffusion neural network architecture.
[0123] In any of the examples above, the data item generated using the diffuse neural network can be an output in the output space, i.e., such that the value in the spectrogram is the value of the spectrogram at an appropriate resolution; or an output in the latent space, i.e. such that the value in the output data item is the value in the latent representation of the spectrogram.
[0124] When generating output data items in the latent space, the system can generate the final spectrogram in the output space by processing the output data items in the latent space using a decoder neural network (e.g., a decoder neural network already pre-trained in an autoencoder framework). During training, the system can encode the target spectrogram in the output space using an encoder neural network (e.g., an encoder neural network already pre-trained in conjunction with the decoder in an autoencoder framework) to generate the target output for the diffusion neural network in the latent space.
[0125] More specifically, as described above, for any given diffusion neural network, the system uses the diffusion neural network to perform a backdiffusion process across multiple update iterations to generate output data items.
[0126] Any given diffusion neural network described above can be any suitable diffusion neural network that has been trained, for example, by the system or another training system to process the diffusion input for any given update iteration, including the current data item (up to that update iteration), to generate a denoised output for that update iteration.
[0127] In some implementations, the denoised output is an estimate of the noise component of the current data item—that is, the noise that needs to be combined with the final data item (i.e., added to or subtracted from the final data item) to generate the current data item.
[0128] In some other implementations, the denoised output is an estimate of the final data item given the current data item, that is, an estimate of the data item that would be generated by removing the noise component of the current data item.
[0129] In some other implementations, the denoised output is a prediction of the v-parameterization of the noise components and the final data item (Salimans and Ho, arXiv: 2202.00512, 2022, Section 4; Appendix D).
[0130] For example, the system or another training system may have already trained a diffusion neural network on a set of training data items using a denoising score matching target to generate a denoised output.
[0131] The denoising score matching target can measure the error between (i) the denoised output and (ii) the target denoised output, such as mean square error, L1 error, L2 error, or different types of error, which is generated by processing inputs including noisy data items generated by adding sampled noise to training data items, which is generated from training data items, from sampled noise, or both.
[0132] For example, when the denoising output is an estimate of the true noise component of the current data item, the target denoising output can be sampled noise.
[0133] As another example, when the denoised output is an estimate of the target data item, the target denoised output can be the target data item.
[0134] As another example, when the denoising output is a prediction of the v-parameterization of the noise component and the final data item, the target denoising output can be the sampled noise and the true v-parameterization of the target data item.
[0135] In other words, to train a given diffuse neural network, the system can obtain a training dataset comprising spectrograms with corresponding resolutions and corresponding conditional inputs for each spectrogram. For example, the system obtains spectrograms representing synthesized musical works or soundscapes or real-world musical works or soundscapes, and can then downsample each spectrogram as needed to generate spectrograms with corresponding resolutions. The system can also obtain structured data—e.g., structured text as described elsewhere in this specification—to use as conditional inputs, describing the nature of the musical work or soundscape, or the natural language of the musical work or soundscape. Structured data, such as structured text, can be obtained by manually annotating spectrograms in the training dataset or by audio represented by spectrograms in the training dataset. The system can then use the spectrograms and conditional inputs to train the diffuse neural network against the objectives described above.
[0136] As mentioned above, a diffuse neural network can have any suitable architecture that allows the neural network to map a diffused input, which includes input data items with the same dimension as the output data items, to a denoised output that also has the same dimension as the output data items.
[0137] Furthermore, as mentioned above, neural networks can condition the input in any of a variety of ways.
[0138] As an example, the system can use an encoder neural network to generate one or more embeddings representing conditional inputs, and the diffusion neural network can include one or more cross-attention layers, each cross-attention layer cross-attention the one or more embeddings.
[0139] As used in this specification, an embedding is an ordered collection of numerical values, such as a vector of floating-point values or other types of values.
[0140] For example, when the conditional input is text, the system can use a text encoder neural network (e.g., a Transformer neural network) to generate a fixed or variable number of text embeddings representing the conditional input.
[0141] When the conditional input is an image (e.g., a spectrogram), the system can use an image encoder neural network (e.g., a convolutional neural network or a visual Transformer neural network) to generate a set of embeddings representing the image.
[0142] When the conditional input is audio, the system can use, for example, an audio encoder neural network (e.g., an audio encoder neural network that has been jointly trained with a decoder neural network as part of a neural audio codec (e.g., SoundStream, Zeghidour et al., arXiv: 2107.03312v1, 2021)) to generate one or more embeddings that encode the audio.
[0143] When the conditional input is a scalar value, the system can use, for example, an embedding matrix to map the scalar value or its one-hot representation to the embedding.
[0144] In some cases, conditional input includes multiple different types of input, such as text, images, scalar values, or two or more of the following in context:
[0145] In some of these cases, the system can generate one or more initial embeddings for each type of input; that is, using an appropriate encoder neural network as described above, and then using a Transformer encoder neural network to process the initial embeddings for all different types of inputs to update each of the initial embeddings, thereby generating a final set of embeddings. The one or more cross-attention layers within the diffusion neural network can then cross-attention this final set of embeddings.
[0146] In some of these cases, different cross-attention layers within a diffuse neural network can cross-attention to embeddings of different types of conditional inputs.
[0147] In some of these cases, the system can concatenate initial embeddings of different types of inputs along the sequence dimension, and then one or more cross-attention layers can cross-attention the concatenated final set of embeddings.
[0148] As another example, a diffuse neural network can include one or more other types of neural network layers conditioned on one or more embeddings. Examples of such layers include feature-level linear modulation (FiLM) layers, layers with conditionally gated activation functions, and so on.
[0149] The diffusion input at any given update iteration may also include data defining the noise level for that iteration. Typically, each update iteration has a corresponding time step. t Furthermore, the noise level of this iteration depends on the time step. For example, the noise level could be the time step. tThe decreasing function. Examples of such functions include linear functions, cosine functions, and sigmoid functions. In these cases, data identifying the noise level, time step, or both can be embedded using an appropriate neural network (e.g., a multilayer perceptron (MLP)) and used to conditionalize the diffuse neural network 110, as described above for conditional input.
[0150] Furthermore, as mentioned above, at each update iteration, the system uses the denoised output generated by the diffusion neural network to update the current data item up to that update iteration.
[0151] For example, the system can use the denoised output to determine an initial estimate of the final data item, and then apply an appropriate diffusion sampler to that initial estimate to update the current data item.
[0152] As another example, the system can use classifier-free guidance or negation guidance to adjust the denoised output, use the adjusted denoised output to determine an initial estimate of the final data item, and then apply an appropriate diffusion sampler to that initial estimate to update the current data item. Classifier-free guidance is described, for example, in arXiv:2207.12598 by Ho and Salimans.
[0153] The system can use any suitable diffusion sampler (e.g., a DDPM (Denoising Diffusion Probability Model) sampler, a DDIM (Denoising Diffusion Implicit Model) sampler, or another suitable sampler) to update the data items with this estimate to generate the updated current data items. DDPM is discussed, for example, in Ho et al.'s arXiv:2006:11239.
[0154] When the denoised output is a prediction of a data item, the system can directly use the denoised output (or the adjusted denoised output) as an initial estimate.
[0155] When the denoised output is a prediction of the noise component, the system can determine an initial estimate from the current data item, the denoised output, and the noise level for the current update iteration, for example, by combining the current data item and the denoised output according to the noise level for the current update iteration.
[0156] Alternatively, after the final iteration, the system can avoid using the diffusion sampler, and instead use the initial estimate as the updated current data item.
[0157] After the final update iteration, the system outputs the current data item as the final output data item. As mentioned above, when the data item is a spectrogram in the output space, the system can use the final data item as the spectrogram generated using a diffuse neural network. When the data item is in the latent space, the system can map the final data item to a spectrogram by processing the final data item using a decoder neural network.
[0158] Figure 4 This is a flowchart of an example process 400 for generating an initial musical composition or soundscape. For convenience, process 400 will be described as being executed by a system of one or more computers located in one or more locations. For example, an audio generation system appropriately programmed according to this specification (e.g., Figure 1 The audio generation system 100 described herein can execute process 400.
[0159] The system receives audio input (step 402).
[0160] Typically, audio input is text that characterizes the desired properties of a musical work or soundscape. For example, audio input can be unstructured natural language text or structured text that specifies corresponding values for one or more attributes of the work or soundscape.
[0161] As described above, audio input may include (i) positive audio input, (ii) negative audio input, and optionally (iii) lyrics of a musical work.
[0162] In some implementations, the system receives audio input directly from the user, while in others, the system uses, for example, a language model neural network to generate audio input from the raw user input.
[0163] The system generates a low-resolution spectrogram conditioned on the audio input, which spans a first time window (step 404). For example, the system can use the reference above. Figure 2 The underlying diffusion neural network is used to generate this low-resolution spectrogram.
[0164] The system generates a high-resolution spectrogram conditioned on a low-resolution spectrogram and audio input, which spans a first time window (step 406). For example, the system can use the reference above. Figure 2 The upsampling diffusion neural network is used to generate the high-resolution spectrogram.
[0165] Optionally, the system can then generate a waveform from the high-resolution spectrogram (step 408). That is, in some cases, the first time window spanned by the high-resolution spectrogram matches the target length of the musical work or soundscape. In these cases, the system can use a vocoder to generate the waveform and provide that waveform as the system's final output.
[0166] In some other cases, the initial time window is shorter than the target length. In these cases, the system does not need to generate waveforms, and instead, a high-resolution spectrogram can be used to iteratively extend the length of the musical work or soundscape.
[0167] Figure 5 This is a flowchart of an example process 500 for generating longer musical works or soundscapes. For convenience, process 500 will be described as being executed by a system of one or more computers located in one or more locations. For example, an audio generation system appropriately programmed according to this specification (e.g., Figure 1 The audio generation system 100 described herein can execute process 500.
[0168] The system obtains the audio input and an initial spectrogram spanning a first time window (step 502), which is shorter than the target time window spanned by the final musical work or soundscape.
[0169] The system generates an initial expanded spectrogram spanning the target time window from the audio input and the initial spectrogram (step 504). In some implementations, the initial expanded spectrogram has a resolution lower than both the first resolution and the target resolution.
[0170] The system iteratively expands the initial spectrogram conditioned on the audio input and the initially expanded spectrogram (step 506). That is, the system can generate a new portion of the final spectrogram at each iteration in multiple iterations until the final spectrogram reaches the target length, i.e., until the final spectrogram crosses the target time window.
[0171] Figure 6 This is a flowchart of an example process 600 for generating audio and image inputs. For convenience, process 600 will be described as being executed by a system of one or more computers located in one or more locations. For example, an audio generation system appropriately programmed according to this specification (e.g., Figure 1 The audio generation system 100 described herein can execute process 600.
[0172] The system receives user input (step 602). For example, user input may be natural language input describing a musical work or soundscape. As a specific example, user input may be free-form natural language text that has been submitted through a user interface presented by the system on the user's device.
[0173] The system generates an input sequence from the user input (step 604). In some implementations, the input sequence includes only text terms from the user input, while in others, it includes additional text in addition to the user input. For example, the input sequence may include natural language instructions or few-sample prompts.
[0174] The system uses a language model neural network to process the input sequence to generate the output sequence (step 606).
[0175] The output sequence typically includes one or more subsequences.
[0176] Specifically, one or more subsequences within a subsequence define the audio input.
[0177] In other words, language model neural networks have been configured, for example, by fine-tuning, utilizing natural language instructions or few-shot cues in the input sequence, or both, to generate output sequences comprising different parts that define the structured input to the audio generation neural network.
[0178] The input is called "structured" because it follows a specific format or structure, rather than being free-form natural language text. For example, structured input can be structured to include affirmative prompts and lyrics, and optionally negative prompts.
[0179] Positive cues can be structured to include corresponding values for each of several properties that the generated audio should possess. Examples of properties include genre, audio quality, style, rhythm, cadence, pitch, timbre, dynamics, melody, instruments used, year of audio generation, or other time units, etc. See below for reference. Figures 7A to 7C Examples describing properties.
[0180] Negative prompts can be structured to include the corresponding value for each of several properties that the audio should not possess. The properties specified in a negative prompt can be the same as or different from those specified in a positive prompt.
[0181] The system uses an audio generative neural network system to process the audio input to generate audio described by the user input (step 608).
[0182] Alternatively, the system can also generate image input from the output sequence, for example, by using only positive audio input in combination with predetermined prompts or by using both positive and negative audio input in combination with predetermined prompts.
[0183] The system uses an image-generative neural network system to process image input to generate an image describing the generated audio (step 610). Specifically, when the generated audio is a musical work as described above, the generated image can be an "album photo" that visually depicts certain characteristics of the generated musical work.
[0184] Figures 7A to 7C Example 700 of system operation is shown. Specifically, Figures 7A to 7C An example of a user interface provided by the system is shown, which allows users to submit input that results in the generation of music.
[0185] like Figure 7A As shown, the user submits an initial query in natural or free-form language (here: "Song about the weather in London") to the query field "Describe the music you want to hear").
[0186] like Figure 7B As shown, the user selects "Generate prompt." In response, the system uses a language model neural network to generate positive prompts ("List elements to include"), negative prompts ("List elements to exclude (optional)"), and lyrics ("Lyrics (optional)"). These fields are collectively referred to as "detailed prompts."
[0187] In this context, the positive cue is "London Weather, Blur, britpop, electricguitars, pop, 1998, HQ, pristine quality, Remastered 2023", and the negative cue is "LondonCalling, The Clash, speech, audiobook, podcast, low quality", with lyrics such as "There's a cold wind blowing in the streets of London\nAnd I don't know if I'll ever get warm again\nThe rain is falling and the sky is grey\nAnd I'm wishing that I was back at home".
[0188] It should be noted that users can also directly input positive prompts, negative prompts, and lyrics through the user interface, or edit the output of the language model.
[0189] like Figure 7C As shown, the user selects "Generate". In response, the system generates a musical piece or soundscape, and can generate corresponding album art using the image-generating model and audio-generating model described above, as well as with detailed prompts.
[0190] It should be noted that if the advanced field (e.g., by clicking "Advanced") is hidden, the user can generate detailed prompts, images, and audio using a single input by directly selecting "Generate". That is, the system can use a language model neural network to generate detailed prompts and then generate music or soundscapes and images from the detailed prompts without displaying them to the user.
[0191] Furthermore, by using a diffusion model as part of both the image-generative neural network and the audio-generative neural network, the system can generate multiple different plausible musical works or soundscapes from the same detailed cues, and optionally generate images, by leveraging the randomness of the generation process (e.g., by sampling different noises). The system can then display all these different album covers and works or soundscapes to the user, allowing the user to view and listen to multiple plausible, high-quality interpretations of their query.
[0192] The described system is implemented in a way that extends beyond simply generating musical works. For example, as described above, system 130 can iteratively extend the length of the generated musical work or soundscape until a target length is reached, or in principle, extend the length of the generated musical work or soundscape indefinitely. This facilitates the use of the generated musical work or soundscape in therapeutic applications. As a specific example, the generated musical work or soundscape can be used to mask tinnitus, for example, by continuously providing the musical work or soundscape as background noise to the patient through headphones or earplugs. As another example, the generated musical work or soundscape can be used to create privacy through sound masking; that is, the generated musical work or soundscape can be played in public or other environments (such as offices or hospitals) to mask ongoing private conversations.
[0193] In some implementations, the described system is used to evaluate, calibrate, test, or modify audio electronics or audio communication systems (such as mobile phones, smart speakers, video conferencing devices or systems) or audio signal transmission systems. As an example, the described system can be used to generate music or soundscapes captured, transmitted, or played by the audio device or audio communication or signal transmission system, and the output of the audio device or audio communication or signal transmission system can then be compared with the generated music or soundscape to evaluate the fidelity of the output. Optionally, the audio device or audio communication or signal transmission system can then be modified, for example, calibrated or trained, to increase fidelity. As another example, the generated soundscape can be combined with speech, and this combination is captured, transmitted, or played by the audio device or audio communication or signal transmission system configured to filter out background soundscapes, such as noise from offices, shopping malls, or other locations. The output of the audio device or audio communication or signal transmission system can then be compared with the input speech and / or soundscape to evaluate the effectiveness of the filtering. Optionally, the audio device or audio communication or signal transmission system can then be modified, for example, calibrated or trained, to increase the effectiveness of filtering (e.g., by measuring the attenuation of unwanted components of the signal, i.e., soundscapes).
[0194] This specification uses the term "configured" in conjunction with system and computer program components. For a system of one or more computers to be configured to perform a specific operation or action, this means that software, firmware, hardware, or a combination thereof have been installed on the system to cause the system to perform those operations or actions in operation. For one or more computer programs configured to perform a specific operation or action, this means that the one or more programs include instructions that, when executed by a data processing device, cause the device to perform that operation or action.
[0195] Embodiments of the subject matter and functional operation described in this specification may be implemented in digital electronic circuit systems, in tangibly embodied computer software or firmware, in computer hardware (including the structures disclosed in this specification and their equivalents), or in one or more combinations thereof. Embodiments of the subject matter described in this specification may be implemented as one or more computer programs, for example, one or more modules of computer program instructions encoded on a tangible, non-transitory storage medium for execution by a data processing device or for controlling the operation of a data processing device. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination thereof. Alternatively or additionally, program instructions may be encoded on artificially generated propagated signals (e.g., machine-generated electrical, optical, or electromagnetic signals) generated to encode information for transmission to a suitable receiver device for execution by the data processing device.
[0196] The term "data processing device" refers to data processing hardware and includes all kinds of devices, apparatuses, and machines for processing data, such as programmable processors, computers, or multiple processors or computers. The device may also be or further include special-purpose logic circuit systems, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). In addition to hardware, the device may optionally include code that creates an execution environment for computer programs, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, or combinations thereof.
[0197] A computer program (which may also be referred to or described as a program, software, software application, app, module, software module, script, or code) can be written in any form of programming language, including compiled or interpreted languages or declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but does not necessarily, correspond to a file in a file system. A program may be stored as a part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinating files, such as a file storing one or more modules, subroutines, or code portions. A computer program may be deployed to execute on a single computer or on multiple computers located at a site or distributed across multiple sites and interconnected via a data communication network.
[0198] In this specification, the term "database" is used broadly to refer to any collection of data: data that does not need to be structured in any particular way, or does not need to be structured at all, and can be stored on storage devices in one or more locations. Thus, for example, an indexed database may include multiple collections of data, each of which can be organized and accessed differently.
[0199] Similarly, in this specification, the term "engine" is used broadly to refer to a software-based system, subsystem, or process programmed to perform one or more specific functions. Typically, an engine will be implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines may be installed and run on the same one or more computers.
[0200] The processes and logic flows described in this specification can be executed by one or more programmable computers, which execute one or more computer programs to perform functions by manipulating input data and generating output. The processes and logic flows can also be executed by a dedicated logic circuit system, such as an FPGA or ASIC, or by a combination of a dedicated logic circuit system and one or more programmed computers.
[0201] A computer suitable for executing computer programs can be based on a general-purpose microprocessor, a special-purpose microprocessor, or both, or any other type of central processing unit. Typically, the central processing unit receives instructions and data from read-only memory or random access memory, or both. The basic components of a computer are the central processing unit for making or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and memory may be supplemented by or incorporated into a special-purpose logic circuit system. Typically, a computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or operatively coupled to receive data from or transfer data to such one or more mass storage devices, or both. However, a computer does not necessarily need to have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device such as a universal serial bus (USB) flash drive, to name a few.
[0202] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks (e.g., internal hard disks or removable disks); magneto-optical disks; and CD ROMs and DVD-ROMs.
[0203] To provide interaction with the user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, voice, or tactile input. Additionally, the computer can interact with the user by sending documents to and receiving documents from the device used by the user; for example, by sending a webpage to a web browser on the user's device in response to a request received from a web browser. Furthermore, the computer can interact with the user by sending text messages or other forms of messages to a personal device (e.g., a smartphone running a messaging application) and receiving responsive messages from the user in response.
[0204] Data processing devices used to implement machine learning models may also include, for example, dedicated hardware accelerator units, which handle the common and computationally intensive parts of machine learning training or production (e.g., inference, workloads).
[0205] Machine learning frameworks (such as TensorFlow or Jax) can be used to implement and deploy machine learning models.
[0206] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes backend components (e.g., as a data server), or middleware components (e.g., an application server), or frontend components (e.g., a client computer having a graphical user interface, web browser, or app that a user can interact with through an implementation of the subject matter described in this specification), or any combination of one or more such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication (e.g., a communication network) of any form or medium. Examples of communication networks include local area networks (LANs) and wide area networks (WANs), such as the Internet.
[0207] A computing system may include clients and servers. Clients and servers are typically geographically separated and interact via a communication network. The client-server relationship is established by computer programs running on respective computers and having a client-server relationship with each other. In some embodiments, the server transmits data (e.g., HTML pages) to a user device, for example, for the purpose of displaying data to a user interacting with the device acting as a client and receiving user input from that user. Data generated at the user device, such as the result of user interaction, may be received at the server from the device.
[0208] While this specification contains numerous details of specific implementations, these details should not be construed as limiting the scope of any invention or the scope that may be claimed, but rather as descriptions of features that may be characteristic of particular embodiments of a particular invention. Certain features described in this specification in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as operating in certain combinations and even initially claimed in this way, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may involve sub-combinations or variations thereof.
[0209] Similarly, although operations are depicted in the accompanying drawings and described in a specific order in the claims, this should not be construed as requiring such operations to be performed in the specific order shown or in sequential order, or requiring all shown operations to be performed to achieve the desired result. In some contexts, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0210] Specific embodiments of this subject matter have been described. Other embodiments are within the scope of the appended claims. For example, the actions recited in the claims can be performed in a different order and still achieve the desired result. As an example, the processes depicted in the figures do not necessarily require the specific order or sequential order shown to achieve the desired result. In some cases, multitasking and parallel processing can be advantageous.
Claims
1. A method executed by one or more computers, the method comprising: Obtain audio input representing a musical work or soundscape with a target resolution; A first spectrogram with a first resolution lower than the target resolution is generated from the audio input using a basic diffusion neural network. as well as A second spectrogram with the target resolution is generated from the first spectrogram using a first upsampling diffusion neural network.
2. The method of claim 1, further comprising: Generate an audio waveform from the second spectrogram; as well as The audio waveform is provided for playback.
3. The method of claim 1, wherein the second spectrogram spans a first time window, the method further comprising: One or more additional diffuse neural networks are used to iteratively expand the second spectrogram with the target resolution to generate a final spectrogram spanning a second, longer time window.
4. The method of claim 3, further comprising: Generate an audio waveform from the final spectrogram; as well as The audio waveform is provided for playback.
5. The method of claim 3 or claim 4, wherein using one or more additional diffusion neural networks to iteratively expand the second spectrogram having the target resolution to generate a final spectrogram spanning a second, longer time window comprises: An initial extended spectrogram is generated from the audio input and the second spectrogram, spanning a second, longer time window but with a lower resolution than the target resolution; as well as The second spectrogram is iteratively expanded using the initial expanded spectrogram.
6. The method of claim 5, wherein generating an initial extended spectrogram from the audio input and the second spectrogram, spanning a second, longer time window but having a lower resolution than the target resolution, comprises: The initial expanded spectrogram is generated using an initial low-resolution extended diffusion neural network.
7. The method of claim 5 or claim 6, wherein iteratively expanding the second spectrogram using the initial expanded spectrogram comprises, at each of the plurality of iterations: Generate a new high-resolution spectrogram conditioned on (i) the low-resolution spectrogram that has been used to generate the most recently generated high-resolution spectrogram and (ii) the corresponding portion of the initial expanded spectrogram.
8. The method of claim 7, wherein at the first iteration, the low-resolution spectrogram used to generate the most recently generated high-resolution spectrogram is the second spectrogram.
9. The method of claim 7 or claim 8, wherein at each subsequent iteration after the first iteration, the most recently generated high-resolution spectrogram is the new high-resolution spectrogram generated at the previous iteration.
10. The method of any one of claims 7 to 9, wherein the corresponding portion of the initial expanded spectrogram includes a portion of the initial expanded spectrogram that spans the same time window as the new high-resolution spectrogram.
11. The method of any one of claims 7 to 10, wherein generating a new high-resolution spectrogram conditioned on (i) a low-resolution spectrogram already used to generate a recently generated high-resolution spectrogram and (ii) a corresponding portion of the initial expanded spectrogram comprises: The new high-resolution spectrogram is generated using a low-resolution extended diffusion neural network and a second upsampling diffusion neural network.
12. The method of claim 11, wherein the low-resolution extended diffusion neural network is the base diffusion neural network, and the second upsampling diffusion neural network is the first upsampling diffusion neural network.
13. The method of any preceding claim, wherein obtaining audio input characterizing a musical work or soundscape having a target resolution comprises: Receive user input representing the musical work or soundscape; as well as The user input is processed using a language model neural network to generate the audio input.
14. The method of claim 13, wherein the audio input includes one or more of an affirmative prompt, a negative prompt, or lyrics of the musical work.
15. The method of any one of claims 1 to 14, wherein generating a second spectrogram having the target resolution from the first spectrogram using a first upsampling diffusion neural network comprises: The first spectrogram is upsampled to generate an upsampled spectrogram with the target resolution; as well as The first upsampled diffusion neural network is conditioned on the upsampled spectrogram.
16. The method of any preceding claim, wherein generating a second spectrogram having the target resolution from the first spectrogram using a first upsampling diffusion neural network comprises: The first upsampled diffusion neural network is conditioned on the audio input.
17. The method of any of the preceding claims, further comprising: An image-generating neural network is used to process the image input representing the musical work or soundscape to generate an image representing the musical work or soundscape.
18. A method executed by one or more computers, the method comprising: Obtain the initial spectrogram spanning the first time window; as well as The initial spectrogram is iteratively expanded using one or more diffusion neural networks to generate a final spectrogram spanning a longer time window.
19. The method of claim 18, further comprising: Generate an audio waveform from the final spectrogram; as well as The audio waveform is provided for playback.
20. The method of claim 18 or claim 19, wherein the initial spectrogram has been generated from the audio input and has a target resolution, the final spectrogram has the target resolution, and one or more diffusion neural networks are used to iteratively expand the initial spectrogram having the target resolution to generate a final spectrogram spanning a longer time window, comprising: An initial extended spectrogram is generated from the audio input and the initial spectrogram, spanning the longer time window but with a lower resolution than the target resolution; as well as The initial spectrogram is iteratively expanded using the initial expanded spectrogram.
21. The method of claim 20, wherein generating an initial extended spectrogram from the audio input and the initial spectrogram, spanning the longer time window but having a lower resolution than the target resolution, comprises: The initial expanded spectrogram is generated using an initial low-resolution extended diffusion neural network.
22. The method of claim 19 or claim 20, wherein iteratively expanding the initial spectrogram using the initial expanded spectrogram comprises, at each of the plurality of iterations: Generate a new high-resolution spectrogram conditioned on (i) the low-resolution spectrogram that has been used to generate the most recently generated high-resolution spectrogram and (ii) the corresponding portion of the initial expanded spectrogram.
23. The method of claim 22, wherein at the first iteration, the low-resolution spectrogram that has been used to generate the most recently generated high-resolution spectrogram is the initial spectrogram.
24. The method of claim 22 or claim 23, wherein at each subsequent iteration after the first iteration, the most recently generated high-resolution spectrogram is the new high-resolution spectrogram generated at the previous iteration.
25. The method of any one of claims 22 to 24, wherein the corresponding portion of the initial expanded spectrogram includes a portion of the initial expanded spectrogram that spans the same time window as the new high-resolution spectrogram.
26. The method of any one of claims 22 to 25, wherein generating a new high-resolution spectrogram conditioned on (i) a low-resolution spectrogram already used to generate a recently generated high-resolution spectrogram and (ii) a corresponding portion of the initial expanded spectrogram comprises: The new high-resolution spectrogram is generated using a low-resolution extended diffusion neural network and an upsampling diffusion neural network.
27. A method executed by one or more computers, the method comprising: A language model neural network is used to process user input to generate structured audio input for the audio generation neural network. as well as The audio generation neural network is used to process the structured audio input to generate an output that defines a musical work or soundscape.
28. The method of any of the preceding claims when it is subordinate to claim 5, wherein the initial expanded spectrogram has a resolution lower than both the first resolution and the target resolution.
29. A system comprising: One or more computers; as well as One or more storage devices that store instructions that, when executed by the one or more computers, cause the one or more computers to perform the corresponding operations as described in any one of claims 1 to 28.
30. One or more computer-readable storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform a corresponding operation of the method as claimed in any one of claims 1 to 28.