Removing loudspeaker acoustic leak in captured audio data
A DNN-based method processes captured audio to reduce loudspeaker bleed, improving audio quality by minimizing loudspeaker audibility in recorded performances.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- RT SIXTY LTD
- Filing Date
- 2025-11-04
- Publication Date
- 2026-06-18
AI Technical Summary
Bleed audio, such as background audio or click tracks played through loudspeakers, is often captured during audio recordings, degrading the quality of the recorded audio, especially when loud.
A method and device using a deep neural network (DNN) to process captured audio data by reducing the audibility of reference audio reproduction, generating processed audio data with reduced bleed audio components, utilizing a smartphone or tablet for near real-time processing.
Effectively reduces the audibility of loudspeaker bleed audio in recorded performances, enhancing audio quality by minimizing the perception of loudspeaker reproductions while preserving the performance audio.
Smart Images

Figure GB2025052410_18062026_PF_FP_ABST
Abstract
Description
[0001] PROCESSING AUDIO DATA
[0002] Field
[0003] The present disclosure relates to processing audio data. Embodiments have been developed to allow real time music extraction (or near real time music extraction), and several examples are described with reference to that application. Other embodiments, examples, and applications are also described.
[0004] When recording an audio performance, so-called bleed audio may be captured alongwith the performance. For example, background audio or a click track being played through headphones or a loudspeaker may be picked up by a microphone that is recording the performance. Bleed audio may affect the quality of the audio that is recorded. The problem may be exacerbated as bleed audio gets louder.
[0005] Aspects and embodiments are set out in the accompanying claims.
[0006] Brief ion of the
[0007] Various embodiments will now be described, by way of example only, with reference to the accompanying drawings in which:
[0008] Figure 1 is a flowchart of an example method of processing audio data;
[0009] Figure 2 is a schematic of an example device for implementing a method of processing audio data;
[0010] Figure 3 is a schematic view of an example scenario involving the device of Figure 2;
[0011] Figure 4 is a landing screen of an example app that implements a method of processing audio data;
[0012] Figures 5 and 6 are further screens of the app of Figure 4;
[0013] Figure 7 is a flowchart of an example of the processing step of a method of processing audio data; Figure 8 is a flowchart of an example of a time domain to frequency domain transform;
[0014] Figure 9 is a schematic view of an example U-Net deep neural network (DNN) architecture for use in implementing a method of processing audio data;
[0015] Figure 10 is a schematic view of an example DNN arrangement for implementing the processing step of a method of processing audio data;
[0016] Figure 11 is a schematic view of a further example DNN arrangement for implementing the processing step of a method of processing audio data;
[0017] Figure 12 is a schematic view of a further example DNN arrangement for implementing the processing step of a method of processing audio data;
[0018] Figure 13 is a flowchart showing of an example of a method of processing audio data, including examples of pre-processing and postprocessing;
[0019] Figure 14 is a flowchart showing a method for determining a latency offset that may be used in synchronising captured audio data with reference audio data;
[0020] Figure 15 is a schematic view of a further example DNN arrangement for implementing the processing step of a method of processing audio data;
[0021] Figure 16 is a schematic view of a smartphone for implementing a method of processing audio data;
[0022] Figure 17 is a schematic view of a system for implementing a method of processing audio data;
[0023] Figure 18 is a flowchart of a further method of processing audio data; and
[0024] Figure 19 is a flowchart of a further method of processing audio data.
[0025] Detailed Description
[0026] In general terms, examples herein describe methods and devices for processing audio data. Such processing may include obtaining reference audio data that represents reference audio and obtaining captured audio data. The captured audio may have been captured via a microphone and may represent a combination of a reproduction of the reference audio by a speaker driver (or loudspeaker) and an audible performance accompanying the reproduction. The captured audio data may be processed using the reference audio data and an audibility reduction technique to generate processed audio data. The processed audio data may represent the performance. Any representation of the reproduction in the processed audio data may have a lower audibility in the processed audio data than in the captured audio data.
[0027] The term “speaker driver” may refer to an electroacoustic transducer that converts an electrical audio signal into a corresponding sound. By way of non-limiting examples, speaker drivers may include electrostatic drivers, piezoelectric drivers, planar magnetic drivers, Heil air motion drivers, and ionic drivers.
[0028] The term “loudspeaker” may refer to one or more speaker drivers, optionally including electrical and / or acoustic filtering components (including, for example, electrical and / or acoustic crossovers, and tuned ports), configured to convert an electrical audio signal into a corresponding sound.
[0029] A loudspeaker may be a free-air loudspeaker, intended to reproduce sound and project it through the air to a listener over some distance. Free-air loudspeakers may be designed for listeningfrom a distance of more than about 10 centimetres (approximately 4 inches), for example. Examples of free-air loudspeakers include smart speakers and monitors.
[0030] A loudspeaker may be a personal-listening loudspeaker, intended to reproduce sound and project it directly into a listener’s ear. Personal-listening loudspeakers may be designed for listening from a distance of less than about 10 centimetres (approximately 4 inches), for example. Examples of personal-listening loudspeakers include headphones and earbuds.
[0031] In examples, a loudspeaker may be operable as both a free-air loudspeaker and a personal-listening loudspeaker. For example, in a first mode, a smartphone loudspeaker may operate as a free-air loudspeaker when a user holds the smartphone many centimetres or inches from their face (e.g., when watching a video). In a second mode, the smartphone loudspeaker may operate as a personal-listening loudspeaker when the user holds the smartphone close to the user’s ear (e.g., during a private phone call). In examples, the first and second modes may be distinguished by differences in sound pressure level (SPL) produced at a maximum volume setting. In examples, the first and second modes may be further distinguished by different equalisations applied to the audio reproduced by the loudspeaker, to account for differences in sound transmission of different frequencies over different distances. Figure 1 shows an example of a method 100 for processing audio data. Figure 2 shows a diagram of a device 200. The device 200 takes the form of a smartphone. The device 200 includes a processor 202 and a memory 204 operatively coupled with the processor 202. The memory 204 stores instructions 206 and data 208, as described in more detail below. When executed by the processor 202, the instructions 206 cause the device 200 to implement the method 100 using the data 208, as described in more detail below.
[0032] In this example, the data 208 includes reference audio data 210 and captured audio data 212. The reference audio data 210 and the captured audio data 212 are described in more detail below.
[0033] The device 200 includes audio hardware 214 operatively coupled to the processor 202. The audio hardware 214 includes an digital-to-analogue converter (DAC) 216. The DAC 216 is configured to receive, from the processor 202, digital audio data 218 in a digital format. The digital audio data 218 may be provided by the processor 202 to the DAC 216 in any suitable digital format. For example, the digital audio data 218 may be provided as a stream of 16-bit pulse code modulated (PCM) samples at 48 kHz. Different resolutions, modulation types, sample rates, and / or encoding schemes may be used in other examples.
[0034] The processor 202 may be configured to generate the digital audio data 218 based on the reference audio data 210. The processor 202 may generate the digital audio data 218 by processing the reference audio data 210.
[0035] Processing the reference audio data 210 may include, for example, filtering the reference audio data 210 in any suitable manner. For example, the processor 202 may apply multi-band and / or parametric equalisation to the reference audio data 210 as part of generating the audio data 218. Such equalisation may at least partly account for an inherent frequency response of device 200 and its hardware components. Such equalisation may also take into account user preferences, such as a user-selected equalisation preset.
[0036] Processing the reference audio data 210 may include, for example, transcoding the reference audio data 210 into a digital format that is compatible with the DAC 216.
[0037] Other forms of processing may be applied to the reference audio data 210 to generate the digital audio data 218. The DAC 216 converts the digital audio data 218 into an analogue audio signal 220 in a conventional manner.
[0038] The audio hardware 214 includes a speaker driver in the form of a loudspeaker 222. The loudspeaker 222 is coupled to the output of the DAC 216 to receive the analogue audio signal 220. Although omitted for clarity in Figure 2, an amplifier is conventionally provided between the output of the DAC 216 and the loudspeaker 222 to amplify the analogue audio signal 220 output by the DAC 216. The loudspeaker 222 outputs a reproduction of the reference audio 210.
[0039] The audio hardware 214 includes a microphone 224. The microphone 224 converts sound waves into an analogue audio signal 226.
[0040] The audio hardware 214 includes an analogue-to-digital converter (ADC) 228. The ADC 228 is coupled to the microphone 224 to receive the analogue audio signal 226. Although omitted for clarity in Figure 2, a microphone pre-amplifier is conventionally provided between the output of the microphone 224 and the ADC 228 to amplify the analogue audio signal 226 output by the microphone 224. The ADC 228 converts the analogue audio signal 226 into a digital representation 254 of the analogue audio signal 226.
[0041] As described in more detail below, the microphone 224 may capture audio representing a combination of the reproduction of the reference audio by a speaker driver, such as the loudspeaker 222, and an audible performance accompanying the reproduction. The performance may be a human performance, involving singing or the playing of musical instrument(s), for example. The performance may also include playback of other audio, including via an audio reproduction system, a synthesiser, a sampler, or any other means of producing audio, whether under the control of a human or a computer, or stochastically generated or controlled.
[0042] The device 200 includes a display 230 for displaying information to a user. In examples, the instructions 206 may include routines for displaying, to a user, information about operation of the method 100. The display 230 includes a touch-responsive overlay (not shown) configured to allow a user to provide user input, such as selecting options, adjusting settings, and inputtinginformation relevanttothe method and related routines.
[0043] In other examples, the instructions 206 may include routines for outputting information regarding operation of the method 100 in other formats, such as by outputting instructions and other information by way of the loudspeaker 222. In other examples, the instructions 206 may include routines for accepting input, such as user input, regarding operation of the method 100 in other formats, such as by accepting verbal or other instructions by way of the microphone 224.
[0044] Figure 3 shows an example scenario of the device 200 in operation. For clarity, only the microphone 224 and loudspeaker 222 of the device 200 are indicated by reference numerals in Figure 3. The device 200 is positioned in front of a user 232.
[0045] The user 232 is recording a track as part of a multi-track recording. The track is a vocal performance by the user 232. The vocal performance will accompany reference audio. The reference audio may include, for example, a musical and / or percussive backing track, a metronome, or a click track (i.e., a series of audio cues used in a manner similar to a metronome). The reference audio may take any other suitable form. In the example shown in Figure 2, the reference audio is stored as reference audio data 210 in the memory 204 of the device 200.
[0046] In the example of Figure 3, the instructions 206, when initially executed, cause the device 200 to open an app. Figure 4 shows an example of a landing screen 234 of the app. The landing screen 234 includes a button 236. In other examples, the button 234 may be provided on a screen other than an app landing screen. For example, the button 234 may be presented on a page that is accessible via a menu or other user interface element.
[0047] Pressing the button 236 opens a list of available reference audio tracks 238 that may be used as accompaniment, as shown in Figure 5. In examples, one or more of the reference audio tracks 238 may be stored locally in the memory. In examples, one or more of the reference audio tracks 238 may be stored remotely, such as on a streaming server, a file server, or other storage accessible byway of a wired or wireless connection, such as a local area network (including a wired and / or wireless local area network (LAN)), a mobile telecommunications network, and / or the Internet.
[0048] Selecting one of the available reference audio tracks opens a recording session page. In the illustrated example, the user 232 selects “Track 3” from the list 238, causing a recording session page 240 open, as shown in Figure 6. The recording session page 240 includes a first waveform window 242, a second waveform window 244, a third waveform window 268, and a “start” button 270. The first waveform window 242 is initially blank, but displays a waveform corresponding to the reference audio during playback, as described below. The second waveform window 244 is also initially blank, but displays a waveform correspondingto the audio captured by the microphone 224 during recording, as described below. The third waveform window 268 is also initially blank, but displays a waveform corresponding to processed audio as the processed audio is generated, as described below.
[0049] When ready to record, the user 232 presses the “start” button 270. In response, the processor 202 retrieves, from the memory 204, the reference audio data 210 corresponding to “Track 3” as previously selected by the user 232. The processor 202 streams the “Track 3” reference audio data 210 to the DAC 216. The DAC 216 converts the reference audio data 210 into the analogue audio signal 220 and outputs the analogue audio signal 220 to the loudspeaker 222. The loudspeaker 222 reproduces the reference audio, as indicated by arrows 246 in Figure 3.
[0050] Optionally, a click track, countdown, or other introductory segment may be played before the track is played. This may reduce the perception of any delay associated with retrieving and commencing replay of the requested track, which may be particularly useful if the reference audio data 210 corresponding to the track is stored remotely from the device 200.
[0051] The user 232 can hear the reproduction of the reference audio from the loudspeaker 222. In the example of Figure 6, the reference audio is a musical backing track. The user 232 sings along with the backing track, producing sound as indicated by arrow 248. The user’s singing is an example of an audible performance accompanying the reproduction.
[0052] While the loudspeaker 222 reproduces the reference audio, the microphone 224 receives audio from the environment around the device 200, as indicated by arrow 250 in Figure 3. The audio received by the microphone 224 includes the sound 248 of the user singing and the reference audio 246 reproduced by the loudspeaker 222.
[0053] The microphone 224 converts the received audio into the analogue audio signal 226 representing captured audio, and outputs the analogue audio signal 226 to the ADC 228. The ADC 228 converts the analogue audio signal 226 into the digital representation 254 of the analogue audio signal 226. The digital representation 254 represents captured audio data. The captured audio data represents a combination of the reproduction 246 of the reference audio by the loudspeaker 222 and the audible performance (i.e., the sound 248 of the user 232 singing) accompanying the reproduction 246. The processor 202 stores the digital representation 254 in the memory 204.
[0054] While the reference audio is being played, the first waveform window 242 displays a waveform 256 corresponding to a time domain representation of the reference audio. The horizontal axis of the first waveform window 242 represents time. The vertical axis of the first waveform window 242 represents amplitude. As the recording proceeds, the waveform 256 moves to the left of the first waveform window 242.
[0055] Similarly, while the reference audio is being played, the second waveform window 244 displays a waveform 262 corresponding to a time domain representation of the captured audio. The horizontal axis of the second waveform window 244 represents time. The vertical axis of the second waveform window 244 represents amplitude. As the recording proceeds, the waveform 262 moves to the left of the second waveform window 244.
[0056] Similarly, while the reference audio is being played, the third waveform window 268 displays a waveform 272 corresponding to a time domain representation of the processed audio (the processed audio is described in more detail below). The horizontal axis of the third waveform window 268 represents time. The vertical axis of the third waveform window 268 represents amplitude. As the recording proceeds, the waveform 272 moves to the left of the third waveform window 268.
[0057] Referring to Figure 1 , the method 100 includes obtaining 102 the reference audio data 210 representing the reference audio. Obtaining 102 the reference audio data 210 may include, for example, retrieving the reference audio data 210 from the memory 204, as described above.
[0058] The method 100 includes obtaining 104 the captured audio data 212. As described above, the captured audio was captured via a microphone and represents a combination of: the reproduction 246 of the reference audio by the loudspeaker 222; and the audible performance (i.e., the user’s singing 248) accompanying the reproduction 246.
[0059] Obtaining 102 the reference audio data 210 and obtaining 104 the captured audio data 212 may be performed in either order, or may be performed simultaneously. Alternatively, portions of each of the reference audio data 210 and the captured audio data 212 may be retrieved simultaneously or alternately from the memory 204. The portions may be of any suitable size. For example, each portion may include one of more frames of the reference audio data 210 and the captured audio data 212.
[0060] The method 100 includes processing 106 the captured audio data 212 using the reference audio data 210 and an audibility reduction technique to generate processed audio data 108. The processed audio data 108 represents the performance. In this context, “represents” means “includes at least some audible component of”. That is, a typical human listener will be able to hear at least audible components of the reproduction when listening to the processed audio data 108 at ordinary listening levels.
[0061] Any representation of the reproduction in the processed audio data 108 has a lower audibility in the processed audio data 108 than in the captured audio data 212. In examples, the processed audio data 108 contains little or no audible component of reproduction. That is, a typical human listener will be unable to hear any audible trace of the reproduction when listening to the processed audio data 108 at ordinary listening levels. In other examples, the processed audio data 108 contains an audible component of the reproduction, but at a substantially lower perceived volume level relative to its perceived volume level in the captured audio data 212.
[0062] In other examples, the typical human listener will be able to hear a version of the reproduction of the reference audio data 210 in a reproduction of the processed audio data 108. However, the reproduction of the reference audio data 210 will be less audible than in a reproduction of the captured audio data prior to the processing. In examples, the reproduction of the reference audio data 210 may be less audible due to a relative volume of the reproduction of the reference audio data 210 being lower than the performance in the processed audio 108 than in the captured audio 212. In other examples, the audibility of the reproduction of the reference audio data 210 may be reduced as a result of reducing amplitude or power at frequencies to which the human ear is more sensitive.
[0063] Figure 7 shows one example of the processing 106. The reference audio data 210 and the captured audio data 212 are provided as inputs to a deep neural network, DNN, arrangement 278. The DNN arrangement 278 comprises one or more DNNs 280. The processed audio data 108 is based on an output of the DNN arrangement. A DNN may be a neural network having at least one hidden layer. The DNN arrangement 278 may include at least one convolutional DNN.
[0064] The number, type, and configuration of DNNs 280 within the DNN arrangement 278 may be selected to suit the hardware capabilities of the device 200 that hosts the DNN arrangement 278. For example, when the device 200 is a smartphone, such a DNN arrangement may be capable of being run on a modern smartphone or tablet computing device, such that the processing can be done in near real-time.
[0065] Where the device 200 has more capable hardware than a modern smartphone, then a more complex DNN arrangement 278 may be employed. A more complex DNN arrangement 278 may offer improved performance and / or quality, typically at the expense of more expensive and / or less efficient hardware requirements. Similarly, where the device 200 has less capable hardware than a modern smartphone, then a less complex DNN arrangement 278 may be employed. A less complex DNN arrangement 278 may allow the use of cheaper and / or more efficient hardware, typically at the expense of reduced performance and / or quality.
[0066] The format in which the reference audio data 210 and the captured audio data 212 are provided to the DNN arrangement 278 will typically match the format of the audio data upon which the DNN arrangement 278 was trained. For example, if the DNN arrangement 278 was trained on frames of frequency domain data, then the inputs during operation should also be frames of frequency domain data. Alternatively, if the DNN arrangement 278 was trained on frames of time domain representation, then the inputs during operation should also be frames of time domain data.
[0067] In the example of Figure 7, the reference audio data 210 comprises a first sequence of audio frames 282. The captured audio data 212 comprises a second sequence of audio frames 284. When the DNN arrangement 278 has been trained on frames of samples corresponding to time domain representations of the corresponding audio data, then each audio frame 282 / 284 contains time domain data. When the DNN has been trained on frequency domain representations of the corresponding audio data, then each audio frame 282 / 284 contains frequency domain data.
[0068] Figure 8 shows an example of a method 300 by which time domain audio data may be converted to frequency domain audio data. The method 300 may be used when, for example, the reference audio data 210 and / or the captured audio data 212 are in a time domain format. The audio data may be in, for example, a time domain format such as 16- bit pulse code modulated (PCM) data at a 48 kHz sampling rate.
[0069] The time domain audio data may be provided as a sequence of frames. Figure 8 shows one such frame 286 of audio data. The frame 286 may represent, for example, one of the audio data frames 282 and 284 of Figure 7. The frame 286 includes a sequence of 1024 PCM values 288. The PCM values 288 represent a sampled waveform of the audio represented by the frames 282 / 284.
[0070] The method 300 includes Fourier transforming 302 the frame 286 to produce a frequency domain representation of the audio data in the frame 286. The Fourier transform is short-time Fourier transform. The frequency domain representation comprises a three-dimensional array of values output by the Fourier transform. The dimensions include time, frequency, and amplitude. The array therefore represents the changing amplitudes of the frequencies within an audio signal over time.
[0071] In the example of Figure 8, the frequency domain representation of the audio data in the frame 286 is shown as a spectrogram 290. Spectrograms visually represent the three dimensions of time, frequency, and amplitude. In the spectrogram 290, the horizontal axis 292 represents time and the vertical axis 294 represents frequency. The amplitude of the audio signal at each frequency / time point is visually represented by the brightness at that point. The spectrogram 290 is shown for illustrative purposes. The method 300 may generate the frequency domain representation as an array of values, without generating the spectrogram 290.
[0072] The frequency domain representation output bythe Fouriertransform (as visually represented by the spectrogram 290) may be considered a frequency domain frame correspondingto the time domain frame 286.
[0073] Other techniques may be used to convert a time domain representation of audio data into a frequency domain representation. For example, a wavelet transform may be used.
[0074] Returning to Figure 7, the sequence of frames 282 representing the reference audio data 210 and the sequence of frames 284 representing the captured audio data 212 are simultaneously provided as inputs to the DNN arrangement 278. In the example of Figure 7, each of the frames 282 / 284 is a frequency domain representation that has been generated using a short-time Fourier transform, for example in accordance with the method 300 shown in Figure 8.
[0075] The DNN arrangement 278 processes the corresponding reference audio data frame and the captured audio data frame. For example, frame (n) from the sequence of frames 282 representing the reference audio data 210 and frame (n) from the sequence of frames 284 representing the captured audio data 212 are provided as simultaneous inputs to the DNN arrangement.
[0076] The DNN arrangement 278 may take any suitable form. For example, Figure 9 shows a DNN arrangement 278 having a U-Net architecture 400. In general, U-Net architectures have multiple convolutional and deconvolutional layers.
[0077] In examples, the DNN arrangement may employ one or more other forms of convolutional DNN or other artificial neural network (ANN).
[0078] The DNN arrangement 278 may be configured to generate and output the processed audio data 108 in any suitable manner. The processed audio data 108 represents the performance. In the example of Figure 3, for example, the processed data 108 represents the user’s singing. The general intention is to extract the user’s singing from the captured audio data. This is achieved by reducing the audibility of the performance relative to the audibility of the user’s singing. Using the reference audio data 210 as an input allows the DNN arrangement 278 to more effectively reduce the audibility of the reproduction in the captured audio.
[0079] While it may be desirable to render the reproduction completely inaudible while leaving the performance unaffected, this may not always be practical. Depending upon factors such as the design and / or training of the DNN arrangement, the capability of the hardware of the device 200 (including, for example, speed of the processor 202), the complexity of the reference audio data 210 and / or the captured audio data 212, the relative volumes of the reference audio and the performance within the captured audio data 212, room effects such as resonance, echo, and reverberation, and additional background noise not forming part of the reproduction or performance, some audible vestiges of the reproduction may remain in the output audio data 108. Such factors may also result in some distortion of the performance.
[0080] Any audible vestiges of the reproduction or distortion of the performance may be an acceptable compromise given the advantages, such as those discussed below. The DNN arrangement278 may generate the processed audio data in any suitable way. In examples, the one or more DNNs 280 of the DNN arrangement 278 may be configured to directly transform the input frames into the processed audio data 108. In examples, the one or more DNNs 280 of the DNN arrangement 278 may be configured to produce (spectral) mask data that may be applied to the captured audio data 212 to reduce the audibility of the reproduction. In examples, the one or more DNNs 280 of the DNN arrangement 278 may be configured to produce offset data that may be added to (or subtracted from) the captured audio data 212 to reduce the audibility of the reproduction. Adding or subtracting the offset data may be considered antiphase application of the offset data to the captured audio data as part of generating the processed data.
[0081] When using masking data and / or offset data, the phase components of the spectrogram frames do not need to be manipulated or considered. Instead, the mask data or offset data may be applied directly to the recording spectrogram magnitudes, i.e. retaining the phase information of the reference audio data 210 and the captured audio data 212 in the processed audio data 108.
[0082] Figure 10 shows an example in which the DNN arrangement 278 is configured to directly transform the captured audio data 212 into the processed audio. For example, the DNN arrangement 278 may U-Net architecture, such as the U-Net architecture 400 shown in Figure 9, configured to operate as a transformer DNN.
[0083] In Figure 10, the reference audio data 210 and the captured audio data 212 are input to the DNN arrangement 278. The DNN arrangement 278 transforms the captured audio data 212 with reference to the reference audio data 210, and outputs of the processed audio data 108.
[0084] Figure 11 shows an example in which the DNN arrangement 278 is configured to output mask data. The reference audio data 210 and the captured audio data 212 are input to the DNN arrangement 278. The DNN arrangement 278 generates mask data 500. The mask data 500 and the captured audio data 212 are multiplied together by a multiplier 502, the output of the multiplier 502 being the processed audio data 108. Any representation of the reproduction in the processed audio data 108 has a lower audibility in the processed audio data than in the captured audio data. A similar result may be achieved by configuring the DNN arrangement 278 to output mask data in a format such that the captured audio data 212 is divided by the mask data to generate the processed audio data 108.
[0085] Figure 12 shows an example in which the DNN arrangement 278 is configured to output offset data. The reference audio data 210 and the captured order dated 212 are input to then arrangement 278. The DNN arrangement 278 generates offset data 504. The offset data 504 and the captured audio data 212 are added together by an addition block 506, the output of the addition block 506 being the processed audio data 108, in which any representation of the reproduction in the processed audio data 108 has a lower audibility in the processed audio data than in the captured audio data.
[0086] A similar result may be achieved by configuring the DNN arrangement 172 to output offset data in a format such that the offset data is subtracted from the captured audio data 212 to generate the processed audio data 108.
[0087] In the examples of Figures 10 to 12, the reference audio data 210 and the captured audio data 212 are frequency domain audio data. In other examples, the reference audio data 210 and / or the captured audio data 212 may be time domain audio data. In each case, the DNN arrangement 278 is trained to accept the reference audio data 210 and the captured audio data 212 in the time and / or frequency domain format on which the DNN arrangement 278 has been trained.
[0088] In the examples of Figures 10 to 12, the output of the DNN arrangement 278 (i.e., the processed audio data 108 in Figure 10, the mask data 500 in Figure 11 , and the offset data in Figure 12) takes the form of frequency domain data.
[0089] Similarly, the processed audio data 108 in Figures 11 and 12 take the form of frequency domain data. In other examples, the output of the DNN arrangement 278, and / or the processed audio data 108, may take the form of frequency domain data.
[0090] When the processed audio data 108 is in the frequency domain, it may be converted back into the time domain, ready to be played back or used as a track in a multi-track recording, for example. Converting the processed audio data 108 from the frequency domain into the time domain may be achieved by using a short-time inverse Fourier transform.
[0091] Optionally, one or more pre-processing steps may be applied to the reference audio data 210 and / or the captured audio data 212, prior to providing them as inputs to the DNN arrangement 278. Such preprocessing steps may be applied to either or both of the reference audio data 210 and / or the captured audio data 212. Similarly, one or more post-processing steps may be applied to the output of the DNN arrangement 278 and / or to the processed audio data 108.
[0092] Figure 13 shows a schematic view of an example of such pre-processing and postprocessing steps, in the context of a method of processing audio data, such as method 100.
[0093] At 508, the reference audio data 210 and the captured audio data 212 are synchronised. To improve performance, it is desirable to reduce or minimise any timing differences between the reference audio data 210 and the captured audio data 212. Accurate synchronisation (i.e., time-alignment) of the reference audio data 210 and the captured audio data 212 may significantly improve results. This is also true of the training reference audio data and captured audio data used to train the DNN.
[0094] Because the recording process is not directly coupled to the reproduction process, it cannot be assumed that the original frames of reference audio data 210 and the original (i.e., as assembled during the capture process) frames of the captured audio data 212 are synchronised. As such, synchronisation may involve: time-aligning the samples of the reference audio data 210 and the captured audio data 212; and generating respective sequences of time-aligned frames of the reference audio data 210 and the captured audio data 212 based on the time-aligned samples.
[0095] When generating the respective sequences of frames of the reference audio data 210 and the captured audio data 212, the frames of one or the other of the respective sequences of frames of the reference audio data 210 and the captured audio data 212 may be used as a timing template for generating the other sequence of frames. For example, the original frames of the reference audio data 210 may be used as a timing template for generating the synchronised frames of the captured audio data 212 from the time-aligned samples.
[0096] Audio played back on a particular model of smartphone may have a specific total playback and recording latency of, say, several hundred samples. However, the total latency may be higher or lower for a different model of smartphone or other device such as a tablet computing device or a personal computer.
[0097] It is therefore desirable to train the DNN on playback, recording and target data that is time-aligned (i.e., without latency). The measured or otherwise known latency of any implementation device can then be used in the generation of a latency offset that can be use to synchronise the reference audio data 210 and the captured audio data 212. Many audio devices (such as audio interfaces and other ADC / DAC circuity) report their latency to the host CPU on start-up, allowing the latency offset to be calculated. Latency values of common recording systems (for example, specific models of smartphone and / or tablet computing device) can also be stored with the instructions, for use in determining an appropriate latency offset to use.
[0098] Playback and recording latency involves delays caused by the audio reproduction and capture paths within a device, such as device 200. Afirst delay is caused by the time taken to send the reference audio data 210 to the DAC 216, convert it to the analogue signal 220 within the DAC 216, amplify the analogue signal, and output it to the loudspeaker 222. A second delay is caused by the time taken for sound to travel through the air from the loudspeaker 222 to the microphone 224. A third delay is caused by the time taken to amplify the analogue audio signal 226 from the microphone 224 and send it to the ADC 228, and convert it to a digital representation of the captured audio data 212.
[0099] The playback and recording latency is at least partly based on the sum of the first, second, and third delays. The reference audio data 210 and the captured audio data 212 may be synchronized based at least in part on the playback and recording latency.
[0100] Some devices, such as particular models of smartphones and tablet computing devices, may store values that are indicative of playback and recording latency for that model of device. In that case, executing the instructions 206 may cause the processor 202 to look up those values for use in improving synchronisation of the reference audio data 210 and the captured audio data 212.
[0101] Some audio devices, such as audio interfaces and other ADC / DAC circuitry, report their latency to a host CPU on start-up. In that case, executingthe instructions 206 may cause the processor 202 to look up those values for use in improving synchronisation of the reference audio data 210 and the captured audio data 212. Playback and recording latency may be determined in other ways. Figure 14 shows an example method 600 performed by the device 200 under the control of the processor 202, for determining a latency offset that may be used in synchronising the reference audio data 210 and the captured audio data 212. The processor 202 causes a sound to be played 602 via the loudspeaker 222. The sound may be any suitable sound. For example, the sound may be optimised for use in a latency offset determination process. This may include, for example, the use of sounds with hard transients spaced by lower-volume (or silent) periods. Alternatively, the sound may be music. In examples, the sound may be the reference audio.
[0102] If short enough, the sound may be played immediately before playing the reference audio, allowing the latency offset to be determined without the need for a latency determination process separate from the recording process.
[0103] The played sound is recorded 604 via the microphone 224. That is, the sound is captured via the recording path, including the microphone 224, an amplifier if present, and the ADC 228.
[0104] A time delay associated with the playing and the recording of the sound is measured 606. The time delay may be measured in any suitable way.
[0105] The measured time delay is used 608 to determine the latency offset. In some examples, the latency offset is the time delay. In other examples, the device 200 may be configured to perform tests to determine the latency offset based on the measured time delay. For example, the device 200 may be configured to apply several adjustments to the measured time delay and test which gives the best result. The best adjusted version of the measured time delay is then used as the latency offset.
[0106] The latency offset may be determined once and stored for future reference. Optionally, the latency offset may be re-determined periodically. Alternatively, the latency offset may be determined at the start of each session of using the device 200, or at the start of each recording.
[0107] Returning to Figure 13, after synchronisation 508, the reference audio data 210 and the captured audio data 212 are individually transformed 510 from the time domain to the frequency domain. This may involve applying a short-time Fourier transform 302 as described above with reference to Figure 8. Next, the frequency domain versions of the reference audio data 210 and the captured audio data 212 are normalised 512.
[0108] Once normalised, the frequency domain versions of the reference audio data 210 and the captured audio data 212 are provided as inputs for processing 514, for example by the DNN arrangement 278. The DNN arrangement 278 may apply any suitable processing, including the processing described above.
[0109] If the DNN arrangement 278 involves direct transformation, then its output may be provided as the output of the processing 514 directly, without further processing.
[0110] If the DNN arrangement 278 is configured to output mask data, such as the mask data 500, then the processing 514 may also include applying the mask data to output of the DNN arrangement 278, for example as described above, to produce the processed audio data 108.
[0111] If the DNN arrangement 278 is configured to output offset data, such as offset data 504, then the processing 514 may also include applying the offset data to the output of the DNN arrangement 278, for example as described above, to produce the processed audio data 108.
[0112] The processed audio data 108 is reverse-normalised 516, which involves the inverse of normalisation 512.
[0113] The reverse-normalised audio data is transformed 518 from the frequency domain to the time domain. This may involve applying an inverse short-time Fourier transform.
[0114] The time domain data 520 output by the inverse Fourier transform may then be stored in the memory 204 for subsequent replay and / or use in a multitrack recording, for example.
[0115] As explained above, the reference audio data 210 input to the DNN arrangement 278 may comprise a first sequence 282 of reference audio frames. The captured audio data 212 input to the DNN arrangement 278 may comprise a second sequence 284 of captured audio frames. Each audio frame represents, for example, a particular number of samples, each sample having a particular bit depth.
[0116] In the example of Figure 7, the DNN arrangement 278 simultaneously accepts as an input a whole audio frame of each of the first sequence 282 and the second sequence 284. In other examples, the audio frames may be reconfigured prior to input to the DNN. For example, the DNN arrangement 278 may have been trained on a blocks having a certain format, including frequency resolution, time resolution, and amplitude resolution. If the format(s) of the frames of the first sequence 282 and / or the second sequence 284 do not match the format upon which the DNN arrangement was trained, then the frames may be reconfigured prior to providing them as inputs to the DNN arrangement 278, such that the frame format matches what the frame format that the DNN arrangement 278 was trained on.
[0117] Where the frames includethe wrong number of samples, reconfiguring the frames may involve repackaging of the samples into frames of the correct length.
[0118] Where the sample resolution is incorrect, upsampling or downsampling may be employed to achieve the correct resolution.
[0119] Where a DNN arrangement is used, it may be desirable to train the DNN arrangement using a frame format that is a common format for storing and / or streaming audio data. This may reduce the amount of processing needed to format frames for input to the DNN arrangement.
[0120] In other examples, more than one frame of the reference audio data 210 may be provided as an input to the DNN 178 arrangement for each frame of the captured audio data 212. For example, Figure 15 shows an arrangement in which a first reference audio frame 640 and a second reference audio frame 642 are simultaneously provided as inputs to the DNN 178 arrangement with a first captured audio frame 644. The first reference audio frame 640 corresponds with the first captured audio frame 644. That is, the first reference audio frame 640 and the first captured audio frame 644 are corresponding frames in the first sequence 282 and the second sequence 284. The second reference audio frame 642 is the frame immediately preceding the first reference audio frame 640 in the first sequence 282.
[0121] Frames that are close together in time will often have commonalities. In addition to recorded sounds overlapping frame boundaries (for example, a single note may extend for several frames), audio and room effects such as reverb and echo may also extend the effect of sounds across frames. As such, providing two or more adjacent frames as inputs to the DNN arrangement may improve performance, at the cost of additional processing. Processing of the captured audio data using the reference audio data and the audibility reduction technique may commence at any suitable time. For example, the processing may commence once recording of the captured audio is complete. The processing may start automatically, or under the control of a user.
[0122] The use of frame-based processing with a DNN arrangement may bring additional advantages in some examples. For example, providing reference audio data and captured audio data as inputs to a DNN arrangement in the form of corresponding frames may allow the DNN arrangement to process the audio data in substantially realtime, on a frame-by-frame basis.
[0123] For example, while frames of the reference audio are being streamed for replay by the loudspeaker 222, frames of the captured audio data may be generated based on audio captured by the microphone 224. Once each captured audio data frame is available, it is input to the DNN arrangement with the corresponding reference audio frame. The corresponding processed audio frame is stored in memory once generated. Since the processing is performed on a frame-by-frame basis, the processing may take place while the audio is still being captured. Once the audio capture is finished (e.g., when the user 232 stops recording, or the reference audio completes playback), the processed audio 108 will be available for playback as soon as the last frame or frames are processed.
[0124] Accordingly, instead of waiting for the audio capture to be completed before commencing processing, the processing may take place in parallel with the audio capture. This may reduce waiting time for the processed audio data to be available.
[0125] In addition, the delay between completion of audio capture and availability of the processed audio for replay may be largely or wholly independent of the duration of the captured data. For example, the delay may be of the order of hundreds of milliseconds, a second, or a single digit number of seconds, irrespective of whetherthe captured audio has a duration of 30 seconds, 10 minutes, an hour, or any other duration.
[0126] Where the loudspeaker (which, in examples, may include a speaker driver) and the microphone are part of the same device, the microphone may be no more than 30 centimetres (approximately 12 inches) from the speaker driver. The distance between the microphone and the speaker driver may be measured along an outside surface of a casing or housing within which the microphone and speaker driver are mounted. There may be more than one speaker driver. In examples, there may be a multiway loudspeaker comprising multiple speaker drivers, each speaker driver being optimised for a subset of the audio spectrum. In examples, there may be a stereo pair of speaker drivers (or loudspeakers, at least some of which comprise multiple speaker drivers). In examples, there may be three or more speaker drivers (or loudspeakers, at least some of which comprise multiple speaker drivers), each speaker driver (or loudspeaker) being configured to reproduce a channel of multi-channel audio.
[0127] For example, Figure 16 shows the device 200, in the form of a smartphone. In other examples, the device 200 may be a tablet computing device or any other device in which the physical relationship between the loudspeaker 222 and the microphone 224 is fixed. In the example of Figure 16, the device 200 has a casing 460. The display 230 effectively forms part of the casing 460, since it encloses the internal components of the device 200.
[0128] The loudspeaker 222 (dotted line) and the microphone 224 (dotted line) are mounted within the casing 460. An outlet aperture 462 allows sound from the loudspeaker 222 to exit the casing 460. Similarly, an inlet aperture 464 allows sound to enter the casing 460 to reach the microphone 224.
[0129] The device 200 in Figure 16 is approximately 15 cm (approximately 6 inches) tall and 7.5 cm (approximately 3 inches) wide. Measured across the outside surface of casing 460, as shown by dotted arrow 468, the distance between the closest points of the outlet aperture 462 and the inlet aperture 464 is approximately 14 cm (approximately 5.5 inches).
[0130] Due to the fixed physical relationship between the loudspeaker 222 and the microphone 224, the delay caused by transmission of sound from the loudspeaker 222 to the microphone 224 is fixed. As such, the second delay mentioned above may be determined in advance for common models of smartphone, tablet computing device, and the like, and stored along with the instructions in the memory 204. The first and third delays may be looked up from the device’s operating system or may also be stored along with the instructions in the memory 204.
[0131] Either way, the latency offset may be determined based on obtaining the first, second, and third delays, which obviates the need to take measurements. However, such measurements may be taken as well, in order to confirm the obtained values. The obtained values may be useful as a starting point in measuring the latency upon which the latency offset is based, which may reduce the amount of time and processing power required to measure the latency. Alternatively, or in addition, the latency may be measured as described above, without relying on stored values.
[0132] The microphone may be no more than 50 centimetres (approximately 20 inches) from the loudspeaker, as measured along an outside surface of the casing. Alternatively, the microphone may be no more than 30 centimetres (approximately 12 inches), 20 centimetres (approximately 8 inches), 15 centimetres (approximately 6 inches), or 12 centimetres (approximately 5 inches) from the loudspeaker, as measured along an outside surface of the casing.
[0133] Where there are multiple speaker drivers (or loudspeakers), and only a single microphone, the microphone may be no more than 50 centimetres (approximately 20 inches) from the furthest or closest loudspeaker, as measured along an outside surface of the casing. Alternatively, the microphone may be no more than 30 centimetres (approximately 12 inches), 20 centimetres (approximately 8 inches), 15 centimetres (approximately 6 inches), or 12 centimetres (approximately 5 inches) from the furthest or closest loudspeaker, as measured along an outside surface of the casing.
[0134] Where there are multiple microphones and only a single speaker driver (or loudspeaker), the speaker driver (or loudspeaker) may be no more than 50 centimetres (approximately 20 inches) from the furthest or closest microphone, as measured along an outside surface of the casing. Alternatively, the speaker driver (or loudspeaker) may be no more than 30 centimetres (approximately 12 inches), 20 centimetres (approximately 8 inches), 15 centimetres (approximately 6 inches), or 12 centimetres (approximately 5 inches) from the furthest or closest microphone, as measured along an outside surface of the casing.
[0135] Where there are multiple microphones and multiple speaker drivers (or loudspeakers), no speaker driver (or loudspeaker) may be more than 50 centimetres (approximately 20 inches) from the furthest or closest microphone, as measured along an outside surface of the casing. Alternatively, no speaker driver (or loudspeaker) may be more than 30 centimetres (approximately 12 inches), 20 centimetres (approximately 8 inches), 15 centimetres (approximately 6 inches), or 12 centimetres (approximately 5 inches) from the furthest or closest microphone, as measured along an outside surface of the casing. One or more speaker driver(s) and microphone(s) being disposed within the same case results in some unique challenges and advantages.
[0136] Advantages include the fact that the characteristics of the loudspeaker and the microphone, and the physical relationship between the loudspeaker, the microphone, and the casing, are fixed. Using an external microphone and / or loudspeaker that may have widely varying characteristics (e.g., frequency and phase response), as well as different positions and spacings depending upon how the user has connected and positioned the loudspeaker and / or the microphone, may result in markedly different transfer functions. This may result in poorer performance, in terms of the reducing the audibility of the reference audio in the processed audio relative to the performance while minimising any impact on the fidelity of the performance. The loudspeaker and the microphone being mounted within the case means that the method may be optimised for that specific combination of loudspeaker and microphone, including their spacing, and the physical characteristics of the case including resonances (designed or otherwise).
[0137] The relative closeness of the loudspeaker and the microphone within the same case means that the reproduction of the reference audio produced by the loudspeaker will typically be picked up by the microphone. Where the performance is of relatively low volume relative to the reproduction, this may bring challenges in reducing the audibility of the reference audio in the processed audio relative to the performance while minimising any impact on the fidelity of the performance. However, the fixed nature of the loudspeaker and the microphone may help mitigate this due to the predictability of the transfer function between them.
[0138] In other examples, the reference audio data may be reproduced via a loudspeaker that is not part of the device that implements the method of processing audio data as disclosed herein. For example, the loudspeaker may form part of a physically separate audio reproduction system, such as a stereo system, smart speaker, headphones, earbuds, or the like. In other examples, the loudspeaker may be a physically separate component, with all other functionality (including amplification, for example) being performed by the device. In this context, “physically separate” means not being mounted in the same casing as the device that performs the method. There may still be a wired and / or wireless connection for providing or receiving power, data, and / or control signals, for example.
[0139] In examples, the device may output the reference audio data to the audio reproduction system for reproduction. For example, the device may stream the reference audio data to the audio reproduction system, the audio reproduction system reproducingthe reference audio as it is received. Alternatively, the device may output an line level or amplified analogue version of the reference audio for reproduction by the audio reproduction system and / or the loudspeaker.
[0140] In examples, the device may send the reference audio data to the audio reproduction system for storage and subsequent reproduction. Any subsequent reproduction may be under the control of the audio reproduction system, optionally under the manual control of the user 232. Alternatively, any subsequent reproduction may be under the control of the device, which may be configured to send a control signal to the audio reproduction system, instructing the audio reproduction system to begin reproduction of the reference audio.
[0141] In examples, the reproduction and the performance may be captured via a microphone that is not part of the device that implements a method of processing audio data as disclosed herein. For example, the microphone may form part of a physically separate audio capture system. Optionally, such an audio capture system may form part of an audio reproduction system, such as a stereo system, smart speaker, headphones, earbuds, orthe like. Alternatively, the audio capture system may be capable of capturing audio but not reproducing it. In examples, the microphone may be a physically separate component, with all other functionality (including pre-amplification, for example) being performed by the device. In this context, “physically separate” means not being mounted in the same casing as the device that performs the method. There may still be a wired and / or wireless connection for providing or receiving power, data, and / or control signals, for example.
[0142] In examples, the device may receive the captured audio from the audio capture system. For example, the audio capture system may stream the captured audio data to the device as the captured audio data is captured. Alternatively, the device may receive a line level or pre-amplified analogue version of the captured audio for conversion into the captured audio data by the device. Alternatively, the audio capture system may capture the reproduction and the performance, and store it as the captured audio data for subsequent transmission to the device. Any such transmission may be under the control of the audio capture system, optionally under the manual control of the user 232. Alternatively, any such transmission may be under the control of the device, which may be configured to send a control signal to the audio capture system, instructing the audio capture system to send the captured audio data to the device.
[0143] Figure 17 shows a system 700. The system includes a device 702. The device 702 may take the form of a smartphone or a tablet computing device, such as those described above for example. In other examples, the device 702 may take the form of a desktop computer, laptop computer, or other personal computer. In yet other examples, the device 702 may take the form of a digital audio workstation (DAW) or other specialist music or sound production device. As with the device 200, the device 702 includes a processor 202 and a memory 204 operatively coupled with the processor 202. The memory 204 stores instructions 206 and data 208, as described above, for example. When executed by the processor 202, the instructions 206 cause the device 702 to implement the method 100. In this example, the data 208 includes the reference audio data 210 and the captured audio data 212.
[0144] The device 702 includes an input / output (I / O) interface 704 for communicating with external devices and components, as described in more detail below. The I / O interface 704 may be configured for wireless and / or wired communication with such external devices and components.
[0145] The system 700 includes an audio reproduction system 706. The audio reproduction system 706 includes a receiver 708 for receiving data and control signals from the device 702, as described in more detail below. The receiver 708 may be configured to receive the data and control signals wirelessly from the device 702 via the I / O interface 704. In other examples, the receiver 708 is configured to receive the data and control signals via a wired connection with the I / O interface 704.
[0146] The audio reproduction system 706 includes a processor and memory (not shown) configured to store reference audio data received from the device 702. The audio reproduction system 706 also includes a DAC (not shown) for converting the reference audio data into an analogue signal, and an amplifier (not shown) for amplifying the analogue signal output by the DAC.
[0147] The audio reproduction system 706 includes a loudspeaker 710 for reproducing the amplified analogue signal output by the amplifier.
[0148] The system 700 includes an audio capture system 712. The audio capture system 712 includes a transmitter 714 for transmitting data to the device 702, as described in more detail below. The transmitter 714 may be configured to transmit the data wirelessly to the device 702 via the I / O interface 704, in other examples, the transmitter 714 is configured to send the data via a wired connection with the I / O interface 704.
[0149] The audio capture system 712 includes a microphone 716 for capturing the reproduction of the reference audio and the performance. The audio capture system 712 also includes a pre-amplifier (not shown) for pre-amplifying the analogue signal output by the microphone 716, and an ADC (not shown) for converting the pre-amplified analogue signal into captured audio data.
[0150] The audio capture system 712 also includes a processor and memory (not shown) configured to store the captured audio data.
[0151] The audio reproduction system 706 and the audio capture system 712 may be configured to operate independently of each other and the device 702. Alternatively, or in addition, either or both of the audio reproduction system 706 and the audio capture system 712 may be configured to operate under the control of the device 702.
[0152] In use, the audio reproduction system 706 and the audio capture system 712 are connected to the device 702. As explained above, the connection may be wireless and / or wired. The device 702 may provide an interface to the user 232, such as that shown in Figures 4 to 6. When the user presses the “start” button 242, the device 702 begins streaming the reference audio data to the audio reproduction system 706. Based on the received reference audio data, the audio reproduction system 706 reproduces the reference audio via the loudspeaker 710, as indicated by arrow 718.
[0153] The user 232 (or one or more others performers) performs while the reference audio is being reproduced. For example, the user 232 can sing, speak, play one or more musical instruments, or otherwise undertake an audible performance accompanied by the reproduction of the reference audio. The performance is indicated by arrow 720. The microphone 716 of the audio capture system 712 receives the reproduced reference audio 718 and the performance 720 and outputs an analogue signal representative of the reproduced reference audio 718 and the performance 720. The analogue signal is amplified by the amplifier, converted into captured audio data by the ADC, and streamed by the audio capture system 712 to the device 702 via the I / O interface 704.
[0154] Based on the captured audio data received from the audio capture system 712 and the reference audio data, the device 702 performs the method 100 to generate the processed audio data 108, as described above for example.
[0155] In examples, the audio reproduction system 706 and the audio capture system 712 may take the form of separate devices. Alternatively, the audio reproduction system 706 and the audio capture system 712 may be contained within the same physical device. For example, the audio reproduction system 706 and the audio capture system 712 may take the form of a stereo system, smart speaker, headphones, earbuds, or the like, having both audio reproduction and audio capture capabilities. The dotted line 724 in Figure 17 represents such a single device.
[0156] In other examples, the audio reproduction system 706 may form part of the device 702, with the audio capture system 712 forming part of a physically separate device.
[0157] In yet other examples, the audio capture system 712 may form part of the device 702, with the audio reproduction system 706 forming part of a physically separate device.
[0158] In examples, the processor 702 may process the reference audio data 210 prior to transmitting it for reproduction by the audio reproduction system 706. In other examples, some or all of such processing may take place within the audio reproduction system 706. Examples of such processing are described above with reference to Figure 2.
[0159] In examples, the processor 702 may perform time domain to frequency domain conversion of the reference audio data and / or the captured audio data. In other examples, conversion of the reference audio data and / or the captured audio data may be performed by the audio capture system 712, and / or another device or system not forming part of the device 702, audio capture system 712, or audio reproduction system 706. The device 700 may include a display (not shown) for displaying information to a user, as described above in relation to device 200, for example.
[0160] In examples, audibility of the representation of the reproduction in the processed audio data 108 may be adjustable. In examples, there may be a compromise between reducing audibility of the reference audio in the processed audio data and increasing fidelity of the performance in the processed audio data. In general, maximising the audibility reduction may increase the chance of undesirable distortion of the performance in the processed audio data. It may therefore be desirable in examples to modulate the amount of audibility reduction.
[0161] Where a DNN arrangement is used to produce mask data, the mask data may be scaled and / or offset by one or more scaling and / or offset factors in order to increase or decrease the effect of the mask data.
[0162] When a DNN arrangement is used to produce offset data, the offset data may be scaled and / or offset by one or more scaling and / or offset factors in order to increase or decrease the effect of the offset data.
[0163] A scaling factor may be applied by multiplying or dividing one or more amplitude values in the frequency domain representation by the scaling factor. An offset value may be applied by addingthe offset value to, or subtractingthe offset value from, one or more amplitude values in the frequency domain representation.
[0164] In examples, a single scaling and / or offset factor may be applied to all amplitude values within the frequency domain representation. In other examples, different scaling factors may be applied to different frequencies, time, or amplitude values within the frequency domain representation.
[0165] In examples, the method and device may allow the captured audio data and the processed audio data to be combined. In examples, a mixer or blender function may be provided to allowa userto blend, sum, or mix the captured audio data and the processed audio data. The mixer or blender function may allow, for example, different amounts of the captured audio data to be mixed with the processed audio data. Such mixing may reduce the audibility of processing artifacts, such as digital artifacts, that may be present in the processed audio data.
[0166] The mixer or blender function may also offer the option of filtering or otherwise processing either or both of the captured audio data and processed audio data before mixing. For example, adjusting the amplitude of certain frequencies in either or both of the captured audio data and processed audio data may provide a more sonically appealing result in certain circumstances.
[0167] The amount of captured audio data mixed with the processed audio data, and optionally any filtering or processing applied to the captured audio data and / or the processed audio data, may be controlled by a user to improve the sound of the mixed version.
[0168] Although audibility reduction techniques involving the use of various DNN arrangements have been described, it will be appreciated that non-DNN-based techniques may be employed in other examples. In examples, non-DNN-based artificial neural network techniques may be used. In other examples, the techniques do not involve artificial neural networks. For example, frequency-based masking or filtering may be used.
[0169] In examples, the audibility reduction technique may comprise a bleed reduction technique. Bleed audio is unwanted sound captured in a recording process, the bleed audio arising from the reproduction of audio, such as a backing track or a click track. Bleed audio may come from a loudspeaker through which backing audio is being reproduced. Bleed audio may also arise as a result of audible leakage of the backing audio from private audio reproduction systems such as headphones and earbuds.
[0170] An advantage of at least some examples is that the audibility reduction technique may be applied independent of an audio type of the reference audio data and / or the captured audio data. In this context, “audio type” may include characteristics related to musical genre, for example. An alternative approach to applying an audibility reduction technique would be to train an artificial neural network on a particular audio type (e.g., a particular musical genre, such as rock, jazz, or orchestral music, and / or a musical instrument type, such as guitar, violin or piano). An artificial neural network trained in this way may extract a performance from captured audio, based on the audio type upon which the artificial neural network was trained. For example, a singer’s vocals may be extracted from a live recording that also includes the sounds generated by other members of an accompanying band. For best results, such an approach requires the use of a different model for each audio type to which the audibility reduction technique is to be applied. In contrast, by using the reference audio data and the captured audio data as an inputs to the audibility reduction technique, the audibility reduction technique need not be audio-type specific. For example, if the audibility reduction technique involves an artificial neural network, such as a DNN arrangement, then it is not necessary to train the artificial neural network on a specific audio type.
[0171] Although the various methods and techniques above are shown for a single channel of audio, the methods and techniques may be provided for multi-channel audio, such as stereo audio.
[0172] Figure 18 shows a method 800 of performing real-time music extraction. In this context, “real-time” means that music extraction from a performance is performed while the performance is still being captured.
[0173] The method 800 includes providing 802 playback and recordingframes as inputs to a deep neural network, DNN. The DNN may comprise, for example, a DNN arrangement as described above. The playback frames may, for example, take the form of the reference audio described above. The recording frames may, for example, take the form of the captured audio described above.
[0174] The method 800 includes using the DNN to remove 804 playback bleed from the recording frames. Playback bleed may comprise unwanted or undesirable sound associated with the playback frames that is part of the recording frames.
[0175] The playback bleed removal may be performed while recording is taking place.
[0176] The playback bleed removal may completed shortly after the recording stops. In examples, the playback bleed removal may be completed within 500 milliseconds of the recording stopping. In examples, the playback bleed removal may be completed within 1 second, 2 second, 5 seconds, or 10 seconds of the recording stopping.
[0177] In examples, the amount of time between the recording stopping and the completion of the playback bleed removal is independent of a duration of the captured audio data. That is, the time taken for the bleed removal to finish is substantially constant, and does not, for example, increase with duration of the captured audio data upon which the recordingframes are based.
[0178] Figure 19 shows a method 900 of reducing audibility of reproduced reference music in a recording of a live performance accompanying the reproduced reference music. The reproduced reference music may, for example, be based on reference audio data, such as the reference audio data 210 described above. The recording may, for example, comprise the captured audio data 212 described above.
[0179] The method 900 includes reproducing902 the reference music, via a loudspeaker of a smartphone or a tablet computing device, based on reference music data. The loudspeaker may, for example, be the loudspeaker 222 or loudspeaker 710 as described above. The smartphone or tablet computing device may, for example, be the device 200 orthe device 700.
[0180] The method 900 includes generating 904 captured audio data by capturing the reproduced reference music and a live performance accompanying the reproduced reference music via a microphone of the smartphone orthe tablet computing device. The captured audio may, for example, be the captured audio data 212 described above.
[0181] The method 900 includes processing 906 the captured audio data and reference audio data in a deep neural network, DNN, to produce processed audio data, such that audibility of the reference music relative to the live performance is reduced in the processed audio data relative to in the captured audio data.
[0182] Examples of the method and device may work in ‘real-time’ on frames (sometimes called packets, slices or buffers) of audio. A frame refers to a number of consecutive audio samples. For example, a frame may contain 1024 audio samples, where the audio sampling frequency is 48,000 samples per second (i.e., 48 kHz). When a frame of recorded audio is captured, it is input along with the corresponding frame of playback audio along with which the performer was playing. The two input frames are then processed as follows:
[0183] 1 . Align / synchronise the frames to correct for any time difference or latency in the playback and recording system.
[0184] 2. Generate a short-term Fourier transform of each of the playback and recorded frames (for example, to produce a three dimensional array containing time, frequency, and amplitude or power variables).
[0185] 3. Normalise the frequency domain frames in preparation for extraction (for example, based on signal strength analysis or prior calibration).
[0186] 4. Use a trained deep-learning neural network (DNN) to generate a cleanestimate recording frame from the playback and recorded frames. Generating the cleanestimate recording frame may involve direct transformation. Generating the clean- estimate recording frame may involve the generation of a masking frame that is applied to the recorded frame. Generating the clean-estimate recording frame may involve the generation of an offset frame that is applied to the recorded frame.
[0187] 5. Reverse normalise the clean-estimate recording frame.
[0188] 6. Inverse short-term Fourier transform the normalised clean-estimate recording frame, to produce a cleaned audio frame of the recording.
[0189] The described frame-by-frame processing may take place repeatedly until the recording is completed. The recording may be completed automatically. For example, the recording may be completed when the playback of the recording is complete. The recording may be completed when a particular duration (for example, set by the user) has been reached. The recording may be completed when the user stops the recording, for example by pressing a “stop” button.
[0190] After the recording is stopped, the final frame(s) complete processing. The cleaned audio file is then available for use, playback and review.
[0191] Frame overlap may be used to incorporate a number of frames as inputs to the DNN. In an example, in processing backing track frame B(n) and recording frame R(n) (where n is the frame number), frames B(n-1 ) and R(n-1) may be used as inputs to the DNN. This may improve the quality of the results and / or reduce the impact of what are known as “edge conditions”. Frames are concatenated together before input to the DNN arrangement 278.
[0192] Figure 15 shows an example in which frames R(n), R(n-1 ), and B(n) are used as inputs to the DNN arrangement 278.
[0193] In further examples, frames B(n-1 ), B(n+1 ), R(n-1), and R(n+1 ) may be used as inputs to the DNN.
[0194] Any other combination of previous and subsequent frames, of the reference audio data 210 and / or the captured audio data 212, may be used as inputs.
[0195] The use of a greater number of frames may further improve the quality of the results and / or reduce the impact of edge conditions.
[0196] In further examples, adjacent frames may include samples in common. That is, the content of adjacent frames may partly overlap. This may further improve the quality of the results and / or reduce the impact of edge conditions. The use of multiple frames from each sequence of frames may introduce additional delay in production of the processed audio data 108. For example, where only one frame of the playback audio and recorded audio is used as an input, a single frame of latency is introduced. Where multiple frames of the playback audio and / or recorded audio are used as inputs, multiple frames of latency are introduced. An improvement in quality may make the additional delay an acceptable compromise.
[0197] In relation to all methods and devices described herein, the reference audio data and the captured audio data may represent music. This is in distinction to, for example, spoken word applications, such as hands-free telephony and conference calls.
[0198] In relation to all methods and devices described herein, the performance may represent an overlay track for a multi-track recording. For example, the reference audio may be generated based on at least one other track associated with the multi-track recording. The at least one other track may include all tracks of a multi-track recording, or a subset of those tracks. For example, if a multi-track recording currently includes drum tracks, guitar tracks, bass tracks, and vocal tracks, the reference audio may include just the drum tracks and the guitar tracks. Reducing the number of tracks forming part of the reference audio may reduce complexity of the audio in the reference audio, which may improve the amount by which audibility of the reference audio on the captured audio may be reduced, and / or reduce the amount of distortion in the remaining performance audio.
[0199] Although smartphones, tablet computing devices, desktop computers, laptop computers, other types of personal computer, digital audio workstations (DAWs), and other specialist music or sound production devices have been described, other examples and embodiments may be implemented in, or take the form of, any other suitable device or connected devices.
[0200] Although various devices are shown as having a single processor, in other examples, processing may be performed across multiple processors, including one or more processors in different devices.
[0201] Although various devices are shown as having a single memory, in other examples, multiple types of memory (such as volatile and non-volatile memory, and memory types having different speeds) may be employed. Although various devices are shown as having an on-board processor and memory, in other examples, processing and / or storage may be performed by one or more other devices. Such one or more other devices may be physically remote from the device performing the method. For example, such one or more other devices may be in contact with the device performing the method byway of a wired or wireless connection, such as a local area network (including a wired and / or wireless local area network (LAN)), a mobile telecommunications network, and / or the Internet.
[0202] In examples, the DNN arrangement 278 may use a hybrid design. For example, the DNN arrangement may comprise a two-stage neural network arrangement. The first neural network is configured to estimate ‘pure’ bleed. A second neural network is configured to create a bleed removal mask (or offset). This approach can have advantages for training and testing purposes, given that the DNN can be trained and verified in a modular form, enabling accuracies and improvement opportunities to be identified at specific stages in the conversion process.
[0203] When recording a new layer (i.e., an ‘overlay’ or ‘overdub’) of music to a backing track (e.g., recording voice in time with an instrumental backing track, or recording guitar to a metronome click), headphones may be used. If such headphones are not sufficiently acoustically isolated, leakage of the backing track is also recorded into the microphone, causing an inadequately isolated overlay recording to be captured. The unwanted backing track recording is often referred to as ‘bleed’. In many cases, there is a latency in the recording system, so the recorded bleed is not in time or synchronised with the source audio, so a degraded or otherwise distorted result is achieved.
[0204] The use of headphones when recording may be undesirable in some circumstances. Non-technical musicians may find it challenging to correctly setup a headphone system for recording. For example, an audio interface including a preamplifier, a DAC and various cables and adaptors may be required, depending on what recording hardware is being used. Some musicians may find headphones uncomfortable or restrictive. Some musicians may find it difficult to perform well when isolated from the sound of their voice or instrument due to the headphone isolation.
[0205] It may be cumbersome to carry headphones and any associated equipment. This is of particular importance where it is desired to record using a portable device such as a smartphone, mobile phone, or tablet computing device, as the headphones and any associated equipment may take up considerable additional space and weight. Such portable devices may not have a jack socket for a headphone connection. Moreover, wireless headphones may not be a viable option for reducing bleed issues, due to the relatively large latency they may introduce.
[0206] In a studio or other recording environment, it may be valuable for the performer to be in the same room as the recording engineer and / or music producer, to aid dialogue and communication. Allowing all participants to hear playback at the same time over loudspeakers may be desirable in some circumstances.
[0207] Examples of the method and device may allow users to avoid the complication of using headphones when recording music that accompanies a musical backing track, metronome click track, or other ‘playback’ audio data. Examples of the method and device may allow overlay recordings without the use of headphones, by reducing or removing musical bleed components in a recording.
[0208] Examples of the method and device may allow reduction or removal of unwanted bleed music in real-time, as the recording is being made.
[0209] Examples of the method and device may allow results to be available almost immediately after the recording completes.
[0210] Examples of the method and device may be particularly useful with portable screen-based devices such as mobile smartphones, mobile phones, and tablet computing devices. Such devices incorporate a loudspeaker and microphone in close proximity. The proximity may cause significant bleed issues if headphones are not used for overlay recording.
[0211] Examples of the method and device operate differently to extraction tools that allow the isolation of a single voice or instrument from a recording. For example, examples of the method and device allow the recording of a new overlay track for use in a multitrack recording, whereas extraction tools only allow for extraction of existing voices or instruments from a combined recording.
[0212] Using the reference audio as an additional input to an audibility reduction technique may enable improved audio bleed reduction.
[0213] Using the reference audio as an additional input to an audibility reduction technique may improve the fidelity of a recorded performance after the audibility reduction technique has been applied. Examples of the method and device may allow the reduction or removal of unwanted bleed audio from any audio recording scenario where audio bleed may occur. Examples of the method and device may be of particular use when recording an overlay track in a multi-track recording scenario.
[0214] Examples of the method and device may be of use in any system involving audio reproduction and recording. Examples of the method and device may be of particular use when using a portable device, such as a smartphone, mobile phone, or tablet computing device. In such devices, loudspeaker(s) (for reproducing reference audio) and microphone(s) (for capturing the reproduced reference audio and an accompanying performance) are provided in a single casing.
[0215] Examples of the method and device may operate locally (i.e., within a single device). Operating locally may avoid the need for uploading or otherwise transferring the reference audio data and captured audio data to a remote server (e.g., a cloud server) for processing. Uploading or otherwise transferring the reference audio data and captured audio data to a remote server may be time-consuming, costly, and / or unreliable. The use of a remote server may also introduce privacy and / or security risks
[0216] Particularly when operating locally, examples of the method and device may process captured data while the capture is ongoing. This may allow for completion of processing shortly after completion of the capture, without the need for a separate action to upload the captured data and / or otherwise initiate processing of the captured audio data byway of a separate action after the capture is complete.
[0217] Examples of the method and device may be used after capture is complete. In examples, the method and device may be operable to remove bleed audio from previously captured audio data. In examples, the method and device may be operable to remove audio bleed that was accidentally recorded.
[0218] Examples of the method and device may operate without the need for metadata relating to the reference audio or the performance audio. In examples, the method and device do not need to know the musical instrument or voice type. In examples, the method and device do not need to know the type or genre of music.
[0219] Although specific examples and embodiments have been described, it will be appreciated that the invention may be embodied in many other forms.
Claims
1. Claims1. A computer-implemented method of processing audio data, the method comprising: obtaining reference audio data representing reference audio; obtaining captured audio data, the captured audio having been captured via a microphone and representing a combination of: a reproduction of the reference audio by a speaker driver; and an audible performance accompanyingthe reproduction; and processing the captured audio data using the reference audio data and an audibility reduction technique to generate processed audio data, wherein the processed audio data represents the performance, and wherein any representation of the reproduction in the processed audio data has a lower audibility in the processed audio data than in the captured audio data.
2. A computer-implemented method according to claim 1 , wherein the processing of the captured audio data comprises: providing the reference audio data and the captured audio data as inputs to a deep neural network, DNN, arrangement, wherein the DNN arrangement comprises one or more DNNs, and wherein the processed audio data is based on an output of the DNN arrangement.
3. A computer-implemented method according to claim 2, wherein the DNN arrangement uses a U-Net architecture.
4. A computer-implemented method according to claim 2 or 3, wherein the one or more DNNs comprise a transformer DNN.
5. A computer-implemented method according to any of claims 2 to 4, wherein the DNN arrangement is configured to output masking data for application to the captured audio data as part of generatingthe processed data.6 A computer-implemented method according to any of claims 2 to 5, wherein the DNN arrangement is configured to output offset data for antiphase application to the captured audio data as part of generatingthe processed data.
7. A computer-implemented method accordingto any preceding claim, wherein: the reference audio data comprises a first sequence of audio frames, each audio frame comprising audio data values; and the captured audio data comprises a second sequence of audio frames, each audio frame comprising audio data values, wherein the processing of the captured audio data usingthe reference audio data and the audibility reduction technique comprises simultaneously processing audio data values from the first sequence of audio frames and the second sequence of audio frames.
8. A computer-implemented method according to claim 7, wherein the processing of the captured audio data using the reference audio data and the audibility reduction technique comprises simultaneously processing audio data values from one or more corresponding frames of the first sequence of audio frames and the second sequence of audio frames.
9. A computer-implemented method accordingto any preceding claim, wherein the processing of the captured audio data using the reference audio data and the audibility reduction technique starts during the reproduction and performance.
10. A computer-implemented method accordingto any preceding claim, wherein the processing of the captured audio data using the reference audio data and the audibility reduction technique completes within an amount of time that is independent of a duration of the captured audio data.
11. A computer-implemented method accordingto any preceding claim, wherein the processing of the captured audio data using the reference audio data and an audibility reduction technique comprises:using a short-time Fourier-type transform to: transform the reference audio data into a frequency domain representation of the reference audio; and transform the captured audio data into a frequency domain representation of the reproduction of the reference audio and the performance accompanying the reproduction; and performing the audibility reduction technique based on the frequency domain representations.
12. A computer-implemented method accordingto any preceding claim, the method being performed by a device comprising a casing, the casing comprising the microphone and the speaker driver.
13. A computer-implemented method according to claim 12, wherein the microphone is no more than 30 centimetres from the speaker driver, as measured along an outside surface of the casing.
14. A computer-implemented method according to claim 12 or 13, the method comprising: obtaining a latency offset indicative of a playback and recording latency of the device; and synchronising the captured audio data and the reference audio data based at least in part on the latency offset.
15. A computer-implemented method according to claim 14, wherein the obtaining of the latency offset comprises: playing a sound via the speaker driver; recordingthe played sound via the microphone; measuring a time delay associated with the playing and the recording of the sound; and using the measured time delay to determine the latency offset.
16. A computer-implemented method accordingto any preceding claim, comprising: outputting the reference audio data via the speaker driver.
17. A computer-implemented method accordingto any preceding claim, wherein the audibility reduction technique comprises a bleed reduction technique, and wherein the reproduction of the reference audio comprises bleed audio.
18. A computer-implemented method accordingto any preceding claim, wherein the audibility reduction technique is applied independent of an audio type of the reference audio data and / or the captured audio data.
19. A computer-implemented method according to any preceding claim, comprising controllingthe audibility of the representation of the reproduction in the processed audio data.
20. A method of performing real-time music extraction, the method comprising: providing playback and recording frames as inputs to a deep neural network,DNN; and usingthe DNN to remove playback bleed from the recording frames, wherein playback bleed removal is performed while recording is taking place, and wherein playback bleed removal is completed within an amount of time that is independent of a duration of the captured audio data.21 . A method of reducing audibility of reproduced reference music in a recording of a live performance accompanying the reproduced reference music, the method comprising: reproducingthe reference music, via a speaker driver of a smartphone or a tablet computing device, based on reference music data; generating captured audio data by capturing the reproduced reference music and a live performance accompanying the reproduced reference music via a microphone of the smartphone or the tablet computing device; andprocessing the captured audio data and reference audio data in a deep neural network to produce processed audio data, such that audibility of the reference music relative to the live performance is reduced in the processed audio data relative to in the captured audio data.
22. A computer-implemented method accordin to any preceding claim, wherein the reference audio data and the captured audio data represent music.
23. A computer-implemented method accordingto any preceding claim, wherein the performance represents an overlay track for a multi-track recording.
24. A device configured to perform the method of any preceding claim.
25. The device of claim 24, wherein the device is a smartphone or a tablet computing device.
26. A computer program comprising instructions which, when executed by a computing device, cause the computing device to perform a method according to any one of claims 1 to 23.