Audio codec bit rate using directional loudness

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By taking into account the directionality and frequency of the audio signal and dynamically adjusting the gain allocation bandwidth, the artifact problem caused by the failure to effectively utilize directionality in existing technologies is solved, achieving more efficient audio coding and quality improvement.

CN122228546APending Publication Date: 2026-06-16SAMSUNG ELECTRONICS CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SAMSUNG ELECTRONICS CO LTD
Filing Date: 2024-12-06
Publication Date: 2026-06-16

Application Information

Patent Timeline

06 Dec 2024

Application

16 Jun 2026

Publication

CN122228546A

IPC: G10L19/008; G10L19/002; G10L19/02

CPC: G10L19/008; G10L19/02; G10L19/002

AI Tagging

Application Domain

Speech analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing audio codecs fail to effectively consider the directionality of audio signals when allocating bit rates, leading to artifacts during compression and limiting creative expression and audio quality.

⚗Method used

By determining the relative power and directivity of the audio signal relative to a reference signal, the gain is dynamically adjusted to allocate bandwidth based on frequency and directivity, achieving more precise bit rate allocation.

🎯Benefits of technology

It improves the quality and efficiency of audio coding, enabling the transmission of more objects or channels at a given bit rate, or reducing the bit rate requirement for a specific number of objects or channels, and reducing the occurrence of artifacts.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122228546A_ABST

Patent Text Reader

Abstract

In one embodiment, a method for encoding audio comprises accessing a window of audio comprising a plurality of audio signals. The method further comprises determining, for each of the plurality of audio signals, a power of the audio signal relative to a reference audio signal; determining, for each of the plurality of audio signals, and based on the determined relative power of the audio signal, a gain to be applied to the audio signal relative to the reference audio signal, wherein the gain depends on frequency and on directionality relative to a listener; and encoding the audio by assigning a bandwidth amount to each of the plurality of audio signals based on the determined gain that is frequency-dependent and directionality-dependent, such that the bandwidth is assigned based on directionality relative to a listener.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application generally relates to using directional loudness to determine the bit rate of an audio codec. Background Technology

[0002] Audio codecs are used to encode and decode audio information according to an encoding scheme. Encoding compresses the audio stream, thereby reducing the bit rate used to represent the audio stream. Decoding decompresses the compressed audio stream. Compression can introduce artifacts into the audio stream, including temporal artifacts and spatial artifacts (for example, a sound that should be played from the user's left side may instead be played from the user's center).

[0003] With the streaming of media, including audio, audio compression has become increasingly common. While audio on fixed storage media (such as Blu-ray discs) typically doesn't require compression, many streaming applications benefit from some form of compression of audio information. For example, streaming may be data-limited, and compression can improve the audio quality heard by compressing audio during transmission to meet data rate requirements while preserving important information in the audio stream that affects listening experience. As another example, in mixed media streaming, bandwidth is often increasingly allocated to video, so reducing the amount of data allocated to audio can improve the overall streaming experience, provided that the audio isn't excessively degraded during compression.

[0004] Entertainment systems typically involve multiple speakers that play audio. For example, an entertainment system may include a pair of left and right stereo speakers, a subwoofer, a center speaker, a pair of left and right surround speakers, and / or a pair of left and right rear surround speakers. The number of speakers in a system is typically represented using the xy-channel convention, where x is the number of speakers used in the system and y represents the number of subwoofers used in the system. Encoding and compression can be performed on a channel-based audio representation or on object-based audio, which assigns audio information to one or more objects that may or may not move over time. Summary of the Invention

[0005] [Technical Solution]

[0006] One embodiment provides a method for encoding audio, comprising: accessing a window of audio including a plurality of audio signals; determining, for each of the plurality of audio signals, a relative power of the audio signal relative to a reference audio signal; for each of the plurality of audio signals, and based on the determined relative power of the audio signal, determining a gain to be applied to the audio signal relative to the reference audio signal, wherein the gain depends on frequency and on directionality relative to a listener; and encoding the audio by allocating a bandwidth amount to each of the plurality of audio signals based on the frequency-dependent and directionality-dependent determined gain, such that the bandwidth is allocated based on the directionality relative to the listener.

[0007] One embodiment provides a computer-readable storage medium storing instructions operable to, when executed: access a window of audio comprising a plurality of audio signals; for each of the plurality of audio signals, determine a relative power of the audio signal relative to a reference audio signal; for each of the plurality of audio signals, and based on the determined relative power of the audio signal, determine a gain to be applied to the audio signal relative to the reference audio signal, wherein the gain depends on frequency and on directionality relative to a listener; and encode the audio by allocating a bandwidth amount to each of the plurality of audio signals based on the determined gain based on frequency dependence and directionality dependence, such that the bandwidth is allocated based on directionality relative to a listener.

[0008] One embodiment provides a system for encoding audio, comprising: a computer-readable storage medium storing instructions; and one or more processors coupled to the computer-readable storage medium and operable to execute the instructions to: access a window of audio comprising a plurality of audio signals; determine, for each of the plurality of audio signals, a relative power of the audio signal relative to a reference audio signal; for each of the plurality of audio signals, and based on the determined relative power of the audio signal, determine a gain to be applied to the audio signal relative to the reference audio signal, wherein the gain depends on frequency and on directionality relative to a listener; and encode the audio by allocating an amount of bandwidth to each of the plurality of audio signals based on the frequency-dependent and directionality-dependent determined gain, such that the bandwidth is allocated based on directionality relative to a listener.

[0009] These and other aspects and advantages of one or more embodiments will become apparent from the following detailed description, which, when taken in conjunction with the accompanying drawings, illustrates the principles of one or more embodiments by way of example. Attached Figure Description

[0010] Figure 1The illustration shows an example audibility curve that shows how much a listener perceives a particular pitch as varying with frequency and sound pressure level (SPL).

[0011] Figure 2 An example is illustrated where the listener's perception of sound loudness is plotted as a function of directionality and signal frequency.

[0012] Figure 3 The illustration shows an example of audibility based on relative directional changes.

[0013] Figure 4 The illustration shows an example method for determining the bit rate of an audio codec using directional loudness.

[0014] Figure 5 The diagram shows... Figure 4 Example implementation of the method.

[0015] Figure 6 The illustration shows example interpolations of gain curves at different frequencies and two different SPLs.

[0016] Figure 7 The diagram illustrates the SPL interpolation of the gain curve.

[0017] Figure 8 The figure shows an example gain curve for SPL for a specific frequency, directionality, and relative to a reference signal directly in front of the listener.

[0018] Figure 9 The example computing system is illustrated. Detailed Implementation

[0019] Audio codecs are used to compress the bit rate of speech, music, and film content, typically for streaming content over the internet. Audio content can be channel-based, such as stereo, 5.1 (surround sound), or 7.1.4 (immersive with high-volume speakers); however, audio content can also (or alternatively) be represented using channel-independent objects. In either case, audio channels, or audio assigned to specific objects, are encoded and transmitted over the internet for decoding by decoders on consumer devices such as smartphones, televisions, etc. Channel-based audio signals convey a wealth of information, such as music, ambiance, and dialogue, while objects may be relatively sparse throughout the content. However, objects are crucial because they can convey important perceptual cues surrounding the viewer, such as when reconstructing the sound of a helicopter flying by.

[0020] Audio codecs rely on the principle of psychoacoustic masking, where louder sounds essentially mask quieter sounds, and use this principle to encode audio signals. Existing codecs allocate an equal bitrate to each channel or object in the content. This quickly becomes very expensive in terms of bitrate if the number of objects is not limited in content creation. This constraint on bitrate (limiting the number of channels / objects) can hinder the creative expression of cinematic content due to distribution requirements (e.g., low audio data rates). Furthermore, lossy compression techniques will introduce audible artifacts if a high object count exists, keeping the audio stream within data rate requirements (e.g., 256 kbps).

[0021] Auralizability—the degree to which a sound can be heard by a human listener—varies with frequency. For example, a frequency of about 3 kHz can be heard at a much lower decibel level than a frequency of about 100 Hz or about 15 kHz. In other words, a 15 kHz signal would require a much higher sound pressure level to be perceived by a human listener as the same volume as a 3 kHz signal.

[0022] The audibility of a particular pitch depends not only on its frequency but also on the presence of other pitches in the audio signal. Specifically, the audibility of a pitch can depend on the frequency and volume of another pitch present in the signal. Figure 1 (From Herre, J. et al., “Psychoacoustic models for perceptual audio coding—A tutorial review,” Applied Sciences, Vol. 9, 2019) An example is illustrated where audibility curve 110 illustrates how the degree to which a listener perceives a particular pitch varies as a function of frequency and sound pressure level (SPL). Curve 110 represents the SPL (as a function of frequency) of a particular sound being played so that the pitch is audibly heard (i.e., the area below curve 110 is inaudible to the listener). Curve 120 illustrates how the presence of the example masking tone 130 alters the degree to which the listener can hear the pitch; that is, the pitch now needs to be played at the SPL indicated by curve 110, adjusted by curve 120. Figure 1 As illustrated, audibility is frequency-dependent because the required adjustment to curve 110 in the presence of another masking tone varies based on the frequency of both the target tone and the masking tone. Furthermore, audibility depends on the SPL of the masking tone.

[0023] As explained more fully below, existing codec compression techniques allocate bits to an audio signal based on its relative perceived loudness (a function of frequency). However, human perception of a signal depends not only on its frequency, its SPL (Saturation Point Probability), and the frequencies and SPLs of any additional signals, but also on the signal's directionality relative to the listener. For example, the perceived loudness of a pitch depends on the location of its source relative to the listener, and this directional dependence is also a function of frequency and the signal's SPL, as explained below. Figure 2 A more thorough discussion is needed. Furthermore, the spatial separation between the target tone and the masking tone affects the audibility of the target tone: a co-located masking tone has a greater impact on the target tone than a masking tone that is spatially and directionally separated from the target tone (i.e., if the target tone source and the masking tone source come from different directions from the user). For example, in Figure 3 In the example, when accompanied by a masking tone 315, the target tone 310 is heard differently by the listener 305 than when accompanied by a masking tone 320, because the masking tone 315 and the masking tone 320 have different directions relative to the listener 305.

[0024] The techniques described in this paper include a codec compression method that takes directionality into account when allocating bandwidth for compression. In other words, the compression techniques described in this paper take into account the fact that directionality affects the audibility of audio (and therefore its perception) when allocating bandwidth to different parts of the audio.

[0025] Figure 4 An example method for determining the bit rate of an audio codec using directional loudness is shown. Figure 4 Step 410 of the example method includes accessing a window of audio containing audio signals. When audio is channelized, different audio signals can be different channels. These channels can be referred to as frames, such that the audio window can be represented as a vector x = {x1, ..., x...} n}, where x n This refers to audio in a specific channel (e.g., in a 5.1 system, index n ranges from 1 to 6). According to embodiments of this disclosure, the audio signal can be different object-based audio signals, such as audio signals associated with different objects, and these audio signals can move spatially as their corresponding objects move. According to embodiments of this disclosure, the audio window can consist of a set of samples simulating audio input.

[0026] In certain embodiments, it is possible to determine whether the pairwise correlation between two individual audio signals exceeds a threshold correlation. Figure 5 It shows Figure 4An example implementation of the method is provided, wherein the input window of audio x is channelized across k frames. In block 508, the pairwise correlation ρ between two channels i and j is calculated. (i, j) Pairwise correlations can be calculated using time analysis or time-frequency analysis. Figure 5 In the example implementation, box 510 compares the pairwise correlation between the two channels with a threshold T. For example, according to an embodiment of this disclosure, T can be 90%. If the pairwise correlation is greater than the threshold, the two input audio signals... i The bit allocation is performed in a direction-based manner, fully correlated with j, as described more comprehensively below. No preprocessing analysis (i.e., calculating the correlation coefficient) is required in the sidechain; however, considering... Figure 2 The subjective test results for directional loudness in this example were obtained using the same stimulus in all directions, so this preprocessing step is appropriate. However, other embodiments may not determine the pairwise correlation between signals.

[0027] If the pairwise correlation is not greater than a threshold, existing techniques can be used to perform bitrate allocation on the input audio, examples of which are described at a high level below. Generally, audio coding can use transform coding or waveform coding, and... Figure 4 The techniques employed in this example can be applied to any use case. In existing compression methods, the input audio is passed to an MDCT (Modified Discrete Cosine Transform) block 502, which performs time-frequency analysis using the transform of the input audio. The input audio (which may include speech, music, and / or other sounds) can be channelized or can be object-based; for example, in channelized representation, the input audio can be represented as x t (k), where index t refers to the frame (or channel) number, and k is the sample index in frame t. For example, if an audio window can contain 2N samples in a frame (e.g., N=1024), then k would be an index from 1 to 2048. The value of x() represents the amplitude of the audio signal at the corresponding index of the sample in the specified frame.

[0028] The output of MDCT block 502 is the MDCT coefficient X for a specified channel. t (m) and samples of the input audio. The MDCT formula is:

[0029]

[0030]

[0031] For example, a window is:

[0032]

[0033]

[0034] The MDCT transform provides a compact representation of the input signal from 2N input signal samples in a given frame to N coefficients, which can then be converted to binary format for bit allocation. (Continue...) Figure 5 For example, prior art uses a quantization / encoding block 526 to encode the input MDCT coefficients into a b-bit binary output for each frame. A bit allocation block 532 iteratively determines how many bits to allocate to each frame of a given input audio based on a psychoacoustic (auditory) model 506 and a distortion measurement module 530. The quantization / encoding block 526 can, for example, use a PCM waveform to digitize an analog audio signal, and then differential PCM can be used to create a 2-bit digitization of the analog audio signal. Quantization noise error e exists due to coarsely quantized amplitude values. Encoding techniques aim to encode the signal using an acceptable number of bits, making quantization noise imperceptible based on a psychoacoustic auditory threshold (i.e., the sensitivity of human hearing in a quiet environment): too high a bit rate will unnecessarily consume bandwidth, while too low a bit rate will introduce audible artifacts, for example, because the quantization noise is not well masked.

[0035] In the prior art, the psychoacoustic model 506 outputs a noise masking (NMR) ratio, for example, in the frequency band from 20 Hz to 20 kHz. However, the NMR in existing compression techniques only considers a masking threshold based on loudness, such as... Figure 1 As shown, this does not consider the directionality of the audio signal and how that directionality affects the listener's perception of the audio. Instead, the technique described in this paper can obtain directionality-based NMR, for example, for an azimuth angle θ. i and elevation angle NMR of the i-th audio signal at position i ).

[0036] Returning to the prior art, the distortion measurement block 530 calculates the distortion D as given by the following formula:

[0037]

[0038] Where x i and Let n represent the i-th unquantized and quantized transform coefficients, respectively; and E[.] is the expectation operator. i To assign to coefficient x i The number of bits used for quantization is such that:

[0039]

[0040] Here, N f This represents the total number of transform (e.g., MDCT) coefficients. Then, bit allocation block 532 can allocate bits to the audio channels, for example, if x iIf it is uniformly distributed, then a uniform distribution is, for example:

[0041]

[0042] Yes, it can be used. Other allocation schemes can be used (e.g., Gaussian allocation schemes); Table 3.1 in A. Spanias and T. Painter, Audio Signal Processing and Coding, J. Wiley & Sons, 2007, identifies examples of uniform and Gaussian distribution coefficients for a given input vector. Return Figure 5 Once the quantization / encoding block 526 encodes each channel (in this example) using the number of bits determined by the bit allocation block 532, the entropy encoding block 528 is used to replace specific codes in the bitstream with unique codewords before the bitstream is packed and transmitted, in order to compress the bitstream.

[0043] Return to Figure 4 , Figure 4 Step 420 of the example method includes determining the relative power of each of the two individual audio signals relative to a reference audio signal. For example, in Figure 5 In the example, box 512 shows the calculation of the power P of signal i. i (θ, And box 514 shows the calculation of the power P of signal j. j (θ, For each audio signal i and j, the power is determined relative to a reference audio signal. For example, in channelized representation, the reference audio signal could be the center channel, and the power of channel n could be expressed as...

[0044]

[0045] Where, x c It is the audio signal frame of the center channel, for example, in (θ, = (0, 0). According to embodiments of this disclosure, other channels or objects can be used as reference audio signals.

[0046] Figure 4 Step 430 of the example method includes determining a gain applied to the audio signal relative to a reference audio signal for each audio signal in the audio signal and based on the determined relative power of the signal, wherein the gain depends on the frequency and on the directivity relative to the listener. For example, for signal i, the gain determined for the signal is G. i (θ, (Box 516), and for signal j, the gain is G. j (θ, (Box 518). Figure 2 Examples are shown illustrating how a listener's perception of sound loudness is plotted as a function of directionality and signal frequency. Graph 210 shows the perception of sound played at 65 dB SPL, and graph 220 shows the perception of sound played at 45 dB SPL. The graphs are plotted with (θ, This illustrates the effect of directionality, where θ represents rotation to the listener's left, ranging from 0 degrees to 180 degrees. This indicates the tilt angle rotation, from 0 degrees (no tilt angle) to 90 degrees (directly above the listener). As a result, (0,0) represents the direction directly in front of the listener, and (180,0) represents the direction behind the user.

[0047] Figure 2 The diagram illustrates the listener's frequency-dependent and direction-dependent sensitivity to sound. These graphs indicate the SPL (in dB) at which certain frequencies are played as a function of sound directionality in order for human listeners to perceive these sounds as having the same volume. For example, a 5 kHz sound to the left of the listener must be played at a higher SPL than a 5 kHz sound played directly in front of the listener so that each sound is perceived as equally loud. Similarly, a 0.4 kHz sound behind the listener must be played at a higher SPL than a 5 kHz sound played behind the listener so that each sound is perceived as being played at the same volume.

[0048] Figure 2 The illustration shows the gains that must be applied at 65 dB and 45 dB to signals at 0.4 kHz, 1 kHz, and 5 kHz (based on averaging responses from many listeners), but audio signals may occur at other frequencies and other SPLs. Typically, while a more comprehensive listener data repository can be built, collecting data at every frequency band and every SPL would be resource-intensive. Therefore, according to one embodiment of this disclosure, it may be necessary to interpolate various aspects of the listener data to compress the actual audio signal encountered within a specific audio window. For example, Figure 6 The illustration shows example interpolations of gain curves at different frequencies (i.e., the gain when the same loudness is perceived relative to a reference direction (e.g., (0,0)) at 65 dB SPL (curve 610) and 45 dB SPL (curve 620). These example gain curves are relative to the user's direction. .therefore, Figure 6 The figure illustrates the gain curves interpolated across a certain frequency range under specific SPL and specific directivity (e.g., compared to...). Figure 2 The data collected in the middle identifies the frequency much more.

[0049] Figure 7 The diagram illustrates the SPL for gain curve interpolation, also relative to the listener's direction. This interpolation can be used for SPLs that differ from those represented in the listener data (e.g., beyond). Figure 2 The SPLs marked in the middle are SPLs.

[0050] Figure 8 The psychoacoustic directional loudness model 522 illustrates an example of determining gain for a specific frequency, directionality, and SPL relative to a reference signal directly in front of the listener. Curve 810 illustrates... The gain required for a 45 dB signal at that point is illustrated in curve 820. The required gain for a 45 dB signal at that location. Figure 5 As illustrated in the diagram, the frequency-dependent signal can be calculated using the Fast Fourier Transform (FFT) with a smoothing block 520, for use with... Figure 7 The frequency-dependent directional loudness curves are compared. Optionally, 1 / N (e.g., N=3) frequency domain smoothing can be applied to the FFT output to obtain a smoother response in the frequency domain representation.

[0051] Once appropriate frequency-dependent, SPL-dependent, and directional-dependent gains have been determined for signals i and j (correlated signals in this example) in step 430, step 440 includes encoding the audio by allocating bandwidth amounts to each audio signal based on the determined gains for their respective frequency-dependent and directional-dependent characteristics, such that the bandwidth allocation is based on the directionality relative to the listener. For example, regarding Figure 5 Bit allocation block 524 allocates bits based on both the psychoacoustic model 506 described above and directionality. For example, bit allocation block 524 may obtain the converged B kbps from bit allocation block 532, and for a reference audio signal (e.g., relative to the user's center channel at (0,0), the center channel bit rate b can be... c The bit rate is set to B / N (where N is the number of audio channels). For other audio signals (e.g., channels), the bit allocation module 524 can refine the bit rate using the output of the directional loudness block 522, making... .

[0052] For example, the output F for channel i from the FFT / smoothing block i (ω) can be transmitted through the given direction (θ) of vocal tract i. i , i The RMS level of the signal is normalized and then interpolated with the interpolation gain curve for a given signal level (e.g., 65 dB). Figure 7The comparison is performed. MDCT coefficients associated with the frequency band of a given channel i that exceeds the specified curve value will be encoded at a higher bit rate compared to other MDCT coefficients whose values do not exceed the specified curve value. Furthermore, to allocate bits between channels for a given MDCT coefficient, 524 includes a comparison process that first determines the difference between the gain curve and the FFT / smoothing output of each channel, and then gradually reduces the allocated bits as the difference decreases. For example, the left channel (30,0) FFT output may have a larger positive difference than the (135,0) channel, thus allocating more bits to the MDCT coefficients in the left channel. Certain embodiments may also use statistical redundancy techniques to reduce the bit rate based on inter-channel analysis. Although... Figure 5 An example is illustrated in which bit allocation block 524 is used to modify the output of bit allocation block 532, but according to embodiments of this disclosure, the output of directional loudness block 522 may instead be directly passed to bit allocation block 532, which then determines the bit rate based on both psychoacoustic models 506 and 522.

[0053] While the above discussion uses the audio channel as an example of audio input, object-based audio signals can be used. For example, the location of the audio object (e.g., determined based on metadata about the object or based on inferred location, such as by analyzing the corresponding video) can be used to identify the directionality of the corresponding audio relative to the user, and that directionality can then be used to determine the appropriate compression for that audio, as described above (e.g., typically, the direction with lower NMR receives relatively more bits).

[0054] Although the above description discusses specific frequencies, the techniques described in this article can actually be applied to a frequency band, where each frequency in the band is treated similarly, and the frequency band is generally aligned with human perception of frequency, so that frequencies within the band are perceived similarly.

[0055] The technique described in this paper dynamically controls the bit rate based on the directional loudness of the audio signal during audio encoding. This results in improved bit allocation based on directional loudness sensitivity, enabling the compression and transmission of more objects or channels at a given total bit rate (e.g., 256 kbps), and / or reducing the bit rate required to compress and transmit a specific number of objects or channels.

[0056] According to embodiments of this disclosure, the characteristics of the playback space can affect a listener's perception of audio content. For example, large rooms and small rooms have different audio characteristics, and rooms with many echoes have different audio characteristics than dry, unechoic rooms. According to embodiments of this disclosure, these characteristics can affect the encoding used to compress audio; that is, compression can be based on the expected (or actual) characteristics of the playback reproduction space. These characteristics can be reflected in psychoacoustic models (e.g., Model 506). For example, reverberation or reflections can reduce the amount of spatial demasking; as a result, reflections from different directions make it more difficult for the auditory system to separate each channel or object from each other. For headphone reproduction, content is typically rendered using an analog head-related transfer function (HRTF), which also introduces a reduced spatial demasking effect.

[0057] The demasking estimate in encoding material intended for use in either or both scenarios can be relaxed based on the estimated energy leakage of reflections (i.e., the masking can be reduced). This estimate can be based on a coarse, general heuristic (e.g., a maximum 3dB of spatial demasking in a typical room) or on a more sophisticated statistical room acoustics / absorption modeling of reflections and reverberation. In the case of real-time streaming, instead of a statistical average estimate, room attributes can be sent upstream to the encoder and used to optimize the audio codec bit allocation specifically for that listening scenario. According to embodiments of this disclosure, categories of room characteristics can be used to define reproduction characteristics, such as "living room" versus "headphones" versus "cinema".

[0058] Regarding extended reality (XR, including augmented reality, virtual reality, and mixed reality) applications, in completely unrestricted XR applications, if the user's viewpoint moves very close to a location, each channel or object of the audio content can potentially be heard in isolation. In such cases, encoding all audio elements individually without attempting to utilize psychoacoustic masking may be advantageous. However, most virtual environments involve some degree of leakage from other audio elements, room effects, and reflections.

[0059] The reconstructed scene can be estimated during the encoding stage. If it is not entirely unrestricted (e.g., the user's range of movement within the environment is limited compared to the source location), a maximum spatial demasking amount can be set based on this analysis. Such analysis can utilize, for example, estimated properties of the virtual acoustic environment and sound attenuation based on the average distance information between audio elements and the user.

[0060] In an embodiment, perceptual importance analysis, taking into account the source orientation attribute or spatial demasking attribute of multimedia content items, can be performed during audio coding. This can be applied... Figures 4 to 8The embodiments described herein, and redundant descriptions are omitted below. Bits per channel or per object can be assigned to multimedia content items to maximize audio quality based on perceptual importance analysis. Listening scenarios including one or more defined room reverberations or reflections can be identified.

[0061] Furthermore, in embodiments, during the refinement process for multimedia reproduction, at least one of the directional loudness or spatial demasking of a multimedia content item can be relaxed based on identifying a listening scene including one or more defined room reverberations or reflections. During the refinement process for extended reality (XR), the expected amount of at least one of the directional loudness or spatial demasking can be estimated based on one or more XR reproduction environment attributes and one or more user constraints.

[0062] Furthermore, in embodiments, source direction attributes or spatial demasking attributes can be used by psychoacoustic analysis configured to consider the temporal, frequency content, or direction of multimedia content items. When a listening scenario is identified, at least one of the directional loudness or spatial demasking of the multimedia content items can be relaxed. One or more defined room reverberations or reflections can be predetermined.

[0063] Figure 9 An example computer system 900 is illustrated. According to one embodiment of this disclosure, one or more computer systems 900 perform one or more steps of one or more methods described or illustrated herein. According to one embodiment of this disclosure, one or more computer systems 900 provide the functionality described or illustrated herein. According to one embodiment of this disclosure, software running on one or more computer systems 900 performs one or more steps of one or more methods described or illustrated herein, or provides the functionality described or illustrated herein. Specific embodiments include one or more portions of one or more computer systems 900. Throughout this document, references to computer systems may include computing devices and vice versa, where appropriate. Furthermore, references to computer systems may include one or more computer systems, where appropriate.

[0064] This disclosure contemplates any suitable number of computer systems 900. This disclosure contemplates computer systems 900 in any suitable physical form. By way of example, and not limitation, computer system 900 may be an embedded computer system, a system-on-a-chip (SOC), a single-board computer system (SBC) (e.g., a computer-on-module (COM) or system-on-module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive kiosk, a mainframe, a computer system grid, a mobile phone, a personal digital assistant (PDA), a server, a tablet computer system, or a combination of two or more of these. Where appropriate, computer system 900 may include one or more computer systems 900; may be single or distributed; span multiple locations; span multiple machines; span multiple data centers; or reside in a cloud, which may include one or more cloud components in one or more networks. Where appropriate, one or more computer systems 900 may perform one or more steps of one or more methods described or illustrated herein without substantial spatial or temporal limitations. By way of example, and not limitation, one or more computer systems 900 may perform one or more steps of one or more methods described or illustrated herein in real time or in batch mode. Where appropriate, one or more computer systems 900 may perform one or more steps of the methods described or illustrated herein at different times or in different locations.

[0065] According to one embodiment of this disclosure, computer system 900 includes processor 902, memory 904, storage device 906, input / output (I / O) interface 908, communication interface 910, and bus 912. Although this disclosure describes and illustrates a particular computer system having a particular number of particular components in a particular arrangement, this disclosure contemplates any suitable computer system having any suitable number of any suitable components in any suitable arrangement.

[0066] According to one embodiment of this disclosure, processor 902 includes hardware for executing instructions (e.g., instructions constituting a computer program). By way of example and not limitation, in order to execute instructions, processor 902 may retrieve (or fetch) instructions from internal registers, internal caches, memory 904, or storage device 906; decode and execute the instructions; and then write one or more results to internal registers, internal caches, memory 904, or storage device 906. According to one embodiment of this disclosure, processor 902 may include one or more internal caches for data, instructions, or addresses. Where appropriate, this disclosure contemplates processor 902 including any suitable number of suitable internal caches. By way of example and not limitation, processor 902 may include one or more instruction caches, one or more data caches, and one or more translation back buffers (TLBs). Instructions in the instruction cache may be copies of instructions in memory 904 or storage device 906, and the instruction cache may accelerate the retrieval of these instructions by processor 902. The data in the data cache may be a copy of data in memory 904 or storage device 906 for operation by instructions executed at processor 902; the result of a previous instruction executed at processor 902 for access to or writing to memory 904 or storage device 906 by subsequent instructions executed at processor 902; or other suitable data. The data cache can accelerate read or write operations performed by processor 902. The TLB can accelerate virtual address translation by processor 902. According to one embodiment of this disclosure, processor 902 may include one or more internal registers for data, instructions, or addresses. Where appropriate, this disclosure contemplates processor 902 including any suitable number of suitable internal registers. Where appropriate, processor 902 may include one or more arithmetic logic units (ALUs); may be a multi-core processor; or may include one or more processors 902. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.

[0067] According to one embodiment of this disclosure, memory 904 includes main memory for storing instructions to be executed by processor 902 or data to be operated by processor 902. By way of example and not limitation, computer system 900 may load instructions from storage device 906 or another source (e.g., another computer system 900) into memory 904. Processor 902 may then load instructions from memory 904 into internal registers or internal cache. To execute instructions, processor 902 may retrieve instructions from internal registers or internal cache and decode them. During or after instruction execution, processor 902 may write one or more results (which may be intermediate or final results) to internal registers or internal cache. Processor 902 may then write one or more of these results to memory 904. According to one embodiment of this disclosure, processor 902 executes only the instructions in one or more internal registers or internal caches or memory 904 (and not storage device 906 or elsewhere), and operates only on the data in one or more internal registers or internal caches or memory 904 (and not storage device 906 or elsewhere). One or more memory buses (each may include an address bus and a data bus) may couple processor 902 to memory 904. Bus 912 may include one or more memory buses as described below. According to one embodiment of this disclosure, one or more memory management units (MMUs) reside between processor 902 and memory 904 and facilitate access to memory 904 requested by processor 902. According to one embodiment of this disclosure, memory 904 includes random access memory (RAM). Where appropriate, the RAM may be volatile memory. Where appropriate, the RAM may be dynamic RAM (DRAM) or static RAM (SRAM). Furthermore, where appropriate, the RAM may be single-port or multi-port RAM. This disclosure contemplates any suitable RAM. Where appropriate, memory 904 may include one or more memories 904. Although this disclosure describes and illustrates specific memories, this disclosure contemplates any suitable memory.

[0068] According to one embodiment of this disclosure, storage device 906 includes a mass storage device for data or instructions. By way of example and not limitation, storage device 906 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disk drive, a magneto-optical disk drive, magnetic tape, a universal serial bus (USB) drive, or a combination of two or more of these. Where appropriate, storage device 906 may include removable or non-removable (or fixed) media. Where appropriate, storage device 906 may be internal or external to computer system 900. According to one embodiment of this disclosure, storage device 906 is a non-volatile solid-state memory. According to one embodiment of this disclosure, storage device 906 includes read-only memory (ROM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically changeable ROM (EAROM), or flash memory, or a combination of two or more of these. This disclosure contemplates mass storage device 906 in any suitable physical form. Where appropriate, storage device 906 may include one or more storage control units that facilitate communication between processor 902 and storage device 906. Where appropriate, storage device 906 may include one or more storage devices 906. Although this disclosure describes and illustrates particular storage devices, this disclosure contemplates any suitable storage device.

[0069] According to one embodiment of this disclosure, I / O interface 908 includes hardware, software, or both, providing one or more interfaces for communication between computer system 900 and one or more I / O devices. Where appropriate, computer system 900 may include one or more of these I / O devices. One or more of these I / O devices can enable communication between a person and computer system 900. By way of example and not limitation, I / O devices may include a keyboard, keypad, microphone, monitor, mouse, printer, scanner, speaker, still camera, stylus, tablet computer, touchscreen, trackball, camera, another suitable I / O device, or a combination of two or more of these. I / O devices may include one or more sensors. This disclosure contemplates any suitable I / O device and any suitable I / O interface 908 for them. Where appropriate, I / O interface 908 may include one or more device or software drivers that enable processor 902 to drive one or more of these I / O devices. Where appropriate, I / O interface 908 may include one or more I / O interfaces 908. Although this disclosure describes and illustrates specific I / O interfaces, this disclosure contemplates any suitable I / O interface.

[0070] According to one embodiment of this disclosure, communication interface 910 includes hardware, software, or both, providing one or more interfaces for communication (e.g., packet-based communication) between computer system 900 and one or more other computer systems 900 or one or more networks. By way of example and not limitation, communication interface 910 may include a network interface controller (NIC) or network adapter for communicating with Ethernet or other wired networks, or a wireless NIC (WNIC) or wireless adapter for communicating with wireless networks (e.g., Wi-Fi networks). This disclosure contemplates any suitable network and any suitable communication interface 910 used therefor. By way of example and not limitation, computer system 900 may communicate with one or more portions of an ad hoc network, personal area network (PAN), local area network (LAN), wide area network (WAN), metropolitan area network (MAN), or the Internet, or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, computer system 900 may communicate with a wireless PAN (WPAN) (e.g., Bluetooth WPAN), a Wi-Fi network, a Wi-Fi Max network, a cellular telephone network (e.g., a Global System for Mobile Communications (GSM) network), or other suitable wireless networks, or a combination of two or more of these. Where appropriate, computer system 900 may include any suitable communication interface 910 for any of these networks. Where appropriate, communication interface 910 may include one or more communication interfaces 910. Although this disclosure describes and illustrates specific communication interfaces, this disclosure contemplates any suitable communication interface.

[0071] According to one embodiment of this disclosure, bus 912 includes hardware, software, or both, coupling components of computer system 900 to each other. By way of example and not limitation, bus 912 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry-Specific Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry-Specific Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 912 may include one or more buses 912. Although this disclosure describes and illustrates specific buses, this disclosure contemplates any suitable bus or interconnect.

[0072] In this document, computer-readable non-transitory storage media may include one or more semiconductor-based or other integrated circuits (ICs) (e.g., field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs)), hard disk drives (HDDs), hybrid hard disk drives (HHDs), optical disks, optical disk drives (ODDs), magneto-optical disks, magneto-optical drives, floppy disks, floppy disk drives (FDDs), magnetic tape, solid-state drives (SSDs), RAM drives, secure digital cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these. Where appropriate, computer-readable non-transitory storage media may be volatile, non-volatile, or a combination of volatile and non-volatile.

[0073] In this document, unless otherwise expressly indicated or the context otherwise indicates, “or” is inclusive rather than exclusive. Therefore, in this document, unless otherwise expressly indicated or the context otherwise indicates, “A or B” means “A, B, or both”. Furthermore, unless otherwise expressly indicated or the context otherwise indicates, “and” is both joint and separate. Therefore, in this document, unless otherwise expressly indicated or the context otherwise indicates, “A and B” means “common or separate A and B”.

[0074] The scope of this disclosure covers all changes, substitutions, variations, alterations, and modifications to the exemplary embodiments described or illustrated herein that will be understood by those skilled in the art. The scope of this disclosure is not limited to the exemplary embodiments described or illustrated herein. Furthermore, although various embodiments are described or illustrated herein as including specific components, elements, features, functions, operations, or steps, any of these embodiments may include any combination or arrangement of any components, elements, features, functions, operations, or steps described or illustrated anywhere herein that will be understood by those skilled in the art.

Claims

1. A method for encoding audio, comprising: Access (410) a window containing multiple audio signals; For each of the plurality of audio signals, determine (420) the relative power of the audio signal relative to the reference audio signal; For each of the plurality of audio signals, and based on the determined relative power of the audio signals, determine (430) a gain to be applied to the audio signal relative to the reference audio signal, wherein the gain depends on the frequency and on the directionality relative to the listener; as well as The audio is encoded (440) by allocating bandwidth to each of the plurality of audio signals based on a gain determined by frequency correlation and directionality correlation, such that the bandwidth is allocated based on the directionality relative to the listener.

2. The method of claim 1, further comprising: Before determining the relative power of each of the two audio signals in the plurality of audio signals (420), it is determined whether the pairwise correlation between the two audio signals exceeds a threshold correlation.

3. The method as described in claims 1 and 2, wherein, Each of the plurality of audio signals is associated with a specific object.

4. The method as described in claims 1 to 3, wherein, The multiple audio signals include multiple audio frames, each audio frame corresponding to a separate audio channel.

5. The method as described in claims 1 to 4, wherein, The reference audio signal includes the center channel audio signal.

6. The method as described in claims 1 to 5, wherein, The gain is determined at least in part based on interpolation of the gain curves of one or more of (1) the frequency or (2) the sound pressure level.

7. The method as described in claims 1 to 6, wherein, Encoding the audio such that bandwidth is allocated based on directionality relative to the listener includes adjusting the bit allocation determined by a psychoacoustic model based on loudness.

8. The method of claims 1 to 7, further comprising: The bandwidth of one or more of the plurality of audio signals is adjusted based on one or more characteristics of the reproduction space used to play the audio signals.

9. The method of claims 1 to 8, further comprising: The bandwidth of one or more of the plurality of audio signals is adjusted based on the listener's limited range of movement in the extended reality environment.

10. A computer-readable storage medium storing instructions operable to, when executed: Access a window containing multiple audio signals; For each of the plurality of audio signals, determine the relative power of the audio signal relative to a reference audio signal; For each of the plurality of audio signals, and based on the determined relative power of the audio signals, a gain to be applied to the audio signal relative to the reference audio signal is determined, wherein the gain depends on the frequency and on the directionality relative to the listener; as well as The audio is encoded by allocating bandwidth to each of the plurality of audio signals based on a gain determined by frequency correlation and directionality correlation, such that the bandwidth is allocated based on the directionality relative to the listener.

11. The medium as claimed in claim 10, wherein, The instructions are also operable to, when executed, determine whether the pairwise correlation between the two audio signals exceeds a threshold correlation before determining the relative power of each of the two audio signals among the plurality of audio signals.

12. The medium as described in claims 10 and 11, wherein, Each of the plurality of audio signals is associated with a specific object.

13. The medium as described in claims 10 to 12, wherein, The gain is determined at least in part based on interpolation of the gain curves of one or more of (1) the frequency or (2) the sound pressure level.

14. The medium as described in claims 10 to 13, wherein, Encoding the audio such that bandwidth is allocated based on directionality relative to the listener includes adjusting the bit allocation determined by a psychoacoustic model based on loudness.

15. A system (900) for encoding audio, comprising: Computer-readable storage medium for storing instructions; as well as One or more processors (902) are coupled to the computer-readable storage medium and operable to execute the instructions to: Access a window containing multiple audio signals; For each of the plurality of audio signals, determine the relative power of the audio signal relative to a reference audio signal; For each of the plurality of audio signals, and based on the determined relative power of the audio signals, a gain to be applied to the audio signal relative to the reference audio signal is determined, wherein the gain depends on the frequency and on the directionality relative to the listener; as well as The audio is encoded by allocating bandwidth to each of the plurality of audio signals based on a gain determined by frequency correlation and directionality correlation, such that the bandwidth is allocated based on the directionality relative to the listener.