Spatial audio for two-way audio environments
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- MAGIC LEAP INC
- Filing Date
- 2026-01-26
- Publication Date
- 2026-07-02
Smart Images

Figure 00000000_0000_ABST
Abstract
Description
Technical Field
[0001] (Cross - reference to related applications) This application claims the benefit of priority to U.S. Provisional Application No. 62 / 686,655, filed on June 18, 2018, the content of which is hereby incorporated by reference in its entirety. This application further claims the benefit of priority to U.S. Provisional Application No. 62 / 686,665, filed on June 18, 2018, the content of which is hereby incorporated by reference in its entirety.
[0002] The present disclosure generally relates to spatial audio rendering, and more particularly to spatial audio rendering for virtual sound sources within a virtual acoustic environment.
Background Art
[0003] Virtual environments are ubiquitous in computing environments and find use in video games (where the virtual environment can represent a game world), maps (where the virtual environment can represent terrain to be navigated), simulations (where the virtual environment can simulate a real - world environment), digital storytelling (where virtual characters can interact with each other within the virtual environment), and many other applications. Modern computer users are generally comfortable perceiving and interacting with virtual environments. However, the user experience with virtual environments can be limited by the technology used to present the virtual environment. For example, conventional displays (e.g., 2D display screens) and audio systems (e.g., fixed speakers) may not be able to realize a virtual environment so as to attract people and create a realistic and immersive experience.
[0004] Virtual reality ("VR"), augmented reality ("AR"), mixed reality ("MR"), and related technologies (collectively, "XR") share the ability to present users of XR systems with sensory information corresponding to a virtual environment represented by data within a computer system. Such systems can provide a uniquely enhanced sense of immersion and presence by combining virtual visual and audio cues with real-world sights and sounds. Therefore, it may be desirable to present digital sounds to users of XR systems so that the sounds appear as if they were occurring naturally and consistently with the sounds the user expects in the user's real environment. Generally speaking, users expect virtual sounds to have the acoustic properties of the real environment in which they are heard. For example, a user of an XR system in a large concert hall might expect the virtual sounds of the XR system to have a sound quality similar to that of a large cave, while a user in a small apartment might expect the sounds to be more attenuated, closer, and immediate.
[0005] Digital or artificial reverberators can be used in audio and music signal processing to simulate the perceived effects of diffuse acoustic reflections in a room. In XR environments, it is desirable to use digital reverberators to realistically simulate the acoustic properties of a room within the XR environment. A convincing simulation of such acoustic properties can give the XR environment credibility and immersion. [Overview of the Initiative] [Means for solving the problem]
[0006] A system and method for presenting an output audio signal to a listener located at a first location within a virtual environment are disclosed. According to one embodiment of the method, an input audio signal is received. For each of the multiple sound sources in the virtual environment, a separate first intermediate audio signal corresponding to the input audio signal is determined based on the location of the individual sound source in the virtual environment, and the separate first intermediate audio signal is associated with a first bus. For each of the multiple sound sources in the virtual environment, a separate second intermediate audio signal is determined. The separate second intermediate audio signal corresponds to a reflection of the input audio signal within the surface of the virtual environment. The separate second intermediate audio signal is determined based on the location of the individual sound source and further based on the acoustic properties of the virtual environment. The separate second intermediate audio signal is associated with a second bus. The output audio signal is presented to the listener via the first bus and the second bus. This specification also provides, for example, the following items: (Item 1) A method for presenting an output audio signal to a listener located at a first location within a virtual environment, wherein the method is: Receiving an input audio signal, For each of the multiple sound sources in the aforementioned virtual environment, Based on the location of individual sound sources within the virtual environment, a separate first intermediate audio signal corresponding to the input audio signal is determined, Associating the individual first intermediate audio signals with the first bus, The method involves determining individual second intermediate audio signals based on the locations of the individual sound sources and further based on the acoustic properties of the virtual environment, wherein the individual second intermediate audio signals correspond to reflections of the input audio signals within the surface of the virtual environment. Associating the aforementioned individual second intermediate audio signals with the second bus, The output audio signal is presented to the listener via the first bus and the second bus. Methods that include... (Item 2) The method according to item 1, wherein the acoustic properties of the virtual environment are determined via one or more sensors associated with the listener. (Item 3) The method according to item 2, wherein the one or more sensors comprises one or more microphones. (Item 4) The one or more sensors are associated with a wearable head unit configured to be worn by the hearing person, The output signal is presented to the listener via one or more speakers associated with the wearable head unit. The method described in item 2. (Item 5) The method according to item 4, wherein the wearable head unit comprises a display configured to show the view of the virtual environment to the hearing person in parallel with the presentation of the output signal. (Item 6) The method of item 4, further comprising reading the acoustic properties from a database, wherein the acoustic properties include acoustic properties determined via one or more sensors of the wearable head unit. (Item 7) Reading out the aforementioned acoustic properties is The location of the listener is determined based on the output of one or more of the sensors, Identifying the acoustic properties based on the location of the listener. The method described in item 6, including the method described in item 6. (Item 8) It is a wearable device, A display configured to show a view of the virtual environment, One or more sensors, One or more speakers, One or more processors, Receiving an input audio signal, For each of the multiple sound sources in the aforementioned virtual environment, Based on the location of individual sound sources within the virtual environment, a separate first intermediate audio signal corresponding to the input audio signal is determined, Associating the individual first intermediate audio signals with the first bus, The process involves determining individual second intermediate audio signals based on the locations of the individual sound sources and further based on the acoustic properties of the virtual environment, wherein each individual second intermediate audio signal corresponds to a reflection of the input audio signal on a surface within the virtual environment. Associating the aforementioned individual second intermediate audio signals with the second bus, The output audio signal is presented to the listener via the speaker and via the first bus and the second bus. One or more processors configured to perform a method including A wearable device equipped with [features / equipment]. (Item 9) The wearable device according to item 8, wherein the acoustic properties of the virtual environment are determined via one or more sensors. (Item 10) The wearable device described in item 8 comprises one or more microphones. (Item 11) The method of item 8, further comprising displaying a view of the virtual environment via the display in parallel with the presentation of the output signal. (Item 12) The method of item 8, further comprising reading the acoustic properties from a database, wherein the acoustic properties include acoustic properties determined via one or more sensors. (Item 13) Reading out the aforementioned acoustic properties is The location of the listener is determined based on the output of one or more of the sensors, Identifying the acoustic properties based on the location of the listener. The method described in item 12, including the method described in item 12. (Item 14) For each of the multiple sound sources in the virtual environment, Determining an individual third intermediate audio signal based on the location of the individual sound source and further based on a second acoustic property of the virtual environment, wherein the individual third intermediate audio signal corresponds to the reverberation of the input audio signal in the virtual environment, and associating the individual third intermediate audio signal with a second bus further comprising the second bus comprising a reflection bus and a reverberation bus associating the individual second intermediate audio signal with the second bus includes associating the individual second intermediate audio signal with the reflection bus associating the individual third intermediate audio signal with the second bus includes associating the individual third intermediate audio signal with the reverberation bus The method according to item 1. (Item 15) The method further comprises for each sound source of the plurality of sound sources in the virtual environment determining an individual third intermediate audio signal based on the location of the individual sound source and further based on a second acoustic property of the virtual environment, wherein the individual third intermediate audio signal corresponds to the reverberation of the input audio signal in the virtual environment, and associating the individual third intermediate audio signal with the second bus including the second bus comprising a reflection bus and a reverberation bus associating the individual second intermediate audio signal with the second bus includes associating the individual second intermediate audio signal with the reflection bus associating the individual third intermediate audio signal with the second bus includes associating the individual third intermediate audio signal with the reverberation bus The wearable device according to item 8. (Item 16) The method according to item 1, wherein determining the individual first intermediate audio signal comprises applying a first individual filter to the input audio signal, the first individual filter comprising one or more of a sound source directivity model, a distance model, and an orientation model. (Item 17) The method according to item 16, wherein determining the individual first intermediate audio signal further comprises applying one or more individual gain and individual panning processes to the input audio signal. (Item 18) The method according to item 17, wherein the individual panning process includes panning the input audio signal based on the geometry of the loudspeaker array. (Item 19) The method according to item 1, wherein determining the individual second intermediate audio signal comprises applying a second individual filter to the input audio signal, the second individual filter comprising a sound source directivity model. (Item 20) The method according to item 19, wherein determining the individual second intermediate audio signal further comprises applying one or more of individual delays, individual gains, and individual panning processes to the input audio signal. (Item 21) The method according to item 20, wherein the individual panning process comprises encoding the input audio signal into an ambisonic signal having three channels. (Item 22) The method according to item 20, wherein the individual panning process includes panning the reflection of the input audio signal based on one or more of the azimuth angle and spatial focus parameters. (Item 23) Determining the individual first intermediate audio signal comprises applying a first individual filter to the input audio signal, wherein the first individual filter comprises one or more of a sound source directivity model, a distance model, and an orientation model, as described in item 8 of the wearable device. (Item 24) The wearable device according to item 23, wherein determining the individual first intermediate audio signal further comprises applying one or more individual gain and individual panning processes to the input audio signal. (Item 25) The wearable device according to item 24, wherein the individual panning process includes panning the input audio signal based on the geometry of the loudspeaker array. (Item 26) Determining the individual second intermediate audio signal comprises applying a second individual filter to the input audio signal, wherein the second individual filter comprises a sound source directivity model, as described in item 8 of the wearable device. (Item 27) The wearable device according to item 26, wherein determining the individual second intermediate audio signal further includes applying one or more of individual delays, individual gains, and individual panning processes to the input audio signal. (Item 28) The wearable device according to item 27, wherein the individual panning process comprises encoding the input audio signal into an ambisonic signal having three channels. (Item 29) The wearable device according to item 27, wherein the individual panning process includes panning the reflection of the input audio signal based on one or more of the azimuth angle and spatial focus parameters. [Brief explanation of the drawing]
[0007] [Figure 1] Figure 1 illustrates exemplary wearable systems according to several embodiments.
[0008] [Figure 2]Figure 2 illustrates exemplary handheld controllers that can be used in conjunction with exemplary wearable systems, according to several embodiments.
[0009] [Figure 3] Figure 3 illustrates exemplary auxiliary units that can be used in conjunction with exemplary wearable systems, according to several embodiments.
[0010] [Figure 4] Figure 4 illustrates exemplary functional block diagrams of exemplary wearable systems according to several embodiments.
[0011] [Figure 5] Figure 5 illustrates exemplary geometric room representations in several embodiments.
[0012] [Figure 6] Figure 6 illustrates exemplary models of room response measured from a source in a room to the listener, according to several embodiments.
[0013] [Figure 7] Figure 7 illustrates exemplary factors that influence the user's perception of direct sound, reflection, and reverberation in several embodiments.
[0014] [Figure 8] Figure 8 illustrates exemplary audio mixing architectures for rendering multiple virtual sound sources within a virtual room, according to several embodiments.
[0015] [Figure 9] Figure 9 illustrates exemplary audio mixing architectures for rendering multiple virtual sound sources within a virtual room, according to several embodiments.
[0016] [Figure 10] Figure 10 illustrates exemplary source-specific processing modules according to several embodiments.
[0017] [Figure 11] Figure 11 illustrates exemplary source-specific reflector pan modules according to several embodiments.
[0018] [Figure 12] Figure 12 illustrates exemplary room processing algorithms according to several embodiments.
[0019] [Figure 13] Figure 13 illustrates exemplary reflective modules according to several embodiments.
[0020] [Figure 14] Figure 14 illustrates the exemplary spatial distribution of the apparent direction of arrival of reflections in several embodiments.
[0021] [Figure 15] Figure 15 illustrates examples of direct gain, reflection gain, and reverberation gain as functions of distance according to several embodiments.
[0022] [Figure 16] Figure 16 illustrates exemplary relationships between distance and spatial focus in several embodiments.
[0023] [Figure 17] Figure 17 illustrates exemplary relationships between time and signal amplitude in several embodiments.
[0024] [Figure 18] Figure 18 illustrates exemplary systems for processing spatial audio according to several embodiments. [Modes for carrying out the invention]
[0025] In the following description of the embodiments, accompanying drawings, which form part of this specification and illustrate specific embodiments that can be put into practice, are referenced. It should be understood that other embodiments may also be used, and structural modifications may be made without departing from the scope of the disclosed embodiments.
[0026] Exemplary wearable systems
[0027] Figure 1 illustrates an exemplary wearable head device 100 configured to be worn on the user's head. The wearable head device 100 may be part of a broader wearable system comprising one or more components such as a head device (e.g., the wearable head device 100), a handheld controller (e.g., the handheld controller 200 described below), and / or an auxiliary unit (e.g., the auxiliary unit 300 described below). In some embodiments, the wearable head device 100 may be used for virtual reality, augmented reality, or mixed reality systems or applications. The wearable head device 100 includes one or more displays, such as displays 110A and 110B (which may comprise left and right transmissive displays and associated components for coupling light from the displays to the user's eyes, such as orthogonal pupil dilation (OPE) grid sets 112A / 112B and exit pupil dilation (EPE) grid sets 114A / 114B), left and right acoustic structures, such as speakers 120A and 120B (which may be mounted on the armrests 122A and 122B and positioned adjacent to the user's left and right ears, respectively), and infrared sensors. The wearable head device 100 may include one or more sensors such as an accelerometer, GPS unit, inertial measuring unit (IMU) (e.g., IMU 126), acoustic sensor (e.g., microphone 150), an orthogonal coil electromagnetic receiver (e.g., receiver 127, indicated to be mounted on the left lance arm 122A), left and right cameras (e.g., depth (time-of-flight) cameras 130A and 130B) oriented away from the user, and left and right eye cameras (e.g., for detecting the user's eye movements) (e.g., eye cameras 128 and 128B) oriented towards the user. However, the wearable head device 100 may incorporate any suitable display technology and any suitable number, type, or combination of sensors or other components without departing from the scope of the present invention.In some embodiments, the wearable head device 100 may incorporate one or more microphones 150 configured to detect audio signals generated by the user's voice, and such microphones may be positioned within the wearable head device adjacent to the user's mouth. In some embodiments, the wearable head device 100 may incorporate networking features (e.g., Wi-Fi capability) to communicate with other devices and systems, including other wearable systems. The wearable head device 100 may further include components such as a battery, processor, memory, storage unit, or various input devices (e.g., buttons, touchpads), or may be coupled to a handheld controller (e.g., handheld controller 200) or auxiliary unit (e.g., auxiliary unit 300) having one or more such components. In some embodiments, the sensor may be configured to output a set of coordinates of the head-mounted unit relative to the user's environment, providing input to a processor which may perform simultaneous localization and mapping (SLAM) procedures and / or visual odometry algorithms. In some embodiments, the wearable head device 100 may be coupled to a handheld controller 200 and / or an auxiliary unit 300, as further described below.
[0028] Figure 2 illustrates an exemplary mobile handheld controller component 200 of an exemplary wearable system. In some embodiments, the handheld controller 200 may be wired or wirelessly connected to a wearable head device 100 and / or an auxiliary unit 300 as described below. In some embodiments, the handheld controller 200 includes a handle portion 220 to be held by the user and one or more buttons 240 arranged along the top surface 210. In some embodiments, the handheld controller 200 may be configured for use as an optical tracking target, for example, sensors of the wearable head device 100 (e.g., a camera or other optical sensor) may be configured to detect the position and / or orientation of the handheld controller 200, which in turn may indicate the position and / or orientation of the user's hand holding the handheld controller 200. In some embodiments, the handheld controller 200 may include one or more input devices such as a processor, memory, storage unit, display, or those described above. In some embodiments, the handheld controller 200 includes one or more sensors (e.g., any of the sensors or tracking components described above with respect to the wearable head device 100). In some embodiments, the sensors can detect the position or orientation of the handheld controller 200 relative to the wearable head device 100 or to another component of the wearable system. In some embodiments, the sensors may be positioned within the handle portion 220 of the handheld controller 200 and / or mechanically coupled to the handheld controller. The handheld controller 200 may be configured to provide one or more output signals corresponding, for example, the pressed state of button 240, or the position, orientation, and / or movement of the handheld controller 200 (e.g., via an IMU). Such output signals may be used as inputs to the processor of the wearable head device 100, to an auxiliary unit 300, or to another component of the wearable system.In some embodiments, the handheld controller 200 may include one or more microphones to detect sounds (e.g., user speech, ambient sounds) and, in some cases, to provide a signal corresponding to the detected sound to a processor (e.g., the processor of the wearable head device 100).
[0029] Figure 3 illustrates an exemplary auxiliary unit 300 of an exemplary wearable system. In some embodiments, the auxiliary unit 300 may be wired or wirelessly connected to the wearable head device 100 and / or handheld controller 200. The auxiliary unit 300 may include a battery to provide energy to operate one or more components of the wearable system, such as the wearable head device 100 and / or handheld controller 200 (including a display, sensors, acoustic structures, a processor, a microphone, and / or other components of the wearable head device 100 or handheld controller 200). In some embodiments, the auxiliary unit 300 may include a processor, memory, a storage unit, a display, one or more input devices, and / or one or more sensors, such as those described above. In some embodiments, the auxiliary unit 300 includes a clip 310 (e.g., a belt worn by the user) for attaching the auxiliary unit to the user. An advantage of using the auxiliary unit 300 to house one or more components of a wearable system is that doing so may allow large or heavy components to be carried on the user's waist, chest, or back, which are relatively well suited to supporting large and heavy objects, rather than being mounted on the user's head (for example, when housed in a wearable head device 100) or carried by the user's hands (for example, when housed in a handheld controller 200). This may be particularly advantageous with respect to relatively heavy or bulky components such as batteries.
[0030] Figure 4 shows an exemplary functional block diagram that may correspond to an exemplary wearable system 400, which may include the exemplary wearable head device 100, handheld controller 200, and auxiliary unit 300 described above. In some embodiments, the wearable system 400 may be used for virtual reality, augmented reality, or mixed reality applications. As shown in Figure 4, the wearable system 400 may include an exemplary handheld controller 400B, which is hereby referred to as “Totem” (and may correspond to the handheld controller 200 described above), and the handheld controller 400B may include a Totem / Headgear 6-degree-of-freedom (6DOF) totem subsystem 404A. The wearable system 400 may also include an exemplary wearable head device 400A (which may correspond to the wearable headgear device 100 described above), and the wearable head device 400A may include a Totem / Headgear 6DOF headgear subsystem 404B. In some embodiments, the 6DOF totem subsystem 404A and the 6DOF headgear subsystem 404B cooperate to determine six coordinates (e.g., offsets in three translation directions and rotations along three axes) of the handheld controller 400B relative to the wearable head device 400A. The six degrees of freedom may be expressed relative to the coordinate system of the wearable head device 400A. The three translation offsets may be expressed as X, Y, and Z offsets, a translation matrix, or some other representation within such a coordinate system. The rotational degrees of freedom may be expressed as a sequence of yaw, pitch, and roll rotations, a vector, a rotation matrix, a quaternion, or some other representation. In some embodiments, one or more depth cameras 444 (and / or one or more non-depth cameras) and / or one or more optical targets (e.g., buttons 240 on the handheld controller 200 or dedicated optical targets included within the handheld controller, as described above) can be used for 6DOF tracking.In some embodiments, the handheld controller 400B may include a camera as described above, and the headgear 400A may include an optical target for optical tracking in conjunction with the camera. In some embodiments, the wearable head device 400A and the handheld controller 400B each include a set of three orthogonally oriented solenoids, which are used to wirelessly transmit and receive three distinguishable signals. The 6DOF of the handheld controller 400B relative to the wearable head device 400A may be determined by measuring the relative magnitudes of the three distinguishable signals received in each of the coils used for receiving. In some embodiments, the 6DOF totem subsystem 404A may include an inertial measurement unit (IMU), which is useful for providing improved accuracy and / or more timely information regarding the high-speed movement of the handheld controller 400B.
[0031] In some embodiments involving augmented reality or mixed reality applications, it may be desirable to transform coordinates from local coordinate space (e.g., coordinate space fixed relative to the wearable head device 400A) to inertial coordinate space or environmental coordinate space. For example, such a transformation may be necessary so that the display of the wearable head device 400A presents virtual objects in their expected position and orientation relative to the real environment (e.g., a virtual person seated in a real chair facing forward, regardless of the position and orientation of the wearable head device 400A), rather than in a fixed position and orientation on the display (e.g., at the same position on the display of the wearable head device 400A). This allows for the illusion that virtual objects exist in the real environment (and do not appear unnaturally positioned in the real environment as the wearable head device 400A shifts and rotates, for example). In some embodiments, compensatory transformations between coordinate spaces can be determined by processing images from a depth camera 444 (e.g., using simultaneous localization and mapping (SLAM) and / or visual odometry procedures) to determine the transformation of the wearable head device 400A to an inertial or environmental coordinate system. In the embodiment shown in Figure 4, the depth camera 444 can be coupled to a SLAM / visual odometry block 406 and can provide images to the block 406. The SLAM / visual odometry block 406 implementation may include a processor configured to process these images and then determine the position and orientation of the user's head, which can be used to identify the transformation between head coordinate space and real coordinate space. Similarly, in some embodiments, an additional source of information regarding the user's head pose and location is obtained from the IMU 409 of the wearable head device 400A. The information from the IMU 409 can be integrated with the information from the SLAM / visual odometry block 406 to provide improved accuracy and / or more timely information regarding faster adjustment of the user's head pose and position.
[0032] In some embodiments, the depth camera 444 can supply 3D images to a hand gesture tracker 411, which may be implemented within the processor of the wearable head device 400A. The hand gesture tracker 411 can identify the user's hand gestures, for example, by matching the 3D images received from the depth camera 444 to stored patterns representing hand gestures. Other preferred techniques for identifying the user's hand gestures will also become apparent.
[0033] In some embodiments, one or more processors 416 may be configured to receive data from a headgear subsystem 404B, an IMU 409, a SLAM / visual odometry block 406, a depth camera 444, a microphone (not shown), and / or a hand gesture tracker 411. The processor 416 may also transmit and receive control signals from the 6DOF totem system 404A. The processor 416 may be coupled wirelessly to the 6DOF totem system 404A, for example, in embodiments where the handheld controller 400B is not tethered. The processor 416 may further communicate with additional components such as an audiovisual content memory 418, a graphical processing unit (GPU) 420, and / or a digital signal processor (DSP) audio spatializer 422. The DSP audio spatializer 422 may be coupled to a head-related transfer function (HRTF) memory 425. The GPU 420 may include a left channel output coupled to a left source 424 of light modulated for each image, and a right channel output coupled to a right source 426 of light modulated for each image. The GPU 420 can output stereoscopic image data to the sources 424 and 426 of light modulated for each image. The DSP audio spatialization device 422 can output audio to the left speaker 412 and / or the right speaker 414. The DSP audio spatialization device 422 may receive an input from the processor 416 indicating a direction vector from the user to a virtual sound source (e.g., which can be moved by the user via a handheld controller 400B). Based on the direction vector, the DSP audio spatialization device 422 can determine the corresponding HRTF (e.g., by accessing the HRTF or by interpolating multiple HRTFs). The DSP audio spatialization device 422 can then apply the determined HRTF to an audio signal, such as an audio signal corresponding to a virtual sound generated by a virtual object.This can improve the credibility and realism of virtual sounds by incorporating the user's relative position and orientation to virtual sounds within a mixed reality environment; that is, by presenting virtual sounds that match the user's expectations of what they would hear if they were real sounds in a real environment.
[0034] In some embodiments, such as those shown in Figure 4, one or more of the processor 416, GPU 420, DSP audio spatialization device 422, HRTF memory 425, and audio / visual content memory 418 may be contained within an auxiliary unit 400C (which may correspond to the auxiliary unit 300 described above). The auxiliary unit 400C may include a battery 427 that powers its components and / or supplies power to the wearable head device 400A and / or handheld controller 400B. Including such components within an auxiliary unit that can be mounted on the user's waist can limit the size and weight of the wearable head device 400A, which in turn can reduce fatigue in the user's head and neck.
[0035] Figure 4 presents elements corresponding to various components of the exemplary wearable system 400, although various other preferred arrangements of these components will also be apparent to those skilled in the art. For example, an element presented in Figure 4, such as one associated with an auxiliary unit 400C, may instead be associated with a wearable head device 400A or a handheld controller 400B. Furthermore, some wearable systems may omit the handheld controller 400B or the auxiliary unit 400C entirely. Such changes and modifications are understood to be within the scope of the disclosed embodiments.
[0036] Mixed reality environment
[0037] Like all people, users of a mixed reality system exist within the real environment, that is, the three-dimensional part and all of its contents of the “real world” that is perceptible to the user. For example, the user perceives the real environment using their normal human senses, namely sight, hearing, touch, taste, and smell, and interacts with the real environment by moving their own body within it. A location within the real environment can be described as a coordinate in coordinate space, for example, a coordinate can include latitude, longitude, and altitude relative to sea level, distance in three orthogonal dimensions from a reference point, or other preferred values. Similarly, a vector can describe a quantity that has direction and magnitude in coordinate space.
[0038] A computing device can maintain a representation of a virtual environment, for example, in memory associated with the device. As used herein, a virtual environment is a computer representation of a three-dimensional space. A virtual environment can include representations of any object, action, signal, parameter, coordinate, vector, or other properties associated with that space. In some embodiments, the circuitry of the computing device (e.g., a processor) can maintain and update the state of the virtual environment; that is, the processor can determine the state of the virtual environment in a second time based on data associated with the virtual environment and / or inputs provided by the user in a first time. For example, if an object in the virtual environment is located at a first coordinate at a given time, has some programmed physical parameters (e.g., mass, coefficient of friction), and inputs received from the user indicate that a force should be applied to the object in a certain direction vector, the processor can apply the laws of kinematics and use basic mechanics to determine the object's location at that time. The processor can use any known suitable information about the virtual environment and / or any suitable inputs to determine the state of the virtual environment at a given time. When maintaining and updating the state of a virtual environment, the processor may run any suitable software, including software related to creating and deleting virtual objects within the virtual environment, software for defining the behavior of virtual objects or characters within the virtual environment (e.g., scripts), software for defining the behavior of signals within the virtual environment (e.g., audio signals), software for creating and updating parameters associated with the virtual environment, software for generating audio signals within the virtual environment, software for handling inputs and outputs, software for implementing network operations, software for applying asset data (e.g., animation data for moving virtual objects over time), or many other possibilities.
[0039] Output devices such as displays or speakers can present any or all aspects of the virtual environment to the user. For example, the virtual environment may include virtual objects that can be presented to the user (including representations of inanimate objects, people, animals, light, etc.). The processor can determine a view of the virtual environment (e.g., corresponding to a "camera" with origin coordinates, visual axes, and a frustum) and render a visible scene of the virtual environment corresponding to that view on the display. Any suitable rendering technique may be used for this purpose. In some embodiments, the visible scene may include only some virtual objects in the virtual environment and exclude some other virtual objects. Similarly, the virtual environment may include audio aspects that can be presented to the user as one or more audio signals. For example, virtual objects in the virtual environment may generate sounds resulting from the object's location coordinates (e.g., a virtual character may speak or trigger a sound effect), or the virtual environment may be associated with musical cues or ambient sounds that may or may not be associated with a specific location. The processor can determine an audio signal that corresponds to the "listener" coordinates, for example, a complex of sounds in a virtual environment, and that will be mixed and processed to simulate the audio signal that would be heard by the listener in the listener coordinates, and then present the audio signal to the user through one or more speakers.
[0040] Because a virtual environment exists only as a computer structure, users cannot directly perceive it using their normal senses. Instead, users can only perceive the virtual environment indirectly, such as through displays, speakers, haptic output devices, etc. Similarly, while users cannot directly touch, manipulate, or otherwise interact with the virtual environment, they can provide input data via input devices or sensors to a processor that can use the device or sensor data to update the virtual environment. For example, a camera sensor may provide optical data indicating that the user is attempting to move an object in the virtual environment, and the processor can use that data to cause the object to respond accordingly within the virtual environment.
[0041] Reflection and echo
[0042] Aspects of the listener's audio experience within a virtual environment (e.g., a room) include the listener's perception of the direct sound, the listener's perception of the reflection of the direct sound off the room's surfaces, and the listener's perception of the reverberation ("reverb") of the direct sound within the room. Figure 5 illustrates a geometric room representation 500 in several embodiments. The geometric room representation 500 shows exemplary propagation paths for direct sound (502), reflections (504), and reverberations (506). These paths represent the paths an audio signal can take from the source to the listener within the room. The room shown in Figure 5 may be any preferred type of environment associated with one or more acoustic properties. For example, room 500 may be a concert hall and may include a stage with a pianist and an audience seating area with an audience. As shown, direct sound is sound that originates at the source (e.g., the pianist) and travels directly toward the listener (e.g., the audience). A reflection is sound that occurs at the source, reflects off a surface (e.g., the walls of a room), and travels to the listener. An echo is sound that includes annihilation signals, which consist of many reflections arriving in close proximity to each other at a given time.
[0043] Figure 6 illustrates an exemplary model 600 of a room response measured from a source in a room to a listener, according to several embodiments. The room response model shows the amplitudes of the direct sound (610), the reflection of the direct sound (620), and the reverberation of the direct sound (630) from the listener's perspective at a certain distance from the direct sound source. As illustrated in Figure 6, the direct sound generally arrives at the listener before the reflection (with a reflection delay (622) in the figure, indicating the difference in time between the direct sound and the reflection), which in turn arrives before the reverberation (with a reverberation delay (632) in the figure, indicating the difference in time between the direct sound and the reverberation). The reflection and reverberation may be perceptually different to the listener. Reflections can be modeled separately from reverberations, for example, to better control the time, attenuation, spectral shape, and direction of arrival of individual reflections. Reflections may be modeled using a reflection model, and reverberations may be modeled using a reverberation model, which may differ from the reflection model.
[0044] The reverberation properties (e.g., reverberation annihilation) of the same sound source can differ between two different acoustic environments (e.g., rooms) for the same sound source, and it is desirable to realistically reproduce the sound source according to the properties of the current room in the listener's virtual environment. That is, when a virtual sound source is presented in a mixed reality system, the reflection and reverberation properties of the listener's real environment should be accurately reproduced. L. Savioja, J. Huopaniemi, T. Lokki, and R. Vaananen, "Creating Interactive Virtual Acoustic Environments," J. Audio Eng. Soc. 47(9): 675-705 (1999), describe methods for reproducing direct paths, individual reflections, and acoustic reverberations in real-time virtual 3D audio reproduction systems for video games, simulations, or AR / VR. In the methods disclosed by Savioja et al., the direction of arrival, delay, amplitude, and spectral equalization of each individual reflection are derived from geometric and physical models of the room (e.g., a real room, a virtual room, or some combination thereof), which may require a complex rendering system. These methods are computationally complex and may be prohibitively complex for mobile applications where computing resources may be limited.
[0045] In some room acoustic simulation algorithms, reverberation can be implemented by downmixing all sound sources into mono signals and sending the mono signals to a reverberation simulation module. The gain used for downmixing and transmission may depend on dynamic parameters such as source distance and manual parameters such as reverberation gain.
[0046] Source directivity, or radiation pattern, can refer to a measure of the amount of energy a sound source emits in different directions. Source directivity affects all parts of the room impulse response (e.g., direct, reflected, and reverberated). Different sound sources can exhibit different directivity; for example, human speech may have a different directivity pattern than trumpet playing. Room simulation models may consider source directivity when generating an accurate simulation of an acoustic signal. For example, a model incorporating source directivity may include a function of the direction of the line from the sound source to the listener with respect to the front direction of the sound source (or the primary acoustic axis). The directivity pattern is axisymmetric with respect to the primary acoustic axis of the sound source. In some embodiments, a parametric gain model may be defined using a frequency-dependent filter. In some embodiments, the average of the diffuse power of the sound source may be calculated (e.g., by integrating over a sphere centered on the acoustic center of the sound source) to determine how much audio from a given sound source should be transmitted into the reverberation bus.
[0047] Bidirectional audio engines and sound design tools can make assumptions about the acoustic system being modeled. For example, some bidirectional audio engines can model source directivity as a function independent of frequency, which can have two potential drawbacks. First, it can ignore frequency-dependent attenuation for direct sound propagation from the source to the listener. Second, it can ignore frequency-dependent attenuation for reflected and reverberant transmissions. These effects can be important from a psychoacoustics perspective, and failing to reproduce them can lead to room simulations that are unnatural and perceived differently from what the listener is accustomed to experiencing in a real acoustic environment.
[0048] In some cases, a room simulation system or two-way audio engine may not completely isolate the sound source, listener, and acoustic environment parameters such as reflections and reverberations. Instead, the room simulation system may be tuned as a whole for a specific virtual environment and may not be suitable for different playback scenarios. For example, reverberations in the simulated environment may not match the environment that the user / listener physically exists in when listening to the rendered content.
[0049] In augmented or mixed reality applications, computer-generated audio objects may be rendered via an acoustically transparent playback system so as to blend with the physical environment as it would be naturally heard by the user / listener. This may require binaural artificial reverberation processing to match local ambient acoustics, and therefore the synthesized audio object may not be indistinguishable from naturally occurring sounds or sounds reproduced across loudspeakers. For example, approaches involving the measurement or calculation of room impulse responses based on estimating the geometric shape of the environment may be limited in consumer environments due to practical obstacles and complexities. In addition, physical models may not necessarily provide the most engaging listening experience because they do not consider the acoustic principles of psychoacoustics or may not provide suitable audio scene parameterization for sound designers to fine-tune the listening experience.
[0050] Matching certain specific physical properties of a target acoustic environment may not provide a simulation that perceptually closely matches the listener's environment or the application designer's intentions. A perceptually relevant model of the target acoustic environment, which can be characterized using a practical audio environment description interface, may be desired.
[0051] For example, a rendering model that separates the contributions of the source, listener, and room properties may be desired. A rendering model that separates contributions may allow components to be adapted or swapped at runtime according to the local environment and the nature of the end user. For example, a listener may be in a physical room with different acoustic characteristics than the virtual environment in which the content was originally created. Modifying the simulation's early reflections and / or reverberations to match the listening environment may lead to a more compelling listening experience. Matching the listening environment may be particularly important in mixed reality applications where the desired effect may be that the listener cannot distinguish between the simulated ambient sounds and the sounds present in the actual surrounding environment.
[0052] It may be desirable to create compelling effects without requiring detailed knowledge of the geometric shape of the actual surrounding environment and / or the acoustic properties of the surrounding surfaces. Detailed knowledge of the properties of the actual surrounding environment may not be available, or they may be complex to estimate, especially on portable devices. Instead, models based on perceptual and psychoacoustic principles may be a far more practical tool for characterizing acoustic environments.
[0053] Figure 7 illustrates Table 700, which includes several objective acoustic and geometric parameters characterizing each segment in a binaural room impulse model that distinguishes the properties of the source, listener, and room according to several embodiments. Some source properties, including free-field and diffuse-field transfer functions, may be independent of how and where the content will be rendered, while other properties, including position and orientation, may need to be dynamically updated during playback. Similarly, some listener properties, including free-field and diffuse-field head-associated transfer functions or diffuse-field interaural coherence (IACC), may be independent of where the content will be rendered, while other properties, including position and orientation, may be dynamically updated during playback. Some room properties, in particular those contributing to late reflections, may be entirely environment-dependent. Representations of reflection annihilation rate and room cubic volume may be used to adapt the spatial audio rendering system to the listener's playback environment.
[0054] The source and the listener's ear may be modeled as emitting and receiving transducers, respectively, characterized by a set of directionally dependent free-field transfer functions, including the listener's head-related transfer function (HRTF).
[0055] Figure 8 illustrates an exemplary audio mixing system 800 for rendering multiple virtual sound sources in a virtual room, such as within an XR environment, according to several embodiments. For example, the audio mixing architecture may include a rendering engine for room acoustic simulation of multiple virtual sound sources 810 (i.e., objects 1-N). System 800 includes a room transmit bus 830 that feeds to modules 850 (e.g., shared reverberation and reflection modules) that render reflections and reverberations. Aspects of this general process are described, for example, in the IA-SIG 3D Audio Rendering Guidelines (Level 2), www.iasig.net (1999). The room transmit bus combines contributions from all sources, e.g., each processed by a corresponding module 820, to derive an input signal for the room module. The room transmit bus may comprise a mono-room transmit bus. The format of the main mixing bus 840 may be a two-channel or multi-channel format that conforms to the final output rendering method, which may include, for example, a binaural renderer for headphone playback, an ambisonic decoder, and / or a multi-channel loudspeaker system. The main mixing bus combines contributions from all sources with the room module output to derive the output rendering signal 860.
[0056] Referring to the exemplary system 800, each of the N objects may represent a virtual sound source signal and may be assigned an apparent location in the environment, such as by a panning algorithm. For example, each object may be assigned an angular position on a sphere centered on the position of a virtual listener. The panning algorithm may calculate the contribution of each object to each channel of the main mix. This general process is described, for example, in J.-M. Jot, V. Larcher, and J.-M. Pernaux, "A comparative study of 3-D audio encoding and rendering techniques," Proc. AES 16th International Conference on Spatial Sound Reproduction (1999). Each object may also be input to a pan and gain module 820, which can implement a panning algorithm and perform additional signal processing, such as adjusting the gain level for each object.
[0057] In some embodiments, system 800 may assign to each virtual sound source an apparent distance relative to the location of a virtual listener, from which the rendering engine can derive a source-specific direct gain and a source-specific room gain for each object. The direct and room gains may affect the audio signal power contributed by the virtual sound source to the main mixing bus 840 and the room transmission bus 830, respectively. A minimum distance parameter may be assigned to each virtual sound source, and the direct and room gains may roll off at different rates as the distance increases beyond this minimum distance.
[0058] In some embodiments, the system 800 in Figure 8 may be used for generating audio recordings and bidirectional audio applications targeting conventional two-channel front stereo loudspeaker playback systems. However, when applied in a binaural or immersive 3D audio system that enables a simulated spatial diffusion distribution of reverberations and reflections, the system 800 may not provide sufficiently convincing sound localization cues when rendering virtual sound sources, particularly those far from the listener. This can be addressed by incorporating clustered reflection rendering modules shared among virtual sound sources 810, while supporting per-source control of the spatial distribution of reflections. It is desirable that such modules incorporate per-source early reflection processing algorithms and dynamic control of early reflection parameters based on virtual sound source and listener positions.
[0059] In some embodiments, it may be desirable to have spatial audio processing models / systems and methods that can accurately reproduce position-dependent room acoustic cues without computer-complex rendering of individual early reflections for each virtual sound source or detailed descriptions of acoustic reflector geometry and physical properties.
[0060] The reflection processing model can dynamically consider the positions of listeners and virtual sound sources within a real or virtual room / environment, without associated physical and geometric descriptions. Perceptual models for controlling clustered reflection panning and early reflection processing parameters per source may be efficiently implemented.
[0061] Figure 9 illustrates an audio mixing system 900 for rendering multiple virtual sound sources within a virtual room, according to several embodiments. For example, system 900 may include a rendering engine for room acoustic simulation of multiple virtual sound sources 910 (e.g., objects 1-N). Compared to system 800 described above, system 900 may include separate control of reverberation and reflection transmission channels for each virtual sound source. Each object may be input to a separate per-source processing module 920, and a room transmission bus 930 may be fed to a room processing module 950.
[0062] Figure 10 illustrates a source-specific processing module 1020 according to several embodiments. Module 1020 may correspond to one or more of the modules 920 shown in Figure 9 and the exemplary system 900. The source-specific processing module 1020 can perform processing specific to an individual source (e.g., 1010, which may correspond to one of the sources 910) of the overall system (e.g., system 900). The source-specific processing module may include direct processing paths (e.g., 1030A) and / or room processing paths (e.g., 1030B).
[0063] In some embodiments, individual direct and room filters may be applied separately to each sound source. Applying filters separately can allow for more refined and precise control over how each source radiates sound toward the listener and into the surrounding environment. In contrast to broadband gain, the use of filters can allow for matching a desired sound radiation pattern as a function of frequency. This is beneficial because radiation properties can vary across sound source types and may be frequency-dependent. The angle between the primary acoustic axis of the sound source and the listener's position can affect the sound pressure level perceived by the listener. Furthermore, source radiation characteristics can affect the average of the source's diffuse power.
[0064] In some embodiments, the frequency-dependent filter may be implemented using a double-shelving approach disclosed in U.S. Patent Application No. 62 / 678259, titled "INDEX SCHEMING FOR FILTER PARAMETERS" (the contents of which are incorporated as a whole by reference). In some embodiments, the frequency-dependent filter may be applied in the frequency domain and / or using a finite impulse response filter.
[0065] As shown in the embodiments, the direct processing path may include a direct transmit filter 1040, followed by a direct pan module 1044. The direct transmit filter 1040 may model one or more acoustic effects, such as one or more of source directivity, distance, and / or orientation. The direct pan module 1044 can spatialize the audio signal to correspond to an apparent location in the environment (e.g., a 3D location in a virtual environment such as an XR environment). The direct pan module 1044 may be amplitude and / or intensity based and may depend on the geometry of the loudspeaker array. In some embodiments, the direct processing path may include a direct transmit gain 1042 together with the direct transmit filter and the direct pan module. The direct pan module 1044 can output to a main mix bus 1090, which may correspond to the main mix bus 940 described above with respect to the exemplary system 900.
[0066] In some embodiments, the room processing path comprises a room delay 1050 and a room transmit filter 1052, followed by a reflection path (e.g., 1060A) and an echo path (e.g., 1060B). The room transmit filter may be used to model the effect of source directivity on the signal proceeding through the reflection and echo paths. The reflection path may comprise a reflection transmit gain 1070 and transmit the signal to a reflection transmit bus 1074 via a reflection pan module 1072. The reflection pan module 1072 may be analogous to a direct pan module 1044 in that it can spatialize audio signals but can operate on reflections instead of direct signals. The echo path 1060B may comprise an echo gain 1080 and transmit the signal to an echo transmit bus 1084. The reflection transmit bus 1074 and the echo transmit bus 1084 may be grouped into a room transmit bus 1092, which may correspond to the room transmit bus 930 described above with respect to the exemplary system 900.
[0067] Figure 11 illustrates an embodiment of a source-by-source reflection pan module 1100 that may correspond to the reflection pan module 1072 described above, according to several embodiments. As shown in the figure, the input signal may be encoded into a 3-channel ambisonic B-format signal, for example, as described in J.-M. Jot, V. Larcher, and J.-M. Pernaux, "A comparative study of 3-D audio encoding and rendering techniques," Proc. AES 16th International Conference on Spatial Sound Reproduction (1999). The encoding coefficients 1110 can be calculated according to equations 1-3. [ka]
[0068] In equation 1-3, k is [ka] It can be calculated as follows, where F is a spatial focus parameter with a value between [0, 2 / 3] and Az is an angle in degrees between [0, 360]. The encoder may encode the input signal into a 3-channel ambisonic B-format signal.
[0069] Az can be an azimuth angle defined by the projection of the principal direction of arrival of the reflection onto the head-relative horizontal plane (e.g., a plane perpendicular to the "up" vector of the listener's head and containing the listener's ear). The spatial focus parameter F can indicate the spatial concentration of reflected signal energy arriving at the listener. When F is zero, the spatial distribution of arriving reflected energy may be uniform around the listener. As F increases, the spatial distribution may become increasingly concentrated around the principal direction determined by the azimuth angle Az. The maximum theoretical value of F is 1.0, which may indicate that all energy arrives from the principal direction determined by the azimuth angle Az.
[0070] In one embodiment of the present invention, the spatial focus parameter F may be defined as the magnitude of the Garson energy vector, as described, for example, in J.-M. Jot, V. Larcher, and J.-M. Pernaux, "A comparative study of 3-D audio encoding and rendering techniques," Proc. AES 16th International Conference on Spatial Sound Reproduction (1999).
[0071] The output of the reflective pan module 1100 can be provided to a reflective transmit bus 1174, which may correspond to the reflective transmit bus 1074 described above with respect to Figure 10 and the exemplary processing module 1020.
[0072] Figure 12 illustrates exemplary room treatment modules 1200 according to several embodiments. Room treatment module 1200 can correspond to room treatment module 950 described above with respect to Figure 9 and exemplary system 900. As shown in Figure 9, room treatment module 1200 may include reflection treatment path 1210A and / or reverberation treatment path 1210B.
[0073] The reflection processing path 1210A may receive a signal from the reflection transmission bus 1202 (which may correspond to the reflection transmission bus 1074 described above) and output a signal into the main mixing bus 1290 (which may correspond to the main mixing bus 940 described above). The reflection processing path 1210A may include a reflection global gain 1220, a reflection global delay 1222, and / or a reflection module 1224 capable of simulating / rendering reflections.
[0074] The echo processing path 1210B may receive signals from the echo transmission bus 1204 (which may correspond to the echo transmission bus 1084 described above) and output signals into the main mixed bus 1290. The echo processing path 1210B may include an echo global gain 1230, an echo global delay 1232, and / or an echo module 1234.
[0075] Figure 13 illustrates exemplary reflection module 1300 according to several embodiments. The input 1310 of the reflection module can be output by a reflection pan module 1100, such as the one described above, and presented to the reflection module 1300 via a reflection transmit bus 1174. The reflection transmit bus may carry a 3-channel ambisonic B-format signal combining contributions from all virtual sound sources (e.g., sound sources 910 (objects 1-N) described above with respect to Figure 9). In the embodiment shown, the three channels, represented as (W, X, Y), are fed to an ambisonic decoder 1320. According to the embodiment, the ambisonic decoder generates six output signals, which are each fed to six mono input / output basic reflection modules 1330 (R1-R6) to generate a set of six reflection output signals 1340 (s1-s6). (The embodiment shows six signal and reflection modules, but any preferred number may be used.) The reflected output signal 1340 is presented to the main mixing bus 1350, which may correspond to the main mixing bus 940 described above.
[0076] Figure 14 illustrates the spatial distribution 1400 of the apparent direction of arrival of reflections, as detected by a listener 1402, in several embodiments. For example, the reflections shown may be generated by the reflection module 1300 described above with respect to a sound source to which specific values of the reflection pan parameters Az and F, as described above with respect to Figure 11, are assigned.
[0077] As illustrated in Figure 14, the effect of the reflective module 1300 in combination with the reflective pan module 1100 is to generate a series of reflections, each of which may arrive at different times (e.g., as illustrated in Model 600) from each of the virtual loudspeaker directions 1410 (e.g., 1411-1416, which may correspond to the reflected output signals s1-s6 described above). The effect of the reflective pan module 1100 in combination with the ambisonic decoder 1320 is to adjust the relative magnitude of the reflected output signal 1340 to produce a sensation for the listener that the reflections are emanating from the principal direction angle Az with a spatial distribution (e.g., more or less concentrated around its principal direction) determined by the setting of the spatial focus parameter F.
[0078] In some embodiments, the principal reflection angle Az for each source coincides with the apparent arrival direction of the direct path, which can be controlled per source by the direct pan module 1020. The simulated reflections can enhance the listener's perception of the directional position of the virtual sound source.
[0079] In some embodiments, the main mixing bus 940 and the direct pan module 1020 may enable three-dimensional reproduction of the sound direction. In these embodiments, the principal reflection direction angle Az may coincide with the projection of the apparent direction onto the plane on which the principal reflection direction angle Az is measured.
[0080] Figure 15 illustrates Model 1500 of exemplary direct gain, reflected gain, and reverberation gain as a function of distance (e.g., to the listener) in several embodiments. Model 1500 illustrates examples of variations in direct, reflected, and reverberation transmission gains with respect to source distance, for example, as shown in Figure 10. As shown in the figure, direct sound, its reflection, and its reverberation may have significantly different falloff curves with respect to distance. In some cases, source-specific processing, such as those described above, may enable faster distance-based rolloff with respect to reflection than with respect to reverberation. Psychoacoustically, this may enable robust directional and distance perception, particularly with respect to distant sources.
[0081] Figure 16 illustrates exemplary models 1600 of spatial focus-to-source distances for direct and reflected components according to several embodiments. In this embodiment, the direct pan module 1020 is configured to produce the maximum spatial concentration of the direct path component in the direction of the sound source, regardless of its distance. On the other hand, the reflected spatial focus parameter F may be set to an exemplary value of 2 / 3 in a realistic manner to enhance the perception of directionality for all distances longer than the limiting distance (e.g., minimum reflection distance 1610). As illustrated by exemplary model 1600, the reflected spatial focus parameter value decreases toward zero as the source approaches the listener.
[0082] Figure 17 shows an exemplary model 1700 of the amplitude of an audio signal as a function of time. As described above, a reflection processing path (e.g., 1210A) may receive the signal from the reflection transmit bus and output the signal onto the main mixed bus. The reflection processing path may include a reflection global gain (e.g., 1220), a reflection global delay (e.g., 1222) for controlling the parameter Der as shown in model 1700, and / or a reflection module (e.g., 1224), such as those described above.
[0083] As described above, the reverberation processing path (e.g., 1210B) may receive a signal from the reverberation transmit bus and output the signal into the main mixed bus. The reverberation processing path 1210B may include a reverberation global gain (e.g., 1230) for controlling the parameter Lgo as shown in Model 1700, a reverberation global delay (e.g., 1232) for controlling the parameter Drev as shown in Model 1700, and / or a reverberation module (e.g., 1234). The processing blocks within the reverberation processing path may be implemented in any preferred order. Embodiments of the reverberation module are described in U.S. Patent Application No. 62 / 685235, titled “REVERBERATION GAIN NORMALIZATION” and U.S. Patent Application No. 62 / 684086, titled “LOW-FREQUENCY INTERCHANNEL COHERENCE CONTROL” (the contents of which are incorporated herein by reference in their entirety).
[0084] Model 1700 in Figure 17 illustrates how, in several embodiments, per-source parameters including distance and reverberation delay may be considered to dynamically adjust the reverberation delay and level. In the figure, Dtof represents the delay due to the time of flight with respect to a given object, i.e., Dtof = ObjDist / c, where ObjDist is the distance to the object from the center of the listener's head and c is the speed of sound in the air. Drm represents the room delay per object. Dobj represents the total delay per object, i.e., Dobj = Dtof + Drm. Der represents the global early reflection delay. Drev represents the global reverberation delay. Dtotal represents the total delay with respect to a given object, i.e., Dtotal = Dobj + Dglobal.
[0085] Lref represents the level of echo with respect to Dtotal=0. Lgo represents the global level offset due to global delay, which can be calculated according to equation 10, where T60 is the echo time of the echo algorithm. Loo represents the level offset per object due to global delay, which can be calculated according to equation 11. Lto represents the total level offset with respect to a given object, which can be calculated according to equation 12 (assuming dB values). [ka] [ka]
[0086] In some embodiments, the reverberation level is calibrated independently of object position, reverberation time, and other user-controllable parameters. Thus, Lrev may be an extrapolated level of the extinction reverberation at the initial time of sound emission. Lrev may be the same quantity as the initial reverberation power (RIP) as defined in U.S. Patent Application No. 62 / 685235, titled "REVERBERATION GAIN NORMALIZATION" (the contents of which are incorporated herein by reference in their entirety). Lrev may be calculated according to equation 13. [ka]
[0087] In some embodiments, T60 may be a function of frequency. Thus, Lgo, Loo, and consequently Lto are frequency-dependent.
[0088] Figure 18 illustrates an exemplary system 1800 for determining spatial audio properties based on an acoustic environment. The exemplary system 1800 can be used to determine spatial audio properties relating to reflections and / or reverberations, such as those described above. In some embodiments, such properties may include the volume of the room, reverberation time as a function of frequency, the listener's position relative to the room, the presence of objects in the room (e.g., sound-attenuating objects), surface materials, or other preferred properties. In some embodiments, these spatial audio properties may be read locally by capturing a single impulse response using microphones and loudspeakers freely positioned within the local environment, or they may be adaptively derived by continuously monitoring and analyzing sounds captured by a mobile device microphone. In some embodiments, such as when the acoustic environment can be sensed via sensors of an XR system (e.g., an augmented reality system including one or more of the wearable head unit 100, handheld controller 200, and auxiliary unit 300 described above), the user's location can be used to present audio reflections and reverberations corresponding to the environment presented to the user (e.g., via a display).
[0089] In the exemplary system 1800, the acoustic environment sensing module 1810 identifies the spatial audio properties of the acoustic environment, such as those described above. In some embodiments, the acoustic environment sensing module 1810 can capture data corresponding to the acoustic environment (step 1812). For example, the data captured in step 1812 may include audio data from one or more microphones, camera data from cameras such as RGB cameras or depth cameras, LIDAR data, sonar data, radar data, GPS data, or other suitable data that can convey information about the acoustic environment. In some instances, the data captured in step 1812 may include user-related data such as the user's position or orientation relative to the acoustic environment. The data captured in step 1812 may be captured via one or more sensors of a wearable device, such as the wearable head unit 100 described above.
[0090] In some embodiments, the local environment in which the head-mounted display device resides may include one or more microphones. In some embodiments, one or more microphones may be employed, mounted on a mobile device, or positioned in the environment, or both. The benefits of such an arrangement may include collecting directional information about room reverberation or mitigating poor signal quality of any one of the microphones in the one or more microphones. Poor signal quality on a given microphone may be due to, for example, blockage, overload, wind noise, transducer damage, and the like.
[0091] In step 1814 of module 1810, features can be extracted from data captured in step 1812. For example, room dimensions can be determined from sensor data such as camera data, LiDAR data, and sonar data. The features extracted in step 1814 can be used to determine one or more acoustic properties of the room, such as frequency-dependent reverberation time, and these properties can be stored in step 1816 and associated with the current acoustic environment.
[0092] In some embodiments, module 1810 can communicate with database 1840 to store and retrieve acoustic properties relating to an acoustic environment. In some embodiments, the database may be stored locally on the device's memory. In some embodiments, the database may be stored online as a cloud-based service. The database may assign geographic locations to room properties for easy access at a later point in time, based on the listener's location. In some embodiments, the database may contain additional information to identify the listener's location and / or determine the reverberation properties in the database, which are close approximations of the listener's environmental properties. For example, room properties may be classified by room type, and thus a set of parameters can be used as soon as it is identified that the listener is in a room of a known type (e.g., a bedroom or living room), even if the absolute geographic location cannot be determined.
[0093] The memory of echogenic properties within the database may relate to U.S. Patent Application No. 62 / 573448, titled "PERSISTENT WORLD MODEL SUPPORTING AUGMENTED REALITY AND INCLUDING AUDIO COMPONENT" (the contents of which are incorporated herein by reference in their entirety).
[0094] In some embodiments, the system 1800 may include a reflection-adaptation module 1820 for reading acoustic properties of a room and applying those properties to audio reflections (e.g., audio reflections presented to the user of the wearable head unit 100 via headphones or speakers). In step 1822, the user's current acoustic environment can be determined. For example, GPS data may indicate the user's location in GPS coordinates, which in turn can indicate the user's current acoustic environment (e.g., the room located at those GPS coordinates). In another embodiment, camera data combined with optical recognition software may be used to identify the user's current environment. The reflection-adaptation module 1820 can then communicate with the database 1840 to read acoustic properties associated with the determined environment, which are used in step 1824 to update the audio rendering accordingly. That is, acoustic properties associated with reflections (e.g., directional patterns or falloff curves, such as those described above) can be applied to the reflected audio signal presented to the user so that the presented reflected audio signal incorporates those acoustic properties.
[0095] Similarly, in some embodiments, the system 1800 may include a reflection-adaptation module 1830 for reading acoustic properties relating to a room and applying those properties to audio reverberations (e.g., audio reflections presented to the user of the wearable head unit 100 via headphones or speakers). The acoustic properties of interest relating to reverberations may differ from those of interest relating to reflections (e.g., in Table 700 relating to Figure 7), such as those described above. In step 1832, the user's current acoustic environment can be determined as described above. For example, GPS data may indicate the user's location in GPS coordinates, which in turn can indicate the user's current acoustic environment (e.g., the room located at those GPS coordinates). In another embodiment, camera data combined with optical recognition software may be used to identify the user's current environment. The reflection-adaptation module 1830 can then communicate with the database 1840 to read acoustic properties associated with the determined environment, which are used in step 1824 to update the audio rendering accordingly. In other words, acoustic properties related to reverberation (for example, the reverberation decay time described above) can be applied to the reverberation audio signal presented to the user so that the presented reverberation audio signal incorporates those acoustic properties.
[0096] With respect to the systems and methods described above, the elements of the systems and methods may be implemented by one or more computer processors (e.g., CPUs or DSPs) as appropriate. This disclosure is not limited to any particular configuration of computer hardware, including computer processors, used to implement these elements. In some cases, multiple computer systems may be employed to implement the systems and methods described above. For example, a first computer processor (e.g., a processor in a wearable device coupled to a microphone) may be used to receive input microphone signals and perform initial processing of those signals (e.g., signal conditioning and / or segmentation, such as those described above). A second (possibly more computationally powerful) processor may then be used to perform more computationally intensive processing, such as determining probability values associated with speech segments of those signals. Another computer device, such as a cloud server, may host a speech recognition engine to which the input signals are ultimately supplied. Other preferred configurations will also become apparent and are within the scope of this disclosure.
[0097] While the disclosed embodiments have been fully described with reference to the accompanying drawings, it should be noted that various changes and modifications will be obvious to those skilled in the art. For example, elements of one or more implementations may be combined, removed, modified, or complemented to form further implementations. Such changes and modifications are understood to fall within the scope of the disclosed embodiments as defined by the appended claims.
Claims
1. A method, wherein the method is The intermediate audio signal is determined based on the location of the sound source and further based on the acoustic properties of the virtual environment, wherein the intermediate audio signal corresponds to the reflection of the input audio signal by the surface of the virtual environment. Determining the intermediate audio signal includes encoding the input audio signal based on the location of the sound source. The aforementioned intermediate audio signal is associated with a first plurality of channels, The intermediate audio signal is associated with a bus, the bus is associated with the first plurality of channels, The output audio signal is presented to the listener via the aforementioned bus, The output audio signal is determined by decoding the intermediate audio signal and generating the decoded intermediate audio signal. The decoded intermediate audio signal is associated with a second set of channels that are different from the first set of channels. Methods that include...
2. The method according to claim 1, wherein the acoustic properties of the virtual environment are determined via one or more sensors.
3. The method according to claim 2, wherein the one or more sensors include one or more microphones.
4. The method according to claim 2, wherein the one or more sensors include one or more cameras.
5. The one or more sensors are associated with a wearable head device configured to be worn by the hearing person, The method according to claim 2, wherein the output audio signal is presented to the listener via one or more speakers associated with the wearable head device.
6. The method according to claim 1, further comprising displaying a view of the virtual environment to the listener in parallel with the presentation of the output audio signal.
7. The method according to claim 1, further comprising reading the acoustic properties from a database, wherein the acoustic properties are determined via one or more sensors.
8. The reading of the acoustic properties is The location of the listener is determined based on the output of one or more of the sensors, Based on the location of the listener, the acoustic properties are identified. The method according to claim 7, including the method described in claim 7.
9. The method according to claim 1, wherein the intermediate audio signal is decoded via an ambisonic decoder.
10. The method according to claim 1, wherein the acoustic properties are determined via a first device, and the intermediate audio signal is determined via a second device different from the first device.
11. The virtual environment is part of a mixed reality environment, The surface of the virtual reality environment includes the surface of the mixed reality environment, The method according to claim 1, wherein the location of the sound source includes the location of the mixed reality environment.
12. One or more speakers, One or more processors configured to carry out the method and A system comprising the method, The intermediate audio signal is determined based on the location of the sound source and further based on the acoustic properties of the virtual environment, wherein the intermediate audio signal corresponds to the reflection of the input audio signal by the surface of the virtual environment. Determining the intermediate audio signal includes encoding the input audio signal based on the location of the sound source. The aforementioned intermediate audio signal is associated with a first plurality of channels, The intermediate audio signal is associated with a bus, the bus is associated with the first plurality of channels, The output audio signal is presented to the listener via the bus and one or more speakers, The output audio signal is determined by decoding the intermediate audio signal and generating the decoded intermediate audio signal. The decoded intermediate audio signal is associated with a second set of channels that are different from the first set of channels. A system that includes this.
13. The system according to claim 12, further comprising one or more sensors, wherein the acoustic properties of the virtual environment are determined via the one or more sensors.
14. The system according to claim 13, wherein the one or more sensors include one or more microphones.
15. The system according to claim 13, wherein the one or more sensors include one or more cameras.
16. The system according to claim 13, wherein the one or more sensors and the one or more speakers are associated with a wearable head device configured to be worn by the listener.
17. The system according to claim 12, further comprising a display, wherein the method further includes displaying a view of the virtual environment to the listener via the display in parallel with the presentation of the output audio signal.
18. The system according to claim 12, wherein the acoustic properties are determined via a first device, and the intermediate audio signal is determined via a second device different from the first device.
19. The virtual environment is part of a mixed reality environment, The surface of the virtual reality environment includes the surface of the mixed reality environment, The system according to claim 12, wherein the location of the sound source includes the location of the mixed reality environment.
20. A non-temporary computer-readable medium storing instructions, wherein, when an instruction is executed by one or more processors, the one or more processors cause the one or more processors to perform a method, the method is The intermediate audio signal is determined based on the location of the sound source and further based on the acoustic properties of the virtual environment, wherein the intermediate audio signal corresponds to the reflection of the input audio signal by the surface of the virtual environment. Determining the intermediate audio signal includes encoding the input audio signal based on the location of the sound source. The aforementioned intermediate audio signal is associated with a first plurality of channels, The intermediate audio signal is associated with a bus, the bus is associated with the first plurality of channels, The output audio signal is presented to the listener via the aforementioned bus, The output audio signal is determined by decoding the intermediate audio signal and generating the decoded intermediate audio signal. The decoded intermediate audio signal is associated with a second set of channels that are different from the first set of channels. Non-temporary computer-readable media, including [specific examples of such media].