Binaural stereo playback method, apparatus, medium, and device
By classifying sound sources and employing different processing strategies, the problem of excessive computational load in multi-sound-source scenarios was solved, achieving efficient binaural stereo playback and improving system performance and auditory experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2024-12-25
- Publication Date
- 2026-06-26
AI Technical Summary
Existing binaural spatial sound reproduction technology has an excessive computational load in multi-sound source scenarios, leading to problems such as stuttering and audio-visual asynchrony.
Sound sources are classified into different types and strategies with different processing complexities are adopted, including HRTF and BP algorithms, and targeted processing is carried out based on the auditory sensitivity attributes of the sound sources.
While achieving binaural spatial sound reproduction, it reduces computational overhead and improves the system's real-time performance and auditory experience.
Smart Images

Figure CN122294064A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of audio processing technology, and in particular to a binaural stereo playback method, a binaural stereo playback device, a computer-readable storage medium, and an electronic device. Background Technology
[0002] Binaural spatial sound reproduction technology utilizes the binaural effect to reproduce a realistic sound experience. Based on how the human ear receives and processes sound, it simulates the human ear's sound reception process to achieve a more realistic auditory experience. Binaural spatial sound reproduction technology is widely used in cinemas, concerts, games, and other fields to provide a more realistic and immersive auditory experience.
[0003] The binaural spatial sound reproduction solution provided by related technologies employs Head Related Transfer Functions (HRTF) technology. However, this technology suffers from high computational overhead. For example, in real-time multi-sound-source applications such as video games and virtual reality scenarios, the computational load for spatial sound reproduction is high if the aforementioned technology is used, which can easily lead to problems such as stuttering, audio-visual asynchrony, and unresponsive operation. Summary of the Invention
[0004] This application provides a binaural stereo playback method, a binaural stereo playback device, a computer-readable storage medium, and an electronic device. It classifies various existing sound sources and adopts different processing strategies with different processing complexities for different sound sources, thereby reducing computational overhead while realizing binaural spatial sound playback.
[0005] In a first aspect, embodiments of this application provide a method for reproducing binaural stereo sound. The method includes: determining N sound sources currently existing in the environment where a target object is located, where N is a positive integer; classifying the N sound sources according to their respective auditory sensitivity attributes, wherein the auditory sensitivity attribute of the i-th sound source is used to characterize the auditory sensitivity of the target object to the i-th sound source, where i is a positive integer not greater than N; performing processing strategies with different processing complexities for different types of sound sources to determine the binaural stereo sound corresponding to each of the N sound sources; and mixing the binaural stereo sound corresponding to each of the N sound sources to obtain the current binaural stereo sound of the target object.
[0006] Secondly, embodiments of this application provide a binaural stereo playback device, which includes: a first determining module, a classification module, a first processing module, and a second processing module.
[0007] The first determining module is used to determine N sound sources currently existing in the environment where the target object is located, where N is a positive integer; the classification module is used to classify the N sound sources according to the auditory sensitivity attributes corresponding to the N sound sources, where the auditory sensitivity attribute of the i-th sound source is used to characterize the auditory sensitivity of the target object to the i-th sound source, where i is a positive integer not greater than N; the first processing module is used to execute processing strategies with different processing complexities for different types of sound sources to determine the binaural stereo corresponding to the N sound sources; and the second processing module is used to perform mixing processing on the binaural stereo corresponding to the N sound sources to obtain the current binaural stereo of the target object.
[0008] In an exemplary embodiment, based on the above scheme, the apparatus further includes: an acquisition module and a second determination module;
[0009] The acquisition module is used to: before the classification module classifies the N sound sources according to their respective auditory sensitivity attributes, acquire the corresponding sound source features and audio features for the i-th sound source, wherein the sound source features include: azimuth angle movement speed, and the audio features include: relative loudness, where i is a positive integer not greater than N; the second determination module is used to determine the i-th relative loudness based on the distance and loudness between the i-th sound source and the target object; wherein the auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness, or the i-th relative loudness and the azimuth angle movement speed of the i-th sound source.
[0010] In an exemplary embodiment, based on the above scheme, the classification module is specifically used to: determine that the i-th sound source belongs to a first type when the auditory sensitivity attribute of the i-th sound source meets the first preset condition; or, determine that the i-th sound source belongs to a second type when the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition; wherein, the human ear perception sensitivity of the target object to the sound source of the first type is greater than the human ear perception sensitivity to the sound source of the second type.
[0011] In an exemplary embodiment, based on the above scheme, the auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness; the above device further includes: a third determining module;
[0012] The third determining module is configured to: determine the i-th threshold corresponding to the i-th relative loudness, wherein the i-th threshold is a preset i-th original threshold, or the i-th threshold is an i-th weighted threshold obtained by multiplying the i-th original threshold by the i-th influence coefficient; the i-th influence coefficient is related to at least one of the following: the relative orientation information between the i-th sound source and the target object, and the azimuth angle movement speed of the i-th sound source; if the i-th relative loudness is greater than the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition; and if the i-th relative loudness is less than the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source does not satisfy the first preset condition.
[0013] In an exemplary embodiment, based on the above scheme, the auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness and the azimuth angle movement speed of the i-th sound source; it also includes: a fourth determining module;
[0014] The fourth determining module is used to: determine the i-th threshold corresponding to the i-th relative loudness, wherein the i-th threshold is a preset i-th original threshold, or the i-th threshold is an i-th weighted threshold obtained by multiplying the i-th original threshold by the i-th influence coefficient; the i-th influence coefficient is related to at least one of the following: the relative orientation information between the i-th sound source and the target object, and the azimuth angle movement speed of the i-th sound source; if the azimuth angle movement speed of the i-th sound source is greater than the first threshold, and the i-th relative loudness is greater than the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition; and if the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, and the i-th relative loudness is greater than the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition.
[0015] In an exemplary embodiment, based on the above scheme, the fourth determining module is further configured to: if the azimuth angle movement speed of the i-th sound source is greater than the first threshold, and the i-th relative loudness is less than or equal to the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition; if the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, and the i-th relative loudness is less than or equal to the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition.
[0016] In an exemplary embodiment, based on the above scheme, when the azimuth angle movement speed of the i-th sound source is greater than the first threshold, the corresponding influence coefficient is a first positive value less than 1; when the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, the corresponding influence coefficient is 1.
[0017] In an exemplary embodiment, based on the above scheme, the classification module is specifically used to: if N is less than the second threshold, classify the N sound sources according to their respective auditory sensitivity attributes; or, if N is greater than or equal to the second threshold, classify the sound sources whose loudness meets the second preset condition according to their corresponding auditory sensitivity attributes, and determine the sound sources whose loudness does not meet the second preset condition as belonging to the second type.
[0018] In an exemplary embodiment, based on the above scheme, the second preset condition is that the loudness ranks among the top M from largest to smallest; or, the ranking belongs to the top x%; where M is a positive integer less than N, and x is a positive integer.
[0019] In an exemplary embodiment, based on the above scheme, the first processing module includes a first processing unit and a second processing unit; wherein, the first processing unit is used to: for the sound sources of the first type among the N sound sources, execute a first binaural stereo playback strategy to determine the corresponding first type of binaural stereo; the second processing unit is used to: for the sound sources of the second type among the N sound sources, execute a second binaural stereo playback strategy to determine the corresponding second type of binaural stereo; wherein, the processing complexity of the first binaural stereo playback strategy is greater than the processing complexity of the second binaural stereo playback strategy.
[0020] In an exemplary embodiment, based on the above scheme, the first processing unit is specifically configured to: for the j-th sound source in the first type of sound source, determine the j-th distance and j-th spatial angle of the j-th sound source relative to the target object, where j is a positive integer not greater than the number Q of the first type of sound sources, and Q is a positive integer not greater than N; attenuate the mono audio signal of the j-th sound source according to the j-th distance; determine the left ear head correlation impulse response and right ear head correlation impulse response corresponding to the j-th sound source according to the j-th spatial angle; convolve the distance-attenuated mono audio signal with the left ear head correlation impulse response to generate the left ear audio signal of the j-th sound source; and convolve the distance-attenuated mono audio signal with the right ear head correlation impulse response to generate the right ear audio signal of the j-th sound source.
[0021] In an exemplary embodiment, based on the above scheme, the second processing unit is specifically configured to: for the kth sound source in the second type of sound source, determine the kth azimuth angle and the kth distance of the kth sound source relative to the target object, where k is a positive integer not greater than the number NQ of the second type of sound sources, and Q is a positive integer not greater than N; calculate the first amplitude gain corresponding to the left ear and the second amplitude gain corresponding to the right ear based on the kth azimuth angle; calculate the time difference between the arrival of the kth sound source signal to the left and right ears based on the kth distance; adjust the amplitude of the audio signal of the kth sound source based on the first amplitude gain to obtain the left ear audio signal; and, when the right ear is far from the kth sound source, adjust the amplitude of the audio signal of the kth sound source based on the second amplitude gain, and perform phase modulation on the amplitude-adjusted audio signal based on the time difference to obtain the right ear audio signal.
[0022] In an exemplary embodiment, based on the above scheme, the second processing module is configured to, for the j-th sound source of the first type and the k-th sound source of the second type, perform a mixing process on the left ear audio signal of the j-th sound source and the left ear audio signal of the k-th sound source to obtain the left ear audio signal in the binaural stereo; and perform a mixing process on the right ear audio signal of the j-th sound source and the right ear audio signal of the k-th sound source to obtain the right ear audio signal in the binaural stereo; wherein, j takes values of 1, 2, ..., the number of sound sources of the first type is Q, and k takes values of 1, 2, ..., the number of sound sources of the second type is NQ.
[0023] In an exemplary embodiment, based on the above scheme, when the relative orientation information between the i-th sound source and the target object does not meet the third preset condition, the corresponding influence coefficient is a second positive value less than 1; when the relative orientation between the i-th sound source and the target object meets the third preset condition, the corresponding influence coefficient is 1; wherein, the third preset condition is: the sound source and the head of the target object are located within a preset range directly in front of each other on the same horizontal plane.
[0024] Thirdly, embodiments of this application provide an electronic device, including a processor and a memory. The memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to perform the binaural stereo playback method provided in the first aspect.
[0025] Fourthly, embodiments of this application provide a chip for implementing the binaural stereo playback method provided in the first or second aspect. Specifically, the chip includes a processor for retrieving and running a computer program from a memory, causing a device equipped with the chip to execute the binaural stereo playback method provided in the first aspect.
[0026] Fifthly, embodiments of this application provide a computer-readable storage medium for storing a computer program that causes a computer to execute the binaural stereo playback method provided in the first aspect.
[0027] In a sixth aspect, embodiments of this application provide a computer program product, including computer program instructions that cause a computer to execute the binaural stereo playback method provided in the first aspect.
[0028] In a seventh aspect, embodiments of this application provide a computer program that, when run on a computer, causes the computer to execute the binaural stereo playback method provided in the first aspect.
[0029] In summary, the binaural stereo playback scheme provided in this application identifies multiple sound sources (let's say N, where N is a positive integer) currently existing in the environment where the target object is located. These N sound sources are then classified according to their respective auditory sensitivity attributes. Furthermore, different processing strategies with varying complexity are applied to different types of sound sources to determine the binaural stereo sound corresponding to each of the N sound sources. The binaural stereo sound corresponding to each of the N sound sources is then mixed to obtain the current binaural stereo sound for the target object. In this application, the classification of multiple sound sources based on their auditory sensitivity attributes allows for the targeted application of different processing strategies with varying complexity to sound sources with different auditory sensitivity attributes. This, in turn, helps reduce computational overhead while achieving binaural spatial sound playback. Attached Figure Description
[0030] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0031] Figure 1 A schematic diagram of the system architecture for the application environment of a binaural stereo playback scheme provided in an embodiment of this application;
[0032] Figure 2 A flowchart illustrating a binaural stereo playback method provided in an embodiment of this application;
[0033] Figure 3A and Figure 3B A schematic scene diagram illustrating the binaural stereo playback method provided in the embodiments of this application;
[0034] Figure 4 A schematic diagram illustrating multiple sound sources in a current environment, provided for an embodiment of this application;
[0035] Figure 5 A flowchart illustrating a method for classifying multiple sound sources in the current environment, provided in an embodiment of this application;
[0036] Figure 6 A schematic diagram illustrating binaural stereo processing of a sound source belonging to the first type, provided as an embodiment of this application;
[0037] Figure 7 A schematic diagram illustrating binaural stereo processing of a sound source belonging to the second type, provided as an embodiment of this application;
[0038] Figure 8 This application provides a schematic diagram of a process for binaural stereo processing of multiple sound sources in the current environment, as illustrated in an embodiment of the present application.
[0039] Figure 9 A schematic block diagram of a binaural stereo playback device provided in an embodiment of this application;
[0040] Figure 10 This is a schematic block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0041] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0042] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in sequences other than those illustrated or described herein. In embodiments of this application, "B corresponding to A" means that B is associated with A. In one implementation, B can be determined based on A. However, it should also be understood that determining B based on A does not mean determining B solely based on A; B can also be determined based on A and / or other information. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or server that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or devices. In the description of this application, unless otherwise stated, "a plurality of" means two or more.
[0043] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.
[0044] Binaural spatial sound reproduction technology utilizes the binaural effect to recreate a realistic sound experience. Based on how the human ear receives and processes sound, it simulates the human ear's sound reception process to achieve a more realistic auditory experience. The binaural effect refers to the ability of people to determine the location of sound by relying on differences in volume, time, and timbre between their two ears.
[0045] Specifically, when a sound comes from directly in front of the listener, since the distance from the sound source to the left and right ears is equal, the time difference and timbre difference of the sound waves arriving at the left and right ears are both zero, and the sound is perceived as coming from directly in front of the listener. Theoretically, when the sound comes from directly behind the listener, the distance from the sound source to the left and right ears should still be equal, so the time difference should also be zero. However, in reality, due to the presence of the head, torso, and auricles, sound waves are affected by reflection, diffraction, and obstruction during propagation. These factors lead to subtle time differences and spectral variations. When the sound comes from other directions, because there is a certain distance between the left and right ears, the sound arrives at the two ears at different times, creating a time difference. This principle can be used to effectively determine the direction of a sound using this time difference.
[0046] The realization of binaural spatial sound reproduction technology relies on a deep understanding and application of the binaural effect. It records and reproduces binaural signals to accurately simulate how the human ear receives sound, thus providing a more realistic and immersive auditory experience. This technology not only considers the physical characteristics of sound, such as volume, time difference, and timbre difference, but also involves knowledge of psychoacoustics and auditory physiology to ensure that the reproduced sound approximates the spatial characteristics of the original sound as closely as possible.
[0047] In practical applications, binaural spatial sound reproduction technology is widely used in cinemas, concerts, games, and other fields to provide a more realistic auditory experience. By accurately simulating how sound propagates in space, this technology allows viewers to experience a more realistic environment and a more immersive sound effect, thereby enhancing the overall movie-watching or gaming experience.
[0048] The following describes the methods provided by the related technologies in the embodiments of this application and the problems therein.
[0049] Existing binaural spatial sound reproduction schemes, in order to achieve a spatial sound effect closer to the real world, are mainly based on the Head-Related Transfer Function (HRTF). However, this scheme is computationally expensive. Specifically, it requires two convolution processes for each sampling point of each sound source (e.g., 48,000 samples per second at a 48kHz sampling rate). The computational cost increases linearly with the number of sound sources. If the application scenario involves multiple sound sources (such as large-scale virtual reality games with dozens or hundreds of sound sources), the computational load from processing each source with HRTF becomes excessive, potentially impacting the real-time performance of other components (such as screen display), thus hindering the overall real-time operation of the application. For example, excessive computational overhead can lead to problems such as audio-visual desynchronization, stuttering, and unresponsiveness.
[0050] To address the aforementioned issues, the binaural stereo playback scheme provided in this application identifies multiple sound sources (e.g., N, where N is a positive integer) currently existing in the environment where the target object is located. These N sound sources are then classified according to their respective auditory sensitivity attributes. Furthermore, different processing strategies with varying complexity are applied to different types of sound sources to determine the binaural stereo sound corresponding to each of the N sound sources. The binaural stereo sound corresponding to each of the N sound sources is then mixed to obtain the current binaural stereo sound for the target object. In this application embodiment, classifying the multiple sound sources based on their auditory sensitivity attributes allows for the application of different processing strategies with varying complexity to sound sources with different auditory sensitivity attributes. For example, some of the N sound sources may be used to achieve binaural stereo sound based on the HRTF head-related transfer function. Therefore, the scheme provided in this application embodiment not only enables binaural spatial sound playback but also helps reduce computational overhead.
[0051] The binaural stereo playback method of this application will be described in detail below through some embodiments. These embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0052] For example, Figure 1 This is a schematic diagram of the system architecture for an application environment of a binaural stereo playback scheme provided in an embodiment of this application. For example... Figure 1 As shown, the system architecture of the implementation environment of this application embodiment may include: a terminal 102 and a server 104 that executes a binaural stereo playback scheme. The connection method 11 between the terminal 102 and the server 104 can be a wired communication link or a wireless communication link, such as through a local area network (LAN) or a wide area network (WAN).
[0053] For example, server 104 identifies multiple sound sources currently existing in the environment where the target object is located. Server 104 classifies the multiple sound sources according to the auditory sensitivity attributes corresponding to each sound source. Further, server 104 executes processing strategies with different processing complexities for different types of sound sources to determine the binaural stereo sound corresponding to each sound source. Server 104 performs mixing processing on the binaural stereo sound corresponding to the multiple sound sources to obtain the current binaural stereo sound of the target object.
[0054] For example, server 104 can be a standalone physical server, a server cluster consisting of multiple physical servers, or a distributed system. Terminal 102 is used to output binaural stereo sound. For example, terminal 102 can be a computer, smartphone, tablet, smart voice interaction device, smart home appliance, vehicle terminal, aircraft, wearable smart device, medical device, etc. Terminal 102 may also be configured with a monitor, display screen, touch screen, etc., and the touch screen may be a touch screen, touch panel, etc., but is not limited to these.
[0055] Figure 2 This is a flowchart illustrating a binaural stereo playback method P200 provided in an embodiment of this application. The execution entity of method P200 is a computing device such as a server or terminal.
[0056] In step S210, N sound sources currently existing in the environment where the target object is located are determined, where N is a positive integer.
[0057] The embodiments of this application can be applied to sound sources in virtual environments such as video games. For example, the embodiments of this application can be applied to video game applications (such as...). Figure 3A As shown), virtual space applications (such as...) Figure 3B As shown in the image, such applications contain various sound sources. These sound sources can be generated and constructed to meet the design needs of specific characters or virtual scenes in video games, and they possess real-time sound signals and location information. In virtual environments, such as video games, virtual reality applications, or augmented reality applications, various fictional sound sources are set up. These sound sources are pre-recorded by the application or generated programmatically to enhance immersion. Examples include sound sources from characters during dialogue, simulated vehicle sounds (driving or accelerating), simulated weather effects in natural environments (wind, rain), simulated background music or performance sounds in specific situations, and so on.
[0058] This application can also be applied to the real voices of actual participants. For example, in some applications, such as multiplayer online games, video conferencing, or remote collaboration tools, different users can be located in different geographical locations. The sound sources of each user and their surroundings can be collected through corresponding device terminals (such as mobile phones, computers, tablets, etc.). The collected sound signals are compressed into data formats suitable for network transmission (such as MP3, AAC, etc.) and sent to a server or target device terminal via the Internet or other communication networks. The voices of multiple users can be transmitted simultaneously to the same server or multiple receiving ends, forming a multi-source audio stream.
[0059] Exemplary Reference Figure 4The sound sources in the current environment are identified as including 11 sound sources, from "sound source 1" to "sound source 11". After the above N sound sources in the current environment reach the computing device currently executing the binaural stereo playback method, the processing process as described in the following embodiment will be performed to achieve a three-dimensional spatial audio effect and provide a more realistic auditory experience.
[0060] Continue to refer to Figure 2 In step S220, the N sound sources are classified according to their respective auditory sensitivity attributes. The auditory sensitivity attribute of the i-th sound source is used to characterize the auditory sensitivity of the target object to the i-th sound source, where i is a positive integer not greater than N.
[0061] In this embodiment, all sound sources are classified based on their auditory sensitivity attributes. Furthermore, different classifications of sound sources are processed using processing strategies of varying complexity. For example, for the sound source to which the target object is most auditoryly sensitive, the highest level of complexity is applied to ensure accurate (angle, auditory perception) binaural sound reproduction and playback. To save computational resources while maintaining auditory quality, the second-highest level of complexity is applied to the sound source to which the target object is less auditoryly sensitive. Similarly, the lowest level of complexity is applied to the sound source to which the target object is least auditoryly sensitive. The method of classifying and processing multiple sound sources provided in this embodiment not only achieves binaural stereo playback with high location perception accuracy but also reduces the resources required for sound source processing as needed. Furthermore, applying different processing strategies to different types of sound sources improves control flexibility.
[0062] In an exemplary embodiment, taking the i-th sound source as an example, an embodiment for obtaining its corresponding auditory sensitivity attribute is described.
[0063] In step S21, the sound source features and audio features of the i-th sound source are obtained.
[0064] In this embodiment, the speed of the sound source's movement and the distance between the sound source and the target object both affect the target object's hearing. Therefore, the sound source characteristics of the i-th sound source include: the azimuth angle movement speed of the i-th sound source and the distance between the i-th sound source and the target object. Here, azimuth angle refers to the angle of the sound source measured clockwise or counterclockwise from directly in front of the target object; for example, 0 degrees is directly in front of the target object, -90 degrees is to the left, +90 degrees is to the right, and ±180 degrees is directly behind. Azimuth angle displacement refers to the change in angle of the sound source in the target object's horizontal viewing angle as the sound source moves; for example, if a sound source moves from the left side (-90 degrees) to directly in front of the target object (0 degrees), its azimuth angle displacement is 90 degrees. If the sound source produces azimuth angle displacement, the resulting change in sound pressure difference and time difference between the left and right ears will more easily attract the target object's attention. Therefore, the azimuth angle movement speed of the i-th sound source is used as the sound source characteristic of that sound source.
[0065] In this embodiment, the audio characteristics of the i-th sound source include: the loudness of the i-th sound source.
[0066] Loudness is a subjective perception of sound intensity as heard by the human ear. Sound pressure level (SPL) is an objective physical quantity used by users to measure sound intensity; it directly reflects changes in the pressure of sound waves. By measuring and calculating SPL, an objective and quantifiable index can be obtained to describe the energy level of sound. Therefore, this embodiment calculates the sound pressure level of each sound source audio.
[0067] Loudness is affected by various factors such as signal amplitude, frequency, and time. In the field of audio processing, the sensitivity of the human ear to sounds of different frequencies is taken into account. Therefore, the time-domain signal of the collected sound can be converted into a frequency-domain signal by Fourier transform, and the absolute power spectrum X(f) at different frequency points in the frequency domain can be calculated.
[0068] Furthermore, because the human ear has varying sensitivities to different frequencies of sound—for example, it is more sensitive to mid-frequency sounds and less sensitive to low and high frequencies—the following processing can more accurately reflect the actual sound intensity perceived by the human ear. To make the calculated Sound Pressure Level (SPL) more consistent with human perception, a perceptual weighting coefficient, cof(f), needs to be applied. This step adjusts the sound at different frequencies according to human ear sensitivity. Thus, the final SPL value more accurately reflects the actual perception of the human auditory system.
[0069] As shown in formula (1), the perceived weighted power spectrum Xc(f) is obtained by multiplying the perceived weighting coefficient cof(f) with the absolute power spectrum X(f).
[0070] Xc(f)= X(f) * cof(f) (1)
[0071] The perceptual weighting coefficient cof(f) can be calculated based on psychoacoustic equal-loudness curve data and is used to adjust the weights of different frequencies.
[0072] Next, in order to avoid the volume decision being affected by the short-term fluctuation characteristics of the audio signal, the perceptual weighted power spectrum Xc(f) is smoothed, as shown in formula (2).
[0073] Xcsm(i,f) =α*Xcsm(i-1,f) +(1-α)*Xc(i,f) (2)
[0074] Where i represents the time index, which is an integer greater than 1, f represents the frequency point, α is the smoothing coefficient (e.g., α is 0.95), and Xcsm(i,f) represents the perceptual weighted power spectrum after smoothing at time point i. This process helps eliminate transient noise or short-term fluctuations, making the results more stable.
[0075] Finally, the weighted smoothed power spectrum of each frequency point is summarized and converted into a sound pressure level description, as shown in formula (3).
[0076]
[0077] Here, A_ref represents the reference sound pressure level, usually taken as 20 μPa, which is approximately equal to the minimum sound pressure level that the human ear can hear. SPL (sound pressure level) is a physical quantity describing sound energy, usually expressed in decibels (dB).
[0078] Step S21 extracts the sound pressure level (SPL) after perceptual weighting and smoothing from the acquired time-domain audio signal, thus providing a more accurate and stable loudness assessment metric. This method not only considers the physical characteristics of the audio signal but also incorporates the auditory characteristics of the human ear, making the loudness assessment closer to real-world perception.
[0079] In step S22, the relative loudness of the i-th sound source is determined based on the distance between the i-th sound source and the target object and its loudness.
[0080] The concept of relative loudness (SPLR) was introduced to account for the impact of the distance difference between the sound source and the target object on sound perception. Specifically, due to factors such as air absorption and diffusion, sound attenuates as the distance increases during propagation. Therefore, sound sources with the same sound pressure level (SPL) will sound louder at different distances.
[0081] Therefore, a standard reference distance r0 can be introduced as a reference, and the relative sound pressure level of each sound source mapped to the reference distance r0 can be used as the quantization value of the relative loudness (volume) of this invention. For example, if the distance between the sound source and the target object is r, and the sound pressure level of the sound source is SPL (in decibels dB), the relative sound pressure level SPLR mapped to the standard reference distance r0 can be obtained using the following formula (4).
[0082]
[0083] Where SPL represents the summative sound pressure level of the sound source; SPLR represents the relative summative sound pressure level mapped to the standard reference distance r0; r represents the distance between the sound source and the target object; and r0 represents the standard reference distance.
[0084] The relative loudness SPLR is closer to the actual perceived volume of the target object. Through mapping as shown in formula (4), the relative loudness of the sound source at different distances can be evaluated more accurately, thus providing users with a consistent audio experience.
[0085] Since sound volume is the first response of the human ear to sound, in one embodiment of this application, the auditory sensitivity attribute corresponding to the i-th sound source includes its relative loudness, denoted as the i-th relative loudness (referred to as case A). Further, the classification is determined based on the relative loudness of the current sound source.
[0086] In one embodiment of this application, because a moving sound source (referring to one that has caused azimuthal displacement) produces a changing sound pressure difference and time difference on the left and right ears of a human, it is more likely to attract the attention of the target object. Therefore, in this embodiment, the auditory sensitivity attribute corresponding to the i-th sound source includes not only the i-th relative loudness but also the azimuthal movement speed of the i-th sound source (denoted as case B). In a further embodiment, the classification of the sound source will be comprehensively determined by combining the relative loudness of the current sound source and its azimuthal movement speed.
[0087] It should be noted that the human ear's ability to distinguish multiple sound sources is limited. Typically, when more than a certain number (e.g., four or more) of sound sources are emitted simultaneously, the human ear can only focus on the content of the individual sound sources it is interested in, ignoring the content of others. This is especially true for the directional information of sound sources; that is, the human ear's ability to distinguish the location of sound sources decreases significantly when there are more than a certain number of sound sources. Therefore, in this embodiment, before determining the classification based on the auditory sensitivity data of the sound sources themselves, it can first determine whether the number of sound sources in the current environment exceeds a threshold (i.e., a second threshold). Specifically, Figure 5 This is a flowchart illustrating a method P500 for classifying multiple sound sources in the current environment, as provided in an embodiment of this application.
[0088] In step S510, it is determined whether the number N of sound sources currently existing in the environment where the target object is located is less than the second threshold.
[0089] If N is less than the second threshold, it indicates that there are not many sound sources in the current environment, which may be within the range of human ability to distinguish multiple sound sources. Therefore, these sound sources can be classified according to their respective auditory sensitivity attributes. Specifically, step S540 can be executed.
[0090] In step S540, it is determined whether the auditory sensitivity attribute of the sound source meets the first preset condition.
[0091] In situation A above, the classification is determined based on the relative loudness of the current sound source. Specifically, the following steps can be performed.
[0092] In step SA-1, the i-th relative loudness SPLR is determined. i The corresponding i-th threshold X i .
[0093] In one embodiment, the aforementioned i-th threshold X i The preset original threshold X of the i-th generation ioriginal For example, 60dB.
[0094] In another embodiment, the aforementioned i-th threshold X i Let X be the original threshold of the i-th generation and β be the influence coefficient of the i-th generation. i The i-th weighted threshold X obtained by multiplication iweighted The i-th influence coefficient is related to at least one of the following: the relative orientation information between the i-th sound source and the target object, and the azimuth angle movement speed of the i-th sound source.
[0095] Specifically, two types of binaural spatial sound reproduction algorithms are used: one type uses high-precision binaural spatial sound reproduction algorithms (such as the HRTF algorithm), and the other type uses low-precision binaural spatial sound reproduction algorithms (such as the binaural panning (BP) algorithm based on a planar acoustic model). The performance of low-precision and high-precision binaural spatial sound reproduction algorithms differs depending on the spatial orientation. Specifically, when the target object and the sound source are on the same horizontal plane and directly in front of each other, both algorithms can reproduce the sound's directionality relatively accurately, so the difference in directionality is not significant. For example, the sound source is directly in front of the target object (0-degree azimuth angle) and the vertical angle (elevation angle) does not exceed ±30 degrees, specifically within the range of a horizontal angle close to 0 degrees (e.g., within ±30 degrees) and a vertical angle close to 0 degrees (e.g., within ±30 degrees). In this embodiment, sound sources within the above-mentioned area are considered to meet the third preset condition. It is understandable that when the target object and the sound source are located on the same horizontal plane and directly in front of each other, the horizontal and vertical angles are close to 0 degrees, taking ±30 degrees as an example. Other smaller angles, such as those less than 30 degrees, are also acceptable. It should be noted that for sound sources that meet the third preset condition mentioned above, the classification process as described in method P500 can be omitted. A low-azimuth-perception-precision binaural spatial sound reproduction algorithm can be directly used to improve processing efficiency and save computational overhead.
[0096] However, when the sound source is not located in the vicinity of the target object's head at the same horizontal plane, i.e., when the third preset condition is not met, such as when the sound source is above or below the target object (specifically, when the angle with the horizontal plane exceeds 30 degrees), or when it is located behind the target object, the localization ability of the low-complexity precision binaural spatial sound reproduction algorithm is significantly weakened. This is because such algorithms usually simplify the influence of factors such as head and ear shape, and cannot accurately simulate complex sound wave propagation paths.
[0097] Therefore, when the relative orientation between the i-th sound source and the target object does not satisfy the third preset condition (i.e., the sound source is within the complement of a preset range directly in front of the target object's head on the same horizontal plane), to avoid too many sound sources affecting the stereo reproduction accuracy by using algorithms like BP, more sound sources can be appropriately controlled to belong to the first category. This allows for the use of a high-precision binaural spatial sound reproduction algorithm, thereby ensuring the final binaural stereo imitation accuracy. Specifically, when the orientation angle A of the sound source relative to the target object... i If the first condition is met (i.e., the third preset condition is not met): the horizontal angle exceeds ±30 degrees, or although it is within ±30 degrees, the vertical angle exceeds ±30 degrees; or, the vertical angle exceeds ±30 degrees; then the threshold requirement for volume can be lowered, specifically by setting the i-th influence coefficient β mentioned above. i1It is the second positive value less than 1, such as 0.5.
[0098] As mentioned earlier, moving sound sources (especially those that have undergone azimuth displacement) create varying sound pressure differences and time differences in the left and right ears, making them more likely to attract the attention of the target. Therefore, it is possible to appropriately control more sound sources to belong to the first category, so as to employ a high-precision binaural spatial sound reproduction algorithm to ensure the accuracy of the final binaural stereo imitation. Specifically, when the azimuth displacement speed V of the i-th sound source... i The second condition must be met: if the value is greater than the first threshold (e.g., 15 degrees / second), then the i-th influence coefficient β mentioned above can be set. i2 It is the first positive value less than 1, such as 0.9.
[0099] Therefore, if the azimuth angle A of the i-th sound source relative to the target object... i The first condition mentioned above is met, and its azimuth angle movement speed V i If the second condition is met, the i-th weighted threshold can be expressed as formula (5).
[0100] X iweighted =X ioriginal *β i1 *β i2 (5)
[0101] If the azimuth angle between the i-th sound source and the target object is A i If the first condition is met and its azimuth angle movement speed V does not meet the second condition, then the i-th weighted threshold can be expressed as formula (6).
[0102] X iweighted =X ioriginal *β i1 (6)
[0103] If the azimuth angle between the i-th sound source and the target object is A i The first condition is not met, and its azimuth angle movement speed V i If the second condition is met, the i-th weighted threshold can be expressed as formula (7).
[0104] X iweighted =X ioriginal *β i2 (7)
[0105] In step SA-2, if the i-th relative loudness SPLR i Greater than the i-th threshold X mentioned above i If the volume of the current sound source is relatively large and easily attracts the attention of the target object, then it can be determined that the auditory sensitivity attribute of the i-th sound source meets the first preset condition.
[0106] In step SA-3, if the i-th relative loudness SPLR i Less than the i-th threshold X i If the volume of the current sound source is too low to attract the attention of the target object, then it can be determined that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition.
[0107] Among them, the i-th threshold X in steps SA-2 and SA-3 i The i-th original threshold X can be used. ioriginal Alternatively, depending on whether the current sound source meets the first and second conditions, the i-th weighted threshold X, as shown in formulas (5) to (7), can be used. iweighted If the azimuth angle of sound source A moves at a speed of A... i The first condition mentioned above is met, and its azimuth angle movement speed V i If the second condition is not met, then X can be used as in formula (6). iweighted Or you can use X ioriginal .
[0108] In scenario B above, the classification is determined by comprehensively considering the relative loudness of the current sound source and its directional angular movement speed. Specifically, the following steps can be performed.
[0109] In step SB-1, the i-th relative loudness SPLR is determined. i The corresponding i-th threshold X i Including the i-th weighted threshold X iweighted and the i-th original threshold X ioriginal .
[0110] Wherein, the i-th threshold X i The method for determining the value is as described in the embodiment corresponding to step SA-1, and will not be repeated here.
[0111] In step SB-2, if the azimuth angle movement speed V of the i-th sound source i Greater than the first threshold Y1, and the i-th relative loudness SPLR i Greater than the i-th weighted threshold X iweighted If the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition, then it is determined that the i-th sound source satisfies the first preset condition.
[0112] Due to V i >Y1, while V i Larger sound sources are more likely to attract the attention of the target object, so the i-th weighted threshold X of formula (5) or formula (6) can be used according to the actual situation. iweighted This is used to measure whether the relative loudness of the current sound source meets the conditions, so as to determine the sound source classification in a targeted manner according to the actual situation and improve the accuracy of the sound source classification results. In this embodiment, V i >Y1 mesh SPLRi >X iweighted If the sound source moves quickly in terms of its azimuth angle and its volume is greater than the threshold under the current conditions, it is likely to attract the attention of the target object. Therefore, it can be determined that the auditory sensitivity attribute of the i-th sound source meets the first preset condition.
[0113] In step SB-3, if the azimuth angle movement speed V of the i-th sound source i Less than or equal to the first threshold Y1, and the i-th relative loudness SPLR i If the sound source's azimuth angle movement speed is greater than the i-th original threshold, it means that although the sound source's azimuth angle movement speed has not reached the threshold Y1 (e.g., 15 degrees / second), the volume is relatively large and easily attracts the attention of the target object. Therefore, it can be determined that the auditory sensitivity attribute of the i-th sound source meets the first preset condition.
[0114] Due to V i ≤Y1, while V i Smaller sound sources are less likely to attract the attention of the target object, therefore the i-th original threshold X can be used. ioriginal Or X as in formula (7) iweighted This is used to measure whether the relative loudness of the current sound source meets the condition. In this embodiment, V i ≤Y1 and SPLR i >X i If the volume of the sound source meets the threshold corresponding to the current actual situation and is likely to attract the attention of the target object, then it can be determined that the auditory sensitivity attribute of the i-th sound source meets the first preset condition.
[0115] In step SB-4, if the azimuth angle movement speed V of the i-th sound source i Greater than the first threshold Y1, and the i-th relative loudness SPLR i If the value is less than or equal to the i-th threshold, then the auditory sensitivity attribute of the i-th sound source is determined not to meet the first preset condition.
[0116] Due to V i >Y1, while V i Larger sound sources are more likely to attract the attention of the target object, so the i-th weighted threshold X of formula (5) or formula (6) can be used according to the actual situation. iweighted This is used to measure whether the relative loudness of the current sound source meets the conditions, so as to determine the sound source classification in a targeted manner according to the actual situation and improve the accuracy of the sound source classification results. In this embodiment, V i >Y1 mesh SPLR i ≤X iweighted If the sound source moves quickly in terms of its azimuth angle but its volume is not greater than the threshold under the current conditions, it is not easy to attract the attention of the target object. Therefore, it can be determined that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition mentioned above.
[0117] In step SB-5, if the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, and the i-th relative loudness is less than or equal to the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition.
[0118] Due to V i ≤Y1, while V i Smaller sound sources are less likely to attract the attention of the target object, therefore the i-th original threshold X can be used. ioriginal Or X as in formula (7) iweighted This is used to measure whether the relative loudness of the current sound source meets the condition. In this embodiment, V i ≤Y1 and SPLR i ≤X i This indicates that the volume of the sound source does not meet the threshold corresponding to the current actual situation, and it is not easy to attract the attention of the target object. Therefore, it can be determined that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition mentioned above.
[0119] The above embodiments provide specific implementation methods for determining whether the auditory sensitivity attribute of a sound source satisfies a first preset condition under conditions A and B, respectively. (See reference...) Figure 5 After determining that the auditory sensitivity attribute of the i-th sound source meets the first preset condition, step S550 is executed: determine that the i-th sound source belongs to the first type. After determining that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition, step S530 is executed: determine that the i-th sound source belongs to the second type. Wherein, the target object's auditory sensitivity to the first type of sound source is greater than its auditory sensitivity to the second type of sound source. That is, if the auditory sensitivity data of the current sound source meets the first preset condition, it indicates that the target object has a high auditory sensitivity to the current sound source; if the auditory sensitivity data of the current sound source does not meet the first preset condition, it indicates that the target object has a relatively low auditory sensitivity to the current sound source.
[0120] Continue to refer to Figure 5 The above embodiment describes an implementation where, in step S510, it is determined that the number N of sound sources currently existing in the environment where the target object is located is less than the second threshold. Next, another scenario will be described: an embodiment where the number N of sound sources currently existing in the environment where the target object is located is greater than or equal to the second threshold.
[0121] refer to Figure 5 In step S520, it is determined whether the loudness of the sound source meets the second preset condition.
[0122] For example, the second preset condition is that the loudness ranks among the top M in descending order, where M is a positive integer less than N. Alternatively, the second preset condition is that the ranking belongs to the top x%, where x is a positive integer.
[0123] When there is a large number of sound sources in the current environment (e.g., exceeding the second threshold), they can be sorted according to their loudness (measured by sound pressure level, SPL). For example, if the current environment contains 6 sound sources, ranked from highest to lowest SPL, they are: source 1, source 2, source 5, source 6, source 4, and source 3. In one scenario, the top M (e.g., 4) sound sources can be identified as meeting the second preset condition, i.e., sound sources meeting the second preset condition are: source 1, source 2, source 5, and source 6. In another scenario, the sound sources ranking in the top x% (e.g., 50%) can be identified as meeting the second preset condition, i.e., sound sources meeting the second preset condition are: source 1, source 2, and source 5.
[0124] Continue to refer to Figure 5 For sound sources that meet the second preset condition, step S540 is executed, which classifies them according to their own auditory sensitivity attributes, as described in the embodiment corresponding to step S540 above. For sound sources that do not meet the second preset condition, step S530 is executed: they are determined to be sound sources of the second type mentioned above; that is, for sound sources whose loudness does not meet the second preset condition, since they are less likely to attract the attention of the target object, in order to save computing resources, they can be directly determined to be sound sources of the second type mentioned above.
[0125] The above embodiments enable the classification of multiple sound sources in the current environment. Specifically, sound sources in the current environment can be divided into a first type and a second type based on the number of sound sources, the loudness ranking of the sound sources, and the auditory sensitivity data of the sound sources themselves.
[0126] Continue to refer to Figure 2 In step S230, different processing strategies with different processing complexities are executed for different types of sound sources to determine the binaural stereo corresponding to each of the N sound sources.
[0127] In step S230-1, for the sound sources of the first type among the N sound sources, the first binaural stereo playback strategy is executed to determine the corresponding first type of binaural stereo.
[0128] In step S230-2, for the sound sources of the second type among the N sound sources, a second binaural stereo playback strategy is executed to determine the corresponding second type of binaural stereo; wherein, the processing complexity of the first binaural stereo playback strategy is greater than the processing complexity of the second binaural stereo playback strategy.
[0129] If the current sound source belongs to the first type mentioned above, meaning the target object has a high sensitivity to the sound source perceived by the human ear, a more complex binaural stereo playback algorithm, such as the HRTF algorithm, is used. If the current sound source belongs to the second type mentioned above, meaning the target object has a relatively low sensitivity to the sound source perceived by the human ear, a less complex binaural stereo playback algorithm, such as the BP algorithm, is used. It is evident that, on the one hand, for sound sources with high human ear sensitivity, using a more complex binaural stereo playback algorithm can achieve more accurate sound source reproduction. On the other hand, for sound sources with low human ear sensitivity, using a less complex binaural stereo playback algorithm helps save computational overhead. It should also be noted that, even for sound sources with low human ear sensitivity, this embodiment processes each sound source separately using a binaural stereo playback algorithm to preserve the binaural stereo sound corresponding to each sound source. Compared to directly discarding or merging multiple sound sources with low human ear sensitivity, this embodiment is better able to reproduce the binaural stereo sound of all sound sources in the current environment.
[0130] The following embodiments describe an implementation of a first binaural stereo playback strategy for any sound source of the first type in the current environment, such as the j-th sound source, to perform binaural stereo playback processing.
[0131] The HRTF processing flow includes the following steps:
[0132] Step S31: Determine the distance and spatial angle (including horizontal and vertical angles) between each type of sound source and the target object.
[0133] For example, this includes both distance calculation and spatial angle calculation. For distance calculation, it can be based on the three-dimensional coordinates (x, y, y) of the sound source, such as the j-th sound source in the first type of sound source. j y j , z j ) and the three-dimensional coordinates (x, y) of the target object o y o , z o The straight-line distance between the two can be calculated using the Euclidean distance formula, as shown in formula (8).
[0134]
[0135] For spatial angle calculations, this includes horizontal and vertical angles. The horizontal angle is the angle measured clockwise from the front of the target object (e.g., defined as 0 degrees) to the horizontal projection direction of the sound source. The vertical angle is the angle measured vertically from the horizontal plane to the sound source; upward is positive, and downward is negative.
[0136] Step S32: Based on the principle that sound volume attenuates with distance, perform mono audio signal attenuation processing according to the actual distance between the sound source and the target object to simulate the natural attenuation of sound as it travels over distance.
[0137] Based on the physical characteristics of sound propagation in air, the original mono audio signal u(n) is attenuated using the distance attenuation formula to obtain the attenuated signal u′(n). For example, the inverse square law attenuation model applicable to point sound sources is shown in formula (9).
[0138]
[0139] Where d0 represents the reference distance, such as 1 meter; γ represents the attenuation index, such as 2.
[0140] Step S33: Based on the spatial location of the sound source, use convolution processing to give the sound the correct sense of direction.
[0141] Specifically, based on the calculated horizontal and vertical angles, the closest HRIR impulse response data is searched from the Head-Related Impulse Response (HRIR) database. HRIR is the binaural impulse response pre-measured or simulated for different azimuth angles. Further, refer to... Figure 6 The mono audio signal u′(n) after distance attenuation is compared with the HRIRh corresponding to the left and right ears, respectively. L (n), h R (n) Perform convolution operation to generate a stereo signal y with a sense of direction. L (n), y R (n), as in formulas (10) and (11).
[0142]
[0143]
[0144] in, This indicates convolution calculation.
[0145] Through the steps described above, HRTF processing can accurately simulate the position of a sound source in three-dimensional space, allowing the target audience to experience a more realistic sense of sound direction. This process not only considers the physical propagation characteristics of sound waves (such as distance attenuation) but also incorporates the human ear's perception characteristics of sound from different directions (such as HRTF) to provide a highly immersive auditory experience.
[0146] The following embodiments, in conjunction with Figure 7This document describes an embodiment of performing a second binaural stereo playback strategy for any sound source belonging to the second type mentioned above in the current environment, such as the kth sound source, to perform binaural stereo playback processing.
[0147] Step S41: Determine the azimuth angle of the kth sound source in the second type of sound source relative to the target object.
[0148] For example, refer to Figure 7 The positions of the sound source and the target object are represented using a Cartesian coordinate system (x, y, z). Assume the target object is located at the origin (0, 0, 0), and the sound source is located at (x, y, z). k y k , z k The horizontal angle is represented as in formula (12), and the vertical angle is represented as in formula (13).
[0149] θ=atan2(y k x k (12)
[0150] The function atan2 returns values in the range of [-π, π].
[0151]
[0152] Step S42: Calculate the amplitude gain of the left and right ear signals based on the azimuth angle of the sound source.
[0153] Map the azimuth angle θ to a value between -1 (left) and 1 (right), as in formula (14).
[0154]
[0155] The amplitude gain of the left and right channels is calculated based on the Panvalue, as shown in formulas (15) and (16).
[0156]
[0157]
[0158] For example: azimuth angle is -90° (left side), Panvalue = 0.5, Gain L =0.75, Gain R =0.25;
[0159] For example: azimuth angle is 0° (center), Panvalue = 0, Gain L =0.5, Gain R =0.5;
[0160] For example: azimuth angle is +90° (right side), Panvalue = 0.5, Gain L =0.25, Gain R =0.75.
[0161] Step S43: Calculate the time difference between the arrival of the sound source signal at the left and right ears to simulate the time difference between the arrival of the sound at the left and right ears, so as to produce a phase difference effect.
[0162] Assume the sound source is L meters away from the center of the target object's head, the distance between the left and right ears of the target object is w, and the pitch angle of the sound source is φ. Calculate the distances from the k-th sound source in the second type of sound source to the left ear B and the right ear C, as shown in formulas (17) and (18).
[0163]
[0164]
[0165] Calculate the distance difference dL between the left and right ears, as shown in formula (19).
[0166] dL=L C -L B (19)
[0167] Calculate the distance difference dt between the left and right ears, as shown in formula (20).
[0168]
[0169] Where c represents the speed of sound, which is 340 m / s.
[0170] Step S44: Adjust the amplitude and phase of the sound from the kth sound source in the second type of sound source, and adjust the original audio signal of the kth sound source by applying the calculated amplitude ratio and time difference.
[0171] Specifically, the amplitudes of the original audio signals from the k sound sources are multiplied by the amplitude gains Gain L and Gain R of the left and right ears, respectively, to adjust the amplitude. Since the right ear is closer to the sound source, the phase difference effect is achieved by shifting the audio signal corresponding to the right channel backward by dtΔsample rate (sampling rate) samples to adjust the phase.
[0172] Step S45: Output the binaural stereo signal corresponding to the k-th sound source, including the left ear audio signal and the right ear audio signal after the above processing.
[0173] The RP algorithm described above is used for the second type of sound source. Its computational overhead is extremely low. It only needs to obtain the relative spatial angle and distance between the sound source and the target object to quickly calculate the relationship between the amplitude and phase of the binaural signal, and then quickly output the binaural stereo signal.
[0174] As can be seen, in this embodiment, sound sources in the current environment are classified. For the first type of sound sources, which are more sensitive to the human ear, a first binaural stereo playback strategy with higher computational complexity, such as the HRTF algorithm, is used to accurately reproduce the sound signals of these sources. For the second type of sound sources, which are less sensitive to the human ear, a second binaural stereo playback strategy with lower computational complexity, such as the BP algorithm, is used to save computational overhead. Therefore, this embodiment can save computational resources while ensuring the accuracy of binaural stereo reproduction.
[0175] It should be noted that, for the first type of sound source described above, in addition to the HRTF algorithm, other processing algorithms with higher restoration accuracy can also be used, and this application embodiment does not limit this. For the second type of sound source described above, in addition to the BP algorithm described above, other processing algorithms with lower restoration computational overhead can also be used, and this application embodiment does not limit this.
[0176] Continue to refer to Figure 2 In step S240, the binaural stereo sound corresponding to each of the N sound sources is mixed to obtain the current binaural stereo sound of the target object.
[0177] Exemplary Reference Figure 8 For each type of sound source, a first binaural stereo playback strategy is executed to determine the binaural stereo sound corresponding to each type of sound source. The specific implementation process is as described in the embodiments corresponding to steps S31-S33, and will not be repeated here. For each type of sound source, a second binaural stereo playback strategy is executed to determine the binaural stereo sound corresponding to each type of sound source. The specific implementation process is as described in the embodiments corresponding to steps S41-S45, and will not be repeated here.
[0178] Furthermore, a binaural virtual stereo signal output is constructed using spatial sound reproduction generation technology. Specifically, the left-ear stereo signal from each sound source is mixed to obtain the corresponding virtual stereo signal 80, and the right-ear stereo signal from each sound source is mixed to obtain the corresponding virtual stereo signal 82. This can be played through headphones or dual speakers, providing a realistic auditory experience with precise sound reproduction.
[0179] In the binaural stereo playback scheme provided in this application embodiment, multiple sound sources (let's say N, where N is a positive integer) currently existing in the environment where the target object is located are identified. These N sound sources are then classified according to their respective auditory sensitivity attributes. It can be seen that this application embodiment, starting from the characteristics of human ear perception, classifies each sound source into a first type of sound source that is more sensitive to human hearing and a second type of sound source that is less sensitive to human hearing. Furthermore, binaural spatial sound playback algorithms of different complexities are used for sound sources with different sensitivities to determine the binaural stereo sound corresponding to each of the N sound sources. This solves the problem in related technologies where the conventional binaural spatial sound playback scheme using HRTF is too expensive, unsuitable for large-scale sound source application scenarios, and can lead to problems such as audio-visual asynchrony, audio-visual stuttering, and unresponsive application operations. Finally, the binaural stereo sound corresponding to each of the N sound sources is mixed to obtain the current binaural stereo sound of the target object. In this embodiment, multiple sound sources are classified according to their auditory sensitivity attributes. This allows for the application of different processing strategies with varying processing complexity for sound sources with different auditory sensitivity attributes. Consequently, while achieving binaural spatial sound reproduction, this approach helps reduce computational overhead and improves the feasibility of simulating a binaural stereo generation scheme for large-scale sound sources in the real world.
[0180] The above text combined Figures 1 to 8 The method embodiments of this application are described below, in conjunction with Figure 9 This application describes an embodiment of the binaural stereo playback device.
[0181] Figure 9 This is a schematic block diagram of a binaural stereo playback device 900 provided in an embodiment of this application.
[0182] refer to Figure 9 The binaural stereo playback device 900 provided in this application embodiment includes: a first determining module 910, a classification module 920, a first processing module 930, and a second processing module 940.
[0183] The first determining module 910 is used to determine N sound sources currently existing in the environment where the target object is located, where N is a positive integer; the classification module 920 is used to classify the N sound sources according to the auditory sensitivity attributes corresponding to the N sound sources respectively, wherein the auditory sensitivity attribute of the i-th sound source is used to characterize the auditory sensitivity of the target object to the i-th sound source, where i is a positive integer not greater than N; the first processing module 930 is used to execute processing strategies with different processing complexities for different types of sound sources to determine the binaural stereo corresponding to the N sound sources respectively; and the second processing module 940 is used to perform mixing processing on the binaural stereo corresponding to the N sound sources respectively to obtain the current binaural stereo of the target object.
[0184] In an exemplary embodiment, based on the above scheme, the apparatus further includes: an acquisition module and a second determination module;
[0185] The acquisition module is used to: before the classification module 920 classifies the N sound sources according to their respective auditory sensitivity attributes, acquire the corresponding sound source features and audio features for the i-th sound source, wherein the sound source features include: azimuth angle movement speed, and the audio features include: relative loudness, where i is a positive integer not greater than N; the second determination module is used to determine the i-th relative loudness based on the distance and loudness between the i-th sound source and the target object; wherein the auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness, or the i-th relative loudness and the azimuth angle movement speed of the i-th sound source.
[0186] In an exemplary embodiment, based on the above scheme, the classification module 920 is specifically used to: determine that the i-th sound source belongs to a first type when the auditory sensitivity attribute of the i-th sound source meets the first preset condition; or, determine that the i-th sound source belongs to a second type when the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition; wherein, the target object's human ear perception sensitivity to the first type of sound source is greater than its human ear perception sensitivity to the second type of sound source.
[0187] In an exemplary embodiment, based on the above scheme, the auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness; the above device further includes: a third determining module;
[0188] The third determining module is configured to: determine the i-th threshold corresponding to the i-th relative loudness, wherein the i-th threshold is a preset i-th original threshold, or the i-th threshold is an i-th weighted threshold obtained by multiplying the i-th original threshold by the i-th influence coefficient; the i-th influence coefficient is related to at least one of the following: the relative orientation information between the i-th sound source and the target object, and the azimuth angle movement speed of the i-th sound source; if the i-th relative loudness is greater than the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition; and if the i-th relative loudness is less than the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source does not satisfy the first preset condition.
[0189] In an exemplary embodiment, based on the above scheme, the auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness and the azimuth angle movement speed of the i-th sound source; it also includes: a fourth determining module;
[0190] The fourth determining module is used to: determine the i-th threshold corresponding to the i-th relative loudness, wherein the i-th threshold is a preset i-th original threshold, or the i-th threshold is an i-th weighted threshold obtained by multiplying the i-th original threshold by the i-th influence coefficient; the i-th influence coefficient is related to at least one of the following: the relative orientation information between the i-th sound source and the target object, and the azimuth angle movement speed of the i-th sound source; if the azimuth angle movement speed of the i-th sound source is greater than the first threshold, and the i-th relative loudness is greater than the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition; and if the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, and the i-th relative loudness is greater than the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition.
[0191] In an exemplary embodiment, based on the above scheme, the fourth determining module is further configured to: if the azimuth angle movement speed of the i-th sound source is greater than the first threshold, and the i-th relative loudness is less than or equal to the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition; if the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, and the i-th relative loudness is less than or equal to the i-th threshold, then determine that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition.
[0192] In an exemplary embodiment, based on the above scheme, when the azimuth angle movement speed of the i-th sound source is greater than the first threshold, the corresponding influence coefficient is a first positive value less than 1; when the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, the corresponding influence coefficient is 1.
[0193] In an exemplary embodiment, based on the above scheme, the classification module 920 is specifically used to: if N is less than the second threshold, classify the N sound sources according to their respective auditory sensitivity attributes; or, if N is greater than or equal to the second threshold, classify the sound sources whose loudness meets the second preset condition according to their corresponding auditory sensitivity attributes, and determine the sound sources whose loudness does not meet the second preset condition as belonging to the second type.
[0194] In an exemplary embodiment, based on the above scheme, the second preset condition is that the loudness ranks among the top M from largest to smallest; or, the ranking belongs to the top x%; where M is a positive integer less than N, and x is a positive integer.
[0195] In an exemplary embodiment, based on the above scheme, the first processing module 930 includes a first processing unit and a second processing unit; wherein, the first processing unit is used to: for the sound sources of the first type among the N sound sources, execute a first binaural stereo playback strategy to determine the corresponding first type of binaural stereo; the second processing unit is used to: for the sound sources of the second type among the N sound sources, execute a second binaural stereo playback strategy to determine the corresponding second type of binaural stereo; wherein, the processing complexity of the first binaural stereo playback strategy is greater than the processing complexity of the second binaural stereo playback strategy.
[0196] In an exemplary embodiment, based on the above scheme, the first processing unit is specifically configured to: for the j-th sound source in the first type of sound source, determine the j-th distance and j-th spatial angle of the j-th sound source relative to the target object, where j is a positive integer not greater than the number Q of the first type of sound sources, and Q is a positive integer not greater than N; attenuate the mono audio signal of the j-th sound source according to the j-th distance; determine the left ear head correlation impulse response and right ear head correlation impulse response corresponding to the j-th sound source according to the j-th spatial angle; convolve the distance-attenuated mono audio signal with the left ear head correlation impulse response to generate the left ear audio signal of the j-th sound source; and convolve the distance-attenuated mono audio signal with the right ear head correlation impulse response to generate the right ear audio signal of the j-th sound source.
[0197] In an exemplary embodiment, based on the above scheme, the second processing unit is specifically configured to: for the kth sound source in the second type of sound source, determine the kth azimuth angle and the kth distance of the kth sound source relative to the target object, where k is a positive integer not greater than the number NQ of the second type of sound sources, and Q is a positive integer not greater than N; calculate the first amplitude gain corresponding to the left ear and the second amplitude gain corresponding to the right ear based on the kth azimuth angle; calculate the time difference between the arrival of the kth sound source signal to the left and right ears based on the kth distance; adjust the amplitude of the audio signal of the kth sound source based on the first amplitude gain to obtain the left ear audio signal; and, when the right ear is far from the kth sound source, adjust the amplitude of the audio signal of the kth sound source based on the second amplitude gain, and perform phase modulation on the amplitude-adjusted audio signal based on the time difference to obtain the right ear audio signal.
[0198] In an exemplary embodiment, based on the above scheme, the second processing module 940 is specifically used to: for the j-th sound source of the first type and the k-th sound source of the second type, to perform a mixing process on the left ear audio signal of the j-th sound source and the left ear audio signal of the k-th sound source to obtain the audio signal corresponding to the left ear in the binaural stereo; and to perform a mixing process on the right ear audio signal of the j-th sound source and the right ear audio signal of the k-th sound source to obtain the audio signal corresponding to the right ear in the binaural stereo; wherein, j takes values of 1, 2, ..., the number of sound sources of the first type is Q, and k takes values of 1, 2, ..., the number of sound sources of the second type is NQ.
[0199] In an exemplary embodiment, based on the above scheme, when the relative orientation information between the i-th sound source and the target object does not meet the third preset condition, the corresponding influence coefficient is a second positive value less than 1; when the relative orientation between the i-th sound source and the target object meets the third preset condition, the corresponding influence coefficient is 1; wherein, the third preset condition is: the sound source and the head of the target object are located within a preset range directly in front of each other on the same horizontal plane.
[0200] It should be understood that, as Figure 9 The embodiment of the binaural stereo playback device shown corresponds to the binaural stereo playback method embodiment described above, and a similar description can be found in the method embodiment. To avoid repetition, further details are omitted here. Specifically, through... Figure 9 The information interaction between the various modules in the binaural stereo playback device shown can execute the embodiments of the binaural stereo playback method described above, through, as... Figure 9 The information interaction between the various modules in the binaural stereo playback device shown can execute the embodiments of the binaural stereo playback method described above. For the sake of brevity, the method embodiments corresponding to the aforementioned and other operations and / or functions of each module in the device will not be described again here.
[0201] The above description, in conjunction with the accompanying drawings, describes the operation and maintenance related apparatus of the software agent according to the embodiments of this application from the perspective of functional modules. It should be understood that this functional module can be implemented in hardware, in software instructions, or in a combination of hardware and software modules. Specifically, the steps of the method embodiments in this application can be completed by the integrated logic circuits in the processor's hardware and / or by software instructions. The steps of the method disclosed in the embodiments of this application can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. Optionally, the software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, etc. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps in the above method embodiments.
[0202] This application also provides an electronic device.
[0203] Figure 10 This is a schematic block diagram of an electronic device 1000 provided in an embodiment of this application. As described above, the operation and maintenance related devices of the software agent can be deployed in, for example... Figure 10 The electronic device shown can therefore be used to perform the above-described binaural stereo playback method.
[0204] like Figure 10 As shown, the electronic device 1000 may include:
[0205] The system includes a memory 1010 and a processor 1020. The memory 1010 stores a computer program 1030 and transfers the program code 1030 to the processor 1020. In other words, the processor 1020 can call and run the computer program 1030 from the memory 1010 to implement the methods in the embodiments of this application.
[0206] For example, the processor 1020 can be used to execute the steps in the above-described binaural stereo playback method according to the instructions in the computer program 1030, or to execute the steps in the above-described binaural stereo playback method.
[0207] In some embodiments of this application, the processor 1020 may include, but is not limited to:
[0208] General-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0209] In some embodiments of this application, the memory 1010 includes, but is not limited to:
[0210] Volatile memory and / or non-volatile memory. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
[0211] In some embodiments of this application, the computer program 1030 may be divided into one or more modules, which are stored in the memory 1010 and executed by the processor 1020 to complete the binaural stereo playback method provided in this application, or to complete the steps in the aforementioned binaural stereo playback method. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program 1030 in the electronic device.
[0212] like Figure 10As shown, the electronic device 1000 may further include:
[0213] Transceiver 1040, which can be connected to processor 1020 or memory 1010.
[0214] The processor 1020 can control the transceiver 1040 to communicate with other devices; specifically, it can send information or data to other devices or receive information or data sent by other devices. The transceiver 1040 may include a transmitter and a receiver. The transceiver 1040 may further include antennas, and the number of antennas may be one or more.
[0215] It should be understood that the various components in the electronic device 1000 are connected through a bus system, which includes a data bus, a power bus, a control bus, and a status signal bus.
[0216] According to one aspect of this application, a computer storage medium is provided that stores a computer program thereon, which, when executed by a computer, enables the computer to perform the methods of the above-described method embodiments. Alternatively, embodiments of this application also provide a computer program product containing instructions that, when executed by a computer, cause the computer to perform the methods of the above-described method embodiments.
[0217] According to another aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the method described in the above-described method embodiments.
[0218] In other words, when implemented using software, it can be implemented wholly or partially in the form of a computer program product. This computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Video Disc (DVD)), or a semiconductor medium (e.g., Solid State Disk (SSD)).
[0219] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0220] In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. For example, the device embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or modules may be electrical, mechanical, or other forms.
[0221] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. For example, the functional modules in the various embodiments of this application may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
[0222] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A binaural stereo playback method, characterized by, The method includes: Identify the N sound sources currently existing in the environment where the target object is located, where N is a positive integer; The N sound sources are classified according to their respective auditory sensitivity attributes. The auditory sensitivity attribute of the i-th sound source is used to characterize the auditory sensitivity of the target object to the i-th sound source, where i is a positive integer not greater than N. For different types of sound sources, different processing strategies with different processing complexities are implemented to determine the binaural stereo sound corresponding to each of the N sound sources. The binaural stereo sound corresponding to each of the N sound sources is mixed to obtain the current binaural stereo sound of the target object.
2. The method of claim 1, wherein, Before classifying the N sound sources according to their respective auditory sensitivity attributes, the method further includes: For the i-th sound source, obtain its corresponding sound source features and audio features, wherein the sound source features include: azimuth angle movement speed, and the audio features include: relative loudness, and i takes the value of a positive integer not greater than N; The auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness, or the i-th relative loudness and the azimuth angular movement speed of the i-th sound source.
3. The method of claim 2, wherein, The step of classifying the N sound sources according to their respective auditory sensitivity attributes includes: If the auditory sensitivity attribute of the i-th sound source meets the first preset condition, then the i-th sound source is determined to belong to the first type; or, If the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition, the i-th sound source is determined to belong to the second type. Wherein, the target object's human ear sensitivity to the first type of sound source is greater than its human ear sensitivity to the second type of sound source.
4. The method of claim 3, wherein, The auditory sensitivity attribute corresponding to the i-th sound source includes: the i-th relative loudness; the method further includes: Determine the i-th threshold corresponding to the i-th relative loudness, wherein the i-th threshold is a preset i-th original threshold, or the i-th threshold is an i-th weighted threshold obtained by multiplying the i-th original threshold by the i-th influence coefficient; the i-th influence coefficient is related to at least one of the following: the relative orientation information between the i-th sound source and the target object, and the azimuth angle movement speed of the i-th sound source; If the i-th relative loudness is greater than the i-th threshold, then the auditory sensitivity attribute of the i-th sound source is determined to meet the first preset condition; If the relative loudness of the i-th source is less than the i-th threshold, then the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition.
5. The method of claim 3, wherein, The auditory sensitivity attributes corresponding to the i-th sound source include: the i-th relative loudness and the azimuth angular movement speed of the i-th sound source; the method further includes: Determine the i-th threshold corresponding to the i-th relative loudness, wherein the i-th threshold is a preset i-th original threshold, or the i-th threshold is an i-th weighted threshold obtained by multiplying the i-th original threshold by the i-th influence coefficient; the i-th influence coefficient is related to at least one of the following: the relative orientation information between the i-th sound source and the target object, and the azimuth angle movement speed of the i-th sound source; If the azimuth angle movement speed of the i-th sound source is greater than the first threshold, and the relative loudness of the i-th sound source is greater than the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source satisfies the first preset condition. If the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, and the relative loudness of the i-th sound source is greater than the i-th threshold, then the auditory sensitivity attribute of the i-th sound source is determined to meet the first preset condition.
6. The method of claim 5, wherein, The method further includes: If the azimuth angle movement speed of the i-th sound source is greater than the first threshold, and the relative loudness of the i-th sound source is less than or equal to the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition. If the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, and the relative loudness of the i-th sound source is less than or equal to the i-th threshold, then it is determined that the auditory sensitivity attribute of the i-th sound source does not meet the first preset condition.
7. The method according to any one of claims 4 to 6, characterized in that, When the azimuth angle movement speed of the i-th sound source is greater than the first threshold, its corresponding influence coefficient takes the first positive value less than 1. When the azimuth angle movement speed of the i-th sound source is less than or equal to the first threshold, its corresponding influence coefficient is 1.
8. The method according to any one of claims 3 to 6, characterized in that, Based on the auditory sensitivity attributes corresponding to the N sound sources, the N sound sources are classified, including: If N is less than the second threshold, then the N sound sources are classified according to their respective auditory sensitivity attributes; or, If N is greater than or equal to the second threshold, then for the sound sources among the N sound sources whose loudness meets the second preset condition, they are classified according to their corresponding auditory sensitivity attributes, and the sound sources among the N sound sources whose loudness does not meet the second preset condition are determined to belong to the second type.
9. The method according to claim 8, characterized in that, The second preset condition is that the loudness is ranked from largest to smallest by the number of M. Alternatively, the ranking belongs to the top x percent; where M is a positive integer less than N, and x is a positive integer.
10. The method according to claim 8, characterized in that, For different types of sound sources, processing strategies of varying complexity are implemented to determine the corresponding binaural stereo sound for each of the N sound sources, including: For the N sound sources that belong to the first type, execute the first binaural stereo playback strategy to determine the first type of binaural stereo corresponding to it; For the N sound sources that belong to the second type, execute the second binaural stereo playback strategy to determine the corresponding second type of binaural stereo. The processing complexity of the first binaural stereo playback strategy is greater than that of the second binaural stereo playback strategy.
11. The method according to claim 10, characterized in that, For the N sound sources belonging to the first type, the first binaural stereo playback strategy is executed to determine the corresponding first type of binaural stereo, including: For the j-th sound source in the first type of sound source, determine the j-th distance and j-th spatial angle of the j-th sound source relative to the target object, where j is a positive integer not greater than the number Q of the first type of sound sources, and Q is a positive integer not greater than N; Based on the j-th distance, the mono audio signal of the j-th sound source is attenuated. Based on the j-th spatial angle, determine the left ear head related impulse response and the right ear head related impulse response corresponding to the j-th sound source; The mono audio signal after distance attenuation is convolved with the left ear head related impulse response to generate the left ear audio signal of the j-th sound source. The mono audio signal, after distance attenuation, is convolved with the right ear head-related impulse response to generate the right ear audio signal of the j-th sound source.
12. The method according to claim 10, characterized in that, For the N sound sources belonging to the second type, a second binaural stereo playback strategy is executed to determine the corresponding second type of binaural stereo, including: For the kth sound source in the second type of sound source, determine the kth azimuth angle and the kth distance of the kth sound source relative to the target object, where k is a positive integer not greater than the number NQ of the second type of sound sources, and Q is a positive integer not greater than N; Based on the kth azimuth angle, calculate the first amplitude gain corresponding to the left ear and the second amplitude gain corresponding to the right ear; Based on the k-th distance, calculate the time difference between the arrival of the k-th sound source signal at the left and right ears; Based on the first amplitude gain, the amplitude of the audio from the kth sound source is adjusted to obtain the left ear audio signal; When the right ear is far from the k-th sound source, the audio signal of the k-th sound source is amplitude adjusted according to the second amplitude gain, and the phase of the amplitude-adjusted audio signal is modulated according to the time difference to obtain the right ear audio signal.
13. The method according to claim 11 or 12, characterized in that, The step of mixing the binaural stereo audio corresponding to the N sound sources to obtain the current binaural stereo audio of the target object includes: For the j-th sound source of the first type and the k-th sound source of the second type, the left ear audio signal of the j-th sound source and the left ear audio signal of the k-th sound source are mixed to obtain the audio signal corresponding to the left ear in the binaural stereo. The right ear audio signal of the j-th sound source is mixed with the right ear audio signal of the k-th sound source to obtain the right ear audio signal in the binaural stereo; where j takes the value 1, 2, ..., the number of sound sources of the first type is Q, and k takes the value 1, 2, ..., the number of sound sources of the second type is NQ.
14. The method according to claim 12, characterized in that, If the relative orientation information between the i-th sound source and the target object does not meet the third preset condition, the corresponding influence coefficient takes the value of a second positive value less than 1. When the relative orientation between the i-th sound source and the target object satisfies the third preset condition, the corresponding influence coefficient is 1. The third preset condition is that the sound source and the head of the target object are located within a preset range directly in front of each other on the same horizontal plane.
15. A binaural stereo playback device, characterized in that, The device includes: The first determining module is used to determine the N sound sources currently existing in the environment where the target object is located, where N is a positive integer; The classification module is used to classify the N sound sources according to the auditory sensitivity attributes corresponding to the N sound sources respectively, wherein the auditory sensitivity attribute of the i-th sound source is used to characterize the auditory sensitivity of the target object to the i-th sound source, and i is a positive integer not greater than N; The first processing module is used to execute processing strategies with different processing complexities for different types of sound sources in order to determine the binaural stereo sound corresponding to each of the N sound sources. The second processing module is used to perform mixing processing on the binaural stereo corresponding to the N sound sources respectively, so as to obtain the current binaural stereo of the target object.
16. A computer-readable storage medium, characterized in that, Used to store computer programs; The computer program causes the computer to perform the binaural stereo playback method as described in any one of claims 1 to 14.
17. An electronic device, wherein, Including processor and memory; The memory is used to store computer programs; The processor is configured to execute the computer program to implement the binaural stereo playback method as described in any one of claims 1 to 14.