Method, device and storage medium for audio processing based on multi-loudspeaker scene
By calculating audio gain using a dual-balanced mapping function, the problem of poor spatial audio effect caused by inconsistent speaker positions in multi-speaker scenarios is solved, achieving spatial audio effect in any multi-speaker scenario, and improving perceptual clarity and computational efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
In multi-speaker scenarios, existing technologies result in poor spatial audio performance when the number or placement of speakers does not conform to a fixed speaker placement format. Furthermore, existing solutions struggle to accurately reflect the height information of the real three-dimensional space.
Using a dual-balanced mapping function, the audio gain is calculated based on the relative positions of at least four speakers and the virtual sound source, and then mapped through a small number of speakers to achieve spatial audio effects.
Achieve spatial audio effects in any multi-speaker scenario, improve the clarity of the location and distance perception of virtual sound sources, reduce rendering computational complexity, and enhance adaptability and user freedom.
Smart Images

Figure CN122248317A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of audio processing, and in particular to an audio processing method, device and storage medium for multi-speaker scenarios. Background Technology
[0002] Homes, cars, and cinemas are typical multi-speaker scenarios, often deploying multiple speakers; In related technologies, an audio mapping scheme is pre-designed for several fixed speaker placement formats. In actual use, the closest speaker placement format is selected based on the actual placement of multiple speakers in a multi-speaker scenario. The original audio is then mapped to multiple speakers according to the selected speaker placement format, so that spatial audio effects are generated when multiple speakers are rendered. Spatial audio effects are the effect that users perceive the original audio as coming from a single sound source within the scene. However, when the number or placement of speakers in a multi-speaker scenario is inconsistent with a fixed speaker placement format, the rendering results of multiple speakers will deviate significantly from the standard rendering results, resulting in a poor spatial audio experience for users. Summary of the Invention
[0003] This application provides an audio processing method, device, and storage medium for multi-speaker scenarios, which can achieve spatial audio effects in any multi-speaker scenario. The technical solution includes the following:
[0004] According to one aspect of this application, an audio processing method based on a multi-speaker scenario is provided, the method comprising: Select at least four speakers from the multi-speaker scenario; The speaker positions of the at least four speakers and the sound source positions of the virtual sound sources are obtained; the speaker positions and the sound source positions are represented by two-dimensional coordinates; Based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source, at least four audio gains are calculated using a dual-balance mapping function; The original audio is mapped using the at least four audio gains to obtain at least four target audios, which are used to simulate the effect of playing the original audio through the virtual sound source during playback.
[0005] According to another aspect of this application, an audio processing method based on a multi-speaker scenario is provided, the method comprising: From the L speaker layers contained in the multi-speaker scenario, n speaker layers are selected, each speaker layer aggregates speakers with similar heights, where n and L are both positive integers, n is less than L and greater than one; Select at least four speakers in each of the n speaker layers; The speaker positions of the at least four speakers and the sound source positions of the virtual sound sources are obtained; the speaker positions and the sound source positions are represented by two-dimensional coordinates; Based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source, at least four audio gains are calculated using a dual-balance mapping function; Based on the at least four audio gains, the original audio is mapped by combining the layer weights corresponding to the current speaker layer to obtain at least four target audios; all the target audios of the n speaker layers are used to simulate the effect of playing the original audio through the virtual sound source during playback.
[0006] According to one aspect of this application, an audio processing apparatus for a multi-speaker scenario is provided, the apparatus comprising: An acquisition module is used to select at least four speakers from the multi-speaker scenario; The acquisition module is further configured to acquire the speaker positions of the at least four speakers and the sound source positions of the virtual sound source; the speaker positions and the sound source positions are represented by two-dimensional coordinates; The calculation module is used to calculate at least four audio gains based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source using a double-balanced mapping function. A mapping module is used to map the original audio through the at least four audio gains to obtain at least four target audios, which are used to simulate the effect of playing the original audio through the virtual sound source during playback.
[0007] According to one aspect of this application, an audio processing apparatus for a multi-speaker scenario is provided, the apparatus comprising: The acquisition module is used to select n speaker layers from the L speaker layers included in the multi-speaker scenario. Each speaker layer aggregates speakers with similar heights. n and L are both positive integers, where n is less than L and greater than one. The selection module is used to select at least four speakers in each of the n speaker layers; The acquisition module is further configured to acquire the speaker positions of the at least four speakers and the sound source positions of the virtual sound source; the speaker positions and the sound source positions are represented by two-dimensional coordinates; The calculation module is used to calculate at least four audio gains based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source using a double-balanced mapping function. The mapping module is used to map the original audio based on the at least four audio gains and the layer weights corresponding to the current speaker layer to obtain at least four target audios; all the target audios of the n speaker layers are used to simulate the effect of playing the original audio through the virtual sound source during playback.
[0008] According to one aspect of this application, a computer device is provided, comprising: a processor and a memory, the memory storing a computer program, the computer program being loaded and executed by the processor to implement the audio processing method based on a multi-speaker scenario as described above.
[0009] According to another aspect of this application, a computer-readable storage medium is provided, which stores a computer program that is loaded and executed by a processor to implement the above-described audio processing method based on a multi-speaker scenario.
[0010] According to another aspect of this application, a computer program product or computer program is provided, comprising computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the aforementioned audio processing method based on a multi-speaker scenario.
[0011] The beneficial effects of the technical solutions provided in this application include at least the following: For achieving spatial audio effects in two dimensions, this application embodiment will select at least four speakers from a multi-speaker scenario. Then, based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source, at least four audio gains will be calculated using a double-balanced mapping function. The original audio will then be mapped using these at least four audio gains. In other words, in this application embodiment, for multi-speaker scenarios containing any number and location of speakers, only at least four speakers will be selected to calculate the audio gains using a double-balanced mapping function. This application is applicable to multi-speaker scenarios under any circumstances and can achieve spatial audio effects in any multi-speaker scenario. It significantly improves the adaptability and user freedom in complex multi-speaker scenarios such as homes and vehicles.
[0012] In addition, this application uses only a small number of speakers for mapping, which improves the clarity and stability of the virtual sound source's orientation and distance perception on the one hand, and reduces the computational complexity of a single rendering on the other. Attached Figure Description
[0013] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0014] Figure 1 This is a flowchart of an audio processing method based on a multi-speaker scenario provided in one embodiment of this application.
[0015] Figure 2 This is a schematic diagram of a multi-speaker scenario provided in one embodiment of this application.
[0016] Figure 3 This is a flowchart of an audio processing method based on a multi-speaker scenario provided in another embodiment of this application.
[0017] Figure 4 This is a schematic diagram of a method for selecting four speakers provided in one embodiment of this application.
[0018] Figure 5 This is a flowchart of an audio processing method based on a multi-speaker scenario provided in one embodiment of this application.
[0019] Figure 6 This is a structural block diagram of an audio processing device based on a multi-speaker scenario provided in one embodiment of this application.
[0020] Figure 7 This is a structural block diagram of an audio processing device based on a multi-speaker scenario provided in one embodiment of this application.
[0021] Figure 8 This is a structural block diagram of a computer device provided in one embodiment of this application.
[0022] Figure 9 This is a structural block diagram of a computer device provided in another embodiment of this application. Detailed Implementation
[0023] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0024] It should be understood that "several" in this article refers to one or more, and "multiple" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.
[0025] It should be noted that the information (including but not limited to device information, personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the subject or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0026] Spatial audio effects: In a multi-speaker scenario, when audio is played through multiple speakers, users can realistically experience the effect of a stereo sound field. Specifically, spatial audio effects refer to the ability to perceive the direction and distance of virtual sound sources within a scene when audio is played through multiple speakers.
[0027] In related technologies, the following two typical solutions are provided to achieve spatial audio effects: 1. By pre-designing corresponding audio mapping schemes for several fixed speaker placement formats, in actual use, the closest speaker placement format is selected based on the actual placement positions of multiple speakers in a multi-speaker scenario. The original audio is mapped to multiple speakers according to the selected speaker placement format, so that multiple speakers can produce spatial audio effects when rendering their respective mapped audio.
[0028] 2. Based on the distance from the virtual sound source to each speaker in the multi-speaker scene, calculate the corresponding gain for all speakers, and then each speaker will render its own gained audio to produce a spatial audio effect.
[0029] However, the first type of typical solution mentioned above only supports calculations based on preset standard speaker placements and does not support irregular or non-standard speaker placements. When the actual number or placement of speakers in the scene differs from the standard layout, the rendering results deviate significantly from the standard content, limiting the applicability of the technology in real-world scenarios. The second type of typical solution mentioned above calculates gain based solely on the distance from the virtual sound source to each speaker, often resulting in all speakers participating in the imaging of a certain virtual sound source, leading to excessive energy diffusion. This makes the spatial positioning of the virtual sound source unclear to the user, and blurs the sound image boundaries.
[0030] Furthermore, the aforementioned technologies typically divide the loudspeaker into two layers: an upper ear layer and a sky layer, and perform 2D mapping calculations within each layer to ultimately synthesize a so-called 3D effect. This results in a "pseudo-3D" effect that is difficult to accurately reflect the height information of real three-dimensional space.
[0031] This application provides an audio processing method for multi-speaker scenarios. In one embodiment, the method is applied to a server. Optionally, the server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks, and big data and artificial intelligence platforms.
[0032] For example, in the case of implementing spatial audio effects in two dimensions, the server can use the method of this application to map the original audio to target audio for four speakers, send the four target audio to the audio client, and the audio client controls the four speakers to render their respective target audio. For example, in the case of achieving spatial audio effects in three dimensions, the server can use the method of this application to map the original audio to target audio for 4n speakers, send the 4n target audio to the audio client, and the audio client controls the 4n speakers to render their respective target audio. In one embodiment, the method is applied to a terminal device that has an audio client installed. Optionally, the terminal device includes at least one of a smartphone, smartwatch, in-vehicle terminal, wearable device, smart TV, tablet computer, e-book reader, MP3 player, MP4 player, laptop computer, and desktop computer.
[0033] For example, in the case of implementing spatial audio effects in two dimensions, the audio client obtains the original audio sent by the server, uses the method of this application to map the original audio to the target audio of four speakers, and controls the four speakers to render their respective target audio. For example, in the case of achieving spatial audio effects in three dimensions, the audio client obtains the original audio sent by the server, uses the method of this application to map the original audio to target audio for 4n speakers, and controls the 4n speakers to render their respective target audio.
[0034] Figure 1A flowchart illustrating an exemplary embodiment of this application provides an audio processing method for a multi-speaker scenario, applicable to achieving spatial audio effects in two dimensions. The method is illustrated by way of example, performed by a computer device, which may be a server or a terminal device. The method includes: Step 120: Select at least four speakers from the multi-speaker scenario; A multi-speaker scenario refers to a scenario with multiple speakers. Optionally, multi-speaker scenarios include home scenarios, in-vehicle scenarios, and cinema scenarios. Optionally, the speakers set in a multi-speaker scenario can include at least one of physical speakers and virtual speakers. Physical speakers refer to speaker devices, which can only play audio according to their physical location. Virtual speakers refer to speakers created using software programs, which can play audio according to the playback positions designed by the software program, thereby simulating a multi-channel surround sound effect.
[0035] In this application embodiment, at least four speakers are always selected for any multi-speaker scenario. This application is applicable to situations where the number of speakers in a multi-speaker scenario is greater than four.
[0036] Step 140: Obtain the speaker positions of at least four speakers and the sound source positions of the virtual sound source, which are represented by two-dimensional coordinates. Virtual sound sources are used to represent a point sound source in a multi-speaker environment. For example, in a car setting, the virtual sound source could be the driver's seat, the center of the vehicle, or the roof. In a movie theater setting, the virtual sound source could be the screen or the center of the theater. In a home setting, the virtual sound source could be the living room or bedroom.
[0037] For example, Figure 2 Part (A) illustrates the case where the multi-speaker scene is a virtual speaker scene, which contains multiple virtual speakers and a virtual sound source. Figure 2 Part (B) illustrates a scenario where the multi-speaker scene is a virtual-real hybrid scene, which includes multiple virtual speakers, multiple physical speakers, and a virtual sound source.
[0038] In one embodiment, the virtual sound source is a system-configured sound source; for example, the initial position of the virtual sound source is written into the installation package file of the audio client or the server program. In another embodiment, the virtual sound source is a user-defined sound source. For example, a user can select a virtual sound source from multiple candidate virtual sound sources provided by the system. When audio is played in a multi-speaker scenario, the user will perceive that the audio is coming from the virtual sound source they selected. The virtual sound source selected by the user will correspond to a preset initial position.
[0039] For example, the speaker positions of at least four speakers are defined by at least four first two-dimensional coordinates; the sound source positions of the virtual sound source are defined by second two-dimensional coordinates.
[0040] In one embodiment, the multi-speaker scene constitutes a scene coordinate system, where the speaker positions of at least four speakers and the sound source position of the virtual sound source are coordinates within this scene coordinate system. Optionally, the speaker positions of the at least four speakers are obtained by the audio client through automatic detection. The audio client can detect the multi-speaker scene once each time audio is played to obtain the speaker positions of the at least four speakers, or it can reuse historical detection results, using the speaker positions of the at least four speakers obtained in a previous detection as the speaker positions of the at least four speakers used this time. Optionally, the sound source position of the virtual sound source is the position of the virtual sound source after its initial position is aligned to the scene coordinate system.
[0041] Step 160: Based on the relative positional relationship between the speaker positions of at least four speakers and the sound source position of the virtual sound source, at least four audio gains are calculated using a double-balanced mapping function. In one embodiment, a position space is formed based on the speaker positions of at least four speakers, and the two-dimensional coordinates corresponding to the sound source position of the virtual sound source in the position space are calculated. These two-dimensional coordinates represent the relative positional relationship between the sound source position of the virtual sound source and the speaker positions of the at least four speakers.
[0042] Optionally, the position space is a virtual space based on the speaker positions of at least four speakers. It can be understood that the position space is formed based on the relative positional relationships between the speaker positions of at least four speakers. The position space is a space different from the scene space of a multi-speaker scene.
[0043] The double-balance mapping function is used to equalize the mapping in the horizontal (x-axis) and vertical (y-axis) dimensions. The double-balance mapping function aims to map the energy of a virtual sound source onto a speaker based on the relative position of the speaker to the virtual sound source in the horizontal and vertical dimensions, ensuring that the sum of the mapped energy to each speaker is equivalent to the total audio energy emitted by the virtual sound source.
[0044] For example, a double-balanced mapping function includes a pair of mapping functions. And satisfy: ; The gain of at least four of the four speakers is: ; ; ; ; Where x and y represent the horizontal and vertical components of the two-dimensional coordinates corresponding to the location of the virtual sound source in the coordinate space formed by the positions of at least four speakers. , , , The audio gain is for at least four speakers, and the sum of the four audio gains is one, which satisfies the law of energy conservation.
[0045] Optionally, the bibalanced mapping function includes trigonometric function mapping functions, piecewise linear functions, exponential functions, or other monotonic mapping functions; that is, the above... The mapping function can be a trigonometric function, a piecewise linear function, an exponential function, or another function that satisfies constraints. Optionally, a suitable bibalanced mapping function can be selected based on the user's subjective listening needs and / or hardware limitations.
[0046] Step 180: Map the original audio through at least four audio gains to obtain at least four target audios. The at least four target audios are used to simulate the effect of playing the original audio through a virtual sound source during playback.
[0047] For example, at least four audio gains are multiplied by the audio signal of the original audio to obtain at least four target audios, whereby the audio gains are used to change the signal amplitude of the original audio.
[0048] For example, by playing the target audio through at least four speakers, the effect of playing the original audio through a virtual sound source can be simulated, and users in the scene will perceive that the original audio is coming from the virtual sound source.
[0049] In summary, for achieving spatial audio effects in two dimensions, this application embodiment will select at least four speakers from a multi-speaker scenario. Then, based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source, at least four audio gains will be calculated using a double-balanced mapping function. The original audio will then be mapped using these at least four audio gains. In other words, in this application embodiment, for multi-speaker scenarios containing any number and location of speakers, only at least four speakers will be selected to calculate the audio gains using a double-balanced mapping function. This application is applicable to multi-speaker scenarios under any circumstances and can achieve spatial audio effects in any multi-speaker scenario, significantly improving adaptability and user freedom in complex multi-speaker scenarios such as homes and vehicles.
[0050] In addition, this application uses only a small number of speakers for mapping, which improves the clarity and stability of the virtual sound source's orientation and distance perception on the one hand, and reduces the computational complexity of a single rendering on the other.
[0051] based on Figure 1 In the optional embodiment shown, the speaker positions of at least four speakers include at least four first two-dimensional coordinates, and the sound source position of the virtual sound source is a second two-dimensional coordinate; step 160 includes Figure 3 Steps 320 and 340 shown are illustrated by way of example, with the method being performed by a terminal device or a server, the method including: Step 320: Calculate the first horizontal component and the first vertical component of the second two-dimensional coordinate in the coordinate space composed of at least four first two-dimensional coordinates. The first horizontal component and the first vertical component are used to represent an angle. Optionally, the first horizontal component is used to characterize the balance angle in the x-axis direction, and the first vertical component is used to characterize the balance angle in the y-axis direction. Optionally, the first horizontal component and the first vertical component fall within the interval from 0 to half of π.
[0052] In one embodiment, based on the mapping relationship between at least four first two-dimensional coordinates and the vertex coordinates of the first square, the horizontal component of the second two-dimensional coordinates is mapped to a first horizontal component, and the vertical component of the second two-dimensional coordinates is mapped to a first vertical component, and the side length of the first square is π / 2.
[0053] Taking at least four speakers as an example, the first square is a square of [0, π / 2] × [0, π / 2]. For instance, through affine transformation, the four speakers are mapped to the first square. Based on the mapping relationship between the four first two-dimensional coordinates and the coordinates of the four vertices of the first square, the horizontal component of the second two-dimensional coordinates can be mapped to a horizontal component falling within [0, π / 2], and the vertical component of the second two-dimensional coordinates can be mapped to a vertical component falling within [0, π / 2]. When the virtual sound source moves within the scene, its two-dimensional coordinates within the first square will continuously change within [0, π / 2] × [0, π / 2].
[0054] Optionally, based on the mapping relationship between at least four first two-dimensional coordinates and the vertex coordinates of the unit square, the horizontal component of the second two-dimensional coordinates is mapped to a second horizontal component, and the vertical component of the second two-dimensional coordinates is mapped to a second vertical component, with the second horizontal component and the second vertical component falling in the interval between zero and one; the product of the second horizontal component, the second vertical component and half of π is calculated to obtain the first horizontal component and the first vertical component, respectively.
[0055] Taking at least four speakers as an example, with the side length of the unit square being 1, for instance, the four speakers are mapped to the unit square through an affine transformation. Based on the mapping relationship between the four first two-dimensional coordinates and the coordinates of the four vertices of the unit square, the horizontal component of the second two-dimensional coordinates can be mapped to the second horizontal component falling within [0, 1]. Then, multiplying by π / 2, the first horizontal component can be obtained. Similarly, the vertical component of the second two-dimensional coordinates can be mapped to the second vertical component falling within [0, 1]. Then, multiplying by π / 2, the first vertical component can be obtained.
[0056] Step 340: Based on the first horizontal component and the first vertical component, four audio gains are calculated using a trigonometric double-balanced function.
[0057] A trigonometric function-type biequivalent function includes a first cosine function and a first sine function, where the sum of the squares of the first sine function and the squares of the first cosine function is one. For example, a trigonometric function-type biequivalent function includes a pair of mapping functions. For cosine and sinine functions; In one embodiment, a first function value is obtained by responding to a first horizontal component using a first cosine function; that is, the first function value is the output result of the first cosine function responding to the first horizontal component. The second function value is obtained by responding to the first vertical component with the first cosine function; that is, the second function value is the output result of the first cosine function responding to the first vertical component. The third function value is obtained by responding to the first horizontal component of the first sine function; that is, the third function value is the output result of the first sine function responding to the first horizontal component. The fourth function value is obtained by responding to the first vertical component of the first sine function; that is, the fourth function value is the output result of the first sine function responding to the first vertical component.
[0058] Furthermore, based on the calculation results of the first function value and the second function value, the first audio gain among at least four audio gains is obtained; Based on the calculation results of the third function value and the second function value, the second audio gain among at least four audio gains is obtained; Based on the calculation results of the third and fourth function values, the third audio gain among at least four audio gains is obtained; Based on the calculation results of the first and fourth function values, the fourth audio gain among at least four audio gains is obtained.
[0059] For example, taking at least four audio gains as an example, the four audio gains are represented as follows: ; ; ; ; in, For the first level component, This is the first vertical component; , , , The output values are the first function value, the second function value, the third function value, and the fourth function value, respectively.
[0060] In one embodiment, the first audio gain is the audio gain corresponding to the speaker on the first horizontal side of the minimum enclosing rectangle of at least four speakers; the second audio gain is the audio gain corresponding to the speaker on the first vertical side of the minimum enclosing rectangle of at least four speakers; the third audio gain is the audio gain corresponding to the speaker on the second horizontal side of the minimum enclosing rectangle of at least four speakers; and the fourth audio gain is the audio gain corresponding to the speaker on the second vertical side of the minimum enclosing rectangle of at least four speakers. Among them, the vertical coordinate (i.e., the y-axis coordinate) of the first horizontal side is less than the vertical coordinate of the second horizontal side; the horizontal coordinate (i.e., the x-axis coordinate) of the first vertical side is greater than the horizontal coordinate of the second vertical side.
[0061] Reference Figure 4 , Figure 4 The diagram illustrates a case with at least four speakers, where the first audio gain is the audio gain corresponding to speaker A, the second audio gain is the audio gain corresponding to speaker B, the third audio gain is the audio gain corresponding to speaker C, and the fourth audio gain is the audio gain corresponding to speaker D.
[0062] In the above embodiments, an audio mapping scheme using a trigonometric function-type double balance function is provided. By mapping using a trigonometric function-type double balance function, the audio energy at the virtual sound source can be evenly distributed to the four speakers, and the sound image is continuous without changing the phase of the audio. The target audio obtained by mapping is more in line with the hearing characteristics of the human ear and is more suitable for music scenarios.
[0063] based on Figure 1 In the alternative embodiment shown, step 120 includes step S1.
[0064] S1, based on the first selection condition, select a first multi-element group from multiple speakers in a multi-speaker scenario, the first multi-element group including at least four speakers; The first selection condition includes a convex polygon composed of multiple elements containing a virtual sound source.
[0065] All interior angles of a convex polygon are less than 180°, and when any side is extended infinitely in either direction, the entire shape lies on the same side of that side. Understandably, the tuples selected based on the first selection condition can be mapped to the first square through an affine transformation, thus ensuring that the virtual sound source is located within the first square.
[0066] For example, Figure 4 The first quaternion is illustrated using the first tuple as an example. The first quaternion includes loudspeaker A, loudspeaker B, loudspeaker C and loudspeaker D. The virtual sound source P falls within the convex quadrilateral ABCD.
[0067] Optionally, based on the first selection condition, k candidate tuples are selected from multiple speakers in a multi-speaker scenario, each candidate tuple including at least four candidate speakers, where k is a positive integer greater than one; based on the second selection condition, a first tuple is selected from the k candidate tuples; wherein, the second selection condition includes minimizing the sum of the distance values from the virtual sound source to each speaker in the candidate tuple.
[0068] Optionally, the candidate tuple is a candidate quadruple, and each candidate quadruple includes four candidate speakers.
[0069] After selecting k candidate tuples from all speakers in a multi-speaker scenario based on the first selection criterion, the first tuple is then selected from these k candidate tuples. The first tuple contains the tuple with the minimum sum of distances between each speaker and the virtual sound source. By selecting the first tuple with the minimum total distance to the virtual sound source, the "anchoring" capability of the speakers to the virtual sound source can be maximized, while minimizing the mapping error of the target audio of at least four speakers and the degree of auditory distortion perceived by the user.
[0070] Figure 5 A flowchart illustrating an exemplary embodiment of this application provides an audio processing method for a multi-speaker scene, applicable to achieving spatial audio effects in three dimensions. The method is illustrated by way of example, performed by a computer device, which may be a server or a terminal device. The method includes: Step 510: Select n speaker layers from the L speaker layers contained in the multi-speaker scene, where each speaker layer aggregates speakers with similar heights. Both n and L are positive integers, where n is less than L and greater than one.
[0071] In one embodiment, all speakers in a multi-speaker scenario are aggregated into L speaker layers, with each speaker layer containing speakers of similar height. Optionally, at least one of the following methods can be used to cluster all speakers in the multi-speaker scenario to obtain L speaker layers, with each speaker layer being a cluster.
[0072] Optionally, the height of each of the L speaker layers does not exceed a height threshold. Optionally, the height difference between any two adjacent speaker layers in the L speaker layers is greater than a height difference threshold.
[0073] Optionally, n is a fixed value.
[0074] Optionally, n speaker layers are randomly selected from the L speaker layers. Optionally, the n speaker layers closest to the virtual sound source are selected from the L speaker layers. Optionally, n can be two, and the first speaker layer above the virtual sound source and the first speaker layer below the virtual sound source can be selected from the L speaker layers as n speaker layers.
[0075] Optionally, n can be four, selecting the first and second speaker layers above the virtual sound source and the first and second speaker layers below the virtual sound source from the L speaker layers, as n speaker layers.
[0076] Step 520: Select at least four speakers in each of the n speaker layers; Optionally, step 120 above or the content of derivative step S1 of step 120 may be performed in each of the n speaker layers.
[0077] Step 530: Obtain the speaker positions of at least four speakers and the sound source positions of the virtual sound source; the speaker positions and sound source positions are represented by two-dimensional coordinates; Optionally, the process shown in step 140 above can be performed for each speaker layer.
[0078] Step 540: Based on the relative positional relationship between the speaker positions of at least four speakers and the sound source position of the virtual sound source, at least four audio gains are calculated using a double-balanced mapping function. Optionally, for the speaker positions of at least four speakers obtained in each speaker layer, perform the contents of step 160 or derivative steps 320 and 340 of step 160 as described above.
[0079] Step 550: Based on at least four audio gains, the original audio is mapped in combination with the layer weights corresponding to the current speaker layer to obtain at least four target audios; all target audios from the n speaker layers are used to simulate the effect of playing the original audio through a virtual sound source during playback.
[0080] Each of the n speaker layers has its own layer weight, with a value between 0 and 1. The layer weight represents the proportion of audio energy received by the speaker layer relative to the total energy emitted by the virtual sound source. Optionally, the layer weight is negatively correlated with the height difference between the speaker layer and the virtual sound source; the larger the height difference, the smaller the layer weight. Optionally, the layer weight is negatively correlated with a first ratio, which is the ratio of the height difference between the current speaker layer and the virtual sound source to the height difference between the highest and lowest speaker layers.
[0081] The sum of the weights of the n layers corresponding to the n speaker layers is one.
[0082] For example, a layer weight calculation function is constructed. The input of the layer weight calculation function is the height difference between each of the n speaker layers and the virtual sound source, and the output is the layer weight corresponding to each of the n speaker layers. The layer weight calculation function satisfies that the sum of the n output layer weights is one.
[0083] For example, for at least four speakers in a speaker layer, the product of the audio gain corresponding to each speaker and the layer weight of the speaker layer is calculated to obtain the updated audio gain of each speaker. The product of the updated audio gain of each speaker and the audio signal of the original audio is calculated to obtain the target audio corresponding to each speaker.
[0084] For example, by playing the corresponding target audio through at least four speakers in each of the n speaker layers, i.e., at least 4n speakers, the effect of playing the original audio through a virtual sound source can be simulated, and the user in the scene will perceive that the original audio is coming from the virtual sound source.
[0085] In summary, for achieving spatial audio effects in three dimensions, this application embodiment selects n speaker layers from a multi-speaker scene, then selects at least four speakers for each speaker layer, and then calculates at least four audio gains using a double-balanced mapping function based on the relative coordinate relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source. The original audio is then mapped using these at least four audio gains. In other words, in this application embodiment, for a multi-speaker scene containing any number and location of speakers, only n speaker layers are selected, and the audio gains of at least four speakers in each speaker layer are calculated using a double-balanced mapping function. This application can achieve spatial audio effects in any multi-speaker scene without needing to match a preset multi-speaker placement format.
[0086] Furthermore, by introducing a highly hierarchical and energy distribution mechanism, true sound source mapping in three-dimensional space is achieved, and only n speaker layers are extracted for mapping, ensuring that the amount of computation is controllable.
[0087] Combination Figure 1 The method embodiments shown in this application employ the same dual-balance mapping framework, which simultaneously supports the realization of spatial audio effects in two dimensions and three dimensions. For different situations, a suitable dual-balance calculation method can be selected to obtain a rendering result that is closer to the intention of the production end.
[0088] based on Figure 5 In the optional embodiment shown, step 510 includes: selecting the Pth speaker layer and the Qth speaker layer from the L speaker layers included in the multi-speaker scenario; Among them, the P-th speaker layer is the first speaker layer located below the virtual sound source, and the Q-th speaker layer is the first speaker layer located above the virtual sound source.
[0089] For the P-th speaker layer, step 530 includes: obtaining at least four P-th two-dimensional coordinates and a second two-dimensional coordinate, wherein the at least four P-th two-dimensional coordinates are the two-dimensional coordinates of at least four P-th speakers in the P-th speaker layer, and the second two-dimensional coordinates are the two-dimensional coordinates of the virtual sound source; This step can be referenced from step 140 above.
[0090] Step 540 includes: calculating at least four P-th audio gains using a double-balanced mapping function based on the relative coordinate relationships between at least four P-th two-dimensional coordinates and the second two-dimensional coordinates; This step can be referred to as step 160 above and its derivative steps 320 and 340.
[0091] Step 550 includes: mapping the original audio based on at least four P-th audio gains and combining them with P-th layer weights to obtain at least four P-th target audios, where the P-th layer weights are the weights corresponding to the P-th speaker layer.
[0092] In one embodiment, the first ratio is mapped to the weight of the Pth layer using a cosine function; the weight of the Pth layer is the layer weight of the Pth speaker layer. Wherein, the first ratio is the ratio of the first difference to the second difference, where the first difference is the difference between the height of the virtual sound source and the height of the P-th speaker layer, and the second difference is the difference between the height of the Q-th speaker layer and the P-th speaker layer. For example, the formula for calculating the weight of the P-th layer is as follows: ; in, Indicates the height of the virtual sound source, Indicates the height of the Pth speaker layer, Indicates the height of the Qth speaker layer; By mapping the P-th layer weights using a cosine function, the mapping process only changes the amplitude of the original audio without altering its phase. The resulting target audio is more in line with the auditory characteristics of the human ear and is more suitable for music scenarios.
[0093] Optionally, the height of the Pth speaker layer is the height of a representative speaker within the Pth speaker layer, or the height of the Pth speaker layer is the average height of all speakers in the Pth speaker layer.
[0094] For example, the product of the Pth audio gain corresponding to each Pth speaker in the Pth speaker layer and the layer weight of the Pth speaker layer is calculated to obtain the updated Pth audio gain of each Pth speaker; the product of the updated Pth audio gain of each Pth speaker and the audio signal of the original audio is calculated to obtain the Pth target audio corresponding to each Pth speaker.
[0095] For example, at least four updated P-th audio gains are shown in the following equations: ; ; ; ; For the Q-th speaker layer, step 530 includes: obtaining at least four Q-th two-dimensional coordinates and a second two-dimensional coordinate, wherein the at least four Q-th two-dimensional coordinates are the two-dimensional coordinates of at least four Q-th speakers in the Q-th speaker layer, and the second two-dimensional coordinates are the two-dimensional coordinates of the virtual sound source; This step can be referenced from step 140 above.
[0096] Step 540 includes: calculating at least four Q-th audio gains using a double-balanced mapping function based on the relative coordinate relationships between at least four Q-th two-dimensional coordinates and the second two-dimensional coordinates; This step can be referred to as step 160 above and its derivative steps 320 and 340.
[0097] Step 550 includes: mapping the original audio based on at least four Q-th audio gains and combining them with Q-th layer weights to obtain at least four Q-th target audios, where the Q-th layer weights are the weights corresponding to the Q-th speaker layer.
[0098] In one embodiment, the first ratio is mapped to the Q-th layer weight using a sine function; the Q-th layer weight is the layer weight of the Q-th speaker layer. Wherein, the first ratio is the ratio of the first difference to the second difference, where the first difference is the difference between the height of the virtual sound source and the height of the Qth speaker layer, and the second difference is the difference between the heights of the Qth speaker layers. For example, the formula for calculating the weight of the Qth layer is as follows: ; in, Indicates the height of the virtual sound source, Indicates the height of the Qth speaker layer, Indicates the height of the Pth speaker layer; The Q-th layer weights are obtained by mapping with a sine function, so that the amplitude of the original audio is changed but the phase of the original audio is not changed during mapping. The target audio obtained by mapping is more in line with the auditory characteristics of the human ear and is more suitable for music scenarios.
[0099] Optionally, the height of the Qth speaker layer is the height value of a representative speaker in the Qth speaker layer, or the height of the Qth speaker layer is the average height of all speakers in the Qth speaker layer.
[0100] For example, the product of the Q-th audio gain corresponding to each Q-th speaker in the Q-th speaker layer and the layer weight of the Q-th speaker layer is calculated to obtain the updated Q-th audio gain of each Q-th speaker; the product of the updated Q-th audio gain of each Q-th speaker and the audio signal of the original audio is calculated to obtain the Q-th target audio corresponding to each Q-th speaker.
[0101] For example, at least four updated Q-th audio gains are shown in the following equations: ; ; ; ; In the above embodiments, by using a hierarchical and trigonometric function-based dual-balance algorithm, continuous and smooth sound image migration of the virtual sound source in three-dimensional space as its position changes is achieved while keeping the computational complexity under control.
[0102] In the above embodiments, the dual-balance mapping scheme proposed in this application supports any number and placement of speakers. It can complete the precise mapping from virtual sound source to speaker channel without following a standardized layout. This allows different users to obtain a realistic spatial audio experience close to the intention of music producers in different multi-speaker scenarios, thereby reducing the threshold for system deployment and user use, and significantly improving the consistency and immersion of the listening experience.
[0103] Furthermore, this application selects only a small number of speakers to participate in the rendering of the virtual sound source mapping. On the one hand, it improves the clarity and stability of the virtual sound source's orientation and distance perception, reduces sound image blurring and "pulling" phenomena, and on the other hand, it effectively controls the computational overhead of the rendering end, making it easy to run in real time on the terminal device.
[0104] For in-vehicle scenarios, this application provides a more convenient and unified mapping solution, making the listening experience more similar for different car models and different speaker layouts, and significantly reducing the time and manpower costs for car manufacturers to perform individual sound field tuning for each model.
[0105] Figure 6 An exemplary embodiment of this application illustrates an audio processing apparatus for a multi-speaker scenario, the apparatus comprising: Selection module 601 is used to select at least four speakers from a multi-speaker scenario; The acquisition module 602 is used to acquire the speaker positions of at least four speakers and the sound source positions of virtual sound sources; the speaker positions and sound source positions are represented by two-dimensional coordinates; The calculation module 603 is used to calculate at least four audio gains based on the relative positional relationship between the speaker positions of at least four speakers and the sound source positions of the virtual sound source through a double-balanced mapping function. The mapping module 604 is used to map the original audio through at least four audio gains to obtain at least four target audios, which are used to simulate the effect of playing the original audio through a virtual sound source during playback.
[0106] In an optional embodiment, the speaker positions of at least four speakers are defined by at least four first two-dimensional coordinates, and the sound source positions of the virtual sound source are defined by second two-dimensional coordinates. The calculation module 603 is used to calculate the first horizontal component and the first vertical component of the second two-dimensional coordinate in the coordinate space composed of at least four first two-dimensional coordinates. The first horizontal component and the first vertical component are used to represent an angle. Based on the first horizontal component and the first vertical component, at least four audio gains are calculated using a trigonometric function-type double balance function. Among them, the trigonometric function type double equilibrium function includes the first cosine function and the first sine function, and the sum of the square of the function value of the first sine function and the square of the function value of the first cosine function is one.
[0107] In an optional embodiment, the calculation module 603 is further configured to obtain a first audio gain among at least four audio gains based on the calculation results of the first function value and the second function value; Based on the calculation results of the third function value and the second function value, the second audio gain among at least four audio gains is obtained; Based on the calculation results of the third and fourth function values, the third audio gain among at least four audio gains is obtained; Based on the calculation results of the first function value and the fourth function value, the fourth audio gain among at least four audio gains is obtained; Wherein, the first function value is the output result of the first cosine function responding to the first horizontal component; the second function value is the output result of the first cosine function responding to the first vertical component; the third function value is the output result of the first sine function responding to the first horizontal component; and the fourth function value is the output result of the first sine function responding to the first vertical component.
[0108] In an optional embodiment, the first audio gain is the audio gain corresponding to the speakers on the first horizontal side of the smallest enclosing rectangle of at least four speakers; The second audio gain is the audio gain corresponding to the speakers on the first vertical side of the minimum enclosing rectangle of at least four speakers; The third audio gain is the audio gain corresponding to the speakers on the second horizontal side of the smallest enclosing rectangle of at least four speakers; The fourth audio gain is the audio gain corresponding to the speaker on the second vertical side of the smallest enclosing rectangle of at least four speakers; The vertical coordinate of the first horizontal side is less than the vertical coordinate of the second horizontal side; the horizontal coordinate of the first vertical side is greater than the horizontal coordinate of the second vertical side.
[0109] In an optional embodiment, the calculation module 603 is further configured to map the horizontal component of the second two-dimensional coordinates to a first horizontal component based on the mapping relationship between at least four first two-dimensional coordinates and the vertex coordinates of the first square, and to map the vertical component of the second two-dimensional coordinates to a first vertical component, wherein the side length of the first square is π / 2.
[0110] In an optional embodiment, the calculation module 603 is further configured to map the horizontal component of the second two-dimensional coordinates to a second horizontal component and the vertical component of the second two-dimensional coordinates to a second vertical component based on the mapping relationship between at least four first two-dimensional coordinates and the vertex coordinates of the unit square, wherein the second horizontal component and the second vertical component fall in the range of zero to one. Calculate the product of the second horizontal component, the second vertical component, and half of π to obtain the first horizontal component and the first vertical component, respectively.
[0111] In an optional embodiment, the selection module 601 is configured to select a first multi-speaker group from a plurality of speakers in a multi-speaker scenario based on a first selection condition. The first multi-speaker group includes at least four speakers. The first selection condition includes a convex quadrilateral composed of multiple elements containing a virtual sound source.
[0112] In an optional embodiment, the selection module 601 is used to select k candidate tuples from multiple speakers in a multi-speaker scenario based on a first selection condition, wherein each candidate tuple includes at least four candidate speakers, and k is a positive integer greater than one. Based on the second selection condition, the first tuple is selected from k candidate tuples; The second selection criterion includes minimizing the sum of the distance values from the virtual sound source to each speaker in the candidate tuple.
[0113] In summary, for achieving spatial audio effects in two dimensions, this application embodiment will select at least four speakers from a multi-speaker scenario. Then, based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source, at least four audio gains will be calculated using a double-balanced mapping function. The original audio will then be mapped using these at least four audio gains. In other words, in this application embodiment, for multi-speaker scenarios containing any number and location of speakers, only at least four speakers will be selected to calculate the audio gains using a double-balanced mapping function. This application is applicable to multi-speaker scenarios under any circumstances and can achieve spatial audio effects in any multi-speaker scenario, significantly improving adaptability and user freedom in complex multi-speaker scenarios such as homes and vehicles.
[0114] In addition, this application uses only a small number of speakers for mapping, which improves the clarity and stability of the virtual sound source's orientation and distance perception on the one hand, and reduces the computational complexity of a single rendering on the other.
[0115] Figure 7 An exemplary embodiment of this application illustrates an audio processing apparatus for a multi-speaker scenario, the apparatus comprising: Selection module 701 is used to select n speaker layers from L speaker layers contained in a multi-speaker scene. Each speaker layer aggregates speakers with similar heights. n and L are both positive integers, where n is less than L and greater than one. The selection module 701 is also used to select at least four speakers in each of the n speaker layers; The acquisition module 702 is used to acquire the speaker positions of at least four speakers and the sound source positions of virtual sound sources; the speaker positions and sound source positions are represented by two-dimensional coordinates; The calculation module 703 is used to calculate at least four audio gains based on the relative positional relationship between the speaker positions of at least four speakers and the sound source positions of the virtual sound source through a double-balanced mapping function. The mapping module 704 is used to map the original audio based on at least four audio gains and the layer weights corresponding to the current speaker layer to obtain at least four target audios; all the target audios of the n speaker layers are used to simulate the effect of playing the original audio through a virtual sound source during playback.
[0116] In an optional embodiment, the selection module 701 is further configured to select the Pth speaker layer and the Qth speaker layer from the L speaker layers included in the multi-speaker scenario. Among them, the P-th speaker layer is the first speaker layer located below the virtual sound source, and the Q-th speaker layer is the first speaker layer located above the virtual sound source.
[0117] In an optional embodiment, for the P-th speaker layer, the acquisition module 702 is used to acquire at least four P-th two-dimensional coordinates and a second two-dimensional coordinate, wherein the at least four P-th two-dimensional coordinates are the two-dimensional coordinates of at least four P-th speakers in the P-th speaker layer, and the second two-dimensional coordinates are the two-dimensional coordinates of the virtual sound source. The calculation module 703 is used to calculate at least four P-th audio gains based on the relative coordinate relationships between at least four P-th two-dimensional coordinates and the second two-dimensional coordinates through a double-balanced mapping function. The mapping module 704 is used to map the original audio based on at least four P-th audio gains and combined with P-th layer weights to obtain at least four P-th target audios, where the P-th layer weights are the weights corresponding to the P-th speaker layer.
[0118] In an optional embodiment, for the Q-th speaker layer, the acquisition module 702 is used to acquire at least four Q-th two-dimensional coordinates and a second two-dimensional coordinate, wherein the at least four Q-th two-dimensional coordinates are the two-dimensional coordinates of at least four Q-th speakers in the Q-th speaker layer; and the second two-dimensional coordinates are the two-dimensional coordinates of the virtual sound source. Calculation module 703 is used to calculate at least four Q-th audio gains based on the relative coordinate relationships between at least four Q-th two-dimensional coordinates and the second two-dimensional coordinates through a double-balanced mapping function; The mapping module 704 is used to map the original audio based on at least four Q-th audio gains and combined with Q-th layer weights to obtain at least four Q-th target audios, where the Q-th layer weights are the weights corresponding to the Q-th speaker layer.
[0119] In an optional embodiment, the mapping module 704 is further configured to map the first ratio to the layer weight of the P-th speaker layer using a cosine function. Wherein, the first ratio is the ratio of the first difference to the second difference, the first difference is the difference between the height of the virtual sound source and the height of the Pth speaker layer, and the second difference is the difference between the height of the Qth speaker layer and the Pth speaker layer.
[0120] In an optional embodiment, the mapping module 704 is further configured to map the first ratio to the layer weight of the Q-th speaker layer using a sine function. Wherein, the first ratio is the ratio of the first difference to the second difference, the first difference is the difference between the height of the virtual sound source and the height of the Pth speaker layer, and the second difference is the difference between the height of the Qth speaker layer and the Pth speaker layer.
[0121] In summary, for achieving spatial audio effects in three dimensions, this application embodiment selects n speaker layers from a multi-speaker scene, then selects at least four speakers for each speaker layer, and then calculates at least four audio gains using a double-balanced mapping function based on the relative coordinate relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source. The original audio is then mapped using the at least four audio gains. In other words, in this application embodiment, for a multi-speaker scene containing any number and location of speakers, only n speaker layers are selected, and the audio gains of at least four speakers in each speaker layer are calculated using a double-balanced mapping function. This application can achieve spatial audio effects in any multi-speaker scene without needing to match a preset multi-speaker placement format.
[0122] Furthermore, by introducing a highly hierarchical and energy distribution mechanism, true sound source mapping in three-dimensional space is achieved, and only n speaker layers are extracted for mapping, ensuring that the amount of computation is controllable.
[0123] Figure 8This is a schematic diagram illustrating the structure of a computer device according to an exemplary embodiment. The computer device 800 includes a Central Processing Unit (CPU) 801, a system memory 804 including Random Access Memory (RAM) 802 and Read-Only Memory (ROM) 803, and a system bus 805 connecting the system memory 804 and the CPU 801. The computer device 800 also includes a basic input / output system (I / O system) 806 that facilitates information transfer between various devices within the computer device, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.
[0124] The basic input / output system 806 includes a display 808 for displaying information and an input device 809 for user input, such as a mouse or keyboard. Both the display 808 and the input device 809 are connected to the central processing unit 801 via an input / output controller 810 connected to the system bus 805. The basic input / output system 806 may also include the input / output controller 810 for receiving and processing input from multiple other devices such as a keyboard, mouse, or electronic stylus. Similarly, the input / output controller 810 also provides output to a display screen, printer, or other types of output devices.
[0125] The mass storage device 807 is connected to the central processing unit 801 via a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer device-readable media provide non-volatile storage for the computer device 800. That is, the mass storage device 807 may include computer device-readable media (not shown), such as a hard disk or a compact disc read-only memory (CD-ROM) drive.
[0126] Without loss of generality, the computer device readable medium may include computer device storage media and communication media. Computer device storage media include volatile and non-volatile, removable and non-removable media implemented using any method or technology for storing information such as computer device readable instructions, data structures, program modules, or other data. Computer device storage media include RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), CD-ROM, digital video disc (DVD), or other optical storage, magnetic tape cassettes, magnetic tape, disk storage, or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer device storage media are not limited to the above-mentioned types. The system memory 804 and mass storage device 807 described above can be collectively referred to as memory.
[0127] According to various embodiments of this disclosure, the computer device 800 can also be connected to a remote computer device on a network, such as the Internet. That is, the computer device 800 can be connected to a network 811 via a network interface unit 812 connected to the system bus 805, or it can use the network interface unit 812 to connect to other types of networks or remote computer device systems (not shown).
[0128] The memory also includes one or more programs stored in the memory. The central processing unit 801 executes the one or more programs to implement all or part of the steps of the above-mentioned audio processing method applied to a multi-speaker scenario.
[0129] Figure 9 A structural block diagram of a computer device 900 provided in an exemplary embodiment of this application is shown. The computer device 900 may be a portable mobile terminal, such as a smartphone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III), MP4 player (Moving Picture Experts Group Audio Layer IV), laptop computer, or desktop computer. The computer device 900 may also be referred to as a user device, portable terminal, laptop terminal, desktop terminal, or other names.
[0130] Typically, computer device 900 includes a processor 901 and a memory 902.
[0131] Processor 901 may include one or more processing cores, such as a quad-core processor, a nine-core processor, etc. Processor 901 may be implemented using at least one hardware form selected from DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). Processor 901 may also include a main processor and a coprocessor. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, processor 901 may integrate a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the screen. In some embodiments, processor 901 may also include an AI (Artificial Intelligence) processor, which is used to handle computational operations related to machine learning.
[0132] The memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In some embodiments, the non-transitory computer-readable storage media in the memory 902 is used to store at least one instruction, which is executed by the processor 901 to implement the audio processing method based on a multi-speaker scenario provided in the method embodiments of this application.
[0133] In some embodiments, the computer device 900 may optionally include a peripheral device interface 903 and at least one peripheral device. The processor 901, memory 902, and peripheral device interface 903 can be connected via a bus or signal line. Each peripheral device can be connected to the peripheral device interface 903 via a bus, signal line, or circuit board. For example, the peripheral device may include at least one of the following: a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, and a power supply 908.
[0134] Peripheral device interface 903 can be used to connect at least one I / O (Input / Output) related peripheral device to processor 901 and memory 902. In some embodiments, processor 901, memory 902 and peripheral device interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of processor 901, memory 902 and peripheral device interface 903 can be implemented on separate chips or circuit boards, which is not limited in this embodiment.
[0135] The radio frequency (RF) circuit 904 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The RF circuit 904 communicates with communication networks and other communication devices via electromagnetic signals. The RF circuit 904 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals back into electrical signals. Optionally, the RF circuit 904 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, etc. The RF circuit 904 can communicate with other terminals through at least one wireless communication protocol. This wireless communication protocol includes, but is not limited to: the World Wide Web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and / or WiFi (Wireless Fidelity) networks. In some embodiments, the RF circuit 904 may also include circuitry related to NFC (Near Field Communication), which is not limited in this application.
[0136] Display screen 905 is used to display a UI (User Interface). This UI may include graphics, text, icons, videos, and any combination thereof. When display screen 905 is a touch display screen, it also has the ability to collect touch signals on or above its surface. These touch signals can be input as control signals to processor 901 for processing. In this case, display screen 905 can also be used to provide virtual buttons and / or a virtual keyboard, also known as soft buttons and / or a soft keyboard. In some embodiments, there may be one display screen 905, disposed on the front panel of computer device 900; in other embodiments, there may be at least two display screens 905, disposed on different surfaces of computer device 900 or in a folded design; in other embodiments, display screen 905 may be a flexible display screen, disposed on a curved or folded surface of computer device 900. Furthermore, display screen 905 may be configured as a non-rectangular irregular shape, i.e., a non-rectangular screen. Display screen 905 may be made of materials such as LCD (Liquid Crystal Display) or OLED (Organic Light-Emitting Diode).
[0137] The camera assembly 906 is used to acquire images or videos. Optionally, the camera assembly 906 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the terminal, and the rear-facing camera is located on the back of the terminal. In some embodiments, there are at least two rear-facing cameras, which are any one of a main camera, a depth-sensing camera, a wide-angle camera, and a telephoto camera, to achieve background blurring by fusion of the main camera and the depth-sensing camera, panoramic shooting by fusion of the main camera and the wide-angle camera, VR (Virtual Reality) shooting, or other fusion shooting functions. In some embodiments, the camera assembly 906 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. A dual-color temperature flash refers to a combination of a warm-light flash and a cool-light flash, which can be used for light compensation at different color temperatures.
[0138] The audio circuit 907 may include a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, converting the sound waves into electrical signals that are input to the processor 901 for processing, or input to the radio frequency circuit 904 for voice communication. For stereo sound acquisition or noise reduction purposes, multiple microphones may be used, each located in a different part of the computer device 900. The microphone may also be an array microphone or an omnidirectional microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The speaker may be a conventional diaphragm speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can convert electrical signals not only into audible sound waves but also into inaudible sound waves for purposes such as distance measurement. In some embodiments, the audio circuit 907 may also include a headphone jack.
[0139] Power supply 908 is used to supply power to the various components in computer device 900. Power supply 908 can be AC power, DC power, a disposable battery, or a rechargeable battery. When power supply 908 includes a rechargeable battery, the rechargeable battery can be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery that is charged via a wired line, while a wireless rechargeable battery is a battery that is charged via a wireless coil. The rechargeable battery can also be used to support fast charging technology.
[0140] In some embodiments, the computer device 900 further includes one or more sensors 909. The one or more sensors 909 include, but are not limited to, an accelerometer 910, a gyroscope 911, a pressure sensor 912, an optical sensor 913, and a proximity sensor 914.
[0141] Accelerometer 910 can detect the magnitude of acceleration along the three coordinate axes of a coordinate system established by computer device 900. For example, accelerometer 910 can be used to detect the components of gravitational acceleration along the three coordinate axes. Processor 901 can control display screen 905 to display the user interface in either a landscape or portrait view based on the gravitational acceleration signal acquired by accelerometer 910. Accelerometer 910 can also be used for games or for acquiring user motion data.
[0142] The gyroscope sensor 911 can detect the orientation and rotation angle of the computer device 900. The gyroscope sensor 911, in conjunction with the accelerometer sensor 910, can collect 3D motion data from the user on the computer device 900. Based on the data collected by the gyroscope sensor 911, the processor 901 can perform the following functions: motion sensing (e.g., changing the UI based on the user's tilt), image stabilization during shooting, game control, and inertial navigation.
[0143] The pressure sensor 912 can be disposed on the side bezel of the computer device 900 and / or on the lower layer of the display screen 905. When the pressure sensor 912 is disposed on the side bezel of the computer device 900, it can detect the user's grip signal on the computer device 900, and the processor 901 can perform left / right hand recognition or quick operation based on the grip signal collected by the pressure sensor 912. When the pressure sensor 912 is disposed on the lower layer of the display screen 905, the processor 901 can control the operable controls on the UI interface based on the user's pressure operation on the display screen 905. The operable controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.
[0144] An optical sensor 913 is used to collect ambient light intensity. In one embodiment, the processor 901 can control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 913. For example, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is decreased. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 based on the ambient light intensity collected by the optical sensor 913.
[0145] A proximity sensor 914, also known as a distance sensor, is typically located on the front panel of a computer device 900. The proximity sensor 914 is used to detect the distance between the user and the front of the computer device 900. In one embodiment, when the proximity sensor 914 detects that the distance between the user and the front of the computer device 900 is gradually decreasing, the processor 901 controls the display screen 905 to switch from a screen-on state to a screen-off state; when the proximity sensor 914 detects that the distance between the user and the front of the computer device 900 is gradually increasing, the processor 901 controls the display screen 905 to switch from a screen-off state to a screen-on state.
[0146] Those skilled in the art will understand that Figure 9 The structure shown does not constitute a limitation on the computer device 900, and may include more or fewer components than shown, or combine certain components, or use different component arrangements.
[0147] This application also provides a computer-readable storage medium storing at least one instruction, at least one program, code set, or instruction set, wherein the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the audio processing method based on a multi-speaker scenario provided in the above method embodiments.
[0148] This application provides a computer program product or computer program that includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the audio processing method for a multi-speaker scenario provided in the above-described method embodiments.
[0149] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0150] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.
[0151] The above description is merely an optional embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. An audio processing method for multi-speaker scenarios, characterized in that, The method includes: Select at least four speakers from the multi-speaker scenario; The speaker positions of the at least four speakers and the sound source positions of the virtual sound sources are obtained; the speaker positions and the sound source positions are represented by two-dimensional coordinates; Based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source, at least four audio gains are calculated using a dual-balance mapping function; The original audio is mapped using the at least four audio gains to obtain at least four target audios, which are used to simulate the effect of playing the original audio through the virtual sound source during playback.
2. The method according to claim 1, characterized in that, The speaker positions of the at least four speakers include at least four first two-dimensional coordinates, and the sound source positions of the virtual sound source include second two-dimensional coordinates; The relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source is used to calculate at least four audio gains through a double-balanced mapping function, including: Calculate the first horizontal component and the first vertical component of the second two-dimensional coordinate in the coordinate space formed by the at least four first two-dimensional coordinates, wherein the first horizontal component and the first vertical component are used to characterize an angle. Based on the first horizontal component and the first vertical component, the at least four audio gains are calculated using a trigonometric function-type double balance function. The trigonometric function type double-balanced function includes a first cosine function and a first sine function, wherein the sum of the square of the function value of the first sine function and the square of the function value of the first cosine function is one.
3. The method according to claim 2, characterized in that, The calculation of the at least four audio gains based on the first horizontal component and the first vertical component using a trigonometric double-balanced function includes: Based on the calculation results of the first function value and the second function value, the first audio gain among the at least four audio gains is obtained; Based on the calculation results of the third function value and the second function value, the second audio gain among the at least four audio gains is obtained; Based on the calculation results of the third function value and the fourth function value, the third audio gain among the at least four audio gains is obtained; Based on the calculation results of the first function value and the fourth function value, the fourth audio gain among the at least four audio gains is obtained; Wherein, the first function value is the output result of the first cosine function in response to the first horizontal component; the second function value is the output result of the first cosine function in response to the first vertical component; the third function value is the output result of the first sine function in response to the first horizontal component; and the fourth function value is the output result of the first sine function in response to the first vertical component.
4. The method according to claim 3, characterized in that, The first audio gain is the audio gain corresponding to the speaker on the first horizontal side of the smallest enclosing rectangle of the at least four speakers; The second audio gain is the audio gain corresponding to the speaker on the first vertical side of the smallest enclosing rectangle of the at least four speakers; The third audio gain is the audio gain corresponding to the speaker on the second horizontal side of the smallest enclosing rectangle of the at least four speakers; The fourth audio gain is the audio gain corresponding to the speaker on the second vertical side of the smallest enclosing rectangle of the at least four speakers; Wherein, the vertical coordinate of the first horizontal side is less than the vertical coordinate of the second horizontal side; the horizontal coordinate of the first vertical side is greater than the horizontal coordinate of the second vertical side.
5. The method according to claim 2, characterized in that, The calculation of the first horizontal component and the first vertical component corresponding to the second two-dimensional coordinate in the coordinate space formed by the at least four first two-dimensional coordinates includes: Based on the mapping relationship between the at least four first two-dimensional coordinates and the vertex coordinates of the first square, the horizontal component of the second two-dimensional coordinates is mapped to the first horizontal component, and the vertical component of the second two-dimensional coordinates is mapped to the first vertical component, wherein the side length of the first square is π / 2.
6. The method according to claim 5, characterized in that, The step of mapping the horizontal component of the second two-dimensional coordinates to the first horizontal component and mapping the vertical component of the second two-dimensional coordinates to the first vertical component based on the mapping relationship between the at least four first two-dimensional coordinates and the vertex coordinates of the first square includes: Based on the mapping relationship between the at least four first two-dimensional coordinates and the vertex coordinates of the unit square, the horizontal component of the second two-dimensional coordinate is mapped to a second horizontal component, and the vertical component of the second two-dimensional coordinate is mapped to a second vertical component, wherein the second horizontal component and the second vertical component fall into the range of zero to one. The first horizontal component and the first vertical component are obtained by multiplying the second horizontal component, the second vertical component, and π / 2.
7. The method according to any one of claims 1 to 6, characterized in that, The step of selecting at least four speakers from the multi-speaker scenario includes: Based on the first selection condition, a first multi-element group is selected from the multiple speakers in the multi-speaker scenario, and the first multi-element group includes the at least four speakers; The first selection condition includes a convex polygon composed of multiple elements containing the virtual sound source.
8. The method according to claim 7, characterized in that, The step of selecting a first plural group from multiple speakers in the multi-speaker scenario based on a first selection condition includes: Based on the first selection condition, k candidate tuples are selected from the multiple speakers in the multi-speaker scenario. Each candidate tuple includes at least four candidate speakers, and k is a positive integer greater than one. Based on the second selection condition, the first tuple is selected from the k candidate tuples; The second selection criterion includes minimizing the sum of the distance values from the virtual sound source to each speaker in the candidate tuple.
9. An audio processing method for multi-speaker scenarios, characterized in that, The method includes: From the L speaker layers contained in the multi-speaker scenario, n speaker layers are selected, each speaker layer aggregates speakers with similar heights, where n and L are both positive integers, n is less than L and greater than one; Select at least four speakers in each of the n speaker layers; The speaker positions of the at least four speakers and the sound source positions of the virtual sound sources are obtained; the speaker positions and the sound source positions are represented by two-dimensional coordinates; Based on the relative positional relationship between the speaker positions of the at least four speakers and the sound source position of the virtual sound source, at least four audio gains are calculated using a dual-balance mapping function; Based on the at least four audio gains, the original audio is mapped by combining the layer weights corresponding to the current speaker layer to obtain at least four target audios; all the target audios of the n speaker layers are used to simulate the effect of playing the original audio through the virtual sound source during playback.
10. The method according to claim 9, characterized in that, The step of selecting n speaker layers from the L speaker layers included in the multi-speaker scenario includes: Select the P-th speaker layer and the Q-th speaker layer from the L speaker layers included in the multi-speaker scenario; The Pth speaker layer is the first speaker layer located below the virtual sound source, and the Qth speaker layer is the first speaker layer located above the virtual sound source.
11. The method according to claim 10, characterized in that, The method further includes: The first ratio is mapped to the layer weight of the Pth speaker layer using the cosine function; Wherein, the first ratio is the ratio of the first difference to the second difference, the first difference is the difference between the height of the virtual sound source and the height of the Pth speaker layer, and the second difference is the difference between the height of the Qth speaker layer and the Pth speaker layer.
12. The method according to claim 10, characterized in that, The method further includes: The first ratio is mapped to the layer weight of the Qth speaker layer using a sine function; Wherein, the first ratio is the ratio of the first difference to the second difference, the first difference is the difference between the height of the virtual sound source and the height of the Pth speaker layer, and the second difference is the difference between the height of the Qth speaker layer and the Pth speaker layer.
13. A computer device, characterized in that, The computer device includes a processor and a memory, the memory storing a computer program, the computer program being loaded and executed by the processor to implement the audio processing method based on a multi-speaker scenario as described in any one of claims 1 to 8, or the audio processing method based on a multi-speaker scenario as described in any one of claims 9 to 12.
14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which is loaded and executed by the processor to implement the audio processing method based on a multi-speaker scenario as described in any one of claims 1 to 8, or the audio processing method based on a multi-speaker scenario as described in any one of claims 9 to 12.