Sound signal processing method and apparatus, virtual reality device, and storage medium
By acquiring user gaze trajectory and sound source information, a personalized sound processing solution was designed, which solved the problem of low environmental sound matching in virtual reality systems and improved the realism and immersion of auditory effects.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GOERTEK INC
- Filing Date
- 2023-01-16
- Publication Date
- 2026-06-23
AI Technical Summary
In existing virtual reality systems, the matching degree between ambient sound and user's subjective attention is low, resulting in insufficient realism of auditory effects and affecting the auditory immersion.
By acquiring the user's gaze trajectory and the location and acoustic classification information of the sound-emitting object within their field of vision, and processing the sound signal emitted by the sound-emitting object based on the attention information, a personalized sound processing solution is designed.
It improves the realism of the auditory effects of virtual reality devices, enabling users to access the ambient sounds they are interested in and enhancing the auditory immersion.
Smart Images

Figure CN116052652B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of virtual reality technology, and in particular to a sound signal processing method, apparatus, virtual reality device, and storage medium. Background Technology
[0002] With the rapid development of high-performance computing and signal processing technologies, people have increasingly higher demands for voice and audio experiences, and immersive audio can meet these needs. For example, 4G / 5G voice communication, audio services, and the application of Virtual Reality (VR) have received increasing attention. An immersive virtual reality system not only needs stunning visual effects but also realistic auditory effects; the fusion of audiovisual elements can greatly enhance the virtual reality experience.
[0003] In complex sound environments, existing virtual reality systems typically perform active noise reduction or voice enhancement on certain frequency bands to mask or receive key information. This method merely hard-filters or enhances certain frequency bands, resulting in a poor match between the ambient sound captured by the virtual reality system and the user's subjective focus. Consequently, the auditory effect of the virtual reality device lacks realism and affects the auditory immersion. Summary of the Invention
[0004] Based on the aforementioned problems in the prior art, embodiments of this application provide a sound signal processing method, apparatus, virtual reality device, and storage medium to solve the problem that the sound effect is not realistic enough and affects the sense of auditory immersion due to the low matching degree between the ambient sound picked up by the prior art and the sound that the user is mainly concerned with.
[0005] The embodiments of this application adopt the following technical solutions:
[0006] In a first aspect, embodiments of this application provide a sound signal processing method, the method comprising:
[0007] Acquire the user's gaze trajectory and information on sound-emitting entities within the user's field of vision, wherein the sound-emitting entity information includes the position information and acoustic classification information of each sound-emitting entity;
[0008] Based on the gaze trajectory and the position information of each sound source, obtain the user's attention level information for each sound source;
[0009] The sound signals emitted by each sound source are processed based on the user's attention level to each sound source and the acoustic classification information of each sound source.
[0010] Secondly, embodiments of this application also provide a sound signal processing apparatus, the apparatus comprising:
[0011] An information acquisition unit is used to acquire the user's gaze trajectory and the information of the sound-emitting objects within the user's field of vision. The information of the sound-emitting objects includes the position information and acoustic classification information of each sound-emitting object.
[0012] The weighting calculation unit is used to obtain the user's attention information to each sound source based on the gaze trajectory and the position information of each sound source.
[0013] The sound processing unit is used to process the sound signals emitted by each sound source based on the user's attention information to each sound source and the acoustic classification information of each sound source.
[0014] Thirdly, embodiments of this application also provide a virtual reality device, including:
[0015] Processor; and
[0016] A memory configured to store computer-executable instructions, which, when executed, cause the processor to perform any of the aforementioned sound signal processing methods.
[0017] Fourthly, embodiments of this application also provide a computer-readable storage medium storing one or more programs, which, when executed by a virtual reality device including multiple applications, cause the virtual reality device to perform any of the aforementioned sound signal processing methods.
[0018] The above-mentioned at least one technical solution adopted in the embodiments of this application can achieve the following beneficial effects: The embodiments of this application first obtain the user's gaze trajectory and the position information and acoustic classification information of the sound-emitting body within the user's field of vision. Since the gaze trajectory corresponds to the user's single observation behavior, the movement of the user's gaze point can be obtained through the gaze trajectory. The movement of the gaze point is related to the user's attention to the sound-emitting body. Therefore, the embodiments of this application can obtain the user's attention information to each sound-emitting body based on the gaze trajectory and the position information of each sound-emitting body. Then, the sound signal emitted by each sound-emitting body can be processed accordingly based on the user's attention information to each sound-emitting body and the acoustic classification information of each sound-emitting body.
[0019] The embodiments of this application can design the sound processing scheme of the sound-emitting body according to the user's attention to the sound-emitting body, so that the processed environmental sound meets the user's real needs, the user can obtain the environmental sound they are interested in, and the auditory effect provided by the virtual reality device has a high degree of realism. Attached Figure Description
[0020] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:
[0021] Figure 1 This is a flowchart illustrating a sound signal processing method according to an embodiment of this application;
[0022] Figure 2 This is a schematic diagram of the distribution of sound-emitting bodies within a user's field of vision in an embodiment of this application;
[0023] Figure 3 This is a focal plane exploded view of a sound-generating body in an embodiment of this application;
[0024] Figure 4 This is a schematic diagram illustrating a change in the gaze target in an embodiment of this application;
[0025] Figure 5 This is a schematic diagram illustrating the relative distance change trend and relative distance change rate between the gaze point trajectory and the sound-emitting body in an embodiment of this application;
[0026] Figure 6 This is a schematic diagram of the audio segment sorting in the embodiments of this application;
[0027] Figure 7 This is a schematic diagram of the structure of a sound signal processing device according to an embodiment of this application;
[0028] Figure 8 This is a schematic diagram of the structure of a virtual reality device according to an embodiment of this application. Detailed Implementation
[0029] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0030] The execution subject of the sound signal processing method provided in this application embodiment can be a virtual reality device, a server, or the cloud; the execution subject of the sound signal processing method in this application embodiment can be software or hardware.
[0031] Please refer to Figure 1 , Figure 1Taking a virtual reality device as the executing entity as an example, this application provides a method for processing sound signals. The virtual reality device in this application can be a head-mounted virtual reality device, such as a virtual reality helmet or virtual reality glasses. Figure 1 As shown, an embodiment of this application provides a sound signal processing method that may include the following steps S110 to S130:
[0032] Step S110: Obtain the user's gaze trajectory and the information of the sound-emitting objects within the user's field of vision. The information of the sound-emitting objects includes the position information and acoustic classification information of each sound-emitting object.
[0033] In a real environment, when a user is interested in the sound of a certain sound source, the user pays attention to that sound source to a certain extent. This application embodiment determines the user's attention level to the sound source by using the user's gaze data, and designs a sound processing scheme for the sound source based on the user's attention level to make the processed environmental sound meet the user's real needs.
[0034] Gaze data refers to the points of gaze in the environment during the user's observation of the environment, as well as the location of each gaze point, the gaze start time, the gaze duration, and other data. In this application embodiment, gaze data can be obtained through eye gaze sensors, eyelid sensors, pupil sensors, or eye muscle sensors, or through a dedicated eye-tracking camera.
[0035] A gaze trajectory is an ordered set of points formed by gaze points located between two target gaze points in a temporal sequence. The target gaze point can be determined from the gaze data based on the minimum dwell time required for the human eye to observe at a certain position. For example, gaze points with a gaze duration of 40-200 milliseconds can be considered target gaze points. It is understood that each gaze data point in this embodiment may include one or more gaze trajectories. For example, when the gaze data includes N gaze points, assuming that the gaze duration of three gaze points is greater than the gaze duration threshold T, then the gaze data includes two gaze trajectories: the first and second target gaze points form the start and end points of the first gaze trajectory, and the second and third target gaze points form the start and end points of the second gaze trajectory. Optionally, the gaze duration threshold T = 200 milliseconds. Of course, the gaze duration threshold T can also be other values, which can be set by those skilled in the art based on experience.
[0036] Furthermore, embodiments of this application can utilize the image acquisition device of a virtual reality device to acquire information about sound-emitting entities within the user's field of vision. For example, environmental images can be acquired using the binocular camera of a virtual reality device. By performing target detection on the environmental images, various targets within the field of vision, as well as the target type and target location of each target, can be detected. Based on the target type, the sound-emitting entities within the field of vision can be determined.
[0037] Step S120: Obtain the user's attention information for each sound source based on the gaze trajectory and the position information of each sound source.
[0038] The gaze trajectory in this embodiment corresponds to a user's single observation behavior, for example... Figure 4 As shown, a user's observation behavior of switching from observing a male speaker to observing a pet dog generates a gaze trajectory. This gaze trajectory reveals the movement of the gaze point, such as the change in the relative distance between the gaze point and any sound-emitting object. The movement of the gaze point is correlated with the user's level of attention to the sound-emitting object. Specifically, the closer the gaze point is to an object in the direction of the trajectory, the higher the user's level of attention to that sound-emitting object during that observation; conversely, the farther the gaze point is from a sound-emitting object in the direction of the trajectory, the lower the user's level of attention to that sound-emitting object during that observation. This condition is a prerequisite for the sound signal processing in the embodiments of this application.
[0039] Therefore, based on this premise, the embodiments of this application can determine the user's attention level information for each sound-emitting entity according to the user's gaze trajectory and the position information of each sound-emitting entity. Optionally, the attention level information includes attention level weights, which quantify the degree of user attention to each sound-emitting entity.
[0040] Step S130: Process the sound signal emitted by each sound source based on the user's attention information to each sound source and the acoustic classification information of each sound source.
[0041] As mentioned above, when identifying a sound source within the user's field of vision, the type of each sound source can be obtained. The acoustic classification information in this application embodiment includes audio frequency segment information, noise or musical sound, etc. This application embodiment can obtain the audio frequency segment corresponding to each sound source based on prior knowledge. For example, based on prior knowledge, the audio frequency segment of a male voice, the audio frequency segment of a printer, the audio frequency segment of a pet dog, etc. can be obtained.
[0042] like Figure 1 As can be seen from the sound signal processing method shown, this embodiment first obtains the user's gaze trajectory, the position information of the sound-emitting body within the user's field of vision, and the acoustic classification information of the sound-emitting body. Since the gaze trajectory corresponds to the user's single observation behavior, the movement of the user's gaze point can be obtained through the gaze trajectory. The movement of the gaze point is related to the user's attention to the sound-emitting body. Therefore, this embodiment can obtain the user's attention information to each sound-emitting body based on the gaze trajectory and the position information of each sound-emitting body. Then, the sound signal emitted by each sound-emitting body can be processed accordingly based on the user's attention information to each sound-emitting body and the acoustic classification information of each sound-emitting body.
[0043] This embodiment can design the sound processing scheme of the sound-emitting object according to the user's attention to the sound-emitting object, so that the processed environmental sound meets the user's real needs, the user can obtain the environmental sound they are interested in, and the auditory effect provided by the virtual reality device has a high degree of realism.
[0044] In some embodiments of this application, video data of the user's two eyes can be collected, and eye-tracking technology can be used to process the video frames in the video data to obtain gaze data. Based on the gaze data, the gaze trajectory corresponding to a single observation behavior of the user can be obtained. The position of each point in the gaze trajectory can be represented by (x, y, z) coordinates in the world coordinate system. The x and y coordinates can correspond to latitude and longitude information. The depth of the gaze point is converted into the focal length f of the RGB camera. focus The focal length f of the RGB camera focus The z-coordinate is used as the gaze point, where the RGB camera is used to acquire images of the sound environment.
[0045] Then, the RGB camera is used to focus and capture images of the sound-emitting objects within the field of view. Based on the captured images, the focal length f of each sound-emitting object can be obtained. other At this point, the focal length f of the RGB camera is... other The z-coordinate of the sound-emitting body can be used as the z-coordinate. The x and y coordinates of the sound-emitting body can be obtained from the calibration parameters of the RGB camera.
[0046] like Figure 2 As shown, assume there are four sound-emitting objects within the user's field of vision, namely... Figure 2 The image capture device in the virtual reality environment—a printer, a male speaker, a hair dryer, and a pet dog—is used to capture environmental images. This is achieved using image acquisition equipment, such as a binocular or multi-view RGB camera. Based on target detection algorithms, the four sound-emitting entities are detected, and the pixel position of each entity is obtained. According to calibration parameters, the x and y coordinates of each entity in the world coordinate system are obtained. Then, the RGB camera is controlled to focus and capture images of each sound-emitting entity within the field of view sequentially. Figure 3 As shown, by performing focal plane decomposition on the captured image, the focal length f corresponding to each sound-emitting object can be obtained. other The focal length f corresponding to the sound-emitting body other This serves as the z-coordinate of the sound-generating body, thus providing the complete coordinate position of each sound-generating body in the world coordinate system.
[0047] Furthermore, the target detection algorithm in this embodiment can also detect the type of each sound source, query the acoustic database according to the type of the sound source, obtain the acoustic characteristics of each sound source, and obtain the acoustic classification information of the sound source based on the acoustic characteristics.
[0048] It should be noted that in this embodiment, the coordinates of the gaze point and each sound-emitting object in the z-direction are obtained based on the camera's focal length. In other embodiments, the coordinates in the z-direction can also be obtained based on the distance measurement module of the virtual reality device, or based on binocular vision technology. This application embodiment does not limit the calculation method for the position of each point on the gaze trajectory and the position of the sound-emitting object. For ease of calculation, the position of the gaze point and the position of the sound-emitting object need to be represented using the same coordinate system.
[0049] In some embodiments of this application, user attention information for each sound source is obtained based on the gaze trajectory and the position information of each sound source, including:
[0050] Based on the gaze trajectory and the position information of each sound source, determine the trend of the relative distance between the gaze point and each sound source, as well as the relative distance between the gaze point and each sound source at the end of the gaze trajectory.
[0051] Based on the trend of the relative distance between the gaze point and each sound source, and the relative distance between the gaze point and each sound source at the end of the gaze trajectory, the attention weight of each sound source is obtained.
[0052] In most real-world scenarios, the final gaze point of a user's single observation is located at the location of the sound source the user is focusing on. In this case, the relative distance between the endpoint of the gaze trajectory and each sound source can distinguish between those the user is focusing on and those they are not. However, in other real-world scenarios, the final gaze point of a user's single observation may not be the location of any single sound source. For example, in some situations, it may be inconvenient for the user to directly observe the sound source. In this case, the user's gaze point may remain near the target sound source. In such cases, it is also necessary to distinguish between those the user is focusing on and those they are not focusing on based on the changing trend of the relative distance between the gaze point and the sound source.
[0053] In some possible implementations of this embodiment, the attention weight of each sound source can be obtained based on the preset relationship between relative distance and attention weight, as well as the preset relationship between the trend of relative distance change and attention weight.
[0054] The preset relationship between relative distance and attention weight in this embodiment includes: a relative distance reference value and its corresponding initial weight, an updated value for the attention weight, and a relative distance change step size. If the relative distance between the gaze point and the sound source at the end of the gaze trajectory is less than the relative distance reference value, the attention weight is increased by an updated value based on the initial weight for each step decrease in relative distance. If the relative distance between the gaze point and the sound source at the end of the gaze trajectory is greater than the relative distance reference value, the attention weight is decreased by an updated value based on the initial weight for each step increase in relative distance. For example, Figure 5 The relative distance between the sound source B and L2 is the smallest, so the attention weight of the sound source B is the largest. The relative distance between the sound source C and L2 is the largest, so the attention weight of the sound source C is the smallest.
[0055] The relative distance reference value indicates the critical value of the relative distance between the user's gaze point and the sound source when the user focuses on the sound source at the end of the gaze trajectory. When the relative distance is greater than the critical value, the user does not focus on the sound source, and when the relative distance is less than the critical value, the user focuses on the sound source. Optionally, the initial weight is 0.
[0056] The relative distance change trend in this embodiment includes both decreasing and increasing relative distance. When the relative distance decreases, the closer the gaze point is to the sound-emitting body, the higher the attention the sound-emitting body receives. Figure 5 In the example of sound-emitting bodies A and B, as the relative distance increases, the further the gaze point is from the sound-emitting body, the less attention the sound-emitting body receives. Figure 5 The sound-producing body C in the middle. But the gaze point is relatively... Figure 5 The proximity of sound-producing bodies A and B in the text is different, and the degree of user attention to sound-producing bodies A and B should be further differentiated.
[0057] Based on this, the pre-defined relationship between the relative distance change trend and the attention weight can include: the greater the degree to which the relative distance decreases, the higher the attention weight; the smaller the degree to which the relative distance decreases, the lower the attention weight; the greater the degree to which the relative distance increases, the lower the attention weight; the smaller the degree to which the relative distance increases, the higher the attention weight.
[0058] For example, the step size for the degree of distance change and the update value for the attention weight can be set. The range of the attention weight for voices that are being followed by the user is, for example, (0,1], and the range of the attention weight for voices that are not being followed by the user is, for example, [-1,0]. Assuming that when the relative distance decreases by 0.5 meters, the attention weight is 0.1, and for every additional 0.5 meters of decrease in relative distance, the attention weight increases by 0.05, so when the relative distance decreases by 2 meters, the attention weight is 0.25, i.e. Figure 5 The attention weight of sound source A should be greater than that of sound source B. Of course, other methods can also be used in the embodiments of this application to set the preset relationship between the relative distance change trend and the attention weight.
[0059] Thus, based on the above-mentioned preset relationship, each sound source corresponds to two attention weights. The current attention weight of each sound source is obtained by fusing the two attention weights. Here, a weighted fusion strategy can be used for attention fusion, or other fusion strategies can be used. This application embodiment does not limit this.
[0060] It should be noted that this embodiment can determine the trend of the relative distance between the gaze point and each sound source by observing the change in the relative distance between the gaze point and the sound source at the beginning and end of the gaze trajectory. For example... Figure 5 As shown, the difference between the relative distance between L1 and the sound source and the relative distance between L2 and the sound source is used as the change in the relative distance between the gaze point and each sound source. In other embodiments, the trend of the relative distance between the gaze point and each sound source can also be determined by the change in the relative distance between the gaze point and the sound source at some or all positions in the gaze trajectory, that is, the cumulative value of the difference corresponding to adjacent positions in the gaze trajectory is used as the change in the relative distance between the gaze point and each sound source.
[0061] In some embodiments of this application, the attention information includes attention weights, and the acoustic classification information includes audio frequency bands. Step S130 above processes the sound signal emitted by each sound source based on the user's attention information for each sound source and the acoustic classification information of each sound source, including:
[0062] Obtain the audio frequency segment corresponding to each sound source;
[0063] Based on the audio frequency segment corresponding to each sound source and the attention weight of each sound source, the final attention weight corresponding to each audio frequency segment is obtained.
[0064] The audio signals of each audio segment are processed according to the final attention weight corresponding to each audio segment.
[0065] As mentioned earlier, the sound frequency band of each sound source can be obtained from the acoustic database according to the type of sound source. Then, it is determined whether there are overlapping frequency bands for all sound sources. If there are overlapping frequency bands, the attention weight of the overlapping frequency band is set as the first attention weight, which is the maximum value among the attention weights corresponding to the overlapping area.
[0066] like Figure 6 As shown, the sound frequency bands of all sound-producing bodies are sorted. Figure 6 The height of each frequency band frame indicates the attention weight of the corresponding sound source. Figure 6 It can be seen that the sound frequency bands of sound emitters B and C have a partially overlapping area, the sound frequency band of sound emitter E is within the sound frequency band of sound emitter D, the attention weight corresponding to the overlapping area of sound emitters B and C is the attention weight of sound emitter C, the attention weight corresponding to the non-overlapping area of sound emitter B is the attention weight of sound emitter B, and the attention weight corresponding to the sound frequency band of sound emitter E is the attention weight of sound emitter D.
[0067] In some possible implementations of this embodiment, the audio signal of the corresponding frequency band is processed according to the final attention weight corresponding to each audio frequency band, including:
[0068] The gain coefficient of each audio frequency segment is obtained based on the final attention weight corresponding to each audio frequency segment;
[0069] Filter coefficients are generated based on the gain coefficient of each audio band;
[0070] The sound signal of each sound source is subjected to frequency band gain processing using a filter with the aforementioned filter coefficients.
[0071] In this implementation scheme, after generating a filter based on the gain coefficient of each audio segment, the gain difference between the gain coefficient of each audio segment at the current time and the gain coefficient at the previous time can be obtained. Based on the gain difference, it is determined whether the filter coefficient update condition and fade-in / fade-out condition are met. For example, when the gain difference of one or more audio segments is greater than a first threshold, it is determined that the filter coefficient update condition is met. When the gain difference of one or more audio segments is greater than a second threshold, it is determined that the fade-in / fade-out condition is met. Here, the first threshold is less than the second threshold.
[0072] If the filter coefficient update condition and fade-in / fade-out condition are met, the filter coefficients are updated based on the gain coefficient of each audio frequency band at the current moment, resulting in updated filter coefficients. The updated filter coefficients are then used to perform frequency band gain processing on the audio signal of each sound source, and fade-in / fade-out processing is performed on the audio frequency bands that meet the fade-in / fade-out conditions. For example, the fade-in / fade-out processing changes the gain coefficient within a set time period.
[0073] In some other possible implementations of this embodiment, the historical attention weight of each sound source is also obtained. In this case, the final attention weight corresponding to each sound source segment is obtained based on the sound frequency segment corresponding to each sound source, the attention weight of each sound source, and the historical attention weight.
[0074] The historical attention weight refers to the attention weight obtained through historical gaze trajectories, which are gaze trajectories located a certain time before the current gaze trajectory. Research has found that within a certain period of time (e.g., 2 minutes or more), the more times a user pays attention to a certain sound source and the longer the attention time, the higher the user's attention level to that sound source.
[0075] Therefore, at the current moment, if the final attention weight of a certain audio segment is low, for example, the final attention weight is low and less than 0, while the historical attention weight in the previous period was high, for example, greater than 0.5, and the number of times the historical attention weight is greater than 0.5 is greater than the set number value (for example, 3 times or more), then the final attention weight of the audio segment at the current moment can be set according to the historical attention weight greater than 0.5.
[0076] Therefore, the sound signal processing method provided in the above embodiments of this application offers a method for predicting sound sources of interest based on gaze trajectory, achieving a sound processing method that matches user attention and overcoming the shortcomings of traditional acoustic algorithms. The sound signal processing method provided in the embodiments of this application is particularly effective in complex sound environments, enabling users to achieve a better auditory immersion and improving the user experience of virtual reality devices.
[0077] This application also provides a sound signal processing device 700, such as... Figure 7 The diagram shows a schematic representation of a sound signal processing device according to an embodiment of this application. The device 700 includes: an information acquisition unit 710, a weight calculation unit 720, and a sound processing unit 730, wherein:
[0078] The information acquisition unit 710 is used to acquire the user's gaze trajectory and the sound-emitting object information within the user's field of vision. The sound-emitting object information includes the position information and acoustic classification information of each sound-emitting object.
[0079] The weight calculation unit 720 is used to obtain the user's attention information to each sound source based on the gaze trajectory and the position information of each sound source.
[0080] The sound processing unit 730 is used to process the sound signals emitted by each sound source based on the user's attention information to each sound source and the acoustic classification information of each sound source.
[0081] In one embodiment of this application, the weight calculation unit 720 is used to determine the relative distance change trend between the gaze point and each sound source, and the relative distance between the gaze point and each sound source at the end of the gaze trajectory, based on the gaze trajectory and the position information of each sound source; and to obtain the attention weight of each sound source based on the relative distance change trend between the gaze point and each sound source, and the relative distance between the gaze point and each sound source at the end of the gaze trajectory.
[0082] In one embodiment of this application, the weight calculation unit 720 is specifically used to obtain the attention weight of each sound source based on the preset relationship between relative distance and attention weight, and the preset relationship between the relative distance change trend and attention weight.
[0083] In one embodiment of this application, the attention information includes attention weights, the acoustic classification information includes audio frequency segments, and the sound processing unit 730 is used to obtain the audio frequency segment corresponding to each sound source; obtain the final attention weight corresponding to each audio frequency segment based on the audio frequency segment corresponding to each sound source and the attention weight of each sound source; and process the sound signal of the corresponding audio frequency segment based on the final attention weight corresponding to each audio frequency segment.
[0084] In one embodiment of this application, the sound processing unit 730 is further configured to obtain the historical attention weight of each sound source; and to obtain the final attention weight corresponding to each sound source segment based on the sound frequency segment corresponding to each sound source, the attention weight of each sound source, and the historical attention weight.
[0085] In one embodiment of this application, the sound processing unit 730 is specifically used to determine whether there are overlapping frequency bands in the sound frequency bands corresponding to all sound sources; if there are overlapping frequency bands, the attention weight of the overlapping frequency band is set as a first attention weight, and the first attention weight is the maximum value among the attention weights corresponding to the overlapping region.
[0086] In one embodiment of this application, the sound processing unit 730 is specifically configured to obtain the gain coefficient of each sound frequency band according to the final attention weight corresponding to each sound frequency band; generate filter coefficients according to the gain coefficients of each sound frequency band; and perform frequency band gain processing on the sound signal of each sound source using the filter with the filter coefficients.
[0087] In one embodiment of this application, the sound processing unit 730 is specifically configured to obtain the gain difference between the gain coefficient of each audio frequency band at the current moment and the gain coefficient at the previous moment; determine whether the filter coefficient update condition and fade-in / fade-out condition are met based on the gain difference; if the filter coefficient update condition and fade-in / fade-out condition are met, update the filter coefficients based on the gain coefficient of each audio frequency band at the current moment to obtain the updated filter coefficients; use the updated filter coefficients to perform frequency band gain processing on the sound signal of each sound source, and perform sound fade-in / fade-out processing on the sound frequency bands that meet the fade-in / fade-out condition.
[0088] It is understood that the above-described sound signal processing device can implement each step of the sound signal processing method provided in the foregoing embodiments. The relevant explanations of the sound signal processing method are applicable to the sound signal processing device and will not be repeated here.
[0089] Figure 8 This is a schematic diagram of the structure of a virtual reality device according to an embodiment of this application. Please refer to it. Figure 8 At the hardware level, the virtual reality device includes a processor, and optionally also an internal bus, network interface, and memory. The memory may include RAM, such as high-speed random-access memory (RAM), or non-volatile memory, such as at least one disk drive. Of course, the virtual reality device may also include other hardware required for other business operations.
[0090] The processor, network interface, and memory can be interconnected via an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry Standard Architecture) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 8 The symbol is represented by a single double-headed arrow, but this does not mean that there is only one bus or one type of bus.
[0091] Memory is used to store programs. Specifically, programs may include program code, which includes computer operation instructions. Memory may include main memory and non-volatile memory, and provides instructions and data to the processor.
[0092] The processor reads the corresponding computer program from non-volatile memory into main memory and then runs it, forming a sound signal processing device at the logical level. The processor executes the program stored in memory and specifically performs the following operations:
[0093] Acquire the user's gaze trajectory and information on sound-emitting entities within the user's field of vision, wherein the sound-emitting entity information includes the position information and acoustic classification information of each sound-emitting entity;
[0094] Based on the gaze trajectory and the position information of each sound source, obtain the user's attention level information for each sound source;
[0095] The sound signals emitted by each sound source are processed based on the user's attention level to each sound source and the acoustic classification information of each sound source.
[0096] The above is as stated in this application. Figure 1The method executed by the audio signal processing device disclosed in the illustrated embodiment can be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by integrated logic circuits in the processor's hardware or by instructions in software form. The processor can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of this application can be directly embodied as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in the memory, and the processor reads the information from the memory and, in conjunction with its hardware, completes the steps of the aforementioned sound signal processing method.
[0097] The virtual reality device can also perform Figure 1 The method executed by the sound signal processing device, and the implementation of the sound signal processing device in Figure 1 The functions of the embodiments shown are not described in detail here.
[0098] This application also proposes a computer-readable storage medium that stores one or more programs, the programs including instructions that, when executed by a virtual reality device including multiple applications, enable the virtual reality device to perform... Figure 1 The method executed by the sound signal processing device in the illustrated embodiment is specifically used to perform:
[0099] Acquire the user's gaze trajectory and information on sound-emitting entities within the user's field of vision, wherein the sound-emitting entity information includes the position information and acoustic classification information of each sound-emitting entity;
[0100] Based on the gaze trajectory and the position information of each sound source, obtain the user's attention level information for each sound source;
[0101] The sound signals emitted by each sound source are processed based on the user's attention level to each sound source and the acoustic classification information of each sound source.
[0102] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0103] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0104] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0105] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0106] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0107] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0108] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0109] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0110] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0111] The above description is merely an embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.
Claims
1. A sound signal processing method, characterized in that, The method includes: Acquire the user's gaze trajectory and information on sound-emitting entities within the user's field of vision, wherein the sound-emitting entity information includes the position information and acoustic classification information of each sound-emitting entity; Based on the gaze trajectory and the position information of each sound source, obtain the user's attention level information for each sound source; The sound signals emitted by each sound source are processed based on the user's attention level to each sound source and the acoustic classification information of each sound source; The attention information includes attention weights, and the acoustic classification information includes audio frequency bands. The process of processing the sound signal emitted by each sound source based on the user's attention information and the acoustic classification information of each sound source includes: Obtain the audio frequency segment corresponding to each sound source; Based on the audio frequency segment corresponding to each sound source and the attention weight of each sound source, the final attention weight corresponding to each audio frequency segment is obtained. The audio signal of each audio segment is processed according to the final attention weight corresponding to each audio segment; The method further includes: Obtain the historical attention weight for each voice; The step of obtaining the final attention weight corresponding to each audio frequency segment based on the audio frequency segment corresponding to each sound source and the attention weight of each sound source includes: Based on the audio frequency segment corresponding to each sound source, the attention weight of each sound source, and the historical attention weight, the final attention weight corresponding to each audio frequency segment is obtained.
2. The sound signal processing method as described in claim 1, characterized in that, The step of obtaining user attention information for each sound source based on the gaze trajectory and the position information of each sound source includes: Based on the gaze trajectory and the position information of each sound source, determine the trend of the relative distance between the gaze point and each sound source, as well as the relative distance between the gaze point and each sound source at the end of the gaze trajectory. Based on the trend of the relative distance between the gaze point and each sound source, and the relative distance between the gaze point and each sound source at the end of the gaze trajectory, the attention weight of each sound source is obtained.
3. The sound signal processing method as described in claim 2, characterized in that, The step of obtaining the attention weight of each sound source based on the trend of the relative distance between the gaze point and each sound source, and the relative distance between the gaze point and each sound source at the end of the gaze trajectory, includes: Based on the preset relationship between relative distance and attention weight, and the preset relationship between the trend of relative distance change and attention weight, the attention weight of each voice source is obtained.
4. The sound signal processing method as described in claim 1, characterized in that, The step of obtaining the final attention weight corresponding to each audio frequency segment based on the audio frequency segment corresponding to each sound source and the attention weight of each sound source includes: Determine whether there are overlapping frequency bands for the audio frequencies corresponding to all sound-emitting bodies; If there are overlapping frequency bands, the attention weight of the overlapping frequency band is set as the first attention weight, which is the maximum value among the attention weights corresponding to the overlapping area.
5. The sound signal processing method as described in claim 1, characterized in that, The process of processing the audio signal of the corresponding audio segment according to the final attention weight of each audio segment includes: The gain coefficient of each audio frequency segment is obtained based on the final attention weight corresponding to each audio frequency segment; Filter coefficients are generated based on the gain coefficient of each audio band; The sound signal of each sound source is subjected to frequency band gain processing using a filter with the aforementioned filter coefficients.
6. The sound signal processing method as described in claim 5, characterized in that, After generating the filter based on the gain coefficient of each audio band, the method further includes: Obtain the gain difference between the gain coefficient of each audio frequency band at the current moment and the gain coefficient at the previous moment; Determine whether the filter coefficient update condition and fade-in / fade-out condition are met based on the gain difference; If the filter coefficient update condition and fade-in / fade-out condition are met, the filter coefficients are updated according to the gain coefficient of each audio band at the current moment, and the updated filter coefficients are obtained. The filter with updated coefficients is used to perform frequency band gain processing on the sound signal of each sound source, and to perform sound fade-in and fade-out processing on the sound frequency bands that meet the fade-in and fade-out conditions.
7. A sound signal processing device, characterized in that, The device includes: An information acquisition unit is used to acquire the user's gaze trajectory and the information of the sound-emitting objects within the user's field of vision. The information of the sound-emitting objects includes the position information and acoustic classification information of each sound-emitting object. The weighting calculation unit is used to obtain the user's attention information to each sound source based on the gaze trajectory and the position information of each sound source. The sound processing unit is used to process the sound signals emitted by each sound source based on the user's attention information to each sound source and the acoustic classification information of each sound source. The attention information includes attention weights, and the sound processing unit is specifically used for: Obtain the audio frequency segment corresponding to each sound source; Based on the audio frequency segment corresponding to each sound source and the attention weight of each sound source, the final attention weight corresponding to each audio frequency segment is obtained. The audio signal of each audio segment is processed according to the final attention weight corresponding to each audio segment; The sound processing unit is also used for: Obtain the historical attention weight for each voice; The step of obtaining the final attention weight corresponding to each audio frequency segment based on the audio frequency segment corresponding to each sound source and the attention weight of each sound source includes: Based on the audio frequency segment corresponding to each sound source, the attention weight of each sound source, and the historical attention weight, the final attention weight corresponding to each audio frequency segment is obtained.
8. A virtual reality device, comprising: processor; as well as A memory configured to store computer-executable instructions, which, when executed, cause the processor to perform the sound signal processing method according to any one of claims 1 to 6.
9. A computer-readable storage medium storing one or more programs, which, when executed by a virtual reality device including a plurality of applications, cause the virtual reality device to perform the sound signal processing method according to any one of claims 1 to 6.