Sound image control device and sound image control method, and program
The real-time virtual sound image generation system using speaker arrays and motion capture technology addresses the limitations of existing technologies by dynamically controlling sound images in response to user movements, enhancing localization and immersion.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- UNIVERSITY OF ELECTRO-COMMUNICATIONS
- Filing Date
- 2024-12-20
- Publication Date
- 2026-07-02
Smart Images

Figure 2026110165000001_ABST
Abstract
Description
Technical Field
[0001] The present disclosure relates to an audio - video control device, an audio - video control method, and a program, and particularly to an audio - video control device, an audio - video control method, and a program that can provide a better user experience.
Background Art
[0002] Conventionally, in a linear speaker array in which a plurality of speaker units are arranged linearly, by using wave field synthesis technology (WFS) or spectral division method, and in a spherical speaker array in which a plurality of speaker units are arranged spherically, by using NFC - HOA (Near - Field Compensated Higher Order Ambisonics), it is possible to generate an audio - video including a sense of depth such as behind or in front of the speaker in space. By controlling the generation of the audio - video in this way, it is possible to make a user who is a listener experience a feeling (hereinafter referred to as a localization feeling) as if the audio - video is localized at a predetermined position.
[0003] By the way, it is known that the localization feeling of humans is improved when the listener actively moves. For example, Non - Patent Document 1 discloses a technique for improving the localization feeling during binaural reproduction by headphones by tracking the movement of the listener's head (head tracking) and switching the head transfer function according to the movement.
[0004] Furthermore, when applying the technique disclosed in Non - Patent Document 1 to the reciprocity theorem that the transfer function representing the transmission of sound does not change even when the sound source and the sound reception point are interchanged, it can be interpreted that the localization feeling is improved when the sound source position changes even if the ear position of the listener is fixed. For example, Non - Patent Document 1 discloses that the localization feeling is improved even when the listener moves the audio - video by operating a handle while keeping the head stationary.
Prior Art Documents
[0005] [Non-Patent Document 1] Daisuke Yoshizaki, Tatsuya Hirahara, "Horizontal Plane Sound Image Localization Using Dynamic Binaural Sound Recorded with a Dummy Head Rotationally Controlled by Handle Operation," NII-Electronic Library Service, 2012, TVRSJ VolLI7 No.4, pp.327-331. [Overview of the project] [Problems that the invention aims to solve]
[0006] By combining the aforementioned wavefront synthesis technology, spectral division method, and focal sound source methods such as NFC-HOA with motion capture used in the field of virtual reality, it is thought that it will be possible to interactively control the spatial position of the sound image in accordance with the user's movements. For example, it is thought that a better user experience can be provided by improving the sense of localization compared to conventional methods.
[0007] This disclosure is made in light of these circumstances and aims to provide a better user experience. [Means for solving the problem]
[0008] One aspect of the present disclosure of a sound image control device includes a virtual sound image generation filter that generates a virtual sound image in space using a speaker array composed of a plurality of speaker elements, and a control unit that controls playback so that the sound reproduced in the sound image changes with fluctuations according to the movement of a body part acquired by motion capture.
[0009] One aspect of the present disclosure of a sound image control method or program includes generating a virtual sound image in space using a speaker array composed of a plurality of speaker elements, and controlling the playback so that the sound reproduced in the sound image changes in a fluctuating manner according to the movement of a body part acquired by motion capture.
[0010] In one aspect of this disclosure, a virtual sound image is generated in space using a speaker array composed of multiple speaker elements, and playback is controlled so that the sound reproduced in the sound image fluctuates and changes in accordance with the movement of body parts acquired by motion capture. [Effects of the Invention]
[0011] According to one aspect of this disclosure, a better user experience can be provided.
[0012] The effects described herein are not necessarily limited to those described herein and may include any of the effects described herein. [Brief explanation of the drawing]
[0013] [Figure 1] This block diagram shows an example configuration of a real-time virtual sound image generation system using a linear speaker array. [Figure 2] This is a block diagram showing an example configuration of a real-time virtual sound image generation system using a spherical speaker array. [Figure 3] This diagram illustrates the primary sound field, which is the sound pressure generated by a point source, and the secondary sound field, which is the sound pressure reproduced inside a spherical array. [Figure 4] This diagram illustrates the interpolation process for determining the interpolation position. [Figure 5] This diagram illustrates interpolation processing using linear interpolation and interpolation processing using PID control. [Figure 6] This diagram illustrates PID control. [Figure 7]This is a block diagram showing an example configuration of a control unit that performs movement control processing for the virtual sound image position. [Figure 8] This is a flowchart explaining the process of controlling the movement of the virtual sound image position. [Figure 9] This is a block diagram showing an example configuration of a control unit that performs content control processing. [Figure 10] This is a flowchart explaining the content control process. [Figure 11] This is a block diagram showing an example configuration of one embodiment of a computer to which this technology is applied. [Modes for carrying out the invention]
[0014] The following describes in detail a specific embodiment of this technology, with reference to the drawings.
[0015] <Example configuration of a real-time virtual sound image generation system using a linear speaker array> Referring to Figure 1, a real-time virtual sound image generation system that controls a virtual sound image (focus source) generated in space using a linear speaker array in real time will be described.
[0016] The real-time virtual sound image generation system 11 shown in Figure 1 comprises a sound source 21, a virtual sound image generation filter 22, a linear speaker array 23, and a motion capture device 24.
[0017] The real-time virtual sound image generation system 11 generates a virtual sound image in space by outputting a sound time signal U(t) output from the sound source 21 through a virtual sound image generation filter 22 to a linear speaker array 23, which is composed of multiple speaker elements 34 (s speaker elements 34-1 to 34-s in the illustrated example) arranged in a straight line. Here, t in the sound time signal U(t) is discrete time. The real-time virtual sound image generation system 11 then determines the position of the virtual sound image generated using the linear speaker array 23 (hereinafter referred to as the virtual sound image position r). PSThe planar movement (referred to as such) can be interactively controlled in accordance with the position of the user's hand obtained in real time using the motion capture device 24.
[0018] For example, if the normal direction derivative of the sound on the straight boundary line along the linear speaker array 23 and the transfer function are known, based on the first-kind Rayleigh integral, the subsequent sound field p(r) can be controlled as shown in the following equation (1).
[0019]
Equation
[0020] In this equation (1), r S is the position of each of the speaker elements 34-1 to 34-s, and r PS is the position of the virtual sound image generated by the real-time virtual sound image generation system 11.
[0021] Assuming a suction-type focal sound source that sucks in sound at the virtual sound image (which is actually impossible) and emits sound after the virtual sound image, the filter W(r S , r PS , ω) constituting the virtual sound image generation filter 22 for generating the sound output from the speaker elements 34-1 to 34-s can be derived as shown in the following equation (2).
[0022]
Equation
[0023] In this equation (2), ω is the angular frequency of the sound output from the sound source 21, H1 (1) is the first-kind Hankel function, and g s is a coefficient for considering the attenuation of the sound at a predetermined reference listening position r ref , and g s = √(2π|r ref - r S |).
[0024] For example, as shown in equation (2), filter W(r S ,r PS ω) is the angular frequency ω and the virtual sound image position r PS Because it depends on [a specific factor], a numerical inverse Fourier transform is required, making it somewhat difficult to compute and convolve time-domain filters in real time.
[0025] Therefore, in the real-time virtual sound image generation system 11, the virtual sound image position r is set according to the position of the user's hand. PS In order to control it in real time, it is necessary to convert it to the time domain mathematically. Therefore, when the Hankel function is approximated by an exponential function, equation (2) above can be expressed as equation (3).
[0026]
number
[0027] Furthermore, when the entire equation (3) is inverse Fourier transformed, the filter W(r) is obtained as shown in the following equation (4). S ,r PS ω) is a filter q(t) (=IFFT(√(ω / c / 2πj))), and a scalar gain g(r S ,r PS ), and non-integer delay D(r S ,r PS It can be expressed by (Z). In equation (3), c is the speed of sound (= approximately 340 m / s), and T is the sampling period (i.e., the reciprocal of the sampling frequency). Scalar gain g(r S ,r PS ) and non-integer delay D(r S ,r PS ,Z) is the speaker position r S and virtual sound image position r PS It depends on.
[0028]
number
[0029] Therefore, the virtual sound image generation filter 22 can be configured with an FIR filter 31, delays 32-1 to 32-s, and amplifiers 33-1 to 33-s, as shown in Figure 1. The FIR filter 31 is a system-dependent fixed filter. The delay amount D of delays 32-1 to 32-s and the gain g of amplifiers 33-1 to 33-s change every time in accordance with the position of the user's hand acquired by the motion capture device 24.
[0030] The real-time virtual sound image generation system 11, configured as described above, interactively generates a virtual sound image position r in accordance with the position of the user's hand acquired using the motion capture device 24. PS Its planar movement can be controlled in real time.
[0031] <Example configuration of a real-time virtual sound image generation system using a spherical speaker array> Referring to Figure 2, a real-time virtual sound image generation system that controls a virtual sound image (focus source) generated in space using a spherical speaker array in real time will be described.
[0032] The real-time virtual sound image generation system 41 shown in Figure 2 comprises a sound source 51, a virtual sound image generation filter 52, a spherical speaker array 53, and a motion capture device 54.
[0033] The real-time virtual sound image generation system 41 generates a virtual sound image in the air (reproduces three-dimensional sound) by outputting the sound time signal U(t) output from the sound source 51 through a virtual sound image generation filter 52 to a spherical speaker array 53, which is composed of multiple speaker elements (in Figure 2, each black circle represents a speaker element) arranged in a spherical shape. The real-time virtual sound image generation system 41 generates a virtual sound image position r, which is the position of the virtual sound image generated using the spherical speaker array 53. PSThe three-dimensional movement of the virtual sound image can be interactively controlled in accordance with the position of the user's hand, which is acquired in real time using the motion capture device 54. In addition, the real-time virtual sound image generation system 41 uses a spherical coordinate system (r,Ω)=(r,θ,φ) with the center of the spherical speaker array 53 as the origin, where θ is the zenith angle and φ is the azimuth angle, and the virtual sound image position r PS =(r PS ,θ PS ,φ PS ) is the virtual sound image distance r PS and virtual sound image angle Ω PS =(θ PS ,φ PS It is represented by ).
[0034] For example, the real-time virtual sound image generation system 41 can use focus source methods such as VBAP (Vector Based Amplitude Panning) or HOA (Higher Order Ambisonics) when distance perception is not controlled, and can use focus source methods such as NFC-HOA + Realtime processing when distance perception is controlled.
[0035] Here, HOA is a method that uses a spherical or circular speaker array to decompose, analyze, and synthesize sound using a spatial Fourier transform on the angular directional pattern. When distance is not controlled, the sound field p(r,θ,φ) due to incoming sound from an external source is expressed by the following equation (5).
[0036]
number
[0037] In this equation (5), j n (·) is the spherical Bessel function, Y n m (·) is a spherical harmonic function, and P n m (·) represents a Legendre associated function.
[0038] Furthermore, the spherical harmonic expansion is a spatial Fourier transform with respect to angles, and the spherical harmonics Y of angles θ and φ n m (θ,φ) is the basis, and any characteristic f(θ,φ) on the sphere is expanded using coefficient f nm It is decomposed into these components, and its inverse transformation and transformation are shown by the following equation (6).
[0039]
number
[0040] In equation (6), n is the order, N is the maximum order determined by the number of speaker elements, and S 2 It is the unit sphere.
[0041] For example, when reproducing a sound coming from a certain direction using HOA, if we place one sound source in the direction of angles θ and φ, the characteristic f(θ,φ) is expressed by the following equation (7), and the expansion coefficient f nm This is expressed by the following equation (8).
[0042]
number
number
[0043] In equation (7), the number δ(·) is the Dirac delta function, and in equation (8), (θ t ,φ t ) indicates the direction from which the sound is coming.
[0044] Then, in HOA, the gain G of the i-th speaker drive signal that synthesizes these is... l This can be expressed by the following equation (9).
[0045]
number
[0046] In equation (9), l is the speaker index.
[0047] Next, we derive an NFC-HOA filter to implement distance perception control.
[0048] For example, the sound pressure from a point source as shown in Figure 3A is a first-order sound field p(α,Ω). α ) is expressed by the following equation (10), and is the sound pressure reproduced inside the spherical array as shown in Figure 3B, which is the second-order sound field p^(α,Ω). α ) is expressed by the following equation (11).
[0049]
number
number
[0050] Then, filter W matches equations (10) and (11). s (ω) can be obtained by performing a spherical harmonic expansion of equations (10) and (11) and matching the coefficients of the same order. For example, NFC-HOA filter W using Mode Matching s (ω,r PS The function is expressed by a radial filter and an angular gain, as shown in equation (12) below.
[0051]
number
[0052] In this equation (12), r ps is the virtual sound source distance (radius), r is the speaker distance (radius), N is the expansion order, L is the number of speakers, h n (2) (kr) is the spherical Hankel function, and Y n m(Ωs) is a spherical harmonic function, and k=ω / c is the wavenumber. Furthermore, the radial filter and angular gain depend on the position of the speaker and the virtual sound image. Also, the radial filter is frequency-dependent, requiring a Fourier transform from the frequency-domain filter to the FIR filter.
[0053] Here, the z-domain (time-domain) representation of a radial filter is expressed mathematically by transforming the frequency-domain filter into a z-domain filter via the Laplace transform. For example, the Laplace-domain representation of a third-order radial filter is expressed as shown in equation (13) below.
[0054]
number
[0055] Then, by performing a matched z-transform on equation (13), it can be expressed as follows: equation (14), with one gain, one non-integer delay, and an n=3 order IIR filter.
[0056]
number
[0057] As shown in equation (14), one gain corresponds to the magnitude with respect to the distance to the virtual sound image, one non-integer delay corresponds to the arrival delay with respect to the distance to the virtual sound image, and the n=3rd order IIR filter corresponds to the ratio in which the speaker drive gains of each order are combined for each frequency. Also, the denominator of the n=3rd order IIR filter is fixed, and α l r is determined by pre-calculation, ps Only this can be changed.
[0058] Therefore, the virtual sound image generation filter 52 can be composed of an amplifier 61, a non-integer delay unit 62, and IIR filters 63-0 to 63-N, as shown in Figure 2, and the IIR filters 63-0 to 63-N are connected to each speaker element constituting the spherical speaker array 53 via a gain matrix 64 of (N+1)×L.
[0059] The real-time virtual sound image generation system 41, configured as described above, interactively generates a virtual sound image position r in accordance with the position of the user's hand acquired using the motion capture device 54. PS It can control the three-dimensional movement of the object in real time.
[0060] Furthermore, the virtual sound image generation filter 52 is equipped with a control unit 65 that performs virtual sound image position movement control processing, as described later with reference to Figures 4 to 8, or content control processing, as described later with reference to Figures 9 and 10, in order to provide a good user experience in accordance with the user's hand movements (movement or behavior).
[0061] <Processing to control the movement of the virtual sound image position> The process for controlling the movement of the virtual sound image position will be explained with reference to Figures 4 to 8.
[0062] The real-time virtual sound image generation system 41 needs to set a short frame size (buffer size) for the sound playback frames that reproduce the sound output from the sound source 51 in order to move the virtual sound image position in real time without delay in response to the user's hand movements. For this reason, the real-time virtual sound image generation system 41 is configured to reproduce sound with a buffer size of 256 samples or 512 samples, for example, at a sampling period of 48,000 Hz.
[0063] On the other hand, the motion capture device 24 is generally configured to perform motion capture at a frame rate of about 30 frames per second. Therefore, the tracking period, during which the motion capture device 24 acquires the tracking position by tracking the user's hand position, becomes longer than the frame size of the sound playback frame.
[0064] Therefore, in order to reproduce sound sample by sample, the real-time virtual sound image generation system 41 interpolates the tracking position acquired for each tracking period during motion capture according to the frame period of the sound playback frame, as shown in Figure 4, to obtain an interpolated position X i Perform an interpolation process to find the interpolation position X for each position. i To localize the sound, a movement control process is executed for each sound playback frame to control the movement of the virtual sound image position.
[0065] The upper part of Figure 4 shows a motion capture thread in which the motion capture device 54 performs motion capture according to the tracking cycle, and the lower part of Figure 4 shows a sound playback thread in which the virtual sound image generation filter 52 plays sound according to the frame cycle of the sound playback frame. The virtual sound image generation filter 52 checks for the end of motion capture at each frame cycle of the sound playback frame, and the motion capture device 54 supplies tracking position information (i.e., the position of the user's hand that was motion captured) to the virtual sound image generation filter 52 at the timing when the tracking cycle has ended.
[0066] For example, as shown in the diagram, suppose tracking positions A, B, and C are acquired in three tracking cycles. At this time, tracking position B is acquired when sound is played with tracking position A as the virtual sound image position, and sound is played in each sound playback frame so that the virtual sound image position moves from tracking position A to tracking position B.
[0067] Here, if the frame period of the sound playback frame is 1 / 4 of the tracking period, interpolation processing can be performed to obtain three interpolation positions X1 to X3, and each interpolation position X i Sound is played sequentially for each sound playback frame so that the virtual sound image position is determined. For example, if linear interpolation, which interpolates between tracking positions at equal intervals, is used for the interpolation process, then 1 / 4B is determined as interpolation position X1, 2 / 4B as interpolation position X2, and 3 / 4B as interpolation position X3, so as to interpolate tracking positions A and B.
[0068] Therefore, in this case, after the sound is played at tracking position A, the sound is played with interpolation position X1, which is 1 / 4B, as the virtual sound image position, at timings according to the sound playback frame, the sound is played with interpolation position X2, which is 2 / 4B, as the virtual sound image position, and the sound is played with interpolation position X3, which is 3 / 4B, as the virtual sound image position. Subsequently, tracking position C is obtained at the timing when the sound is played with tracking position B as the virtual sound image position. Then, similarly, interpolation position X is used to interpolate tracking positions B and C. i The interpolation position X is determined so that the virtual sound image position moves from tracking position B to tracking position C. i The sound is played sequentially for each sound playback frame, using the virtual sound image position as the virtual sound image position.
[0069] Thus, there will be a time delay (though not noticeable) between the timing at which the tracking position is acquired and the timing at which the sound is reproduced using that tracking position as a virtual sound image position.
[0070] Incidentally, generally speaking, after a tracking position is acquired at one point in time, the user's hand does not necessarily move between those tracking positions at a constant speed and in a linear trajectory until the next tracking position is acquired. In other words, the user does not move their hand at a constant speed and in a linear trajectory as in linear interpolation; there is some fluctuation in the speed and trajectory of the user's hand movement. Therefore, the interpolated position X obtained by using the linear interpolation described above to ensure that the tracking positions are equally spaced is... i Even when using a virtual sound image position and playing sound at each frame period of the sound playback frame, it was sometimes not possible to obtain a sense of localization where the virtual sound image position followed and moved in accordance with the user's hand movements.
[0071] On the other hand, there are research reports suggesting that sounds are easier to localize when their position is fluctuating (for example, it is easier to recognize that there is a sound source there).
[0072] Therefore, in the real-time virtual sound image generation system 41, the control unit 65 uses the position of the virtual sound source where sound was played at a certain timing as the current position (the tracking position of the source of movement or the previous interpolation position), and the tracking position of the destination as the target position. The control unit 65 then executes a movement control process that controls the movement of the virtual sound image position by using interpolation processing to find the interpolation position that will be the virtual sound source position where sound will be played at the next timing, so that the virtual sound image position moves from the current position towards the target position while fluctuating. As a result, the real-time virtual sound image generation system 41 can improve the sense of localization so that the virtual sound image position follows the user's hand movements, and can provide a better user experience.
[0073] For example, referring to Figure 4, when the control unit 65 plays sound at the time it acquires tracking position A (the tracking position of the source of movement), it uses tracking position A as the current position and tracking position B (the tracking position of the destination) as the target position, and calculates an interpolated position X1 such that the virtual sound image position moves from the current position to the target position while fluctuating. The control unit 65 also calculates the interpolated position X iWhen playing sound, interpolation position X i With the current position as and tracking position B (the tracking position of the destination) as the target position, interpolation position X is set so that the virtual sound image position moves from the current position towards the target position while fluctuating. i+1 We seek.
[0074] The control unit 65 determines the interpolated position using the tracking position of the destination as the target position. However, depending on the speed at which the user moves their hand and the coefficients set to determine the interpolated position, the final interpolated position may be before or after the target position (for example, it may not reach the target position or may go beyond it). In other words, the control unit 65 moves the virtual sound image position not to the target position itself, but towards the area around the target position.
[0075] Furthermore, in the real-time virtual sound image generation system 41, PID (Proportional Integral Differential) control can be used as an interpolation process to realize such virtual sound image position movement control processing.
[0076] PID control is a classical control method that uses the difference from the target value, its derivative, and its definite integral as the controlled variable to continuously and sequentially reduce the difference between the controlled value and the target value. Although PID control is not originally a method used for interpolation, by using PID control for interpolation, it becomes possible to determine the interpolated position so that the virtual sound image position moves from the current position towards the target position (or the area around the target position) while fluctuating.
[0077] Furthermore, when applying interpolation using PID control to virtual sound image position movement control, unlike when applying interpolation using linear interpolation to virtual sound image position movement control, it is not necessary to consider the timestamp of the target position which is updated asynchronously. In addition, when applying interpolation using PID control to virtual sound image position movement control, instantaneous deviations in tracking can be absorbed by adjusting the coefficients of the PID control.
[0078] Referring to Figure 5, we will explain the difference between the interpolated position obtained using linear interpolation and the interpolated position obtained using PID control.
[0079] As mentioned above, for example, from the moment tracking position A is acquired until the moment sound is played at the virtual sound image position A corresponding to tracking position A, a time delay (though not noticeable) occurs that is at least proportional to the tracking period.
[0080] Figure 5A shows an example of an interpolated position obtained by interpolation using linear interpolation.
[0081] As illustrated, even when the user's hand moves in a curved motion, the interpolation process using linear interpolation linearly interpolates position X such that the distance between virtual sound image position A and virtual sound image position B is equal, according to the timing of the sound playback frame. i The following is determined, and the interpolation position X is linearly interpolated so that the distance between virtual sound image position B and virtual sound image position C is equal. i This is required.
[0082] Figure 5B shows an example of an interpolated position obtained by interpolation processing using PID control.
[0083] In interpolation using PID control, the interpolated position X is moved significantly towards the target position when it is far from the target position, and moves slightly towards the target position when it is close to the target position. i Furthermore, in interpolation processing using PID control, depending on the speed at which the user moves their hand and the coefficients set to determine the interpolation position, the interpolation position X may not reach the target position or may exceed the target position. i This is required. For example, in interpolation using PID control, if the user's hand moves quickly, the amount of movement will be large, and if the user's hand moves quickly in a curved manner, the interpolation position X will rotate around the outside of that curve. i It is possible to find this.
[0084] Then, as shown in the figure, when the user's hand moves curvilinearly, the interpolation position X changes curvilinearly (not necessarily along the curve of the hand movement) so that the interval decreases as it approaches the virtual sound image position B from the virtual sound image position A at the timing according to the sound reproduction frame. i is obtained, and the interpolation position X changes curvilinearly so that the interval decreases as it approaches the virtual sound image position C from the virtual sound image position B. i is obtained.
[0085] FIG. 6 is a diagram for explaining PID control.
[0086] PID control is a control that reduces the deviation e of the current position (the virtual sound image position r corresponding to the tracking position of the movement origin or the previous interpolation position) with respect to the target position (the virtual sound image position r corresponding to the tracking position of the movement destination) for each sound reproduction frame. For example, in actual processing, discrete approximation is performed according to the frame period of the DA converter provided in the real-time virtual sound image generation system 41. By such PID control, it is possible to obtain an interpolation position that changes so that the movement amount of the virtual sound image position gradually decreases from the current position toward the target position. ps ~ ) with respect to the target position (the virtual sound image position r corresponding to the tracking position of the movement destination) for each sound reproduction frame. For example, in actual processing, discrete approximation is performed according to the frame period of the DA converter provided in the real-time virtual sound image generation system 41. By such PID control, it is possible to obtain an interpolation position that changes so that the movement amount of the virtual sound image position gradually decreases from the current position toward the target position. ps ) according to its differential and integral values. For example, in actual processing, discrete approximation is performed according to the frame period of the DA converter provided in the real-time virtual sound image generation system 41. By such PID control, it is possible to obtain an interpolation position that changes so that the movement amount of the virtual sound image position gradually decreases from the current position toward the target position. ps is a control that reduces the deviation e of the current position (the virtual sound image position r corresponding to the tracking position of the movement origin or the previous interpolation position) with respect to the target position (the virtual sound image position r corresponding to the tracking position of the movement destination) for each sound reproduction frame. For example, in actual processing, discrete approximation is performed according to the frame period of the DA converter provided in the real-time virtual sound image generation system 41. By such PID control, it is possible to obtain an interpolation position that changes so that the movement amount of the virtual sound image position gradually decreases from the current position toward the target position.
[0087] For example, in PID control, the increment Δr corresponding to the interpolation position obtained from the current virtual sound image position r ps toward the target virtual sound image position r ps ~ is controlled. Then, taking the deviation of the target position with respect to the current position in the Cartesian coordinate system as e ps and setting the deviation e ps used when obtaining the previous sound reproduction frame as the deviation e ps the control amount Δr ps0 with respect to the control period τ is obtained as shown in the following equation (15). ps is obtained as shown in the following equation (15).
[0088]
Equation
[0089] In this equation (15), K P , K I , and K D These are coefficients that adjust the magnitude of the influence of the proportional, integral, and differential terms, respectively. For example, coefficient K P The larger the value, the faster the convergence to the target position can be achieved. The integral term helps to accelerate the convergence of the steady-state error, but the phase lag of the integral worsens the responsiveness. The differential term helps to suppress abrupt changes and oscillations caused by noise.
[0090] By using such PID control for interpolation, the interpolated position can be determined based on how close the current position is to the target position. Therefore, even if the motion capture and sound playback frames are asynchronous and their sizes differ, there is no need to consider the difference. Furthermore, PID control offers a certain degree of flexibility in how the control variables are handled, and processing can be done by directly adding to the current position for greater ease of use, making it simpler to implement than asynchronous processing. In addition, PID control has the flexibility to accommodate tracking discrepancies depending on the coefficient settings.
[0091] Figure 7 is a block diagram showing an example configuration of the control unit 65 that performs movement control processing for the virtual sound image position.
[0092] As shown in Figure 7, the control unit 65 is configured to include a tracking position acquisition unit 71 and an interpolation position calculation unit 72.
[0093] The tracking position acquisition unit 71 acquires the tracking position of the destination based on the tracking position information supplied from the motion capture device 24 when the tracking cycle is completed, and notifies the interpolation position calculation unit 72.
[0094] When the interpolation position calculation unit 72 receives notification of a new tracking position of the destination from the tracking position acquisition unit 71, it sets the destination tracking position as the target position and the tracking position that was previously the destination as the source position. The interpolation position calculation unit 72 then calculates the interpolation position so that the virtual sound image position moves from the current position (the source tracking position or the previous interpolation position) towards the target position with fluctuations for each frame period of the sound playback frame. In other words, the interpolation position calculation unit 72 can calculate the interpolation position by performing interpolation processing using the PID control described above. The interpolation position calculation unit 72 then supplies the interpolation position information indicating the interpolation position to the amplifier 61 and the non-integer delay unit 62.
[0095] As a result, the virtual sound image generation filter 52 uses the amplifier 61 and the non-integer delay unit 62 to determine the interpolated position according to the interpolated position information supplied from the interpolation position calculation unit 72, and then calculates the virtual sound image position r PS Therefore, it can play sound that follows the user's hand movements, gradually and smoothly moving from the current position towards the target position.
[0096] Referring to the flowchart shown in Figure 8, the virtual sound image position movement control process performed by the control unit 65 will be described.
[0097] In step S11, the tracking position acquisition unit 71 acquires the tracking position of the destination based on the tracking position information supplied from the motion capture device 24 at the timing when the tracking cycle has ended, and notifies the interpolation position calculation unit 72.
[0098] In step S12, the interpolation position calculation unit 72 uses the tracking position of the destination notified in step S11 as the target position and calculates the interpolation position by performing interpolation processing using the PID control described above at a timing according to the frame period of the sound playback frame. The interpolation position calculation unit 72 then supplies interpolation position information indicating the interpolation position to the amplifier 61 and the non-integer delay unit 62.
[0099] In step S13, the tracking position acquisition unit 71 determines whether the motion capture device 24 has finished capturing the next tracking cycle.
[0100] In step S13, if the tracking position acquisition unit 71 determines that the motion capture device 24 has not yet completed capturing for the next tracking cycle, the process returns to step S12. In this case, the interpolation position calculation unit 72 calculates the next interpolation position at a timing according to the frame cycle of the next sound playback frame, and the same process is repeated thereafter.
[0101] On the other hand, if in step S13 the tracking position acquisition unit 71 determines that the motion capture device 24 has finished capturing for the next tracking cycle, the process returns to step S11. In this case, the same process is repeated based on the tracking position information supplied from the motion capture device 24.
[0102] As described above, the real-time virtual sound image generation system 41 can improve the sense of localization by having the control unit 65 apply interpolation processing using PID control to the movement control processing of the virtual sound image position, so that the virtual sound image position follows the movement of the user's hand. As a result, the real-time virtual sound image generation system 41 can provide a better user experience.
[0103] Of course, PID control is just one control method, and the control unit 65 may use other control methods for controlling the movement of the virtual sound image position that can calculate an interpolated position so as to improve the sense of localization from the current position to the target position.
[0104] Furthermore, the control unit 65 may not only use the tracking position of the destination as the target position, but may also use positions around the tracking position of the destination as the target position and determine an interpolated position that approaches those surrounding positions while fluctuating. For example, the control unit 65 can acquire not only the position of the user's hand, but also the movements of other body parts around the user's hand (e.g., arms and elbows other than the hand) through motion capture, and by considering the movements of those parts, it can determine an interpolated position that approaches the positions around the tracking position of the destination while fluctuating.
[0105] <Content control processing> The content control process will be explained with reference to Figures 9 and 10.
[0106] The real-time virtual sound image generation system 41 can play sounds of desired content (for example, music, nature sounds, animal sounds, and various other content) when playing sounds using the position of the user's hand acquired by the motion capture device 54 as the virtual sound image position. However, if the virtual sound image position moves in accordance with the user's hand movement, and the content is simply played at a constant volume (for example, the same content is played at the same volume), the user may not be able to obtain a sense of ownership over the sound of that content (for example, the feeling that the sound is coming from the user's own hand).
[0107] Therefore, in the real-time virtual sound image generation system 41, the control unit 65 performs content control processing to adjust the output of the sound of the content, which is reproduced with the user's hand position as the virtual sound image position, according to the user's hand behavior (such as actions and movements) (by introducing fluctuations in volume, frequency characteristics, etc.). As a result, the real-time virtual sound image generation system 41 can improve the user's sense of ownership over the sound of the content, which is reproduced with the user's hand position as the virtual sound image position, and provide a better user experience.
[0108] For example, suppose a user moves their hand, causing the gap between their thumb and the other fingers to open and close (for example, the user's hand to simulate the opening and closing of an animal's mouth). In this case, the real-time virtual sound generation system 41 performs content control processing so that the volume of the content's sound output increases or decreases according to the user's hand movements (for example, the sound of an animal barking becomes louder when the gap between the thumb and the other fingers widens, and quieter when the gap between the thumb and the other fingers closes). For example, if the content is a dog barking, the sound output may be controlled so that the dog barks "woof" each time the gap between the thumb and the other fingers widens or closes. Similarly, when the real-time virtual sound generation system 41 generates sounds such as conversations by characters or avatars other than animals, it may perform content control processing so that the volume of those sounds is adjusted according to the user's hand movements.
[0109] As a result, the real-time virtual sound image generation system 41 can give the user the sensation that their hand has become the mouth of an animal or other object, and can enhance the user's sense of ownership over the sound reproduced with the position of the user's hand as the virtual sound image position.
[0110] Furthermore, the real-time virtual sound image generation system 41 may reproduce sound in a way that fluctuates according to the spacing between the user's fingers, for example, by increasing the volume when the spacing between the user's fingers widens and decreasing the volume when the spacing between the user's fingers narrows.
[0111] Figure 9 is a block diagram showing an example configuration of the control unit 65 that performs content control processing.
[0112] As shown in Figure 9, the control unit 65 is configured to include a behavior recognition unit 81 and a content control unit 82.
[0113] The behavior recognition unit 81 recognizes the user's hand movements based on motion capture information indicating the user's finger movements supplied from the motion capture device 54. When the behavior recognition unit 81 recognizes that a predetermined action has been performed by the user's hand (for example, the movement of opening and closing the gap between the thumb and the other fingers as described above), it notifies the content control unit 82 that the action has been performed.
[0114] The content control unit 82 supplies content control information to the sound source 51 to control the sound of the content in accordance with the user's hand movements recognized by the behavior recognition unit 81 (for example, adjusting the volume when outputting animal sounds as described above).
[0115] As a result, in the real-time virtual sound image generation system 41, the sound source 51 can supply a sound time signal, whose sound has been adjusted according to the content control information, to the virtual sound image generation filter 52. Therefore, the virtual sound image generation filter 52 can move the virtual sound image position in accordance with the user's hand movements, while adjusting the output of the sound played at the virtual sound image position in accordance with the user's hand movements.
[0116] Referring to the flowchart shown in Figure 10, the content control process performed by the control unit 65 will be explained.
[0117] In step S21, the behavior recognition unit 81 recognizes the user's hand movements based on motion capture information indicating the user's finger movements supplied from the motion capture device 54, and determines whether a predetermined behavior has occurred. The behavior recognition unit 81 waits until it determines that a predetermined behavior has occurred, and if it determines that a predetermined behavior has occurred, it notifies the content control unit 82 that the behavior has occurred, and the process proceeds to step S22.
[0118] In step S22, the content control unit 82 supplies content control information to the sound source 51 to control the sound of the content in accordance with the user's hand movements recognized in step S21. After the processing in step S22, the process returns to step S21, and the same process is repeated thereafter.
[0119] As described above, the real-time virtual sound generation system 41 can enhance the user's sense of ownership over the sound of the content by having the control unit 65 perform content control processing that adjusts the output of the content's sound according to the user's hand movements. As a result, the real-time virtual sound generation system 41 can provide a better user experience.
[0120] In other words, the real-time virtual sound image generation system 41 can give meaning to the user's hand movements through content control processing, for example, giving the user the feeling that they are moving the sound more, or that a part of the user themselves is holding a mouth.
[0121] Furthermore, the real-time virtual sound image generation system 41 may adjust the output of the sound played at the virtual sound image location in accordance with the movements of other body parts, such as the feet or head, rather than being limited to the movements of the user's hands. For example, the real-time virtual sound image generation system 41 can produce a buzzing sound that simulates insects, such as mosquitoes, randomly chasing the user's head in accordance with the user's head movements, or, if the user makes a gesture as if swatting at the insect with their hand, it can produce a buzzing sound that simulates insects, such as mosquitoes, fleeing in the opposite direction from the user's head.
[0122] Furthermore, the real-time virtual sound image generation system 41 may, in the virtual sound image position movement control processing and content control processing described above, constantly change the sound source information, including the frequency characteristics and temporal characteristics of the sound source, in order to introduce fluctuations into the sound reproduced at the virtual sound image position.
[0123] The real-time virtual sound image generation system 41, configured as described above, can, for example, in the entertainment field, present a new sensation as if the user's body parts and the sound image were integrated. Furthermore, the real-time virtual sound image generation system 41 can improve localization accuracy in the field of high-presence audio playback, thus improving the three-dimensionality of sound.
[0124] <Example of computer configuration> Next, the series of processes described above (sound image control method) can be performed by hardware or by software. When the series of processes are performed by software, the programs that make up the software are installed on a general-purpose computer or the like.
[0125] Figure 11 is a block diagram showing an example configuration of one embodiment of a computer on which the program that performs the series of processes described above is installed.
[0126] The program can be pre-recorded on the hard disk 105 or ROM 103, which are recording media built into the computer.
[0127] Alternatively, the program can be stored (recorded) on a removable recording medium 111 driven by drive 109. Such a removable recording medium 111 can be provided as so-called packaged software. Examples of removable recording media 111 include flexible disks, CD-ROMs (Compact Disc Read Only Memory), MO (Magneto Optical) disks, DVDs (Digital Versatile Discs), magnetic disks, semiconductor memory, etc.
[0128] In addition to installing the program from the removable storage medium 111 as described above, the program can also be downloaded to the computer via a communication network or broadcasting network and installed on the built-in hard disk 105. That is, the program can be transferred wirelessly to the computer from a download site via a satellite for digital satellite broadcasting, or transferred via a wired connection to the computer via a network such as a LAN (Local Area Network) or the Internet.
[0129] The computer has a built-in CPU (Central Processing Unit) 102, and an input / output interface 110 is connected to the CPU 102 via a bus 101.
[0130] When the CPU 102 receives a command from the user via the input / output interface 110, such as by operating the input unit 107, it executes a program stored in the ROM (Read Only Memory) 103 accordingly. Alternatively, the CPU 102 loads a program stored in the hard disk 105 into the RAM (Random Access Memory) 104 and executes it.
[0131] As a result, the CPU 102 performs processing according to the flowchart described above, or processing according to the configuration of the block diagram described above. The CPU 102 then outputs the processing results as needed, for example, via the input / output interface 110, from the output unit 106, or transmitted from the communication unit 108, or recorded on the hard disk 105.
[0132] The input section 107 consists of a keyboard, mouse, microphone, etc. The output section 106 consists of an LCD (Liquid Crystal Display), speakers, etc.
[0133] In this specification, the processes performed by a computer according to a program do not necessarily have to be performed chronologically in the order described in the flowchart. That is, the processes performed by a computer according to a program include processes that are executed in parallel or individually (e.g., parallel processing or object-based processing).
[0134] Furthermore, the program may be processed by a single computer (processor), or it may be processed in a distributed manner by multiple computers. Moreover, the program may be transferred to a remote computer for execution.
[0135] Furthermore, in this specification, a system means a collection of multiple components (devices, modules (parts), etc.), regardless of whether all components are located in the same enclosure or not. Therefore, multiple devices housed in separate enclosures and connected via a network, and a single device in which multiple modules are housed in one enclosure, are both considered systems.
[0136] Furthermore, for example, the configuration described as a single device (or processing unit) may be divided and configured as multiple devices (or processing units). Conversely, the configurations described above as multiple devices (or processing units) may be combined and configured as a single device (or processing unit). It is also possible to add configurations other than those described above to the configuration of each device (or each processing unit). Moreover, if the overall system configuration and operation are substantially the same, a part of the configuration of one device (or processing unit) may be included in the configuration of another device (or other processing unit).
[0137] Furthermore, for example, this technology can be configured as cloud computing, where a single function is shared and processed collaboratively by multiple devices via a network.
[0138] Furthermore, for example, the program described above can be executed on any device. In that case, the device should have the necessary functions (such as functional blocks) and be able to obtain the necessary information.
[0139] Furthermore, each step described in the flowchart above can be executed by a single device or shared among multiple devices. Additionally, if a single step includes multiple processes, these processes can be executed by a single device or shared among multiple devices. In other words, multiple processes within a single step can be executed as multiple steps. Conversely, processes described as multiple steps can be combined and executed as a single step.
[0140] Furthermore, the program executed by the computer may be executed in a chronological order according to the sequence of steps described herein, or it may be executed in parallel or individually at necessary times, such as when a call is made. In other words, as long as no inconsistencies arise, the processing of each step may be executed in an order different from the sequence described above. Moreover, the processing of the steps of this program may be executed in parallel with the processing of other programs, or it may be executed in combination with the processing of other programs.
[0141] Furthermore, the technologies described in this specification can be implemented independently, as long as they do not create a contradiction. Of course, any multiple technologies can also be implemented in combination. For example, some or all of the technologies described in one embodiment can be combined with some or all of the technologies described in another embodiment. In addition, some or all of the above-mentioned technologies can be implemented in combination with other technologies not mentioned above.
[0142] It should be noted that this embodiment is not limited to the embodiment described above, and various modifications are possible without departing from the spirit of this disclosure. Furthermore, the effects described herein are merely illustrative and not limiting, and other effects may also exist. [Explanation of Symbols]
[0143] 11 Real-time virtual sound image generation system, 21 Sound source, 22 Virtual sound image generation filter, 23 Linear speaker array, 24 Motion capture device, 31 FIR filter, 32-1~32-s delay unit, 33-1~33-s amplifier, 34-1~34-s speaker element, 41 Real-time virtual sound image generation system, 51 Sound source, 52 Virtual sound image generation filter, 53 Spherical speaker array, 54 Motion capture device, 61 Amplifier, 62 Non-integer delay unit, 63-0~63-N IIR filter, 64 Gain matrix, 65 Control unit, 71 Tracking position acquisition unit, 72 Interpolation position calculation unit, 81 Behavior recognition unit, 82 Content control unit
Claims
1. A virtual sound image generation filter that generates a virtual sound image in space using a speaker array composed of multiple speaker elements, A control unit controls playback so that the sound reproduced in the sound image changes and fluctuates in accordance with the movement of body parts acquired by motion capture. A sound image control device equipped with the following features.
2. The control unit performs interpolation processing to obtain a tracking position by tracking the position of the part at a predetermined tracking period, and interpolates this position according to the frame period of the sound playback frame that reproduces the sound. Then, in accordance with the movement of the part, it performs movement control processing to move the position of the sound image. The sound image control device according to claim 1.
3. The control unit applies the interpolation process using PID (Proportional Integral Differential) control to the movement control process. The sound image control device according to claim 2.
4. The control unit, A tracking position acquisition unit acquires the tracking position of the destination based on the tracking position information notified at the time when the aforementioned tracking cycle ends. An interpolation position calculation unit calculates an interpolation position such that the sound image moves from the current position towards the target position with fluctuations for each frame period, using the tracking position of the destination acquired by the tracking position acquisition unit as the target position, and the tracking position of the source or the immediately preceding interpolation position as the current position. The sound image control device according to claim 3, having the following features.
5. The control unit performs content control processing to adjust the sound output of the content reproduced in the sound image according to the behavior of the part. The sound image control device according to claim 1.
6. The control unit, A behavior recognition unit recognizes the behavior of the aforementioned part based on motion capture information indicating the movement of the aforementioned part, A content control unit controls the sound of the content so that the output is adjusted in accordance with the behavior of the part recognized by the behavior recognition unit. The sound image control device according to claim 5, having the following features.
7. The process involves generating a virtual sound image in space using a speaker array composed of multiple speaker elements, The playback is controlled so that the sound reproduced in the sound image changes and fluctuates in accordance with the movement of body parts acquired by motion capture. A sound image control method including the following.
8. In the sound image control device's computer, The process involves generating a virtual sound image in space using a speaker array composed of multiple speaker elements, The playback is controlled so that the sound reproduced in the sound image changes and fluctuates in accordance with the movement of body parts acquired by motion capture. A program that performs a process that includes the following.