Realistic sound field reproduction device and realistic sound field reproduction method
The system uses headset and ambient microphones with location tracking to enhance the sense of presence in satellite venues by reproducing sound fields with high sensitivity, addressing the limitations of existing technologies in recreating immersive audio experiences.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- PANASONIC INTELLECTUAL PROPERTY MANAGEMENT CO LTD
- Filing Date
- 2022-09-16
- Publication Date
- 2026-06-19
AI Technical Summary
Existing scene-based spatial audio reproduction technologies struggle to recreate a highly sensitive sense of presence from various sound sources in satellite venues, as they do not adequately account for the relationship between the atmosphere of the sound field captured by microphones and the atmosphere experienced by listeners.
A system that includes headset microphones worn by performers, ambient microphones around the activity area, and location tracking, combined with signal processing to reproduce the sound field in a satellite venue using multiple speakers, incorporating peripheral area sound signals to enhance the sense of presence.
The system effectively reproduces the sense of presence, including sound images from various sound sources, with high sensitivity in the sound field playback space, providing a realistic immersive experience.
Smart Images

Figure 0007876152000040 
Figure 0007876152000041 
Figure 0007876152000042
Abstract
Description
[Technical Field]
[0001] This disclosure relates to an immersive sound field reproduction device and an immersive sound field reproduction method. [Background technology]
[0002] Recently, scene-based spatial audio reproduction technology has been attracting attention for its ability to reproduce (play back) sound fields in real time. Scene-based spatial audio reproduction technology is a method that uses ambisonic microphones, which have multiple directional microphone elements arranged on a rigid sphere or a hollow spherical surface, to record (capture) multi-channel signals. By applying signal processing to these signals, speakers are arranged to surround the listening environment (space) to reproduce (play back) a three-dimensional sound field in real time, as if the listener were actually present at the location where the ambisonic microphones were placed.
[0003] Prior art relating to sound field reproduction is known, for example, Patent Document 1. Patent Document 1 discloses an audio recording device that receives sound pickup signals from a wireless microphone attached to a subject and generates a multi-channel audio signal based on each audio signal picked up by multiple microphones. This audio recording device assigns the sound pickup signals from the wireless microphone to one or more arbitrary channels of the multi-channel audio signal, combines them at an arbitrary mixing ratio, and records them together with the captured image signal on a recording medium. [Prior art documents] [Patent Documents]
[0004] [Patent Document 1] Japanese Patent Publication No. 2006-314078 [Overview of the project] [Problems that the invention aims to solve]
[0005] Here, we envision using the scene-based spatial audio reproduction technology described above to reproduce a sound field created by various sound sources (e.g., performers on stage, sound effects) recorded in a large concert hall or other live venue, in one or more satellite venues different from the live venue. Patent Document 1 does not disclose in detail the relationship between the atmosphere of the sound field captured by a microphone and the atmosphere of the sound field captured by a wireless microphone, and it is considered difficult to apply the technology of Patent Document 1 to realize the above-mentioned assumption. Furthermore, in order to reproduce (reproduce) the sense of presence from the live venue with high sensitivity in the satellite venue, the sound (in other words, the sound image) when performers on stage at the live venue speak may be insufficient from the standpoint of sound field reproduction for listeners in the satellite venue. However, Patent Document 1 did not present a solution for reproducing a highly sensitive sense of presence using the various sound sources described above.
[0006] This disclosure was devised in view of the conventional circumstances described above, and aims to provide a realistic sound field reproduction device and a realistic sound field reproduction method that reproduce the sense of presence, including sound images from various sound sources in a sound field recording space, with high sensitivity in a sound field reproduction space. [Means for solving the problem]
[0007] This disclosure includes: an acquisition unit that acquires at least: a speech voice signal recorded by a person microphone worn by at least one person who can move within an activity area in a sound field recording space; ambient sound signals recorded by a plurality of ambient microphones arranged around the activity area; and location information of the person; a determination unit that determines, based on the location information of the person, a main ambient microphone which is one of the plurality of ambient microphones and whose recording area is the location of the person within the activity area; and a sound field reproduction unit that performs sound field reproduction processing to reproduce the sound field in the sound field recording space using a plurality of speakers arranged in a sound field reproduction space different from the sound field recording space, based at least the speech voice signal from the person microphone, a first ambient sound signal from the main ambient microphone, and a second ambient sound signal from other ambient microphones other than the main ambient microphone.The acquisition unit further acquires peripheral area sound signals recorded by a sound recording device located in a peripheral area different from the activity area within the sound field recording space, and the sound field playback unit performs signal processing to reproduce a sound field using the peripheral area sound signals in an area within the sound field playback space corresponding to the peripheral area within the sound field recording space, and uses the speech voice signal, the first peripheral sound signal, and the second peripheral sound signal as reference signals, and performs an erasure process as the signal processing to erase the components of the reference signals included in the peripheral area sound signals. To provide a realistic sound field reproduction device.
[0009] These comprehensive or specific embodiments may be implemented as systems, devices, methods, integrated circuits, computer programs, or recording media, or as any combination of systems, devices, methods, integrated circuits, computer programs, and recording media. [Effects of the Invention]
[0010] According to this disclosure, the sense of presence, including sound images from various sound sources within the sound field recording space, can be reproduced with high sensitivity within the sound field playback space. [Brief explanation of the drawing]
[0011] [Figure 1] Block diagram showing an example of the system configuration of the real-time sound field reproduction system according to Embodiment 1. [Figure 2] This diagram shows an example of the placement of microphones in the stage edge zone and ceiling-mounted zone, viewed from the ceiling of a live music venue. [Figure 3] This diagram shows examples of recording ranges for microphones in the stage edge zone and ceiling-mounted zone. [Figure 4] Diagram showing an example of the directivity of a ceiling-mounted zone microphone. [Figure 5] This diagram shows an example of the placement of stage edge zone microphones and ceiling-mounted zone microphones, viewed from the left (right) edge of the stage to the right (left) edge of the stage in a live music venue. [Figure 6] Transition diagram showing an example of periodic updates to a matching table. [Figure 7] This diagram schematically illustrates an example of the operation of selecting zone sound signals based on the performer's position information. [Figure 8] Diagram showing an example of satellite speaker placement within a satellite venue. [Figure 9] Sequence diagram showing an example of the operation procedure of the real-time sound field reproduction system according to Embodiment 1 in chronological order. [Figure 10]Diagram schematically showing the concept from sound field recording to sound field reproduction in scene-based stereophonic reproduction technology using an ambisonic microphone [Figure 11] Diagram showing an example of the basis of ambisonic components based on spherical harmonic function expansion for degree n and order m [Figure 12] Diagram schematically showing an example of the operation outline of a sound field presence reproduction system [Figure 13] Block diagram showing an example of the system configuration of a sound field presence reproduction system according to Embodiment 2 [Figure 14] Diagram showing an example of the operation outline from sound field presence recording to sound field presence reproduction in the sound field presence reproduction system of FIG. 13 [Figure 15] Flowchart showing an example of the operation procedure of sound field presence reproduction by a sound field presence reproduction device according to Embodiment 2 in time series [Figure 16] Block diagram showing an example of the system configuration of a sound field presence reproduction system according to a modification of Embodiment 2 [Figure 17] Diagram showing an example of the operation outline from sound field presence recording to sound field presence reproduction in the sound field presence reproduction system of FIG. 16 [Figure 18] Flowchart showing an example of the operation procedure of sound field presence reproduction by a sound field presence reproduction device according to a modification of Embodiment 2 in time series [Figure 19] Block diagram showing an example of the system configuration of a sound field presence reproduction system according to Embodiment 3 [Figure 20] Diagram showing an example of the operation outline from sound field presence recording to sound field presence reproduction in the sound field presence reproduction system of FIG. 19 [Figure 21] Flowchart showing an example of the operation procedure of sound field presence reproduction by a sound field presence reproduction device according to Embodiment 3 in time series [Figure 22] Block diagram showing an example of the system configuration of a sound field presence reproduction system according to a modification of Embodiment 3 [Figure 23] Diagram showing an example of the operation outline from sound field presence recording to sound field presence reproduction in the sound field presence reproduction system of FIG. 22 [Figure 24]A flowchart showing a time-series example of the operation procedure for sound field presence reproduction using a sound field presence reproduction device according to a modified example of Embodiment 3. [Modes for carrying out the invention]
[0012] Hereinafter, with appropriate reference to the drawings, embodiments specifically disclosing the presence-sensing sound field reproduction device and presence-sensing sound field reproduction method according to this disclosure will be described in detail. However, unnecessarily detailed explanations may be omitted. For example, detailed explanations of already well-known matters and redundant explanations of substantially identical configurations may be omitted. This is to avoid the following explanation becoming unnecessarily verbose and to facilitate understanding by those skilled in the art. The accompanying drawings and the following explanation are provided to enable those skilled in the art to fully understand this disclosure and are not intended to limit the subject matter of the claims.
[0013] (Embodiment 1) Embodiment 1 describes an example in which, while a theatrical performance or the like is being staged in a sound field recording space (for example, a large concert hall or other live venue), sound source signals (sound signals) are recorded from various sound sources, such as the speech of actors or other performers (an example of a person), sounds near the performers' feet, and sounds from the audience away from the stage, such as applause, cheers, and murmurs. The sense of presence, including the sound field, created by these sound source signals is then reproduced in a sound field playback space different from the sound field recording space (for example, a satellite venue such as a movie theater).
[0014] First, with reference to Figure 1, an example of the system configuration of the immersive sound field reproduction system according to Embodiment 1 will be described. Figure 1 is a block diagram showing an example of the system configuration of the immersive sound field reproduction system 1000 according to Embodiment 1.
[0015] The immersive sound field reproduction system 1000 includes various devices located on the live venue LV1 side and various devices located on the satellite venue STL1 side. The live venue side communication unit 6, which is one of the devices on the live venue LV1 side, and the satellite venue side communication unit 11, which is one of the devices on the satellite venue STL1 side, are connected to each other via network NW1 so that data communication can be established. Network NW1 may be a wired network or a wireless network. The wired network may include at least one of the following: wired LAN (Local Area Network), wired WAN (Wide Area Network), or power line communication (PLC), and may also be other network configurations capable of wired communication. On the other hand, the wireless network may include at least one of the following: wireless LAN such as Wi-Fi (registered trademark), wireless WAN, short-range wireless communication such as Bluetooth (registered trademark), or mobile cellular communication network such as 4G or 5G, and may also be other network configurations capable of wireless communication.
[0016] The various devices positioned on the LV1 side of the live venue include, for example, headset microphones HM1, ..., HM4, zone microphones ZM1, ..., ZM7, ambisonic microphone AMB1, mixer 1, performer tracking system 2, face database 3, encoder 4, matching table database 5, and live venue side communication unit 6. In Figure 1, for the sake of simplicity, four headset microphones and seven zone microphones are shown as examples, but the number of headset microphones is not limited to four, and furthermore, the number of zone microphones is not limited to seven.
[0017] Each of the headset microphones HM1 to HM4 is an example of a person microphone, and is a wireless microphone worn by at least one person (e.g., an actor or performer) who can move around the stage STG1 (see Figure 2) of the live venue LV1, and is capable of wirelessly transmitting various control signals or data signals to and from mixer 1. Each microphone element of headset microphones HM1 to HM4 is configured using, for example, an omnidirectional microphone or a directional microphone. Each of the headset microphones HM1 to HM4 uses its microphone element to pick up (record) the speech of the person wearing it. Each of the headset microphones HM1 to HM4 may also have a speaker and output the sound signal input to that speaker. Each of the headset microphones HM1 to HM4 is worn so as to face the mouth of the person wearing it. For example, headset microphone HM1 is worn by performer A, headset microphone HM2 is worn by performer B, headset microphone HM3 is worn by performer C, and headset microphone HM4 is worn by performer D. The speech signals captured (recorded) by each of the headset microphones HM1 to HM4 are wirelessly transmitted to mixer 1 as performer sound signals, along with the identification information of the corresponding headset microphone (for example, the sound source ID, which is the identification number of the performer wearing it (see below)).
[0018] Each of the zone microphones ZM1 to ZM7 is an example of a peripheral microphone, positioned around the stage STG1 (see Figure 2) of the live venue LV1, and capable of wired or wireless communication of various control signals or data signals with Mixer 1. Each of the zone microphones ZM1 to ZM7 is constructed using, for example, an omnidirectional or directional microphone element. Each of the zone microphones ZM1 to ZM7 uses its microphone element to pick up (record) sounds occurring near the feet or above the head of the person wearing the headset microphone. The sound signals picked up (recorded) by each of the zone microphones ZM1 to ZM7 are wirelessly transmitted to Mixer 1 as zone sound signals. Details of the placement examples of zone microphones ZM1 to ZM7 will be described later with reference to Figures 2 to 5.
[0019] The Ambisonics Microphone AMB1 is an example of a sound recording device. It is positioned in a surrounding area away from the stage STG1 (see Figure 2) of the live venue LV1 (for example, near the seats on the audience side) and is capable of wired or wireless communication of various control signals or data signals with Mixer 1. The Ambisonics Microphone AMB1 is equipped with four microphone elements Mc1, Mc2, Mc3, and Mc4 (see Figure 10). Microphone element Mc1 records sound from the front upper left direction of the Ambisonics Microphone AMB1 housing (see Figure 10), microphone element Mc2 records sound from the front lower right direction of the housing (see Figure 10), microphone element Mc3 records sound from the rear lower left direction of the housing (see Figure 10), and microphone element Mc4 records sound from the rear upper right direction of the housing (see Figure 10). The Ambisonics microphone AMB1 records sounds from an area different from the stage STG1 (see Figure 2) within the live venue LV1 (specifically, sounds from the audience such as applause, cheers, and murmurs (in other words, the sense of presence from the audience side)). The Ambisonics microphone AMB1 may have more unidirectional microphone elements than the four hollow-arranged microphone elements Mc1, Mc2, Mc3, and Mc4, or it may have an omnidirectional microphone element arranged on a rigid sphere. By using an Ambisonics microphone with a large number of microphone elements, it becomes possible to synthesize an Ambisonics signal of the second order or higher in the audience presence parameter calculation unit 16, which will be described later. The audio signal from the audience side picked up (recorded) by the Ambisonics microphone AMB1 is wirelessly transmitted to the mixer 1 as an audience presence sound signal.
[0020] Mixer 1 is configured using a processor such as a CPU (Central Processing Unit) provided by computer device P1. Mixer 1 receives performer sound signals with identification information (see above) attached from each of the headset microphones HM1 to HM4, zone sound signals from each of the zone microphones ZM1 to ZM7, and audience presence sound signals from the ambisonic microphone AMB1, mixes them as a live venue sound signal, and outputs it to encoder 4. Alternatively, mixer 1 may mix the zone sound signals from each of the zone microphones ZM1 to ZM7 and the audience presence sound signals from the ambisonic microphone AMB1 and output them to encoder 4, while also outputting the performer sound signals from each of the headset microphones HM1 to HM4 individually to encoder 4. Furthermore, mixer 1, encoder 4, and live venue communication unit 6 may be configured by one or more computer devices P1 (for example, a personal computer or server computer). For example, computer device P1 may be called a sound field recording device.
[0021] The performer tracking system 2 is capable of wired or wireless communication of data signals with the face database 3, encoder 4, and matching table database 5. While a theatrical performance is taking place, the performer tracking system 2 tracks performers moving or standing still on the stage STG1 (see Figure 2) of the live venue LV1 by referring to the face database 3, thereby recognizing the performers and constantly issuing or reconfirming their identification information, as well as acquiring their location information. As a result, while a theatrical performance is taking place, the performer tracking system 2 generates matching table data for each performer, which contains records with the performer's identification information (sometimes referred to as "sound source ID") and the performer's location information, and stores it in the matching table database 5. The performer tracking system 2 may also be equipped with an AI camera, for example, which is equipped with artificial intelligence (AI). The performer tracking system 2 acquires the performer's identification information and location information by comparing the image of the performer captured by this AI camera with the performer's face photograph pre-registered in the face database 3. Furthermore, the performer tracking system 2 is not limited to acquiring performer identification and location information using the AI camera and face database 3; it may also acquire this information based on information detected by other devices (for example, sensors individually worn by the performers). The performer tracking system 2 sends the data signals of each performer's identification and location information, along with the captured video footage of that performer, to the encoder 4.
[0022] The face database 3 is configured, for example, using flash memory or a hard disk drive, and registers facial photographs of each performer who performs plays or other performances on the stage STG1 (see Figure 2) of the live venue LV1, associating them with the performer's identification information. The face database 3 is connected to the performer tracking system 2 so that it can be accessed, and the performer tracking system 2 refers to it when recognizing performers and acquiring their location information. In Figure 1, the database is conveniently referred to as "DB".
[0023] Encoder 4 is configured using a processor such as a CPU provided in computer device P1. Encoder 4 takes data signals of the live venue sound signal from mixer 1 and data of recognized performer identification information, location information, and captured video from performer tracking system 2 as input. Encoder 4 may also obtain data from performer tracking system 2 (e.g., performer identification information and location information) that matches the performer identification information (e.g., sound source ID) attached to the performer sound signal data of the live venue sound signal by referring to matching table database 5. Encoder 4 generates IP packets for transmission and reception via network NW1 by associating (e.g., IP (Internet Protocol) packetization) the performer sound signal data of performers with common identification information with the data of identification information, location information, and captured video. Encoder 4 also generates IP packets for transmission and reception via network NW1 using zone sound signals from zone microphones ZM1 to ZM7 and audience-side presence sound signals from ambisonic microphone AMB1.
[0024] The matching table database 5 is configured, for example, using flash memory or a hard disk drive, and stores data for matching tables MaTL1, MaTL2, ... (see Figure 6) that are generated or updated by the performer tracking system 2. Details of the data for matching tables MaTL1, MaTL2, ... will be described later with reference to Figure 6.
[0025] The live venue-side communication unit 6 is configured using a communication circuit capable of enabling communication using, for example, wired or wireless connections, and performs wired or wireless communication of various control signals or data signals with the satellite venue-side communication unit 11 via the network NW1. The live venue-side communication unit 6 transmits various IP packets generated by the encoder 4 to the satellite venue-side communication unit 11 via the network NW1.
[0026] The various devices located on the satellite venue STL1 side include, for example, a satellite venue side communication unit 11, a decoder 12, a memory unit 13, a video output unit 14, a sound image localization zone determination unit 15, an audience side presence parameter calculation unit 16, a mixing / level balance adjustment unit 17, a sound field reproduction processing unit 18, and satellite speakers SPk1, ..., SPkp. p indicates the number of satellite speakers and is a constant integer of 2 or more.
[0027] The satellite venue communication unit 11 is configured using a communication circuit capable of achieving communication using, for example, wired or wireless connections, and performs wired or wireless communication of various control signals or data signals with the live venue communication unit 6 via the network NW1. The satellite venue communication unit 11 receives various IP packets sent from the live venue communication unit 6 via the network NW1 and outputs them to the decoder 12. Furthermore, the satellite venue communication unit 11, decoder 12, memory unit 13, video output unit 14, sound image localization zone determination unit 15, audience seating area presence parameter calculation unit 16, mixing / level balance adjustment unit 17, and sound field reproduction processing unit 18 may be configured by one or more computer devices P1 (for example, a personal computer or server computer). For example, computer device P2 may be called a sound field reproduction device.
[0028] The decoder 12 is configured using a processor such as a CPU provided in the computer device P2. The decoder 12 decodes the IP packets from the satellite venue side communication unit 11, separating them into performer sound signals, identification information, location information, and captured video data for each performer, which are picked up (recorded) by each of the headset microphones HM1 to HM4, zone sound signals from each of the zone microphones ZM1 to ZM7, and audience-side presence sound signals from the ambisonic microphone AMB1. The decoder 12 sends the captured video data for each performer to the video output unit 14. The decoder 12 outputs the performer sound signals, identification information, location information, and captured video data for each performer, which are picked up (recorded) by each of the headset microphones HM1 to HM4, and the zone sound signals from each of the zone microphones ZM1 to ZM7 to the sound image localization zone determination unit 15. The decoder 12 outputs the audience-side presence sound signal from the ambisonic microphone AMB1 to the audience-side presence parameter calculation unit 16.
[0029] The memory unit 13 is configured using, for example, flash memory or a hard disk drive, and stores the pre-calculated sound image localization parameters for each of the satellite speakers SPk1 to SPkp, or various data acquired by the decoder 12. The method for calculating the sound image localization parameters for each satellite speaker is disclosed, for example, in Japanese Patent Application Publication No. 09-149500, so the explanation is omitted here. In other words, the sound image localization parameters for each satellite speaker are the delay of another satellite speaker (volume assist speaker) that assists the volume from that satellite speaker (localization speaker), and the attenuation level of that volume assist speaker. These sound image localization parameters are used when the sound field reproduction processing unit 18 performs sound image localization processing. Furthermore, the various data acquired by the decoder 12 include, for example, performer sound signals, identification information, location information, and captured video data for each performer, picked up (recorded) by each of the headset microphones HM1 to HM4, as well as zone sound signals from each of the zone microphones ZM1 to ZM7 and audience presence sound signals from the ambisonic microphone AMB1. The memory unit 13 also stores the location information (for example, coordinate information in 3D space) of each of the satellite speakers SPk1 to SPkp within the satellite venue STL1.
[0030] The video output unit 14 is configured, for example, using a projector, and projects the image data of each performer captured from the decoder 12 onto the screen SCR1 (see Figure 8) located in the satellite venue STL1. As a result, multiple audience members seated in the satellite venue STL1 can view the captured images, which were captured in a way that focused on each performer at the live venue LV1, via the screen SCR1.
[0031] The sound localization zone determination unit 15 is configured using a processor such as a CPU provided in the computer device P2. The sound localization zone determination unit 15 is an example of a determination unit, and based on data from the decoder 12 (i.e., data on identification information and position information for each performer), it determines a sound localization zone for reproducing (playing) the sound field (atmosphere) formed by the voices of performers speaking on the stage STG1 (see Figure 2) of the live venue LV1 and the sounds of their feet when they stop or move, etc., within the satellite venue STL1. The sound localization zone referred to here is one of the sound pickup zones ZN1 to ZN7 shown in Figure 3. Details of this sound localization zone determination example will be described later with reference to Figure 7.
[0032] The audience-side presence parameter calculation unit 16 is configured using a processor such as a CPU provided in the computer device P2. The audience-side presence parameter calculation unit 16 is an example of a sound field reproduction unit, and calculates three-dimensional reproduction parameters for three-dimensionally reproducing (playing) the sound field of audience presence in the live venue LV1 in the satellite venue STL1, based on data from the decoder 12 (i.e., the audience-side presence sound signal from the ambisonic microphone AMB1) and the position information of each satellite speaker SPk1 to SPkp stored in the memory unit 13. The audience-side presence parameter calculation unit 16 uses these three-dimensional reproduction parameters to generate speaker drive signals for each satellite speaker SPk1 to SPkp and outputs them to the mixing / level balance adjustment unit 17. Details of the calculation examples of three-dimensional reproduction parameters and the generation examples of speaker drive signals will be described in detail with reference to Embodiment 2.
[0033] The mixing / level balance adjustment unit 17 is configured using a processor such as a CPU provided in the computer device P2. The mixing / level balance adjustment unit 17 is an example of a sound field reproduction unit and adjusts the balance of the signal levels of each performer's sound signal picked up (recorded) by each of the headset microphones HM1 to HM4, the zone sound signals from each of the zone microphones ZM1 to ZM7, and the speaker drive signals for each satellite speaker. The mixing / level balance adjustment unit 17 mixes (i.e., blends) each signal whose signal level balance has been adjusted. Details of the adjustment of the balance of each signal level will be described later with reference to Figure 7.
[0034] The sound field reproduction processing unit 18 is configured using a processor such as a CPU provided in the computer device P2. The sound field reproduction processing unit 18 is an example of a sound field reproduction unit, and uses the sound image localization parameters for each satellite speaker read from the memory unit 13 and the signals output from the mixing / level balance adjustment unit 17 to perform sound field reproduction processing to reproduce (play back) the sound field (atmosphere) formed by the voice of the performers speaking in the live venue LV1 and the sound of their feet when they stand or move, as well as the sound field (atmosphere) formed by the sense of presence on the audience side of the live venue LV1, in the satellite venue STL1. Specifically, in order to reproduce the sound field (atmosphere) formed by the voice of the performers speaking in the live venue LV1 and the sound of their feet when they stand or move, the sound field reproduction processing unit 18 uses the signals output from the mixing / level balance adjustment unit 17 to perform processing to localize the sound image of the performers' voices and foot sounds through each satellite speaker. Furthermore, the sound field reproduction processing unit 18 outputs signals (especially speaker drive signals for each satellite speaker) output from the mixing / level balance adjustment unit 17 via the corresponding satellite speakers in order to reproduce the sound field (atmosphere) formed by the sense of presence on the audience side in the live venue LV1 within the satellite venue STL1.
[0035] Each of the satellite speakers SPk1, ..., SPkp is positioned at a certain distance apart within the satellite venue STL1, for example, and reproduces (outputs) a sound field based on the speaker drive signal from the sound field reproduction processing unit 18. The number of satellite speakers may be varied depending on the sound field to be reproduced. It is also possible to reproduce (reproduce) a sound field using fewer than p satellite speakers by not reproducing (reproducing) a sound field for a specific direction, or by combining it with a commonly known virtual sound image generation method such as a transaural system or VBAP (Vector Based Amplitude Panning) method. Conversely, it is also possible to reproduce (reproduce) a sound field using more than p speakers. Furthermore, the placement of the satellite speakers is not limited as long as they are positioned to surround the reference position in the satellite venue STL1 (for example, the central position within the satellite venue STL1 where the listener PS1 is seated).
[0036] Next, we will explain the details of the zone microphones in Figure 1 with reference to Figures 2 to 5. Figure 2 shows an example of zone microphone placement as viewed from the ceiling of the stage in the live venue. Figure 3 shows an example of the recording range of the zone microphones. Figure 4 shows an example of the directivity of the zone microphones. Figure 5 shows an example of zone microphone placement as viewed from the left (right) edge of the stage to the right (left) edge of the stage in the live venue. In Figures 2 to 5, the Z axis is perpendicular to both the X and Y axes and indicates the height direction parallel to the direction of gravity, the X axis is perpendicular to both the Z and Y axes and indicates the width direction of the stage STG1, and the Y axis is perpendicular to both the Z and Y axes and indicates the depth direction of the stage STG1. The audience seating for the live venue LV1 is located in the positive direction of the Y axis, and the backstage area of the stage STG1 is located in the negative direction of the Y axis.
[0037] As shown in Figure 2, four zone microphones, ZM1, ZM2, ZM3, and ZM4, are positioned at the edge of the stage STG1 in the live venue LV1. Meanwhile, three zone microphones, ZM5, ZM6, and ZM7, are suspended from above the stage STG1 (for example, from the ceiling).
[0038] As shown in Figure 3, each of the zone microphones ZM1, ZM2, ZM3, and ZM4 captures (records) sounds (including speech) that occur within, for example, approximately elliptical (including circular) sound-collecting zones ZN1, ZN2, ZN3, and ZN4 (recording area or sound image localization zone). The longitudinal length of sound-collecting zones ZN1 to ZN4 is 4 [m]. Therefore, when an actor or other performer stops and steps or dances in front of any of the zone microphones ZM1 to ZM4, the sound of their feet will be mainly captured (recorded) by the zone microphone corresponding to the sound-collecting zone that includes the performer's position. However, this does not exclude the possibility that the sound of their feet may also be captured (recorded) by the zone microphone corresponding to the sound-collecting zone adjacent to the sound-collecting zone that includes the performer's position.
[0039] The depth of stage STG1 is 12m, and zone microphones ZM1, ZM2, ZM3, and ZM4 are positioned at the audience-side end of stage STG1, while zone microphones ZM5, ZM6, and ZM7 are suspended from the ceiling at a height of 4m, approximately 6m away from the audience-side end of stage STG1. These dimensions are merely examples and not limiting.
[0040] As shown in Figure 3, each of the zone microphones ZM5, ZM6, and ZM7 captures (records) sounds (including voices) that occur within, for example, the roughly circular sound-collecting zones ZN5, ZN6, and ZN7 (recording areas). The diameter of the sound-collecting zones ZN5 to ZN7 is 4 [m]. Therefore, when an actor or other performer stops directly below or near any of the zone microphones ZM5 to ZM7 and makes a movement that produces sound near their head, the sound will be mainly captured (recorded) by the zone microphone corresponding to the sound-collecting zone that includes the position above the performer's head. However, this does not exclude the possibility that the sound will also be captured (recorded) by the zone microphone corresponding to the sound-collecting zone adjacent to the sound-collecting zone that includes the position above the performer's head.
[0041] As shown in Figure 4, each of the zone microphones ZM5 to ZM7 can capture (record) sound occurring in pickup zones ZN5 to ZN7 (i.e., roughly circular areas with a diameter of 4m as shown in Figure 3) up to 2.5m away within an 80-degree beam range (directivity) from the center of the zone microphone housing. It is assumed that the area from above the head to near the mouth of performer ACT1, such as an actor, who can move on the stage STG1, is located 2.5m away from the housings of each of the zone microphones ZM5 to ZM7.
[0042] Therefore, as shown in Figure 5, when performer ACT1, wearing a headset microphone HM1, speaks lines or other words on stage STG1, not only the performer's sound signal picked up by the headset microphone HM1, but also the zone sound signal picked up by at least one zone microphone that has the position of performer ACT1 as its pickup zone, is recorded and input to mixer 1. Note that in Figure 5, the height of performer ACT1 is shown as 1.5 [m] for illustrative purposes, but the height is not limited to 1.5 [m].
[0043] Next, with reference to Figure 6, we will describe the matching table generated or updated by the performer tracking system 2. Figure 6 is a transition diagram showing an example of periodic updates to the matching table.
[0044] Matching tables MaTL1, MaTL2, ... are data that defines the identification information (sound source ID) and location information (e.g., 2D coordinates (X, Y)) of each performer (e.g., performer A, performer B, performer C, performer D, ...) when they are recognized by the performer tracking system 2 on stage STG1 (see Figure 2). The performer's location information is shown as 2D coordinates with a specific location on stage STG1 (see Figure 2) as the origin. These matching tables are updated periodically, for example, at 1-second intervals. As shown in Figure 6, matching table MaTL1 is generated at time t=t1, and matching table MaTL2 is generated 1 second later at t=t2. The update interval is not limited to 1 second.
[0045] For example, in the matching table MaTL1 in Figure 6, at time t=t1, performer A, performer B, and performer C are recognized by the performer tracking system 2. In other words, performer A's sound source ID (1) and location information (4, 1), performer B's sound source ID (2) and location information (-3, 2), and performer C's sound source ID (3) and location information (1, 0) are at least associated in the matching table MaTL1.
[0046] On the other hand, in the matching table MaTL2, at time t=t2, performers A and B are recognized by the performer tracking system 2 in the same way as at time t=t1, but performer C is not recognized, and performer D is newly recognized. In other words, performer A's sound source ID (1) and location information (1, 5), performer B's sound source ID (2) and location information (-3, 0), and performer D's sound source ID (4) and location information (-2, 3) are at least associated in the matching table MaTL2. For example, it is suggested that performer C disappeared from stage STG1 (see Figure 2) at time t=t2, and performer D newly appeared on stage STG1 (see Figure 2) at time t=t2.
[0047] Thus, even if performers move around on the stage STG1 (see Figure 2) during a play or other performance, the performer tracking system 2 recognizes and tracks each performer's identification and location information, and as a result, the data in the matching table showing the performers' locations is updated in real time.
[0048] Next, we will explain the selection of zone sound signals according to the performer's position information, referring to Figure 7. Figure 7 is a schematic diagram illustrating an example of the operation overview of the selection of zone sound signals according to the performer's position information. To simplify the explanation of Figure 7, let's assume that a total of four zone microphones are placed in the live venue LV1, and a total of four satellite speakers are placed in the satellite venue STL1. Note that p, which indicates the number of satellite speakers, is not limited to 12.
[0049] The performer tracking system 2 acquires the position information of performer A at a certain time, and it is assumed that the position indicated by that position information is within the sound pickup zone ZN2 of zone microphone ZM2. In the explanation of Figure 7, this "certain time" may be referred to as a "specific time". In this case, the sound image localization zone determination unit 15 determines that zone microphone ZM2 is the main peripheral microphone that mainly picks up (records) sound near performer A's head or feet, based on the position information of performer A at that "specific time". In addition, as the audio signal of performer A, not only the performer sound signal picked up (recorded) by performer A's headset microphone and the zone sound signal from zone microphone ZM2, but also the zone sound signals picked up (recorded) by the other zone microphones ZM1, ZM3, and ZM4 are input to the mixing / level balance adjustment unit 17.
[0050] Furthermore, the performer tracking system 2 acquires the position information of performer B at the same "specific time" (see above), and it is assumed that the position indicated by that position information is within the sound pickup zone ZN3 of zone microphone ZM3. In this case, the sound image localization zone determination unit 15 determines that zone microphone ZM3 is the main peripheral microphone that primarily picks up (records) sound near performer B's head or feet, based on the position information of performer B at that "specific time". In addition, as the audio signal of performer B, not only the performer sound signal picked up (recorded) by performer B's headset microphone and the zone sound signal from zone microphone ZM3, but also the zone sound signals picked up (recorded) by the other zone microphones ZM1, ZM2, and ZM4 are input to the mixing / level balance adjustment unit 17.
[0051] The mixing / level balance adjustment unit 17 adjusts the levels of the performer sound signals from performers A and B, as well as the zone sound signals from zone microphones ZM2 and ZM3, to be relatively higher than the other performer sound signals and zone sound signals in order to reproduce the atmosphere of the sound field of performers A and B (for example, spoken voice, sounds generated due to the movements of performers A and B near their heads or feet) in the satellite venue STL1, and then mixes each signal. This mixing process is performed for each of the satellite speakers SPk1 to SPk4.
[0052] In other words, in the example shown in Figure 7, the mixing / level balance adjustment unit 17 mixes the performer's sound signal from performer A, the performer's sound signal from performer B, the zone sound signal from zone microphone ZM1, the zone sound signal from zone microphone ZM2, the zone sound signal from zone microphone ZM3, and the zone sound signal from zone microphone ZM4 for satellite speaker SPk1. The mixing / level balance adjustment unit 17 also mixes the performer's sound signal from performer A, the performer's sound signal from performer B, the zone sound signal from zone microphone ZM1, the zone sound signal from zone microphone ZM2, the zone sound signal from zone microphone ZM3, and the zone sound signal from zone microphone ZM4 for satellite speaker SPk2. Furthermore, the mixing / level balance adjustment unit 17 mixes the performer sound signal from performer A, the performer sound signal from performer B, the zone sound signal from zone microphone ZM1, the zone sound signal from zone microphone ZM2, the zone sound signal from zone microphone ZM3, and the zone sound signal from zone microphone ZM4 for satellite speaker SPk3. The mixing / level balance adjustment unit 17 also mixes the performer sound signal from performer A, the performer sound signal from performer B, the zone sound signal from zone microphone ZM1, the zone sound signal from zone microphone ZM2, the zone sound signal from zone microphone ZM3, and the zone sound signal from zone microphone ZM4 for satellite speaker SPk4. The mixed signals for each satellite speaker are input to the sound field reproduction processing unit 18.
[0053] The sound field reproduction processing unit 18 uses the input signals for each satellite speaker to obtain sound image localization parameters from the storage unit 13 so that performer A is positioned in the sound pickup zone ZN2 of the live venue LV1 at the satellite venue STL1, and performs sound image localization (reproduction) processing. The sound field reproduction processing unit 18 also uses the input signals for each satellite speaker to obtain sound image localization parameters from the storage unit 13 so that performer B is positioned in the sound pickup zone ZN3 of the live venue LV1 at the satellite venue STL1, and performs sound image localization (reproduction) processing.
[0054] Next, we will explain the arrangement of satellite speakers within a satellite venue, referring to Figure 8. Figure 8 is a diagram showing an example of satellite speaker arrangement within a satellite venue. To make the explanation of Figure 8 easier to understand, let's assume that a total of 12 satellite speakers are placed in satellite venue STL1. Note that p, which indicates the number of satellite speakers, is not limited to 12.
[0055] As shown in Figure 8, the satellite venue STL1 has many seats in the center where multiple audience members can sit, and a screen SCR1 is provided in front of each seat. The video output unit 14 is, for example, a projector, which projects the data signal of the performer's captured video sent from the live venue side communication unit 6 of the live venue LV1 onto the screen SCR1 and outputs it. As a result, visitors to the satellite venue STL1 can view the captured video projected onto the screen SCR1 and experience the theatrical performance or other performance taking place on the stage STG1 (see Figure 2) of the live venue LV1.
[0056] On the back side of the screen SCR1, for example, four satellite speakers SPk1, SPk2, SPk3, and SPk4 are arranged at regular intervals. Since each of the satellite speakers SPk1 to SPk4 is located on the back side of the screen SCR1, it is preferable that the sound field reproduction processing unit 18 mainly uses them for sound image localization of the sound field caused by the speech of performers on the stage STG1 (see Figure 2) of the live venue LV1, and sounds generated near their heads or feet. In addition, the sound field reproduction processing unit 18 can also localize the sound field (sound image Sim1) of the performers in the live venue LV1 to a position midway between the satellite speakers SPk1 and SPk2 by using sound image localization parameters for the satellite speakers SPk1 and SPk2 from the memory unit 13. Similarly, the sound field reproduction processing unit 18 can also localize the sound field of the performers in the live venue LV1 at an intermediate position between satellite speakers SPk2 and SPk3 by using sound image localization parameters for satellite speakers SPk2 and SPk3 from the memory unit 13. Similarly, the sound field reproduction processing unit 18 can also localize the sound field of the performers in the live venue LV1 at an intermediate position between satellite speakers SPk3 and SPk4 by using sound image localization parameters for satellite speakers SPk3 and SPk4 from the memory unit 13.
[0057] Here, it is preferable that adjacent satellite speakers SPk1 and SPk2 are arranged such that the angle between the vector from the listener PS1 seated in the audience to satellite speaker SPk1 and the vector to satellite speaker SPk2 is within 20 degrees. This enables high-precision sound image localization processing by the sound field reproduction processing unit 18. This principle applies not only to the angle between satellite speakers SPk1 and SPk2 and the listener PS1, but also to adjacent satellite speakers positioned on the back side of the screen SCR1. In other words, the same applies to the angle between satellite speakers SPk2 and SPk3 and the listener PS1, and the angle between satellite speakers SPk3 and SPk4 and the listener PS1.
[0058] Furthermore, satellite speakers SPk5, SPk6, SPk7, SPk8, SPk9, SPk10, SPk11, and SPk12 are arranged at a certain distance apart to cover the sides of the extensive seating area within the satellite venue STL1. Since each of the satellite speakers SPk5 to SPk12 is positioned to cover the sides of the seating area within the satellite venue STL1, it is preferable that the sound field reproduction processing unit 18 primarily uses them to output the sense of presence from the audience side, such as applause, cheers, and murmurs from the audience side of the live venue LV1. Details of the output of the sense of presence from the audience side using each of these satellite speakers SPk5 to SPk12 (in other words, the reproduction of speaker drive signals for each of the satellite speakers SPk5 to SPk12) will be described in detail with reference to Embodiment 2.
[0059] Next, with reference to Figure 9, the operation procedure of the immersive sound field reproduction system 1000 according to Embodiment 1 will be described. Figure 9 is a sequence diagram showing an example of the operation procedure of the immersive sound field reproduction system 1000 according to Embodiment 1 in chronological order. The series of processes from step St1 to step St7 are executed within the live venue LV1, and this series of processes is repeated as a processing unit while a theatrical performance or the like is being staged. For the sake of simplicity in explaining Figure 9, the explanation will be based on the example of one performer A (see Figure 6), such as an actor, being on the stage STG1 (see Figure 2), but the number of performers can be two or more.
[0060] In Figure 9, the performer tracking system 2 recognizes performer A on the stage STG1 (see Figure 2) of the live venue LV1, and sends the data signals of performer A's identification information (sound source ID), location information, and captured video to the encoder 4 via the mixer 1 (step St1). Also, upon recognizing performer A, the performer tracking system 2 generates or updates data in a matching table (see Figure 6) containing performer A's identification information (sound source ID) and location information, and stores it in the matching table database 5 (step St2).
[0061] The headset microphone HM1 worn by performer A captures (records) the voice of performer A as he is on stage STG1 (see Figure 2), and sends the performer A's sound signal to encoder 4 via mixer 1 (step St3). The zone microphones ZM1 to ZM7, each positioned around stage STG1 (see Figure 2), capture (record) sounds from near performer A's head or feet, and send the performer A's zone sound signal to encoder 4 via mixer 1 (step St4). The ambisonic microphone AMB1, positioned in a surrounding area away from stage STG1 (see Figure 2) (for example, near the space between seats on the audience side), captures (records) the audience-side presence sound signal, which is the sound of the audience-side presence (specifically, applause, cheers, murmuring, etc. from the audience) away from stage STG1 (see Figure 2), and sends it to encoder 4 via mixer 1 (step St5).
[0062] Encoder 4 obtains data from performer tracking system 2 (e.g., performer identification information and location information) that matches the identification information of performer A (e.g., sound source ID) attached to the performer sound signal data sent in step St3, by referring to the data sent in step St1 or the matching table database 5 (step St6). Encoder 4 generates IP packets for transmission and reception via network NW1 by associating (e.g., IP packetization) the performer sound signal data of performer A, which has common identification information, with the data of identification information, location information, and captured video (see step St1 or step ST6) (step St6). Encoder 4 also generates IP packets for transmission and reception via network NW1 using the zone sound signals from each of the zone microphones ZM1 to ZM7 and the audience-side presence sound signal from the ambisonics microphone AMB1 (step St6). The encoder 4 sends each IP packet generated in step St6 to the satellite venue communication unit 11 via the live venue communication unit 6 and the network NW1 (step St7).
[0063] The decoder 12 receives each IP packet sent in step St7 via the satellite venue communication unit 11, performs decoding processing, and extracts various data signals (step St8).
[0064] The sound image localization zone determination unit 15 determines a sound image localization zone (step St9) for reproducing (playing) the sound field (atmosphere) formed by the voice of performer A speaking on the stage STG1 (see Figure 2) of the live venue LV1, as well as the sound of their feet when they stop or move, based on the data from the decoder 12 (i.e., the identification information and position information data of performer A). The sound image localization zone here is one of the sound pickup zones ZN1 to ZN7 shown in Figure 3. The audience presence parameter calculation unit 16 calculates three-dimensional playback parameters (step St10) for reproducing (playing) the sound field of audience presence in the live venue LV1 three-dimensionally in the satellite venue STL1, based on the data from the decoder 12 (i.e., the audience presence sound signal from the ambisonic microphone AMB1) and the position information of the satellite speakers SPk1 to SPkp stored in the memory unit 13. The audience-side presence parameter calculation unit 16 uses these stereoscopic reproduction parameters to generate speaker drive signals (an example of stereoscopic reproduction sound) for each of the satellite speakers SPk1 to SPkp and outputs them to the mixing / level balance adjustment unit 17 (step St11).
[0065] The sound image localization zone determination unit 15 reads and acquires the sound image localization parameters for each of the satellite speakers SPk1 to SPkp from the memory unit 13 (step St12). The sound image localization zone determination unit 15 sends the sound image localization zone information determined in step St9 and the sound image localization parameter information for each of the satellite speakers SPk1 to SPkp acquired in step St12 to the mixing / level balance adjustment unit 17 (step St13).
[0066] The mixing / level balance adjustment unit 17 uses the information sent in step St13 to adjust the balance of the signal levels of each performer's sound signal picked up (recorded) by each of the headset microphones HM1 to HM4, the zone sound signals from each of the zone microphones ZM1 to ZM7, and the speaker drive signals for each satellite speaker (step St14). The mixing / level balance adjustment unit 17 mixes (i.e., blends) each signal whose signal level balance has been adjusted (step St14), and sends the adjusted and mixed signal to the sound field reproduction processing unit 18 (step St15).
[0067] The sound field reproduction processing unit 18 uses the sound image localization parameters for each satellite speaker and the signals output from the mixing / level balance adjustment unit 17 to perform sound field reproduction processing to reproduce (play back) the sound field (atmosphere) formed by the voice of the performers speaking and the sound of their feet when they stand or move in the live venue LV1, as well as the sound field (atmosphere) formed by the sense of presence on the audience side of the live venue LV1, in the satellite venue STL1 (step St16). Specifically, in order to reproduce the sound field (atmosphere) formed by the voice of the performers speaking and the sound of their feet when they stand or move in the live venue LV1, the sound field reproduction processing unit 18 uses the signals output from the mixing / level balance adjustment unit 17 to perform processing to localize the sound image of the performers' voices and sound of their feet through each satellite speaker (step St16). Furthermore, the sound field reproduction processing unit 18 outputs the signals output from the mixing / level balance adjustment unit 17 (especially the speaker drive signals for each satellite speaker) via the corresponding satellite speakers in order to reproduce the sound field (atmosphere) formed by the sense of presence on the audience side in the live venue LV1 within the satellite venue STL1 (step St16).
[0068] As described above, in the immersive sound field reproduction system 1000 according to Embodiment 1, the immersive sound field reproduction device includes an acquisition unit (satellite venue side communication unit 11) that acquires at least the following: a speech voice signal (performer sound signal) recorded by a person microphone (headset microphone HM1, etc.) worn by at least one person (performer ACT1) who is able to move within the activity area (stage STG1) in the sound field recording space (live venue LV1); ambient sound signals recorded by a plurality of ambient microphones (zone microphones ZM1 to ZM7) arranged around the activity area; and position information of the person; and a main ambient microphone (for example, zone microphone ZM1 to ZM7) which is one of the plurality of ambient microphones and whose recording area is the location of the person within the activity area, based on the position information of the person. The system includes a determination unit (sound image localization zone determination unit 15) that determines M2), and a sound field reproduction unit (audience-side presence parameter calculation unit 16, mixing / level balance adjustment unit 17, sound field reproduction processing unit 18) that performs sound field reproduction processing to reproduce the sound field in the sound field recording space using a plurality of speakers (satellite speakers SPk1~SPkp) placed in a sound field reproduction space (satellite venue STL1) different from the sound field recording space, based on at least the speech voice signal from the human microphone, the first peripheral sound signal from the main peripheral microphone (for example, the zone sound signal from zone microphone ZM2), and the second peripheral sound signals from other peripheral microphones other than the main peripheral microphone (for example, the respective zone sound signals from zone microphones ZM1, ZM3~ZM7). As a result, the immersive sound field reproduction device can reproduce with high sensitivity the sense of presence, including sound images from various sound sources (for example, at least one performer) within the sound field recording space (live venue LV1), in a sound field reproduction space (satellite venue STL1) that is different from the sound field recording space.
[0069] Furthermore, the sound field reproduction unit (sound field reproduction processing unit 18) performs sound field reproduction processing to reproduce the sound field in the sound field recording space (live venue LV1) by emphasizing the first ambient sound signal (zone sound signal from zone microphone ZM2) more than the second ambient sound signal (for example, the respective zone sound signals from zone microphones ZM1, ZM3 to ZM7) within the sound field reproduction space (satellite venue STL1). As a result, the immersive sound field reproduction device can relatively emphasize the speech of performers who are on the activity area (stage STG1) and speaking lines, as well as sounds near their heads or feet, thereby reproducing the sound field in a way that makes the performers' speech and sounds near their heads or feet stand out even within the satellite venue STL1, and improving the reproducibility of sound field reproduction.
[0070] Furthermore, the acquisition unit (satellite venue communication unit 11) acquires peripheral area sound signals (audience-side presence sound signals) recorded by a sound recording device (ambisonic microphone AMB1) located in a peripheral area (audience side) different from the activity area (stage STG1) within the sound field recording space (live venue LV1). The sound field playback unit (sound field playback processing unit 18) performs signal processing to reproduce a sound field using the peripheral area sound signals in an area within the sound field playback space that corresponds to the peripheral area within the sound field recording space. As a result, the presence sound field playback device can secondarily reproduce (play back) sounds such as applause, cheers, and murmurs (audience-side presence) from the audience side within the live venue LV1, which are picked up (recorded) by the sound recording device, in an area corresponding to the audience side within the satellite venue LV1.
[0071] Furthermore, the sound field reproduction unit (mixing / level balance adjustment unit 17) mixes the spoken audio signal (performer's voice signal), the first ambient sound signal (zone sound signal from zone microphone ZM2), and the second ambient sound signal (for example, the respective zone sound signals from zone microphones ZM1, ZM3 to ZM7), and adjusts the balance of the signal levels, outputting them through multiple speakers (satellite speakers SPk1 to SPkp). As a result, the immersive sound field reproduction device can accurately reproduce (play back) the atmosphere of the sound field created by the performer, including not only the spoken audio such as the performer's lines occurring in the activity area (stage STG1) within the live venue LV1, but also sounds near the performer's head or feet, even within the satellite venue STL1.
[0072] Furthermore, the sound field reproduction unit (mixing / level balance adjustment unit 17) mixes the spoken audio signal (performer's voice signal), the first peripheral sound signal (zone sound signal from zone microphone ZM2), the second peripheral sound signal (for example, the respective zone sound signals from zone microphones ZM1, ZM3~ZM7), and the surrounding area sound signal (audience-side presence sound signal), and adjusts the balance of the signal levels, outputting them through multiple speakers (satellite speakers SPk1~SPkp). As a result, the presence sound field reproduction device can reproduce (play back) with high precision not only the spoken audio such as the performer's lines and sounds near the performer's head or feet that occur on the activity area (stage STG1) within the live venue LV1, but also the atmosphere of the audience-side presence sound within the live venue LV1, even in the satellite venue STL1.
[0073] Furthermore, the acquisition unit (satellite venue side communication unit 11) repeatedly acquires location information of a person that may be updated periodically. The determination unit (sound image localization zone determination unit 15) determines the main surrounding microphone each time the location information of a person is updated. As a result, the immersive sound field reproduction device can acquire zone sound signals from the main surrounding microphone that match the location information of a person (performer) each time the person moves on the stage STG1 in the live venue LV1, so that even if the performer's location information changes, the sound field can be reproduced adaptively within the satellite venue STL1.
[0074] Furthermore, if there are multiple people (performers such as actors), the determination unit (sound image localization zone determination unit 15) determines the main surrounding microphone for each person based on the position information of that person. As a result, the immersive sound field reproduction device can reproduce the sound image localization of each performer with high precision, even when there are multiple performers on the stage STG1 within the live venue LV1.
[0075] Furthermore, the sound field reproduction unit (for example, the sound field reproduction processing unit 18) uses the spoken voice signal (performer's voice signal), the first ambient sound signal (zone sound signal from zone microphone ZM2), and the second ambient sound signal (for example, the respective zone sound signals from zone microphones ZM1, ZM3 to ZM7) as reference signals and performs an erasure process as signal processing to erase the reference signal components contained in the ambient area sound signal (audience-side presence sound signal). As a result, the presence sound field reproduction device can effectively erase the reference signal components that are thought to be leaking into the ambient area sound signal, and can generate a high-quality ambient area sound signal.
[0076] Furthermore, the sound field reproduction unit (for example, the sound field reproduction processing unit 18) encodes the signal after the erasure process and, based on the encoded signal, generates speaker drive signals for each of the multiple speakers (satellite speakers SPk1 to SPkp) to reproduce the sense of presence in the sound field recording space within the sound field reproduction space. As a result, the immersive sound field reproduction device can generate speaker drive signals that can reproduce the sense of presence on the audience side within the live venue LV1 by taking into account the positional information of each of the multiple satellite speakers SPk1 to SPkp arranged in the space of the satellite venue STL1.
[0077] Furthermore, the sound field reproduction unit (for example, the sound field reproduction processing unit 18) outputs speaker drive signals for each generated speaker (satellite speaker SPk1 to SPkp) from the corresponding speaker (satellite speaker). For example, the sound field reproduction processing unit 18 outputs the speaker drive signal generated for satellite speaker SPk1 from satellite speaker SPk1. Similarly, speaker drive signals generated for other satellite speakers are output from the corresponding satellite speakers. As a result, the immersive sound field reproduction device can reproduce the sense of presence from the audience side in the live venue LV1 with high precision within the satellite venue STL1 by outputting speaker drive signals for each of the multiple satellite speakers SPk1 to SPkp arranged in the space of the satellite venue STL1.
[0078] (Background leading to Embodiment 2 and beyond) Recently, scene-based spatial audio reproduction technology has been attracting attention for its ability to reproduce (play back) sound fields in real time. Scene-based spatial audio reproduction technology is a method that uses ambisonic microphones, in which multiple omnidirectional microphone elements are arranged on a rigid sphere or multiple directional microphones are arranged on a hollow spherical surface, to capture multi-channel signals. By applying signal processing to these signals, speakers are arranged to surround the listening environment (space) to reproduce (play back) a three-dimensional sound field in real time, as if the listener were actually present at the location where the ambisonic microphones were placed.
[0079] Prior art relating to sound field reproduction is known, for example, in Reference Patent Document 1. Reference Patent Document 1 discloses an audio recording device that receives sound pickup signals from a wireless microphone attached to a subject and generates a multi-channel audio signal based on each audio signal picked up by multiple microphones. This audio recording device assigns the sound pickup signals from the wireless microphone to one or more arbitrary channels of the multi-channel audio signal, combines them at an arbitrary mixing ratio, and records them together with the captured image signal on a recording medium.
[0080] (Reference Patent Document 1) Japanese Unexamined Patent Publication No. 2006-314078
[0081] Here, we envision using the scene-based spatial sound reproduction technology described above to place ambisonic microphones (see above) on the audience side of a large concert hall or other live venue, for example, to capture the sense of presence (hereinafter sometimes referred to as "audience presence") from the audience side, such as applause, murmurs, murmurs, and cheers, during a theatrical performance or other show taking place on the main stage, and then reproducing that audience presence in one or more satellite venues different from the live venue. Reference Patent Document 1 does not disclose in detail the relationship between the atmosphere of the sound field captured by a microphone and the atmosphere of the sound field captured by a wireless microphone, and it is considered difficult to apply the technology of Patent Document 1 to realize the above-mentioned assumption. Furthermore, even if the ambisonic microphone is placed on the audience side, it is highly likely that it will capture not only the audience presence but also sound signals that have propagated through the space within the live venue, such as spoken words of actors on stage, sound effects, background music, and original sound sources. In this case, other sound components besides the sense of presence from the audience in the live venue are mixed in, making it difficult to accurately reproduce the sound field of the audience presence for listeners in satellite venues. Patent Document 1 does not present a solution for reproducing the sound field of the audience presence captured from the live venue with high sensitivity in a satellite venue.
[0082] Therefore, in the following embodiments 2 and beyond, we will describe examples of sound field presence reproduction devices and sound field presence reproduction methods that accurately reproduce the atmosphere of audience presence in a sound recording space captured using an ambisonic microphone within at least one satellite venue. It should be noted that "presence sound field reproduction" in Embodiment 1 and "sound field presence reproduction" in each of the following embodiments are synonymous (in other words, the terms may be substituted for each other).
[0083] In the following embodiment, a scene-based spatial audio reproduction technology using an ambisonic microphone as a sound source device for capturing sound source signals such as sounds, music, and human voices within a sound-collecting space (e.g., a live music venue) will be illustrated and explained. In this scene-based spatial audio reproduction technology, the signal (collected signal) or point source that can be represented as a monaural signal, collected by multiple microphone elements constituting the ambisonic microphone, is represented (encoded) as an intermediate representation ITMR1 (see Figure 1) using spherical harmonics or as a B-format signal, thereby uniformly handling sound fields arriving from all directions within the ambisonic signal domain (see below). Furthermore, a speaker drive signal is generated by decoding this intermediate representation, thereby realizing the reproduction of the desired sound field within the reproduction space (e.g., a satellite venue).
[0084] Hereinafter, "sound field" is defined as the space (including location) in which sound spreads. Sound propagating within a sound field includes sound from one or more sound sources propagating within the target space. Here, sound sources include not only sound sources of various performances (e.g., band performances, musical plays) taking place on the main stage of a sound-collecting space such as Live Venue LV1, but also sounds that create a sense of presence, such as cheers, murmurs, roars, and applause, that occur in the audience area away from the main stage within Live Venue LV1.
[0085] (Embodiment 2) First, let's explain the concept of scene-based spatial audio reproduction technology with reference to Figure 10. Figure 10 is a schematic diagram illustrating the concept from sound field capture to sound field reproduction in scene-based spatial audio reproduction technology using the Ambisonics Microphone AMB1. The Ambisonics Microphone AMB1 is placed at a predetermined position on the audience side within the sound capture space (e.g., live venue LV1). In live venue LV1, sound signals propagating within that space are captured by the Ambisonics Microphone AMB1. For example, if a band with multiple members is performing on the main stage of live venue LV1, sound signals from various sound sources such as vocals, bass, guitar, and drums are captured. Also, if a musical is being performed, speech signals from one or more actors (sound sources) are captured. On the other hand, sound signals that give a sense of presence to the audience, such as cheers, murmurs, roars, and applause, are also captured by the Ambisonics Microphone AMB1.
[0086] As an example of a sound pickup device, the Ambisonics Microphone AMB1 is equipped with four microphone elements Mc1, Mc2, Mc3, and Mc4. Each of the microphone elements Mc1 to Mc4 is hollow-shaped so that it faces the four vertices from the center of the cube CB1 in Figure 10, with direction Dr1 being the front direction, and has a unidirectional pattern in the direction of each vertex. Microphone element Mc1 faces the front left upper (FLU) of the Ambisonics Microphone AMB1 and mainly picks up sound in that direction. Microphone element Mc2 faces the front right lower (FRD) of the Ambisonics Microphone AMB1 and mainly picks up sound in that direction. Microphone element Mc3 faces the rear left lower (BLD) of the Ambisonics Microphone AMB1 and mainly picks up sound in that direction. The microphone element Mc4 is positioned towards the rear right up (BRU) of the ambisonic microphone AMB1, and primarily picks up sound from that direction.
[0087] The sound pickup signals from these four directions (i.e., FLU, FRD, BLD, BRU) are called A-format signals. A-format signals cannot be used directly and are converted to B-format signals, which are intermediate representations (ITMR1) with directional characteristics. B-format signals include, for example, B-format signals W for omnidirectional (omnidirectional) sound, B-format signals X for front-to-back sound, B-format signals Y for left-to-right sound, and B-format signals Z for up-and-down sound. A-format signals are converted to B-format signals using the following conversion formula.
[0088] W = FLU + FRD + BLD + BRU X = FLU + FRD - BLD - BRU Y = FLU - FRD + BLD - BRU Z = FLU - FRD - BLD + BRU
[0089] By combining B-format signals W, X, Y, and Z, sound signals for all directions—front, back, left, right, and up—can be obtained. Furthermore, by changing the signal levels of each of the B-format signals W, X, Y, and Z and combining them, it is possible to generate sound signals with any directional characteristics among the front, back, left, right, and up directions. For example, as shown in Figure 10, a total of eight satellite speakers SPk1, SPk2, SPk3, SPk4, SPk5, SPk6, SPk7, and SPk8 are placed at each vertex of a reproduction space (e.g., satellite venue STL1) modeled as a cube, and a three-dimensional coordinate system similar to that of the sound collection space (e.g., live venue LV1) is considered (i.e., the front, back, left, right, and up directions are parallel or the same direction). Note that here, for the sake of clarity, the number of satellite speakers is given as eight, but it goes without saying that the number is not limited to eight.
[0090] The positions of satellite speakers SPk1 to SPk8 are determined by a predetermined distance and angle (azimuth angle θ) from the reference position (e.g., center position LSP1) in the reproducible space (e.g., satellite venue STL1). i and elevation angle φ i) can be identified by the following. In Figure 10, i is a variable that indicates a satellite speaker located in the reproduction space (e.g., satellite venue STL1), and in the example of Figure 10, it takes any integer from 1 to 8.
[0091] Assume that the user, the listener, is located at the center position LSP1 of the reproduction space (e.g., satellite venue STL1), facing forward. Under these circumstances, the sound field within the sound pickup space (e.g., live venue LV1) can be freely reproduced within the reproduction space (e.g., satellite venue STL1) based on the B-format signal W, X, Y, Z data obtained by encoding the A-format signal picked up in the sound pickup space (e.g., live venue LV1), and the respective directions of the satellite speakers SPk1 to SPk8 within the reproduction space (e.g., satellite venue STL1). In other words, when the user, the listener, is present in the reproduction space (e.g., satellite venue STL1), the listener's forward direction is used as the reference direction, and it becomes possible to reproduce and output sound in any three-dimensional direction from that reference direction.
[0092] Next, with reference to Figure 11, we will describe the basis for ambisonic components based on spherical harmonic expansions for order n and frequency m. Figure 11 shows an example of a basis for ambisonic components based on spherical harmonic expansions for order n and frequency m.
[0093] In Figure 11, the horizontal axis (m) represents the degree, and the vertical axis (n) represents the order. The degree m takes values from -n to +n. The total number of spherical harmonics up to order n = N is (N+1). 2It includes n bases. For example, when n=N=0, one base is obtained (i.e., an omnidirectional B-format signal W). Also, for example, when n=N=1, four bases are obtained (i.e., an omnidirectional B-format signal W corresponding to (n,m)=(0,0), a forward-backward B-format signal X corresponding to (n,m)=(1,-1), a vertical B-format signal Z corresponding to (n,m)=(1,0), and a left-right B-format signal Y corresponding to (n,m)=(1,1)). The same applies to n=N=2 and beyond, so the explanation is omitted.
[0094] Spherical harmonics are known to exhibit increasing spatial periodicity as n and m increase. Therefore, different combinations of n and m can be used to represent B-format signals with different directional patterns (directional characteristics). If we define the dimension for order n and frequency m as K = n(n+1) + m based on Ambisonics Channel Numbering (ACN), then spherical harmonics can be expressed in vector form as shown in equation (1). In equation (1), the superscript T indicates the transpose.
[0095]
number
[0096] [ka]
[0097]
number
[0098]
number
[0099] Next, with reference to Figure 12, an example of the operation overview of the sound field presence reproduction system will be described. Figure 12 is a schematic diagram showing an example of the operation overview of the sound field presence reproduction system. In Figure 12, the sound pickup space in which the ambisonic microphone AMB1 is placed is explained using, for example, a live venue LV1 where a band is performing with various sound sources such as vocals, bass, guitar, and drums. However, as mentioned above, the live venue LV1, which is the sound pickup space, is not limited to band performances, but may also include a musical performance with one or more actors, a concert with multiple instruments, or an orchestral performance, and the same applies hereafter.
[0100] As shown in Figure 12, the live venue LV1 is equipped with a main stage STG1, on which the band performs. During the band performance, audio signals SS2 such as the vocals (an example of a sound source), bass signals SS1 from the bass (an example of a sound source), and guitar signals SS3 from the guitar (an example of a sound source) propagate widely within the space of the live venue LV1 and reach the audience. These signals may reach the audience directly from their respective sound source locations through the space, or they may be reproduced through amplification devices such as speakers installed in the live venue LV1 and then reach the audience. The ambisonic microphone AMB1 is positioned at a predetermined location on the audience side of the live venue LV1 (for example, in the center of the audience area) with the aim of primarily capturing sounds that convey the sense of presence to the audience. Therefore, the ambisonic microphone AMB1 mainly captures sounds that give the audience a sense of presence during the band performance, such as cheers, murmurs, commotion, and applause.
[0101] However, as mentioned above, the bass sound signal SS1, the vocal sound signal SS2, and the guitar sound signal SS3 during a band performance propagate within the space of the live venue LV1. As a result, the diffuse sound components DS1, DS2, and DS3 of sound signals SS1, SS2, and SS3 (including reverberation; the same applies hereafter) are picked up as sound signals by the ambisonic microphone AMB1. Consequently, because the ambisonic microphone AMB1 picks up the diffuse sound components DS1, DS2, and DS3, which are not intended to be picked up, it was difficult for conventional scene-based spatial sound reproduction technology to accurately reproduce the sense of presence from the audience side of the live venue LV1 in the satellite venue STL1.
[0102] Therefore, the following embodiment describes an example of a sound field presence reproduction system that accurately reproduces the atmosphere of audience presence in a sound recording space captured using an ambisonic microphone within at least one satellite venue.
[0103] Next, with reference to Figures 13 and 14, the system configuration and operation overview of the sound field presence reproduction system 100 according to Embodiment 2 will be described. Figure 13 is a block diagram showing an example of the system configuration of the sound field presence reproduction system 100 according to Embodiment 2. Figure 14 is a diagram showing an example of the operation overview from sound field presence sound acquisition to sound field presence reproduction in the sound field presence reproduction system 100 of Figure 13.
[0104] The sound field presence reproduction system 100 includes a sound field presence sound recording device 10 and a sound field presence reproduction device 20. The sound field presence sound recording device 10 and the sound field presence reproduction device 20 are connected to each other via a network NW1, enabling data communication. The network NW1 may be a wired network or a wireless network. The wired network may include at least one of the following: wired LAN (Local Area Network), wired WAN (Wide Area Network), or power line communication (PLC), or other network configurations capable of wired communication. On the other hand, the wireless network may include at least one of the following: wireless LAN such as Wi-Fi (registered trademark), wireless WAN, short-range wireless communication such as Bluetooth (registered trademark), or mobile cellular communication network such as 4G or 5G, or other network configurations capable of wireless communication.
[0105] The sound field presence sound recording device 10 is placed in a sound recording space (e.g., a live venue LV1) and includes an ambisonic microphone AMB1, an A / D conversion unit 7, and individual sound recording microphones M1, ..., Mn. Here, n represents the number of individual sound sources in the live venue LV1 (e.g., independent sound sources such as vocals, bass, and guitar in the case of a band performance), and is specifically an integer of 2 or more. Note that the sound field presence sound recording device 10 only needs to have an ambisonic microphone AMB1, and the A / D conversion unit 7 may be provided in the sound field presence reproduction device 20.
[0106] The Ambisonics microphone AMB1 is equipped with four microphone elements Mc1, Mc2, Mc3, and Mc4. Microphone element Mc1 picks up sound from the front upper left direction (see Figure 10), microphone element Mc2 picks up sound from the front lower right direction (see Figure 10), and microphone element Mc3 picks up sound from the rear lower left direction (see Figure 10) and rear upper right direction (see Figure 10). The Ambisonics microphone AMB1 may also be equipped with more unidirectional microphone elements than the four hollow-arranged microphone elements Mc1, Mc2, Mc3, and Mc4, or it may be equipped with an omnidirectional microphone element arranged on a rigid sphere. By using an Ambisonics microphone equipped with a large number of microphone elements, it becomes possible to synthesize an Ambisonics signal of the second order or higher in the encoding unit 22 of the sound field presence reproduction device 20. The signals (acquired signals) picked up by each microphone element constituting the Ambisonics microphone AMB1 are input to the A / D conversion unit 7.
[0107] At least the A / D conversion unit 7 is composed of a semiconductor chip on which at least one of the following electronic devices is mounted, such as a CPU (Central Processing Unit), a DSP (Digital Signal Processor), a GPU (Graphical Processing Unit), or an FPGA (Field Programmable Gate Array), or dedicated hardware.
[0108] The A / D conversion unit 7 converts the analog sound pickup signals from each microphone element constituting the ambisonic microphone AMB1 into digital sound pickup signals. These converted sound pickup signals are transmitted to the sound field presence reproduction device 20 via the communication interface (not shown) and network NW1 provided by the sound field presence sound pickup device 10.
[0109] The individual sound pickup microphone M1 captures individual sound sources (first sound sources) generated from unique sound sources (e.g., the vocalist of a band performance or the performers of a musical play) during events such as band performances or musical plays on the main stage STG1 (see Figure 12) of the live venue LV1. The individual sound pickup microphone M1 may be a headset microphone worn by, for example, the vocalist of a band performance or the performers of a musical play. The individual sound source signal captured by the individual sound pickup microphone M1 is transmitted to the sound field presence reproduction device 20 via the communication interface (not shown) and network NW1 provided by the sound field presence sound pickup device 10.
[0110] Similarly, the individual sound microphone Mn captures individual sound sources (the nth sound source) generated from unique sound sources (for example, a guitar in a band performance, or sound effects or background music in a musical performance) during events such as band performances or musical plays on the main stage STG1 (see Figure 12) of the live venue LV1. The individual sound microphone Mn may be, for example, a headset microphone worn by a guitarist in a band performance, or a microphone capable of capturing sound effects or background music in a musical performance. The individual sound source signal captured by the individual sound microphone Mn is transmitted to the sound field presence reproduction device 20 via the communication interface (not shown) and network NW1 provided by the sound field presence sound capture device 10.
[0111] The sound field presence reproduction device 20 is placed in a reproduction space (for example, a satellite venue STL1) and includes echo cancellation units 21, ..., 2n, an encoding unit 22, a microphone element direction designation unit 23, a speaker direction designation unit 24, a decoding unit 25, a sound field playback unit 26, and satellite speakers SPk1, ..., SPkp. Here, p indicates the number of satellite speakers placed in the satellite venue STL1, and is specifically an integer of 2 or more. Also, n, which indicates the number of echo cancellation units 21 to 2n, and n, which indicates the number of individual sound pickup microphones M1 to Mn, are the same. In other words, the sound field presence reproduction device 20 is provided with the same number of echo cancellation units as the number of types of sound sources picked up by the individual sound pickup microphones. Note that the configurations of the echo cancellation units 21-2n, encoding unit 22, microphone element direction specification unit 23, speaker direction specification unit 24, decoding unit 25, and sound field reproduction unit 26 in Figure 13 correspond to the configurations of the audience-side presence parameter calculation unit 16 and sound field reproduction processing unit 18 according to Embodiment 1 (see Figure 1).
[0112] The echo cancellation unit 21 receives the sound pickup signals from each microphone element of the ambisonic microphone AMB1 sent from the sound field presence sound recording device 10 (A / D conversion unit 7 side), and further receives the individual sound source signals of the first sound source (see above) sent from the sound field presence sound recording device 10 (individual sound recording microphone M1 side) as the first reference signal M1S. The echo cancellation unit 21 performs an erasure process (e.g., echo cancellation process) to erase the component of the first reference signal M1S (i.e., the individual sound source signal picked up by the individual sound recording microphone M1) included in the sound pickup signals from each microphone element of the ambisonic microphone AMB1. The echo cancellation unit 21 outputs the signal after the erasure process (first sound pickup signal) to the encoding unit 22.
[0113] Similarly, the echo cancellation unit 2n receives the sound pickup signals from each microphone element of the ambisonic microphone AMB1 sent from the sound field presence sound recording device 10 (A / D conversion unit 7 side), and further receives the individual sound source signal of the nth sound source (see above) sent from the sound field presence sound recording device 10 (individual sound recording microphone Mn side) as the nth reference signal MnS. The echo cancellation unit 2n performs an erasure process (e.g., echo cancellation process) to erase the component of the nth reference signal MnS (i.e., the individual sound source signal picked up by the individual sound recording microphone Mn) included in the sound pickup signals from each microphone element of the ambisonic microphone AMB1. The echo cancellation unit 2n outputs the signal after the erasure process (the nth sound pickup signal) to the encoding unit 22.
[0114] Here, each of the echo cancellation units 21 to 2n may be configured as an echo canceller using an adaptive filter that operates in the time domain, for example. This echo canceller can be configured as, for example, a Single Channel EchoCanceller. Therefore, as shown in Figure 14, each of the echo cancellation units 21 to 2n can be configured using the same number of Single Channel EchoCancellers as the number of microphone elements of the ambisonic microphone AMB1 (for example, 4). This Single Channel EchoCanceller may be configured as disclosed in, for example, the reference non-patent literature. By using this configuration, especially when there is no correlation between the reference signals (in other words, when the crosstalk component is below a predetermined threshold to the extent that it can be considered substantially free of crosstalk), it becomes possible to erase (suppress) the component of the reference signal (i.e., the individual sound source sound signal picked up by the individual pickup microphones) included in the sound picked up by each microphone element of the ambisonic microphone AMB1 with high accuracy. Furthermore, the echo cancellation sections 21-2n may be implemented as an echo cancellation process using adaptive filters in the frequency domain or subband domain after the time-domain signal has been forward-transformed using a DFT (Discrete Fourier Transform), or the canceled signal may be inverse-transformed back to the time domain using an IFFT (Inverse Fast Fourier Transform) before subsequent processing.
[0115] <Reference Non-Patent Literature> Chapter 5 Acoustic Echo Canceller, "Example of Adaptive Filter Configuration" (see Figure 5.2), p4 / (17), Institute of Electronics, Information and Communication Engineers, 2012, [Retrieved September 2, 2022], Internet<URL:https: / / www.ieice-hbkb.org / files / 02 / 02gun_06hen_05.pdf>
[0116] Each of the echo cancellation units 21 to 2n is provided for the purpose of eliminating (suppressing) individual sound sources that have propagated within the sound pickup space (e.g., live venue LV1). For this reason, each of the echo cancellation units 21 to 2n may be provided together with the encoding unit 22 on the sound pickup space (e.g., live venue LV1) side, or it may be provided together with the encoding unit 22 on the reproduction space (e.g., satellite venue STL) side. In this case, if it is provided on the sound pickup space (e.g., live venue LV1) side, only the component of the first-order ambisonic signal (i.e., audience presence) which is the output of the encoding unit 22 will be sent to the sound field presence reproduction device 20. On the other hand, if it is provided on the reproduction space (e.g., satellite venue STL1) side, both the component of the first-order ambisonic signal (i.e., audience presence) which is the output of the encoding unit 22 and the individual sound source signals will be sent to the sound field presence reproduction device 20. Alternatively, the echo cancellation units 21-2n may be placed only in the sound pickup space (e.g., live venue LV1), while the encoding unit 22 is placed in the reproduction space (e.g., satellite venue STL). In this case, only the output signal component of the echo cancellation unit 2n will be sent to the sound field presence reproduction device 20.
[0117] Furthermore, in the reproduced space (for example, satellite venue STL1), the sound field presence reproduction device 20 may output individual sound source signals, each picked up by the individual sound pickup microphones M1 to Mn of the sound field presence sound pickup device 10, from each of the satellite speakers SPk1 to SPkp, or from other satellite speakers (not shown) provided for the individual sound source signals, for the purpose of reproducing the sense of sound field presence.
[0118] [ka]
[0119]
Chem.
[0120] Here, the details of the encoding process by the encoding unit 22 will be described.
[0121] Generally, for any angle (θ, φ) on the spherical surface, the sound pressure p observed (acquired) at the position of radius r is known to be expanded as Equation (4) with the spherical harmonic function of Equation (2) as the basis, as the solution of the internal problem in the spherical harmonic function region of the wave equation for the wave number k. In Equation (4), A m n is the expansion coefficient, and R n (kr) is the radial function term. Also, the infinite sum with respect to the order n is approximated by truncating at a finite order N, and the accuracy of sound field reproduction changes according to this truncation order N. Hereinafter, the truncation order will be expressed as N.
[0122]
Math.
[0123]
Chem.
[0124]
Math.
[0125]
Math.
[0126] In Equation (6), i is the imaginary unit, and j n (kr) is the spherical Bessel function of the nth order, and j ’ n(kr) is its derivative. In this disclosure, the expansion coefficient vector γ for this plane wave is used. m n This is treated as a B-format signal (intermediate representation), which is the output of the encoding process by the encoding unit 22. Hereinafter, this expanded coefficient vector may be referred to as an ambisonics domain signal or simply an ambisonics signal, which lies in an ambisonics domain different from the time domain.
[0127] More specifically, in the encoding process by the encoding unit 22, the acquired sound signal, which is a time-domain signal after the removal of the reference signal components output from each of the echo cancellation units 21 to 2n, is converted into an ambisonic signal (e.g., a first-order ambisonic signal). This ambisonic signal (e.g., a first-order ambisonic signal) is then decoded by the decoding unit 25 and converted into a speaker drive signal.
[0128] [ka]
[0129]
number
[0130]
number
[0131]
number
[0132] [ka]
[0133] [ka]
[0134] [ka]
[0135] The sound field reproduction unit 26 converts the digital speaker drive signals for each satellite speaker output from the decoding unit 25 into analog speaker drive signals, amplifies the signals, and outputs (plays back) them from the corresponding satellite speakers.
[0136] Satellite speakers SPk1, ..., SPkp are placed at each vertex (see Figure 10) of the reproduction space (e.g., satellite venue STL1) modeled as a cube, and reproduce (recreate) the sound field based on speaker drive signals from the sound field reproduction unit 26. The number of speakers may be varied depending on the sound field to be reproduced. It is also possible to reproduce the sound field using fewer than p satellite speakers (for example, 8 in the example in Figure 10) by not reproducing sound for a specific direction, or by combining it with a commonly known virtual sound image generation method such as a transaural system or VBAP (Vector Based Amplitude Panning) method. Conversely, it is also possible to reproduce the sound field using more than p satellite speakers (for example, 8 in the example in Figure 10). Furthermore, the speaker placement does not have to be at each vertex of the reproduction space (e.g., satellite venue STL1) as long as they are placed so as to surround the reference position (e.g., center position LSP1) of the satellite venue STL1. The sound field reproduction unit 26 may output a signal to a playback device for both ears, such as headphones or earphones, worn by the listener (user), instead of to the satellite speakers. Furthermore, when supplying a signal to the playback device for both ears of the listener (user) (for example, the headphones or earphones mentioned above), the sound field reproduction unit 28 may generate a playback signal corresponding to an azimuth angle of ±90° through a decoding process described later. Alternatively, it may generate virtual sound images for multiple directions surrounding the head, and generate a playback signal by multiplying the virtual sound image in the corresponding direction in the frequency domain or convolving it in the time domain with a transfer characteristic that allows the user to perceive a three-dimensional sound image, such as an HRTF (Head Related Transfer Function), corresponding to these multiple angles. This allows for sound field reproduction not only from each of the satellite speakers SPk1 to SPkp located in the satellite venue STL1, but also to a playback device (for example, the headphones or earphones mentioned above) worn by the listener (user) located in the satellite venue STL1.
[0137] Here, we will explain the details of the processing performed by the decoding unit 25.
[0138] [ka]
[0139]
number
[0140] Next, with reference to Figure 15, the operation procedure for sound field presence reproduction by the sound field presence reproduction device 20 will be described. Figure 15 is a flowchart showing an example of the operation procedure for sound field presence reproduction by the sound field presence reproduction device 20 according to Embodiment 2 in chronological order.
[0141] In Figure 15, the ambisonics microphone AMB1 of the sound field presence sound recording device 10 records sounds occurring around a predetermined position on the audience side within the sound recording space (e.g., live venue LV1) (e.g., sounds that give a sense of presence to the audience side) (step St21). The recorded signals from each microphone element of the ambisonics microphone AMB1 recorded in step St21 are transmitted to the sound field presence reproduction device 20B. However, as mentioned above, the sounds recorded in step St21 include not only sounds that give a sense of presence to the audience side, but also sounds from one or more sound sources such as performances or plays on the main stage STG1 (see Figure 12) in the sound recording space (e.g., live venue LV1). Furthermore, individual sound sources (in other words, the first reference signal M1S to the nth reference signal MnS) captured by individual microphones M1 to Mn from one or more sound sources such as performances or theatrical productions on the main stage STG1 (see Figure 12) of the sound recording space (for example, live venue LV1) are also transmitted to the sound field presence reproduction device 20 (step St21).
[0142] The sound field presence reproduction device 20 repeatedly performs echo cancellation processing (see above) on the time axis for each reference signal in each of the echo cancellation units 21 to 2n, using the sound pickup signal for each microphone element of the ambisonic microphone AMB1 as the main signal and the sound pickup signals of each individual sound pickup microphone M1 to Mn as reference signals (step St22). More specifically, the echo cancellation unit 21 of the sound field presence reproduction device 20 receives various signals sent in step St21 (specifically, the sound pickup signals for each microphone element of the ambisonic microphone AMB1 and the corresponding first reference signal M1S), and performs erasure processing (e.g., echo cancellation processing) on the time axis to erase the component of the first reference signal M1S (i.e., the individual sound source sound signal picked up by the individual sound pickup microphone M1) contained in the sound pickup signal for each microphone element of the ambisonic microphone AMB1 (step St22). Similarly, the echo cancellation unit 2n of the sound field presence reproduction device 20 receives various signals sent in step St21 (specifically, the sound pickup signals for each microphone element of the ambisonic microphone AMB1 and the corresponding nth reference signal MnS), and performs an erasure process (e.g., echo cancellation process) on the time axis to erase the nth reference signal MnS (i.e., the individual sound source sound signal picked up by the individual sound pickup microphone Mn) contained in the sound pickup signals for each microphone element of the ambisonic microphone AMB1 (step St22).
[0143] [ka]
[0144] [ka]
[0145] As described above, the sound field presence reproduction device 20 according to Embodiment 2 comprises an acquisition unit (echo cancellation unit 21-2n) that acquires sound pickup signals (diffuse sound components DS1-DS3) picked up by a sound pickup device (ambisonic microphone AMB1) placed in a sound pickup space (live venue LV1) and sound source signals (sound signal SS1, voice signal SS2, sound signal SS3) from one or more sound sources in the sound pickup space, and an erasure unit (echo cancellation unit) that uses the sound source signals as reference signals and performs an erasure process on the time axis to erase the reference signal components included in the sound pickup signals. The sound field presence reproduction device 20 according to Embodiment 2 comprises a code-cancel unit 21-2n), an encoding unit 22 that encodes the signal after the erasure process, a generation unit (decoding unit 25) that generates speaker drive signals for reproducing the sound field presence in the sound pickup space in the reproduction space for each of the multiple speakers (satellite speakers SPk1-SPkp) placed in a reproduction space (satellite venue STL1) different from the sound pickup space, based on the encoded signal, and a sound field reproduction unit 26 that outputs speaker drive signals for each of the multiple speakers. As a result, the sound field presence reproduction device 20 according to Embodiment 2 can reproduce with high accuracy in at least one satellite venue the atmosphere of the audience presence in the sound pickup space (live venue LV1) mainly picked up by the ambisonic microphone AMB1 by erasing the components of one or more individual sound sources (reference signals) in live venue LV1 picked up by the ambisonic microphone AMB1 on the time axis.
[0146] Furthermore, the sound field presence reproduction device 20 includes a microphone element direction specification unit 23 that specifies the direction information of the multiple microphone elements Mc1 to Mc4 provided by the sound acquisition device (ambisonic microphone AMB1). The encoding unit 22 performs encoding processing using the direction information of each of the multiple microphone elements Mc1 to Mc4 and the signal after the erasure process. As a result, the sound field presence reproduction device 20 can generate an ambisonic signal having multiple direction resolutions (see B-format signal in Figure 10) by taking into account the direction information of each of the microphone elements Mc1 to Mc4 provided by the ambisonic microphone AMB1.
[0147] Furthermore, the sound field presence reproduction device 20 includes a speaker direction specification unit 24 that specifies the direction information of multiple speakers (satellite speakers SPk1 to SPkp) within the reproduction space (satellite venue STL1). The generation unit (decoding unit 25) uses the direction information of each of the multiple speakers and the encoded signal to generate speaker drive signals for each of the multiple speakers in the ambisonics region. As a result, the sound field presence reproduction device 20 can generate speaker drive signals that can reproduce the sense of presence on the audience side within the live venue LV1 by taking into account the direction information of each of the multiple satellite speakers SPk1 to SPkp arranged in the space of the satellite venue STL1 from their respective reference positions (for example, see the central position LSP1 corresponding to the listener's position).
[0148] Furthermore, the elimination section (echo cancellation section 21-2n) of the sound field presence reproduction device 20 is composed of a number of single echo cancellers (see Figure 14) determined based on the number of microphone elements Mc1-Mc4 in the sound pickup device (ambisonic microphone AMB1) and the number of sound sources. Each single echo canceller receives the sound source signal of the corresponding sound source (for example, a sound signal or speech signal from an individual sound pickup microphone) and performs elimination processing (echo cancellation processing) on the time axis. This makes it possible to eliminate (suppress) the component of the reference signal (i.e., the individual sound source sound signal picked up by the individual sound pickup microphone Mn) included in the sound pickup signal of each microphone element of the ambisonic microphone AMB1 with high precision, especially when there is no correlation between the reference signals (in other words, when it is below a predetermined threshold to the extent that the crosstalk component can be considered to be substantially absent). The absence of crosstalk components means, for example, that the sound of a vocalist singing on the main stage STG1 (see Figure 12) is not picked up by other individual microphones, or if it is picked up, its sound pressure level is below the predetermined threshold mentioned above.
[0149] (Modified version of Embodiment 2) Embodiment 2 describes an example in which each of the echo cancellation units 21 to 2n in a sound field presence reproduction device is configured as a Single EchoCanceller. A modified example of Embodiment 2 describes an example in which the echo cancellation units 21 to 2n in the sound field presence reproduction device are replaced with a multi-channel echo canceller that handles multiple audio channels. In the modified example of Embodiment 2, components and contents that overlap with Embodiment 2 are given corresponding common reference numerals to simplify or omit the explanation, while different contents are explained.
[0150] First, with reference to Figures 16 and 17, the system configuration and operation overview of the sound field presence reproduction system 100A according to a modified example of Embodiment 2 will be described. Figure 16 is a block diagram showing an example of the system configuration of the sound field presence reproduction system 100A according to a modified example of Embodiment 2. Figure 17 is a diagram showing an example of the operation overview from sound field presence recording to sound field presence reproduction in the sound field presence reproduction system 100A of Figure 16.
[0151] The sound field presence reproduction system 100A includes a sound field presence sound recording device 10 and a sound field presence reproduction device 20A. The sound field presence sound recording device 10 and the sound field presence reproduction device 20A are connected to each other via a network NW1, enabling data communication.
[0152] The sound field presence reproduction device 20A is placed in the reproduction space (for example, the satellite venue STL1) and includes a multi-channel echo cancellation unit 21A, an encoding unit 22, a microphone element direction specification unit 23, a speaker direction specification unit 24, a decoding unit 25, a sound field playback unit 26, and satellite speakers SPk1, ..., SPkp. In other words, in the modified embodiment of Embodiment 2, instead of the echo cancellation units 21~2n of Embodiment 2, a multi-channel echo cancellation unit 21A is provided in the sound field presence reproduction device 20A that inputs the sound source signals of the sound sources picked up by each of the n individual sound-collecting microphones. Note that the configuration of the multi-channel echo cancellation unit 21A, encoding unit 22, microphone element direction specification unit 23, speaker direction specification unit 24, decoding unit 25, and sound field playback unit 26 in Figure 16 corresponds to the configuration of the audience-side presence parameter calculation unit 16 and sound field playback processing unit 18 according to Embodiment 1 (see Figure 1).
[0153] The multi-channel echo cancellation unit 21A receives the sound pickup signals from each microphone element of the ambisonic microphone AMB1 sent from the sound field presence sound recording device 10 (A / D conversion unit 7 side), and further receives the individual sound source signals from the first sound source (see above) to the nth sound source (see above) sent from the sound field presence sound recording device 10 (individual sound recording microphone M1 side) as the first reference signal M1S to the nth reference signal MnS. The multi-channel echo cancellation unit 21A performs an erasure process (e.g., multi-channel echo cancellation process) in the time domain to erase each component of the sound pickup signal from the ambisonic microphone AMB1, from the first reference signal M1S (i.e., the individual sound source signal picked up by the individual sound recording microphone M1) to the nth reference signal MnS (i.e., the individual sound source signal picked up by the individual sound recording microphone Mn). The multi-channel echo cancellation unit 21A outputs the signals after the cancellation process (from the first to the nth acquired sound signal) to the encoding unit 22.
[0154] Here, the multi-channel echo cancellation unit 21A may be configured as an echo canceller using an adaptive filter that operates in the time domain, for example. This echo canceller can be configured as, for example, a Multi Channel EchoCanceller. The configuration of this Multi Channel EchoCanceller may be based on the stereo echo canceller shown in the reference non-patent document, which is an example where two reference signals are input to the Multi Channel EchoCanceller. Therefore, as shown in Figure 17, the multi-channel echo cancellation unit 21A can be configured using the same number of Multi Channel EchoCancellers or stereo echo cancellers as the number of microphone elements (e.g., 4) of the ambisonic microphone AMB1. This Multi Channel EchoCanceller or stereo echo canceller may be a configuration disclosed in, for example, the reference non-patent document or a configuration obtained by referring to that configuration. By using this configuration, even if there is a correlation between the reference signals (in other words, when it exceeds a predetermined threshold that can be considered to contain no crosstalk components), it becomes possible to eliminate (suppress) the components of the reference signal (i.e., the individual sound source signals picked up by the individual sound pickup microphones Mn) included in the sound pickup signal of the ambisonic microphone AMB1 with high precision. Furthermore, the multi-channel echo cancellation section 21A may be implemented as an echo cancellation process using adaptive filters in the frequency domain or subband domain after the time domain signal has been forward-converted using DFT or the like, or the canceled signal may be inversely converted back to the time domain using IFFT or the like before subsequent processing.
[0155] <Reference Non-Patent Literature> Chapter 5 Acoustic Echo Canceller, "Example of Stereo Echo Canceller Configuration" (see Figures 5 and 8), Institute of Electronics, Information and Communication Engineers, 2012, [Retrieved September 2, 2022], Internet<URL:https: / / www.ieice-hbkb.org / files / 02 / 02gun_06hen_05.pdf>
[0156] Next, with reference to Figure 18, the operation procedure for sound field presence reproduction by the sound field presence reproduction device 20A will be described. Figure 18 is a flowchart showing in chronological order an example of the operation procedure for sound field presence reproduction by the sound field presence reproduction device according to a modified example of Embodiment 2. In the explanation of Figure 18, the same step numbers are assigned to content that overlaps with the explanation of Figure 15, and the explanation is simplified or omitted, while different content is explained.
[0157] In Figure 18, the sound field presence reproduction device 20A performs multi-channel echo cancellation processing (see above) in the time domain in the multi-channel echo cancellation unit 21A, using the sound pickup signal from the ambisonic microphone AMB1 as the main signal and the sound pickup signals from the individual sound pickup microphones M1 to Mn as reference signals (step St22A). More specifically, the multi-channel echo cancellation unit 21A of the sound field presence reproduction device 20A receives various signals sent in step St21 (specifically, the sound pickup signals for each microphone element of the ambisonic microphone AMB1, and the first reference signal M1S to the nth reference signal MnS), and performs an erasure process (for example, a multi-channel echo cancellation process) on the time axis to erase the components of the first reference signal M1S (i.e., the individual sound source sound signal picked up by the individual sound pickup microphone M1) to the nth reference signal MnS (i.e., the individual sound source sound signal picked up by the individual sound pickup microphone Mn) contained in the sound pickup signals for each microphone element of the ambisonic microphone AMB1 (step St22A). The processing from step St22A onward overlaps with Figure 15, so the explanation is omitted.
[0158] As described above, in the sound field presence reproduction system 100A according to a modified example of Embodiment 2, the erasure unit (multi-channel echo cancellation unit 21A) of the sound field presence reproduction device 20A is composed of a number of multi-channel echo cancellers determined based on the number of microphone elements Mc1 to Mc4 provided in the sound acquisition device (ambisonic microphone AMB1) (see Figure 17). Each multi-channel echo canceller receives the sound source signal (sound source tone signal) corresponding to each of the multiple sound sources and performs erasure processing (multi-cancel echo cancellation processing). This makes it possible to erase (erase) the component of the reference signal (i.e., the individual sound source tone signal acquired by the individual sound acquisition microphone) included in the sound acquisition signal of each microphone element of the ambisonic microphone AMB1 with high precision, even if there is a correlation between the reference signals (in other words, when it is above a predetermined threshold that can be considered to contain no crosstalk components).
[0159] (Embodiment 3) Embodiment 2 describes an example in which, in a sound field presence reproduction device, echo cancellation processing is performed in the time domain using the sound pickup signal for each microphone element of the ambisonic microphone AMB1 and the sound source sound signal (reference signal) for each sound source in the sound pickup space (e.g., live venue LV1) before performing the encoding processing to generate a first-order ambisonic signal. Embodiment 3 describes an example in which echo cancellation processing is performed in the ambisonic domain using a first-order ambisonic signal and sound source sound signals with directional specifications for each sound source in the sound pickup space (e.g., live venue LV1). In Embodiment 3, configurations and contents that overlap with Embodiment 2 are given corresponding common reference numerals to simplify or omit the explanation, while different contents are explained.
[0160] First, with reference to Figures 19 and 20, the system configuration and operation overview of the sound field presence reproduction system 100B according to Embodiment 3 will be described. Figure 19 is a block diagram showing an example of the system configuration of the sound field presence reproduction system 100B according to Embodiment 3. Figure 20 is a diagram showing an example of the operation overview from sound field presence recording to sound field presence reproduction in the sound field presence reproduction system 100B of Figure 19.
[0161] The sound field presence reproduction system 100B includes a sound field presence sound recording device 10B and a sound field presence reproduction device 20B. The sound field presence sound recording device 10B and the sound field presence reproduction device 20B are connected to each other via a network NW1, enabling data communication.
[0162] The sound field presence sound recording device 10B is placed in a sound recording space (e.g., a live venue LV1) and includes an ambisonic microphone AMB1, an A / D conversion unit 7, an encoding unit 8, a microphone element direction specification unit 9, and individual sound recording microphones M1, ..., Mn. The sound field presence sound recording device 10 only needs to have at least the ambisonic microphone AMB1, and the A / D conversion unit 7, encoding unit 8, and microphone element direction specification unit 9 may be provided in the sound field presence reproduction device 20B.
[0163] [ka]
[0164] [ka]
[0165] The sound field presence reproduction device 20B is placed in a reproduction space (for example, a satellite venue STL1) and includes echo cancellation units 21B, ..., 2nB, encoding units 31, ..., 3n, sound source position designation units 41, ..., 4n, speaker direction designation unit 24, decoding unit 25B, sound field playback unit 26, and satellite speakers SPk1, ..., SPkp. Furthermore, the n representing the number of echo cancellation units 21~2n, the n representing the number of encoding units 31~3n, the n representing the number of sound source position designation units 41~4n, and the n representing the number of individual sound pickup microphones M1~Mn are the same. In other words, the sound field presence reproduction device 20B is provided with the same number of echo cancellation units, encoding units, and sound source position designation units as the number of types of sound sources picked up by the individual sound pickup microphones. Note that the configurations of the echo cancellation units 21B~2nB, encoding units 31~3n, sound source position designation units 41~4n, speaker direction designation unit 24, decoding unit 25B, and sound field reproduction unit 26 in Figure 19 correspond to the configurations of the audience-side presence parameter calculation unit 16 and sound field reproduction processing unit 18 according to Embodiment 1 (see Figure 1).
[0166] [ka]
[0167] [ka]
[0168] [ka]
[0169] [ka]
[0170] [ka]
[0171] [ka]
[0172] Here, each of the echo cancellation units 21B to 2nB may be configured as an echo canceller using an adaptive filter that operates in the ambisonics region, for example. This echo canceller can be configured as a Single Channel EchoCanceller, for example. Therefore, as shown in Figure 20, each of the echo cancellation units 21B to 2nB can be configured using the same number of Single Channel EchoCancellers as the number of microphone elements (e.g., 4) of the ambisonics microphone AMB1. This Single Channel EchoCanceller may be configured as disclosed in, for example, the reference non-patent literature. By using this configuration, even if there is no correlation between the reference signals (in other words, if it is below a predetermined threshold to the extent that the crosstalk component can be considered substantially absent), it is possible to erase (suppress) the component of the reference signal (i.e., the individual sound source sound signal picked up by the individual sound pickup microphones) contained in the signal component based on the sound pickup signal of each microphone element of the ambisonics microphone AMB1 (for example, the signal having resolution in each direction of W, X, Y, and Z as shown in Figure 10) with high accuracy.
[0173] <Reference Non-Patent Literature> Chapter 5 Acoustic Echo Canceller, "Example of Adaptive Filter Configuration" (see Figure 5.2), p4 / (17), Institute of Electronics, Information and Communication Engineers, 2012, [Retrieved September 2, 2022], Internet<URL:https: / / www.ieice-hbkb.org / files / 02 / 02gun_06hen_05.pdf>
[0174] [ka]
[0175] Next, with reference to Figure 21, the operating procedure for sound field presence reproduction by the sound field presence reproduction device 20B will be described. Figure 21 is a flowchart showing a time-series example of the operating procedure for sound field presence reproduction by the sound field presence reproduction device 20B according to Embodiment 3. In the explanation of Figure 21, the same step numbers are assigned to content that overlaps with the explanation of Figure 15, and the explanation is simplified or omitted, while different content is explained.
[0176] [ka]
[0177] [ka]
[0178] [ka]
[0179] [ka]
[0180] As described above, in the sound field presence reproduction system 100B according to Embodiment 3, the sound field presence reproduction device 20B includes an acquisition unit (echo cancellation unit 21B to 2nB) that acquires at least the sound pickup signal (diffuse sound components DS1 to DS3) picked up by a sound pickup device (ambisonic microphone AMB1) placed in the sound pickup space (live venue LV1), an encoding unit 31 to 3n that encodes the sound source signals (sound signal SS1, voice signal SS2, sound signal SS3) of one or more sound sources in the sound pickup space, and a reference signal for the sound pickup. The system includes an erasure unit (echo cancellation unit 21B~2nB) that performs an erasure process to erase the reference signal component contained in the signal; a generation unit (decoding unit 25B) that generates speaker drive signals for each of several speakers (satellite speakers SPk1~SPkp) placed in a reproduction space (satellite venue STL1) different from the sound pickup space, based on the signal after the erasure process, in order to reproduce the sound field presence in the sound pickup space within the reproduction space; and a sound field reproduction unit 26 that outputs speaker drive signals for each of the multiple speakers. As a result, the sound field presence reproduction device 20B according to Embodiment 3 can reproduce with high accuracy in at least one satellite venue the atmosphere of presence on the audience side within the sound pickup space (live venue LV1) mainly picked up by the ambisonic microphone AMB1, by eliminating the components of one or more individual sound sources (reference signals) within the live venue LV1 picked up by the ambisonic microphone AMB1 in the ambisonic domain rather than in the time domain.
[0181] Furthermore, the sound pickup signals acquired by the acquisition unit (echo cancellation unit 21B to 2nB) are signals encoded using the directional information of each of the multiple microphone elements Mc1 to Mc4 provided by the sound pickup device (ambisonic microphone AMB1). As a result, the sound field presence reproduction device 20B can acquire a first-order ambisonic signal with high directional resolution as the input signal to be processed for erasure in the erasure unit (each of the echo cancellation unit 21B to 2nB).
[0182] Furthermore, the sound field presence reproduction device 20B further includes a speaker direction specification unit 24 that specifies the direction information of multiple speakers (satellite speakers SPk1 to SPkp) within the reproduction space (satellite venue STL1). The generation unit (decoding unit 25B) generates speaker drive signals for each of the multiple speakers using the direction information of each of the multiple speakers and the signal after the erasure process. As a result, the sound field presence reproduction device 20B can accurately generate speaker drive signals with high directional resolution that can reproduce the sense of presence on the audience side within the live venue LV1 and have high directional resolution by performing a decoding process using the signal after the erasure process in the ambisonics region, taking into account the direction information of each of the multiple satellite speakers SPk1 to SPkp from their respective reference positions (for example, see the central position LSP1 corresponding to the listener's position).
[0183] Furthermore, the sound field presence reproduction device 20B is further equipped with sound source position specification units 41 to 4n that specify the position information of one or more sound sources within the sound collection space (live venue LV1). Each of the encoding units 31 to 3n performs encoding processing using the sound source signal and position information of the corresponding sound source. As a result, the sound field presence reproduction device 20B can generate a highly accurate reference signal necessary for the erasure processing of the erasure unit (echo cancellation units 21B to 2nB) by taking into account the direction in which individual sound sources exist within the live venue LV1.
[0184] Furthermore, the erasure unit (echo cancellation unit 21B~2nB) is composed of a number of single echo cancellers (e.g., 4n (=4×n)) determined based on the number of microphone elements Mc1~Mc4 (e.g., 4) and the number of sound sources (e.g., n) of the sound pickup device (ambisonics microphone AMB1). The single echo canceller receives the signal after the sound source signal of the corresponding sound source has been encoded and performs the erasure process. As a result, the sound field presence reproduction device 20B can erase (suppress) the components of the reference signal (i.e., the first-order ambisonics signal based on the individual sound source sound signal and individual sound source direction picked up by the individual sound pickup microphone Mn) contained in the signal component based on the sound pickup signal of each microphone element of the ambisonics microphone AMB1 (e.g., the signal having resolution in each direction of W, X, Y, and Z as shown in Figure 10) with high precision, especially when there is no correlation between the reference signals (in other words, when it is below a predetermined threshold to the extent that the crosstalk component can be considered to be substantially absent).
[0185] (Modified example of Embodiment 3) Embodiment 3 describes an example in which each of the echo cancellation units 21B to 2nB in a sound field presence reproduction device is configured as a Single EchoCanceller. A modification of Embodiment 3 describes an example in which, instead of the echo cancellation units 21B to 2nB in a sound field presence reproduction device, a multi-channel echo canceller is configured to handle multiple audio channels in the ambisonic domain rather than in the time domain. In the modification of Embodiment 3, configurations and contents that overlap with Embodiments 1 and 2 are given corresponding common reference numerals to simplify or omit the explanation, while different contents are explained.
[0186] First, with reference to Figures 22 and 23, the system configuration and operation overview of the sound field presence reproduction system 100C according to a modified example of Embodiment 3 will be described. Figure 22 is a block diagram showing an example of the system configuration of the sound field presence reproduction system 100C according to a modified example of Embodiment 3. Figure 23 is a diagram showing an example of the operation overview from sound field presence sound acquisition to sound field presence reproduction in the sound field presence reproduction system 100C of Figure 22.
[0187] The sound field presence reproduction system 100C includes a sound field presence sound recording device 10B (see Figure 19) and a sound field presence reproduction device 20C. The sound field presence sound recording device 10B and the sound field presence reproduction device 20C are connected to each other via a network NW1, enabling data communication.
[0188] [ka]
[0189] [ka]
[0190] Here, the multi-channel echo cancellation unit 21C may be configured as an echo canceller using multiple adaptive filters operating in the ambisonics region, for example. This echo canceller can be configured as, for example, a Multi Channel EchoCanceller. The configuration of this Multi Channel EchoCanceller may be based on the stereo echo canceller shown in the reference non-patent document, which is an example where two reference signals are input to the Multi Channel EchoCanceller. Therefore, as shown in Figure 23, the multi-channel echo cancellation unit 21C can be configured using the same number of Multi Channel EchoCancellers or stereo echo cancellers as the number of microphone elements (e.g., 4) of the ambisonics microphone AMB1. This Multi Channel EchoCanceller or stereo echo canceller may be a configuration disclosed in, for example, the reference non-patent document or a configuration obtained by referring to that configuration. By using this configuration, even if there is a correlation between the reference signals (in other words, when it exceeds a predetermined threshold that can be considered to contain no crosstalk components), it becomes possible to accurately eliminate (suppress) the components of the reference signal (i.e., the individual sound source sound signals picked up by the individual sound pickup microphones Mn) contained in the signal components based on the sound picked up by the ambisonic microphone AMB1 (for example, the signal with resolution in each of the W, X, Y, and Z directions shown in Figure 10).
[0191] <Reference Non-Patent Literature> Chapter 5 Acoustic Echo Canceller, "Example of Stereo Echo Canceller Configuration" (see Figures 5 and 8), Institute of Electronics, Information and Communication Engineers, 2012, [Retrieved September 2, 2022], Internet<URL:https: / / www.ieice-hbkb.org / files / 02 / 02gun_06hen_05.pdf>
[0192] [ka]
[0193] Next, with reference to Figure 24, the operating procedure for sound field presence reproduction by the sound field presence reproduction device 20C will be described. Figure 24 is a flowchart showing in chronological order an example of the operating procedure for sound field presence reproduction by the sound field presence reproduction device 20C according to a modified example of Embodiment 3. In the explanation of Figure 24, the same step numbers are assigned to any content that overlaps with the explanations of Figures 15, 18, or 21, and the explanation is simplified or omitted, while different content is explained.
[0194] [ka]
[0195] [ka]
[0196] As described above, in the sound field presence reproduction device 20C according to a modified example of Embodiment 3, the erasure unit (multi-channel echo cancellation unit 21C) is composed of a number of multi-channel echo cancellers determined based on the number of microphone elements Mc1 to Mc4 provided in the sound acquisition device (ambisonics microphone AMB1). The multi-channel echo canceller receives the signal after the sound source signal corresponding to each of the multiple sound sources has been encoded and performs erasure processing (multi-channel echo cancellation processing). As a result, the sound field presence reproduction device 20C can erase (suppress) the components of the reference signal (i.e., individual sound source sound signals acquired by individual sound acquisition microphones) included in the signal component based on the sound acquisition signal of each microphone element of the ambisonics microphone AMB1 (for example, the signal having resolution in each direction of W, X, Y, and Z as shown in Figure 10) with high precision in the ambisonics region, even if there is a correlation between the reference signals (in other words, when it is above a predetermined threshold to the extent that it can be considered that no crosstalk components are included).
[0197] While embodiments have been described above with reference to the attached drawings, this disclosure is not limited to such examples. It is clear to those skilled in the art that various modifications, alterations, substitutions, additions, deletions, and equivalents can be conceived within the scope of the claims, and these are also understood to fall within the technical scope of this disclosure. Furthermore, the components of the embodiments described above can be combined in any way without departing from the spirit of the invention.
[0198] In the above-described embodiment 1, the immersive sound field reproduction device was described as using performer sound signals, zone sound signals, and audience-side immersive sound signals recorded within the live venue LV1 to reproduce sound image localization and a sense of presence in real time within the satellite venue STL1. However, the immersive sound field reproduction device may also store the performer sound signals, zone sound signals, and audience-side immersive sound signals, which were pre-recorded in the live venue LV1, in the memory unit 13 within the satellite venue STL1, and reproduce the atmosphere of the sound field of the live venue LV1 (sound image localization, sense of presence reproduction) at a later date than the recording date, rather than in real time.
[0199] In the example system configuration shown in Figure 1, at least one of the following may be provided on the live venue LV side: the memory unit 13, the sound image localization zone determination unit 15, the audience-side presence parameter calculation unit 16, and the mixing / level balance adjustment unit 17. In other words, the arrangement of the system configuration of the presence sound field reproduction system 1000 shown in Figure 1 may be appropriately determined taking into account the performance of the computer device P1 provided on the live venue LV1 side, the performance of the computer device P2 provided on the satellite venue STL1 side, etc. [Industrial applicability]
[0200] This disclosure is useful as a realistic sound field reproduction device and realistic sound field reproduction method that reproduces a sense of presence, including sound images from various sound sources in a sound field recording space, with high sensitivity in a sound field reproduction space. [Explanation of Symbols]
[0201] 1 Mixer 2. Performer Tracking System 3. Face Database 4 encoders 5 Matching Table Database 6. Communication Department at the Live Venue 7 A / D conversion section 8, 22, 31, 3n encoder 9.23 Microphone element direction selection section 10, 10B Sound Field Immersion Sound Recording System 11. Satellite venue communications department 12 Decoders 13 Storage section 14. Video output section 15. Sound Image Localization Zone Determination Unit 16. Audience-side presence parameter calculation unit 17. Mixing / Level Balance Adjustment Section 18 Sound field reproduction processing unit 20, 20A, 20B, 20C Sound field realistic reproduction device 21, 2n, 21B, 2nB echo cancellation section 21A, 21C Multi-channel echo cancellation section 24 Speaker direction selection section 25, 25B, 25C decoding section 26 Sound field reproduction section 41, 4n Sound source position specification section 100, 100A, 100B, 100C Sound Field Immersion Reproduction System 1000 Immersive Sound Field Reproduction System AMB1 Ambisonics Microphone HM1, HM4 Headset Microphone M1, Mn Individually Recording Microphones SPk1, SPk2, SPk3, SPk4, SPk5, SPk6, SPk7, SPk8, SPk9, SPk10, SPk11, SPk12, SPkp Satellite Speakers ZM1, ZM2, ZM3, ZM4, ZM5, ZM6, ZM7 Zone Microphone
Claims
1. An acquisition unit that acquires at least the following: a speech voice signal recorded by a person microphone worn by at least one person who can move within the activity area of the sound field recording space; ambient sound signals recorded by a plurality of ambient microphones arranged around the activity area; and the position information of the person. A determination unit determines, based on the location information of the person, a primary peripheral microphone which is one of the plurality of peripheral microphones and whose recording area is the location of the person within the activity area; The system includes a sound field reproduction unit that performs sound field reproduction processing to reproduce the sound field in the sound field recording space using a plurality of speakers arranged in a sound field reproduction space different from the sound field recording space, based on at least the speech voice signal from the person microphone, the first peripheral sound signal from the main peripheral microphone, and the second peripheral sound signal from other peripheral microphones other than the main peripheral microphone, The acquisition unit further acquires peripheral area sound signals recorded by sound recording devices located in a peripheral area different from the activity area within the sound field recording space. The aforementioned sound field reproduction unit is Signal processing is performed to reproduce a sound field using the surrounding area sound signal in an area within the sound field reproduction space that corresponds to the surrounding area within the sound field recording space. The aforementioned speech voice signal, the first ambient sound signal, and the second ambient sound signal are used as reference signals, and the signal processing is performed to erase the component of the reference signal included in the ambient area sound signal. Realistic sound field reproduction device.
2. The sound field reproduction unit encodes the signal after the erasure process, and generates speaker drive signals for each of the plurality of speakers based on the encoded signal, for reproducing the sense of presence in the sound field recording space within the sound field reproduction space. The real-time sound field reproduction device according to claim 1.
3. The sound field reproduction unit outputs the generated speaker drive signal for each speaker from the corresponding speaker. The real-time sound field reproduction device according to claim 2.