Sound collecting device, sound collecting method, and recording medium storing sound collecting program
By employing techniques such as Fourier transform and direction of arrival estimation, and utilizing a database to calculate the shielding coefficient for speech separation, the problem of inaccurate speech separation under the EM algorithm is solved, achieving high-precision speech separation results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MITSUBISHI ELECTRIC CORP
- Filing Date
- 2021-05-20
- Publication Date
- 2026-06-12
Smart Images

Figure CN117280710B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a sound collection device, a sound collection method, and a recording medium storing a sound collection program. Background Technology
[0002] A device has been proposed that converts received audio signals from two microphones into frequency domain signals, calculates the phase difference between the frequency domain signals, estimates the parameters of a frequency-dependent probability distribution model, generates a mask using the probability distribution model, and uses the mask for sound source separation (i.e., speech separation). See, for example, Patent Document 1. The expected value maximization (EM) algorithm is used in updating the parameters of the probability distribution model in this device.
[0003] Existing technical documents
[0004] Patent documents
[0005] Patent Document 1: Japanese Patent Application Publication No. 2010-187066 (e.g., claims 1 and 4, paragraphs 0026 to 0059) Figure 4 ) Summary of the Invention
[0006] The problem that the invention aims to solve
[0007] However, in devices that use the EM algorithm in the parameter update of the probability distribution model, speech is sometimes not separated accurately.
[0008] The purpose of this invention is to enable high-precision speech separation.
[0009] Methods for solving problems
[0010] The sound collection device of the present invention separates a target speech signal from a first received signal output from a first microphone into which the input speech is received and a second received signal output from a second microphone into which the input speech is received. The sound collection device is characterized by comprising: a first Fourier transform unit that performs a Fourier transform on the first received signal to output a first signal; a second Fourier transform unit that performs a Fourier transform on the second received signal to output a second signal; an arrival direction estimation unit that estimates the arrival direction of the speech; a phase calculation unit that calculates the phase of the cross spectrum of the first signal and the second signal; a shielding coefficient determination unit that determines a shielding coefficient based on an arrival direction phase table read from a pre-generated database representing the relationship between the phase and the arrival direction for each frequency band, the calculated phase, and the estimated arrival direction; a filter that uses the shielding coefficient to separate the signal from either the first signal or the second signal; and an inverse Fourier transform unit that performs an inverse Fourier transform on the separated signal to output the target speech signal.
[0011] The sound collection method of the present invention is performed by a sound collection device that separates a target speech signal from a first received signal output from a first microphone from which the input speech is received and a second received signal output from a second microphone from which the input speech is received. The sound collection method is characterized by the following steps: performing a Fourier transform on the first received signal to output a first signal; performing a Fourier transform on the second received signal to output a second signal; estimating the direction of arrival of the speech; calculating the phase of the cross spectrum of the first signal and the second signal; determining a shielding coefficient based on a direction of arrival phase table read from a pre-generated database representing the relationship between the phase and the direction of arrival for each frequency band, the calculated phase, and the estimated direction of arrival; separating the signal from the first signal or the second signal using the shielding coefficient; and performing an inverse Fourier transform on the separated signal to output the target speech signal.
[0012] Invention Effects
[0013] According to the present invention, speech separation can be performed with high precision. Attached Figure Description
[0014] Figure 1 This is a functional block diagram that schematically illustrates the structure of the sound collection device according to Embodiment 1.
[0015] Figure 2 This is a diagram illustrating an example of the hardware structure of the sound collection device according to Embodiment 1.
[0016] Figure 3 This is a diagram illustrating another example of the hardware structure of the sound collection device according to Embodiment 1.
[0017] Figure 4 This is a graph showing the time difference of voice arrival at the two microphones.
[0018] Figure 5 (A) is a diagram showing the phase difference between signals in the frequency domain, and (B) is a diagram showing the phase of the cross spectrum of signals in the frequency domain.
[0019] Figure 6 This is a diagram showing an example of an incoming direction phase table.
[0020] Figure 7 This is a diagram illustrating an example of the process for determining the shielding coefficient using the incoming direction phase table.
[0021] Figure 8 This is a flowchart illustrating the calculation process for the direction of speech arrival.
[0022] Figure 9 This is a flowchart illustrating the process for determining the shielding coefficient.
[0023] Figure 10 This is a functional block diagram that roughly shows the structure of the sound collection device in Embodiment 2. Detailed Implementation
[0024] The sound collection device, sound collection method, and sound collection procedure of the embodiments will now be described with reference to the accompanying drawings. The following embodiments are merely examples, and the embodiments can be appropriately combined and modified.
[0025] Implementation Method 1
[0026] <Sound Collection Device 1>
[0027] Figure 1 This is a functional block diagram that schematically illustrates the structure of the sound collection device 1 according to Embodiment 1. The sound collection device 1 is also referred to as a speech separation device. The sound collection device 1 is a device capable of performing the sound collection method of Embodiment 1. The sound collection method is also referred to as a speech separation method. Figure 1 As shown, the sound collecting device 1 is a device that separates the target speech signal y(t) from a first received signal x1(t) output from a first microphone 11a that receives one or more speech sounds (e.g., speech #1, speech #2) and a second received signal x2(t) output from a second microphone 11b that receives the same one or more speech sounds (e.g., speech #1, speech #2). Here, t represents time. In other words, when a speech sound composed of a mixture of speech #1 and speech #2 is input to the first microphone 11a and the second microphone 11b, the sound collecting device 1 separates the speech signal of the target speaker (speech #1 or speech #2) from the received signal of the speech sound composed of the mixture of speech #1 and speech #2.
[0028] The sound collecting device 1 includes a first Fourier transform unit 12a, a second Fourier transform unit 12b, an arrival direction estimation unit 17, a phase calculation unit 13, a shielding coefficient determination unit 14, a filter 18, and an inverse Fourier transform unit 19. Furthermore, the sound collecting device 1 includes a spatial aliasing calculation unit 16 and a storage device storing an arrival direction phase table 15. The spatial aliasing calculation unit 16 may also be part of an external device different from the sound collecting device 1. The arrival direction phase table 15 may also be a database stored in an external storage device different from the sound collecting device 1.
[0029] <Microphone 11a and Microphone 21b>
[0030] Voice input is sent to microphone 11a of channel 1 (Ch1) and microphone 11b of channel 2 (Ch2). Figure 1In the example, the input speech is a mixture of speech #1 emitted by the first speaker as the first sound source and speech #2 emitted by the second speaker as the second sound source. The angle representing the direction of arrival of speech #1 is denoted by θ1, and the angle representing the direction of arrival of speech #2 is denoted by θ2. The first microphone 11a outputs the first received signal x1(t). The second microphone 11b outputs the second received signal x2(t). Alternatively, there can be three or more microphones. These multiple microphones are also called a microphone array.
[0031] <First Fourier Transform Unit 12a and Second Fourier Transform Unit 12b>
[0032] The first Fourier transform unit 12a performs a Fourier transform on the first received audio signal x1(t) output from the first microphone 11a to output a first signal X1(ω, τ) with τ frames and angular frequency ω. The second Fourier transform unit 12b performs a Fourier transform on the second received audio signal x2(t) output from the second microphone 11b to output a second signal X2(ω, τ) with τ frames and angular frequency ω.
[0033] <Phase Calculation Unit 13>
[0034] Phase calculation unit 13 calculates the phase Φ of cross spectrum D(ω,τ) based on the first signal X1(ω,τ) and the second signal X2(ω,τ). D (ω, τ). Cross spectrum D(ω, τ) and phase Φ D The calculation method for (ω, τ) will be described later.
[0035] <Spatial aliasing computation unit 16>
[0036] The spatial aliasing calculation unit 16 calculates the angular frequency ω0 of the lower limit of spatial aliasing based on the interval (i.e., the distance between microphones) d between the first microphone 11a and the second microphone 11b using the following formula (1).
[0037]
[0038] At angular frequencies lower than angular frequency ω0, no spatial aliasing occurs.
[0039] <Arrival Direction Estimation Section 17>
[0040] The direction of arrival estimation unit 17 calculates the angle θ representing the direction of arrival of the speech arriving at the first microphone 11a and the second microphone 11b. Figure 1In the example, the arrival direction estimation unit 17 estimates the angle θ1 representing the arrival direction of speech #1 contained in the speech (i.e., mixed speech) arriving at the first microphone 11a and the second microphone 11b, and the angle θ2 representing the arrival direction of speech #2 contained in the speech arriving at the first microphone 11a and the second microphone 11b. Preferably, the arrival direction estimation unit 17 calculates the arrival direction of the speech based on a first signal X1(ω, τ) and a second signal X2(ω, τ) with an angular frequency ω lower than angular frequency ω0. This is because if the arrival direction of the speech is calculated based on a first signal X1(ω, τ) and a second signal X2(ω, τ) with an angular frequency higher than angular frequency ω0, the arrival direction may be calculated incorrectly. Furthermore, the method for estimating the arrival direction (i.e., the calculation method) will be described later.
[0041] <Arrival Direction Phase Table 15>
[0042] The phase table 15 represents the phase Φ of the cross spectrum D(ω, τ) at frequency f (i.e., angular frequency ω = 2πf). D A table showing the relationship between (ω, τ) and the direction of arrival of speech. The direction of arrival phase table 15 is pre-generated and stored as a database in a storage device. For example, the direction of arrival phase table 15 represents the phase Φ of each frequency with a certain bandwidth (i.e., each frequency band with a certain width of angular frequency). D A table showing the correspondence between (ω, τ) and the angle θ representing the direction of arrival. An example of the phase table 15 for the direction of arrival is described later.
[0043] <Shielding Coefficient Determination Section 14>
[0044] The shielding coefficient determination unit 14 determines the phase Φ of the cross spectrum D(ω,τ) calculated by the phase calculation unit 13. D The masking coefficient b(ω, τ) is generated by the arrival direction estimation unit 17, which estimates the angle θ representing the arrival direction of the speech (the angle output from the arrival direction estimation unit 17 is a candidate for the angle representing the arrival direction), and the arrival direction phase table 15. The masking coefficient b(ω, τ) is, for example, a binary masking coefficient. For example, the arrival direction phase table 15 contains the phase Φ of the cross spectrum D(ω, τ). D In the case of items consisting of (ω, τ) and the angle θ representing the direction of arrival of the speech (i.e., in phase Φ), D (If ω, τ) satisfies the predetermined conditions), the shielding coefficient determination unit 14 sets the shielding coefficient b(ω, τ) to 1, and there is no phase Φ of the cross spectrum D(ω, τ) in the arrival direction phase table 15. D In the case of items consisting of (ω, τ) and the angle θ representing the direction of arrival of the voice, the shielding coefficient determination unit 14 sets the shielding coefficient b(ω, τ) to 0.
[0045] <Filter 18>
[0046] Filter 18 uses a masking coefficient b(ω, τ) to separate the frequency domain signal Y(ω, τ) from either the first signal X1(ω, τ) or the second signal X2(ω, τ), which is a signal in the frequency domain. When the masking coefficient b(ω, τ) is a binary masking coefficient, filter 18 multiplies the first signal X1(ω, τ) or the second signal X2(ω, τ) by the masking coefficient b(ω, τ), thereby generating the signal Y(ω, τ). Figure 1 In the example, filter 18 uses a masking coefficient b(ω, τ) to separate signal Y(ω, τ) from the first signal X1(ω, τ). When speech #1 is the target speech, signal Y(ω, τ) is a signal with components other than speech #1 reduced. Additionally, other speech, such as speech #2, can also be the target speech.
[0047] <Inverse Fourier Transform Part 19>
[0048] The inverse Fourier transform unit 19 performs an inverse Fourier transform on the frequency domain signal Y(ω, τ) and outputs the speech signal y(t) in the time domain corresponding to the target speech.
[0049] <Hardware Structure>
[0050] Figure 2 This is a diagram illustrating an example of the hardware structure of the sound collection device 1 according to Embodiment 1. (See diagram for example.) Figure 2 As shown, the sound collection device 1 is implemented through a processing circuit 101. The processing circuit 101 implements... Figure 1 The functions of the first Fourier transform unit 12a and the second Fourier transform unit 12b, the phase calculation unit 13, the arrival direction estimation unit 17, the shielding coefficient determination unit 14, the filter 18, and the inverse Fourier transform unit 19 are shown. The processing circuit 101 can be dedicated hardware or a processor that executes a program. When the processing circuit is dedicated hardware, it can be, for example, a single circuit, a composite circuit, a programmable processor, a parallel programmable processor, an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a combination thereof.
[0051] The voice output unit 102 is, for example, a voice output circuit that outputs voice signals to a speaker or the like. The external storage device 103 is, for example, a non-volatile storage device such as a hard disk drive (HDD) or a solid-state drive (SSD).
[0052] Figure 3This is a diagram showing another example of the hardware structure of the sound collection device 1 according to Embodiment 1.
[0053] like Figure 3 As shown, Figure 3 The sound collection device 1 shown has a processor 111, such as a CPU (Central Processing Unit), as a processing circuit that executes the sound collection program stored in the memory 112. Figure 3 The sound collection device 1 shown is, for example, a computer. The sound collection program is installed from a program stored in a recording medium or by downloading via the Internet. The processor 111 can also be any of a processing device, a computing device, a microprocessor, a microcomputer, and a DSP (Digital Signal Processor). The memory 112 is, for example, a volatile semiconductor memory such as RAM (Random Access Memory).
[0054] Alternatively, the sound collection device 1 can be partially implemented using dedicated hardware and partially implemented using software or firmware. In this way, the processing circuitry can be implemented using hardware, software, firmware, or any combination thereof. Figure 1 The functions described in the document.
[0055] <phaseΦ D Calculation of (ω, τ), representing the angle θ of the direction of arrival.
[0056] Figure 4 This shows the arrival time difference δ of the speech reaching the first microphone 11a and the second microphone 11b. θ The image. Figure 4 In this context, R is a reference line extending in a direction perpendicular to the arrangement direction of the first microphone 11a and the second microphone 11b. Figure 4 In the context of the speech arrival direction (i.e., the opposite direction of the sound source direction with reference to the first microphone 11a and the second microphone 11b) relative to the reference line R, let θ be the angle representing the arrival direction, let the speed of sound be c, and let the distance between the first microphone 11a and the second microphone 11b be d, then the arrival time difference δ θ It is represented by the following formula (2).
[0057]
[0058] According to equation (2), the angle θ representing the direction of arrival is expressed as shown in equation (3) below.
[0059]
[0060] Figure 5(A) is a graph showing the phase difference between the first signal X1(ω, τ) and the second signal X2(ω, τ) as signals in the frequency domain. Figure 5 (B) shows the phase Φ of the cross spectrum D(ω,τ) of the first signal X1(ω,τ) and the second signal X2(ω,τ). D The cross spectrum D(ω,τ) of the first signal X1(ω,τ) and the second signal X2(ω,τ) is represented by the following equation (4).
[0061]
[0062] When the real part of the cross spectrum D(ω,τ) is denoted as K(ω,τ) and the imaginary part is denoted as Q(ω,τ), the cross spectrum D(ω,τ) is represented by the following equation (5).
[0063] D(ω,τ)=K(ω,τ)+jQ(ω,τ) (5)
[0064] In this case, the phase Φ of the cross spectrum D(ω,τ) D (ω, τ) is represented by the following equation (6).
[0065]
[0066] like Figure 5 As shown in (A), the phase Φ obtained by equation (6) D (ω, τ) represents the phase angle of each spectral component of the received sound signal from each microphone. Therefore, the phase Φ D The value obtained by dividing (ω, τ) by the angular frequency ω is the arrival time difference δ(ω, τ). This is expressed as in the following equation (7).
[0067]
[0068] When the frequency is f [Hz] and the angular frequency is ω = 2πf, the angle θ [rad] representing the direction of arrival is expressed by the following equation (8) when using equations (3) and (7).
[0069]
[0070] As explained above, the arrival direction estimation unit 17 can use equation (8) to calculate the angle θ representing the arrival direction of the speech arriving at the first microphone 11a and the second microphone 11b.
[0071] <Example of Arrival Direction Phase Table 15>
[0072] Figure 6 This is a diagram showing an example of the phase table 15 for the direction of arrival. Figure 6 In the middle, the distance difference corresponds to Figure 4 The arrival time difference δ shown θ Regarding the sample difference between microphones, the distance difference is represented by the number of samples. Figure 6 In this context, sinθ is Figure 4 The figure shows the sine of the angle θ representing the direction of arrival. Figure 6 In this context, Θ[rad] represents the angle θ indicating the direction of arrival using radians, and Θ[degree] represents the angle θ indicating the direction of arrival using degrees. Figure 6 In the cross spectrum, the phase [degree] (f = 4kHz) represents the phase Φ at f = 4kHz. D (ω, τ). In Figure 6 In the cross spectrum, the phase [degree] (f = 6kHz) represents the phase Φ at f = 6kHz. D (ω, τ).
[0073] according to Figure 6 It's understandable that when determining the arrival direction of high-frequency speech such as f=4kHz and f=6kHz solely based on phase, incorrect angle θ can sometimes be calculated. For example, in... Figure 6 In the process, considering both the case where the sample difference between microphones is 1 and the case where the sample difference between microphones is 5, the candidate directions of arrival become "8.1°" (i.e., 8.144301°) and "45.1°" (i.e., 45.09947°). That is, in the method of determining the direction of arrival of speech solely based on phase, the directions of arrival 8.1° and 45.1° cannot be distinguished in the 4kHz frequency band. It is assumed that if the target speech and other speech arrive from directions 8.1° and 45.1° respectively, they cannot be separated. Therefore, in the sound collection device 1 of Embodiment 1, the phase Φ is used... D The arrival direction of the speech is determined by (ω, τ) and the arrival direction of the speech estimated by the arrival direction estimation unit 17 using the low-frequency speech components.
[0074] <Action of the shielding coefficient determination unit 14>
[0075] Figure 7 This diagram illustrates an example of the process for determining the shielding coefficient b(ω, τ) using the arrival direction phase table 15. The shielding coefficient determination unit 14 determines the shielding coefficient based on the phase Φ of the cross spectrum D(ω, τ). D (ω, τ), representing the angle θ of the direction of arrival of the speech and the phase of the direction of arrival, are used to generate the shielding coefficient b(ω, τ) in Table 15. For example, in the phase Φ of the cross spectrum D(ω, τ) DWhen (ω, τ) is 90° (f = 4kHz) and the angle representing the direction of arrival of the speech is 8.1°, if there is a matching item 31 in the arrival direction phase table 15 (that is, there is a case that satisfies the predetermined conditions), the shielding coefficient determination unit 14 sets the shielding coefficient b(ω, τ) to 1. If there is no matching item 31 in the arrival direction phase table 15, the shielding coefficient determination unit 14 sets the shielding coefficient b(ω, τ) to 0.
[0076] As shown in the phase table 15 for the direction of arrival, the phase Φ of the cross spectrum D(ω, τ) D For projects with (ω, τ) of 90° (f = 4kHz), there are cases where the angle of arrival direction is 8.1° and cases where the angle of arrival direction is 45.1°. Therefore, it is assumed that the shielding coefficient determination unit 14 only considers the phase Φ of the cross spectrum D(ω, τ). D If the direction of arrival is estimated using (ω, τ), the direction of arrival may be incorrectly determined. Therefore, in Embodiment 1, the arrival direction phase table 15 contains the arrival direction estimated by the arrival direction estimation unit 17 and the phase Φ calculated by the phase calculation unit. D In the case where the data (ω, τ) are consistent, the sampling direction that is consistent with the data is sampled. Here, "consistent" does not mean that the calculated value is exactly the same as the value shown in the phase table 15 of the direction of arrival, but rather that it is within a range (i.e., a frequency band of a certain width) that includes a predetermined error from the value shown in the phase table 15 of the direction of arrival.
[0077] Figure 8 This is a flowchart illustrating an example of the calculation process for the direction of arrival of speech. When the direction of arrival estimation unit 17 receives the first signal X1(ω, τ) and the second signal X2(ω, τ) (step S101), it determines whether ω < ω0 (step S102). If the determination is "yes" (step S102), the direction of arrival estimation unit 17 calculates the angle θ representing the direction of arrival (step S103), and determines whether the number of times angle θ becomes the same calculated value θx is more than a predetermined number Nth (step S104). If the determination is "no" in step S102 or "no" in step S104, the direction of arrival estimation unit 17 waits for the next input of the first signal X1(ω, τ) and the second signal X2(ω, τ).
[0078] Figure 9 This is a flowchart illustrating an example of the process for determining the shielding coefficient b(ω, τ). The shielding coefficient determination unit 14 receives the phase Φ from the phase calculation unit 13. D When the arrival direction estimation unit 17 receives an angle θ (e.g., θ1, θ2) representing the arrival direction of the speech (ω, τ), it determines whether there exists a position in the arrival direction phase table 15 that corresponds to the phase Φ.D The angle θ1 or phase Φ representing the direction of arrival under (ω, τ) is given by (ω, τ). D Data matching the angle θ2 representing the direction of arrival under (ω, τ) (step S202). The arrival direction phase table 15 contains the arrival direction estimated by the arrival direction estimation unit 17 and the phase Φ calculated by the phase calculation unit 13. D When the data (ω, τ) match, the shielding coefficient determination unit 14 sets the shielding coefficient to 1 (step S203). If the arrival direction estimated by the arrival direction estimation unit 17 and the phase Φ calculated by the phase calculation unit 13 are not present in the arrival direction phase table 15, then... D When the data (ω, τ) are consistent, the shielding coefficient determination unit 14 sets the shielding coefficient to 0 (step S204).
[0079] <Effects of Implementation Method 1>
[0080] According to Implementation 1, a signal with an angular frequency ω lower than the angular frequency ω0 that produces the lower limit of spatial aliasing is used to estimate the direction corresponding to the direction of the speaker uttering the target speech, i.e., the direction of speech arrival. Then, based on the estimated direction of arrival and the phase Φ of the cross spectrum D(ω, τ), the direction of speech arrival is determined. D The arrival direction is determined by (ω, τ) and the phase table 15. Therefore, it is possible to perform high-precision separation of high-frequency speech-related sound sources that were sometimes difficult to separate accurately in the past.
[0081] Furthermore, in Implementation 1, the sparsity of speech is utilized, so that the target speech can be separated with high accuracy even when the number of speakers (i.e. the number of sound sources) is unknown.
[0082] Furthermore, according to Implementation 1, computationally intensive calculations such as probability calculations are not required, thus enabling high-precision separation of target speech with less computation.
[0083] Implementation Method 2
[0084] Figure 10 This is a functional block diagram that schematically illustrates the structure of the sound-collecting device 2 in Embodiment 2. Figure 10 In the middle, to and Figure 1 The structural elements shown are the same as or correspond to the structural element labels. Figure 1 The same reference numerals are shown. The sound collection device 2 is a device capable of performing the sound collection method of embodiment 2. In the sound collection device 2, the arrival direction estimation unit 17a estimates the arrival direction of the speech based on images obtained by using the camera 20 to capture images of one or more speakers.
[0085] According to implementation method 2, based on the arrival direction estimated from the image and the phase Φ of the cross spectrum D(ω, τ), DThe arrival direction is determined by (ω, τ) and the phase table 15. Therefore, it is possible to perform high-precision separation of high-frequency speech-related sound sources that were sometimes difficult to separate accurately in the past.
[0086] Furthermore, according to Implementation 2, computationally intensive calculations such as probability calculations are not required, thus enabling high-precision separation of target speech with less computation.
[0087] For matters other than those mentioned above, Implementation 2 is the same as Implementation 1.
[0088] Label Explanation
[0089] 1, 2: Sound collection device; 11a: First microphone; 11b: Second microphone; 12a: First Fourier transform unit; 12b: Second Fourier transform unit; 13: Phase calculation unit; 14: Shielding coefficient determination unit; 15: Arrival direction phase table; 16: Spatial aliasing calculation unit; 17, 17a: Arrival direction estimation unit; 18: Filter; 19: Inverse Fourier transform unit; 20: Camera; x1(t): First received audio signal; x2(t): Second received audio signal; X1(ω, τ): First signal; X2(ω, τ): Second signal; D(ω, τ): Cross spectrum; Φ D (ω, τ): phase; b(ω, τ): shielding coefficient; Y(ω, τ): separated signal; y(t): target speech signal.
Claims
1. A sound collection device, which separates a target speech signal from a first received sound signal output from a first microphone from which the input speech is received and a second received sound signal output from a second microphone from which the input speech is received, characterized in that, The sound collecting device has: The first Fourier transform unit performs a Fourier transform on the first received sound signal and outputs a first signal. The second Fourier transform unit performs a Fourier transform on the second received sound signal and outputs a second signal. Arrival direction estimation unit, which estimates the arrival direction of the speech; A phase calculation unit calculates the phase of the cross spectrum between the first signal and the second signal; The shielding factor determination unit determines the shielding factor based on the arrival direction phase table, which represents the relationship between the phase and the arrival direction for each frequency band, read from a pre-generated database, the calculated phase, and the estimated arrival direction. A filter that uses the shielding coefficient to separate the signal from the first signal or the second signal; as well as The inverse Fourier transform unit performs an inverse Fourier transform on the separated signal to output the target speech signal.
2. The sound-collecting device according to claim 1, characterized in that, The shielding coefficient determination unit determines whether there exists a phase and arrival direction in the arrival direction phase table that match the calculated phase and the estimated arrival direction. If a matching phase and arrival direction exist, the shielding coefficient determining unit sets the shielding coefficient to 1; if no matching phase and arrival direction exist, the shielding coefficient determining unit sets the shielding coefficient to 0. The filter multiplies either the first signal or the second signal by the shielding coefficient.
3. The sound-collecting device according to claim 1 or 2, characterized in that, The arrival direction estimation unit estimates the arrival direction based on the signal with an angular frequency lower than the lower limit of spatial aliasing in the first signal and the second signal.
4. The sound-collecting device according to claim 3, characterized in that, Let ω0 be the angular frequency at which the lower limit of the spatial aliasing is generated, let d be the interval between the first microphone and the second microphone, and let c be the speed of sound. The direction of arrival estimation unit estimates the direction of arrival based on a signal with an angular frequency lower than ω0 = c / 2d.
5. The sound-collecting device according to claim 1 or 2, characterized in that, The arrival direction estimation unit estimates the arrival direction based on images obtained by the camera capturing images of one or more speakers.
6. A method for collecting sound by a sound-collecting device, the sound-collecting device separating a target speech signal from a first received sound signal output from a first microphone of input speech and a second received sound signal output from a second microphone of input speech, characterized in that, The sound collection method includes the following steps: The first received sound signal is subjected to a Fourier transform to output the first signal; The second received sound signal is subjected to a Fourier transform to output a second signal; Estimate the direction of arrival of the speech; Calculate the phase of the cross spectrum between the first signal and the second signal; The shielding factor is determined based on the arrival direction phase table, which represents the relationship between the phase and the arrival direction for each frequency band, read from a pre-generated database, the calculated phase, and the estimated arrival direction; The signal is separated from the first signal or the second signal using the shielding coefficient; as well as The separated signal is subjected to inverse Fourier transform to output the target speech signal.
7. A recording medium storing a sound collection program that causes a computer to perform processing to separate a target speech signal from a first received sound signal output from a first microphone of the input speech and a second received sound signal output from a second microphone of the input speech, characterized in that, The audio collection program causes the computer to perform the following steps: The first received sound signal is subjected to a Fourier transform to output the first signal; The second received sound signal is subjected to a Fourier transform to output a second signal; Estimate the direction of arrival of the speech; Calculate the phase of the cross spectrum between the first signal and the second signal; The shielding factor is determined based on the arrival direction phase table, which represents the relationship between the phase and the arrival direction for each frequency band, read from a pre-generated database, the calculated phase, and the estimated arrival direction; The signal is separated from the first signal or the second signal using the shielding coefficient; as well as The separated signal is subjected to inverse Fourier transform to output the target speech signal.