Target object positioning method and apparatus

CN116753952BActive Publication Date: 2026-06-19DINGTALK (CHINA) INFORMATION TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: DINGTALK (CHINA) INFORMATION TECH CO LTD
Filing Date: 2023-05-11
Publication Date: 2026-06-19

Smart Images

Figure CN116753952B_ABST

Patent Text Reader

Abstract

This specification provides a method and apparatus for locating a target object. The method includes: determining a target microphone array, wherein the target microphone array includes at least two subarrays, and each subarray includes a microphone; determining a target audio signal of a target object collected by a plurality of the microphones; processing the target audio signal to determine a target angle of the target object relative to the target microphone array based on the processing result; determining a sub-angle of the target object relative to each subarray based on the target angle; and determining the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each subarray. By combining the target angle and multiple sub-angles, more computational data can be provided when determining the position of the target object relative to the target microphone array, enabling the location of the target object based on the sound emitted by the target object.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments in this specification relate to the field of speech processing technology, and in particular to a method for locating target objects. Background Technology

[0002] With the development of electronic information technology and acoustic technology, sound source localization technology based on microphone arrays has emerged. A microphone array can be understood as an array composed of several microphones arranged in a certain spatial geometric structure. For example, in video conferencing, microphone arrays can be used to collect the speaker's audio signal, thereby enabling other participants in the video conference to know the speaker's audio content.

[0003] However, in video conferencing, to ensure clear and accurate capture of the speaker's audio and video, it is often necessary to locate the speaker. Therefore, there is an urgent need for an effective technical solution to achieve speaker location. Summary of the Invention

[0004] In view of the above, embodiments of this specification provide a method for locating a target object. One or more embodiments of this specification also relate to a target object location device, a conferencing device, a target object processing method, a target object processing apparatus, a computing device, a computer-readable storage medium, and a computer program, to address the technical deficiencies existing in the prior art.

[0005] According to a first aspect of the embodiments of this specification, a method for locating a target object is provided, comprising:

[0006] A target microphone array is determined, wherein the target microphone array comprises at least two subarrays, and each subarray comprises a microphone;

[0007] Determine the target audio signal of the target object collected by the multiple microphones;

[0008] The target audio signal is processed, and the target angle of the target object relative to the target microphone array is determined based on the processing result.

[0009] Based on the target angle, determine the sub-angles of the target object relative to each sub-array;

[0010] The position of the target object relative to the target microphone array is determined based on the target angle and the sub-angles of the target object relative to each sub-array.

[0011] According to a second aspect of the embodiments of this specification, a target object positioning device is provided, comprising:

[0012] The first determining module is configured to determine a target microphone array, wherein the target microphone array includes at least two subarrays, and each subarray includes a microphone;

[0013] The second determining module is configured to determine the target audio signal of the target object collected by the plurality of microphones;

[0014] The processing module is configured to process the target audio signal and determine the target angle of the target object relative to the target microphone array based on the processing result;

[0015] The third determining module is configured to determine the sub-angle of the target object relative to each sub-array based on the target angle;

[0016] The fourth determining module is configured to determine the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each sub-array.

[0017] According to a third aspect of the embodiments of this specification, a conference device is provided, comprising:

[0018] Target microphone array, memory, and processor;

[0019] Each microphone in the target microphone array is used to collect the target audio signal of the target object;

[0020] The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, which, when executed by the processor, implement the steps of the above method.

[0021] According to a fourth aspect of the embodiments of this specification, a target object processing method is provided, applied to an end-side device, comprising:

[0022] Receive initial images of a set of target objects captured by a camera device, and display the initial images on the display interface;

[0023] According to the target object localization method described in the embodiments of this specification, the position of the target object relative to the target microphone array is determined, wherein the target object is any one of the target objects in the set of target objects;

[0024] The system receives a target image of the target object captured by the camera device and displays the target image on the display interface. The target image is captured by the camera device based on the position of the target object relative to the microphone array.

[0025] According to a fifth aspect of the embodiments of this specification, a target object processing apparatus is provided, applied to an end-side device, comprising:

[0026] The first display module is configured to receive an initial image of a set of target objects captured by a camera device and display the initial image on a display interface.

[0027] The determination module is configured to determine the position of a target object relative to a target microphone array according to the target object localization method described in the embodiments of this specification, wherein the target object is any one of the target objects in the set of target objects;

[0028] The second display module is configured to receive a target image of the target object captured by the camera device and display the target image on a display interface, wherein the target image is captured by the camera device based on the position of the target object relative to the microphone array.

[0029] According to a sixth aspect of the embodiments of this specification, a computing device is provided, comprising:

[0030] Memory and processor;

[0031] The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, which, when executed by the processor, implement the steps of the above method.

[0032] According to a seventh aspect of an embodiment of this specification, a computer-readable storage medium is provided that stores computer-executable instructions that, when executed by a processor, implement the steps of the method described above.

[0033] According to an eighth aspect of the embodiments of this specification, a computer program is provided, wherein when the computer program is executed in a computer, it causes the computer to perform the steps of the above-described method.

[0034] One embodiment of this specification provides a method for locating a target object, comprising: determining a target microphone array, wherein the target microphone array includes at least two subarrays, and each subarray includes a microphone; determining target audio signals of a target object collected by a plurality of the microphones; processing the target audio signals; determining a target angle of the target object relative to the target microphone array based on the processing result; determining a sub-angle of the target object relative to each subarray based on the target angle; and determining the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each subarray.

[0035] When determining the position of the target object relative to the target microphone array, the above method not only considers the target angle of the target object relative to the target microphone array, but also the sub-angles of the target object relative to each sub-array. By combining the target angle and multiple sub-angles, more computational data can be provided when determining the position of the target object relative to the target microphone array, so as to realize the positioning of the target object based on the sound emitted by the target object, and further ensure the accuracy of the target object positioning. Attached Figure Description

[0036] Figure 1 This is a schematic diagram illustrating an application scenario of a target object localization method provided in one embodiment of this specification;

[0037] Figure 2 This is a flowchart of a target object localization method provided in one embodiment of this specification;

[0038] Figure 3 This is a schematic diagram of the target microphone array in a target object localization method provided in one embodiment of this specification;

[0039] Figure 4 This is a schematic diagram of the reference coordinate axis of the target microphone array in a target object processing method provided in one embodiment of this specification;

[0040] Figure 5 This is a schematic diagram of a scanning method in a target object localization method provided in an embodiment of this specification;

[0041] Figure 6 This is a schematic diagram of a positioning result of a target object positioning method provided in one embodiment of this specification;

[0042] Figure 7 This is a schematic diagram of another positioning result of a target object positioning method provided in one embodiment of this specification;

[0043] Figure 8 This is a flowchart illustrating the processing steps of a target object localization method provided in one embodiment of this specification.

[0044] Figure 9 This is a schematic diagram illustrating a target object localization method provided in one embodiment of this specification applied to a teaching scenario;

[0045] Figure 10 This is a schematic diagram of the structure of a target object positioning device provided in one embodiment of this specification;

[0046] Figure 11 This is a flowchart of a target object processing method provided in one embodiment of this specification;

[0047] Figure 12 This is a schematic diagram illustrating an application scenario of a target object processing method provided in one embodiment of this specification;

[0048] Figure 13 This is a schematic diagram of the structure of a target object processing device provided in one embodiment of this specification;

[0049] Figure 14 This is a structural block diagram of a computing device provided in one embodiment of this specification. Detailed Implementation

[0050] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.

[0051] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” as used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items.

[0052] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."

[0053] Furthermore, it should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in one or more embodiments of this specification are all information and data authorized by the user or fully authorized by all parties. Moreover, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0054] First, the terms and concepts used in one or more embodiments of this specification will be explained.

[0055] Cross power spectral density can be used to describe the frequency information between two random vibration processes. It can provide not only the magnitude of energy distributed by frequency, but also the relationship between the two signals.

[0056] This specification provides a target object positioning method, and also relates to a target object positioning device, a conferencing device, a target object processing method, a target object processing apparatus, a computing device, and a computer-readable storage medium, which will be described in detail in the following embodiments.

[0057] See Figure 1 , Figure 1 The illustration shows an application scenario diagram of a target object localization method provided according to an embodiment of this specification.

[0058] Figure 1 This includes conference equipment 102.

[0059] In practical implementation, in online live streaming scenarios, speakers (such as course instructors or conference presenters) typically utilize conference equipment 102 to capture their audio and video. When speaking, the speaker can face the conference equipment 102, facilitating the capture of their audio and video, thus enabling live streaming of the conference or online course. The conference equipment 102 is equipped with a target microphone array, which can be used to capture the speaker's audio signal. After determining the target audio signal of the speaker captured by each microphone in the target microphone array, the conference equipment 102 can determine the target angle of the speaker relative to the target microphone array based on this target audio signal. Furthermore, it can also determine the sub-angles of the speaker relative to each subarray within the target microphone array. The speaker's position relative to the target microphone array is then determined based on the target angle and the sub-angles of each subarray.

[0060] After determining the speaker's position relative to the target microphone array, the conferencing device 102 can use that position to zoom or track the speaker's video image, enabling multi-party participation in the video conference.

[0061] In addition, the conference equipment 102 can also communicate with cloud-based devices. After the target microphone array collects the speaker's voice signal, it can send the voice signal to the cloud-based device. The cloud-based device performs noise reduction on the voice signal and locates the speaker based on the voice signal. The cloud-based device then sends the location result to the conference equipment 102, which uses the location result to perform subsequent operations.

[0062] See Figure 2 , Figure 2A flowchart of a target object localization method according to an embodiment of this specification is shown, which specifically includes the following steps.

[0063] Step 202: Determine the target microphone array, wherein the target microphone array includes at least two subarrays, and each subarray includes a microphone.

[0064] Specifically, the target object localization method provided in this specification can be applied to conference terminal equipment in audio and video conferencing systems. The target object can be, for example, a speaker in a video conference. Using this method, the conference terminal equipment can collect the speaker's audio and locate the speaker based on the audio. This allows for subsequent magnification of the speaker's partial image based on their location information, enabling other participants in the video conference to see the speaker more clearly and thus enabling guided broadcasting. Alternatively, when the speaker's position changes, the method can be used to track and guide the speaker based on their location. This target object localization method can locate one speaker or two or more speakers. Furthermore, this method can also be applied to audio interaction scenarios such as voice calls. In voice call scenarios, speaker localization is necessary to facilitate subsequent audio collection based on the speaker's location, resulting in clearer audio.

[0065] For ease of understanding, the embodiments in this specification all use the application of the target object location method in a video conferencing scenario as an example for detailed description, but this does not affect the implementation of the target object location method in other feasible scenarios.

[0066] The target microphone array can be understood as an array of microphones arranged according to a spatial geometric structure. In practical applications, the target microphone array can be set up in conferencing equipment to collect audio signals from the video conferencing room. The target microphone array can collect multiple audio signals. From the perspective of microphone type, the target microphone array can be an array of omnidirectional microphones or an array of directional microphones. From the perspective of array shape, the microphone array can be a linear array, a circular array, or an irregularly shaped array. This specification does not limit this. A subarray can be understood as a subarray obtained by dividing the microphones included in the target microphone array.

[0067] Specifically, such as Figure 3 As shown, Figure 3 A schematic diagram of the target microphone array in a target object localization method according to an embodiment of this specification is shown. See also Figure 3The target microphone array 300 includes a first subarray 310, a second subarray 320, and a third subarray 330. Each subarray includes a microphone M.

[0068] Understandable. Figure 3 This is merely a schematic diagram of one target microphone array. The number of subarrays included in the target microphone array can be determined according to actual needs. For example, the target microphone array may include 2 subarrays, 3 subarrays, 4 subarrays, etc. The embodiments in this specification are not limited here.

[0069] Step 204: Determine the target audio signal of the target object collected by the multiple microphones.

[0070] Specifically, after determining the target microphone array, the target audio signal of the target object can be determined by each microphone in the target microphone array.

[0071] In this context, "multiple microphones" can be understood as a subset of microphones or each microphone in a target microphone array. For example, a target microphone array may include 10 microphones. It is possible to determine the target audio signal collected by 5 of these 10 microphones, or to determine the target audio signal collected by each of the 10 microphones. The number of microphones determined can be specified according to actual needs, and this specification does not limit this.

[0072] The target object can be understood as the speaker. Therefore, the target audio signal of the target object can be understood as the audio signal corresponding to the sound emitted by the speaker when speaking.

[0073] Based on this, the audio signal corresponding to the sound emitted by the speaker when speaking can be determined from each microphone in the target microphone array.

[0074] For example, when the target microphone array includes three microphones m1, m2 and m3, the target audio signal 1 of the target object collected by microphone m1 can be determined; the target audio signal 2 of the target object collected by microphone m2 can be determined; and the target audio signal 3 of the target object collected by microphone m3 can be determined.

[0075] In practical implementation, video conferencing scenarios may introduce noise from other sound sources. Furthermore, background noise generated by the electrical system can also contribute to noise interference when using a target microphone array for audio signal acquisition. Additionally, reverberation may occur in larger meeting rooms. Therefore, to ensure more accurate subsequent positioning, it is necessary to denoise the audio signal acquired by the microphones to obtain the denoised target audio signal. The specific implementation method is as follows:

[0076] The determination of the target audio signal of the target object collected by the multiple microphones includes:

[0077] Determine the initial audio signal of the target object collected by the multiple microphones;

[0078] The initial audio signal is denoised to obtain the target audio signal of the target object collected by the plurality of microphones.

[0079] The initial audio signal can be understood as the audio signal captured by the microphone before denoising. The target audio signal can be understood as the denoised audio signal obtained after denoising the initial audio signal captured by the microphone.

[0080] Specifically, the initial audio signal of the speaker collected by each microphone in the target microphone array can be determined, and the initial audio signal can be denoised to obtain the target audio signal of the speaker collected by each microphone.

[0081] Using the previous example, for the initial audio signal of the speaker collected by microphone m1, the initial audio signal can be denoised to obtain the denoised target audio signal 1; correspondingly, for the initial audio signal of the speaker collected by microphone m2, the initial audio signal can be denoised to obtain the denoised target audio signal 2; for the initial audio signal of the speaker collected by microphone m3, the initial audio signal can be denoised to obtain the denoised target audio signal 3.

[0082] In summary, by denoising the initial audio signal collected by each microphone in the target microphone array to obtain the denoised target audio signal, the interference of reverberation and noise can be reduced when locating the speaker based on the audio signal, thereby further ensuring the accuracy of the localization result.

[0083] Specifically, when denoising the initial audio signal, a filter can be used to separate the initial audio signal from the noise signal to obtain the denoised target audio signal.

[0084] In one embodiment provided in this specification, the initial audio signal can also be denoised by noise estimation, as specifically implemented as follows:

[0085] The step of denoising the initial audio signal to obtain the target audio signal of the target object collected by the plurality of microphones includes:

[0086] Determine the signal-to-noise ratio of the first initial audio signal acquired by any one microphone, and determine the first cross power spectral density between the first initial audio signal and the initial audio signal;

[0087] Based on the signal-to-noise ratio, determine the second cross-power spectral density between the noise of the first initial audio signal and the noise of the initial audio signal;

[0088] Based on the first cross power spectral density and the second cross power spectral density, the initial audio signal is denoised to obtain the target audio signal of the target object collected by the plurality of microphones.

[0089] In this context, the cross-power spectral density includes the phase difference between the audio signals from different channels. This phase difference can be used to represent the time difference between the sound emitted by the target object reaching the microphone, and this time difference can be used for subsequent target object localization. The first initial audio signal can be understood as any initial audio signal. For example, for three initial audio signals collected by three microphones, the first initial audio signal is any one of these three initial audio signals. The signal-to-noise ratio (SNR) of the first initial audio signal can be understood as the ratio between the first initial audio signal and its noise. Based on the SNR of the first initial audio signal, the second cross-power spectral density can be determined between the noise of the initial audio signal collected by each microphone and the noise of the first initial audio signal.

[0090] Specifically, the first cross-power spectral density between the first initial audio signal and the initial audio signal can be obtained from the vector of the initial audio signal in the frequency domain. The vector of the initial audio signal in the frequency domain can be obtained by performing a short-time Fourier transform on the initial audio signal. Correspondingly, the second cross-power spectral density can be obtained from the vector of the initial audio signal in the frequency domain.

[0091] Based on this, the initial audio signals of the target object collected by each microphone can be determined. A short-time Fourier transform is performed on the initial audio signals to convert them from the time domain to the frequency domain, obtaining the vector of the initial audio signal in the frequency domain. From the initial audio signals collected by each microphone, any first initial audio signal is selected, and noise estimation is performed on this first initial audio signal to determine its signal-to-noise ratio (SNR). Based on the SNR of the first initial audio signal and its vector in the frequency domain, a second cross-power spectral density (CPS) is obtained between the noise in the first initial audio signal and the noise in the second initial audio signal. Based on the vector of the first initial audio signal in the frequency domain, a first CPS is obtained between the two initial initial audio signals. The first CPS and the second CPS are subtracted to obtain the denoised target audio signal. This achieves denoising of the audio signal.

[0092] In practical applications, noise estimation of the first initial audio signal can be achieved using the minimum statistical noise estimation algorithm, which can be used to estimate the power spectral density of steady-state noise. Alternatively, other noise estimation algorithms can be used, but this specification does not limit the specific implementation.

[0093] When the initial audio signal is converted from the time domain to the frequency domain using the short-time Fourier transform, the vector composed of the initial audio signals collected by the M microphones after conversion to the frequency domain is shown in the following formula (1).

[0094] y(k,n)=[y1(k,n),y2(k,n),...,y M (k, n)] (1)

[0095] Where k is the frequency domain index, n is the time domain index, and M is the number of microphones. y1(k, n) is the frequency domain vector of the initial audio signal captured by the first microphone. y(k, n) is the frequency domain vector of the initial audio signals captured by the M microphones.

[0096] The first cross power spectral density between the initial audio signal and the first initial audio signal is shown in the following formula (2).

[0097] Y(k,n)=λY(k,n-1)+(1-λ)y(k,n) (2)

[0098] Where, Y(k, n) = [Y 1，r (k, n), ..., Y m，r (k, n), ..., Y M，r [(k, n)] represents the cross-power spectral density of the initial audio signals collected by M microphones. λ is the smoothing coefficient, ranging from 0 to 1. m is the microphone number. r is the microphone number used to collect the first initial audio signal (i.e., the microphone number used for noise estimation). Y 1，r (k, n) represents the cross-power spectral density between the initial audio signal acquired by the first microphone and the first initial audio signal. m，r (k, n) represents the cross-power spectral density between the initial audio signal acquired by the m-th microphone and the first initial audio signal. M，r (k, n) represents the cross power spectral density between the initial audio signal acquired by the Mth microphone and the first initial audio signal.

[0099] The second cross power spectral density between the noise of the first initial audio signal and the noise of the initial audio signal is shown in the following formula (3).

[0100]

[0101] Where, N(k, n) = [N 1，r (k, n), ..., N m，r (k, n), ..., N M，r [(k, n)] represents the second cross-power spectral density between the noise of the initial audio signal collected by M microphones and the noise of the first initial audio signal, S r (k, n) represents the signal-to-noise ratio of the first initial audio signal. 1，r (k, n) represents the second cross-power spectral density between the noise of the initial audio signal acquired by the first microphone and the noise of the first initial audio signal. m，r (k, n) represents the second cross-power spectral density between the noise of the initial audio signal acquired by the m-th microphone and the noise of the first initial audio signal. M，r (k, n) is the second cross power spectral density between the noise of the initial audio signal collected by the Mth microphone and the noise of the first initial audio signal.

[0102] The target audio signal obtained by subtracting the second cross power spectral density from the first cross power spectral density is shown in the following formula (4).

[0103] Φ(k,n)=Y(k,n)-N(k,n)=[Φ1(k,n),...,Φ m (k, n), ..., Φ M (k, n)] (4)

[0104] Where, Φ m (k, n) = Y m，r (k, n)-N m，r (k, n) represents the target audio signal obtained after denoising the initial audio signal collected by the m-th microphone. Φ(k, n) represents the target audio signal obtained after denoising the initial audio signals collected by M microphones respectively. M (k, n) represents the target audio signal obtained after denoising the initial audio signal collected by the Mth microphone.

[0105] Continuing with the previous example, in a target microphone array comprising three microphones m1, m2, and m3, the following vectors are determined: the initial audio signal m1 acquired by microphone m1 in the frequency domain (X1), the initial audio signal m2 acquired by microphone m2 in the frequency domain (X2), the initial audio signal m3 acquired by microphone m3 in the frequency domain (X3), and the vectors (X1, X2, X3) formed by the three initial audio signals acquired by the three microphones in the frequency domain. Among these three initial audio signals, the initial audio signal m1 acquired by microphone m1 is taken as the first initial audio signal. Noise estimation is performed on this first initial audio signal m1 to determine its signal-to-noise ratio (SNR). Based on the SNR of the first initial audio signal m1 and its frequency domain vector X1, the second cross-power spectral density mX1 between the noise of the initial audio signal m1 and the noise of the first initial audio signal m1 is determined. Based on the frequency domain vector X1 of the initial audio signal m1, the first cross-power spectral density XX1 between the initial audio signal m1 and the first initial audio signal m1 is determined. Subtracting the first cross-power spectral density XX1 and the second cross-power spectral density mX1 yields the denoised target audio signal 1 corresponding to the initial audio signal m1. When denoising the initial audio signal m2, the second cross-power spectral density mX2 between the noise of the initial audio signal m2 and the noise of the first initial audio signal m1 is determined based on the signal-to-noise ratio of the first initial audio signal m1 and the frequency domain vector X2 of the initial audio signal m2. The first cross-power spectral density XX2 between the initial audio signal m2 and the first initial audio signal m1 is determined based on the frequency domain vector X2 of the initial audio signal m2. Subtracting the first cross-power spectral density XX2 and the second cross-power spectral density mX2 yields the denoised target audio signal 2 corresponding to the initial audio signal m2. Similarly, a similar operation is performed on the initial audio signal m3 to obtain the denoised target audio signal 3.

[0106] In summary, denoising the audio signal through noise estimation reduces the impact of noise and reverberation on target object localization. Furthermore, since the audio signals collected by each microphone can be considered identical, the cross-power spectral density can be calculated based solely on the signal-to-noise ratio of the audio signal collected by any single microphone, reducing computational power consumption and further improving target object localization efficiency.

[0107] Step 206: Process the target audio signal and determine the target angle of the target object relative to the target microphone array based on the processing result.

[0108] Specifically, after determining the denoised target audio signal collected by each microphone, the target audio signal can be processed, and the target angle of the target object relative to the target microphone array can be determined based on the processing result.

[0109] The target angle of the target object relative to the target microphone array can be understood as the target angle of the target object under the target reference coordinate axis of the target microphone array. The target angle can be understood as the orientation angle of the target object relative to the target reference coordinate axis of the target microphone array.

[0110] The target reference coordinate axis of the target microphone array can be understood as a coordinate system with the linear array of microphones as the X-axis and the direction facing the target object as the y-axis. The target reference point of this coordinate system can be located anywhere within the target microphone array. In practical applications, to ensure the matching of the camera device coordinate system and the microphone coordinate system, the target reference point can be located at the camera device.

[0111] For details, see Figure 4 , Figure 4 A schematic diagram of the reference coordinate axes of a target microphone array is shown in a target object processing method according to an embodiment of this specification.

[0112] like Figure 4 As shown, the target microphone array 400 includes a first subarray 410, a second subarray 420, and a third subarray 430. Here, xO1y1 is the first reference coordinate axis of the first subarray 410, xOy is the target reference coordinate axis of the target microphone array 400 and the third subarray 430, and xO2y2 is the second reference coordinate axis of the second subarray 420. θ1 is the first sub-angle of the target object p relative to the first subarray 410, i.e., the first sub-angle of the target object p under the first reference coordinate axis. θ2 is the second sub-angle of the target object p relative to the second subarray 420, i.e., the second sub-angle of the target object p under the second reference coordinate axis. θ0 is the target angle of the target object p relative to the target microphone array 400, i.e., the target angle of the target object p under the target reference coordinate axis. θ3 is the third sub-angle of the target object p relative to the third subarray 430, i.e., the third sub-angle of the target object p under the target reference coordinate axis. The vertical distance between the target object p and the target microphone array is D. When locating the target object later, it is necessary to determine the vertical distance D and the target angle θ0.

[0113] In practice, to calculate the target angle of the target object relative to the target microphone array, a preset scanning algorithm can be used. The target audio signal is used as the input signal to scan a preset scanning range to obtain the target angle that meets the preset scanning conditions. The specific implementation method is as follows:

[0114] The step of processing the target audio signal and determining the target angle of the target object relative to the target microphone array based on the processing result includes:

[0115] The target audio signal is used as the input signal, and a preset scanning algorithm is used to scan a preset scanning range, wherein the preset scanning range is determined according to a preset scanning angle and a preset scanning distance;

[0116] The target angle of the target object relative to the target microphone array is determined based on the scanning results.

[0117] The preset scanning distance can be determined based on the actual size of the target microphone array.

[0118] In practical applications, a preset scanning algorithm can be understood as any algorithm used to estimate the location of a sound source. The preset scanning angle can range from 0 to 180 degrees. The preset scanning angle can be defined in the target reference coordinate axis of the target microphone array. See also... Figure 5 , Figure 5 A schematic diagram of a scanning method in a target object localization method provided according to an embodiment of this specification is shown. Figure 5 As shown, a preset scanning angle θ can be defined in the target reference coordinate axis xOy of the target microphone array. l A preset scanning distance R is used, and scanning is performed according to the preset scanning angle and preset scanning distance. The target angle that meets the scanning conditions is determined from the preset scanning angle. This target angle is the target angle of the target object relative to the target microphone array. Figure 5 The system also includes steady-state noise sources. Since the initial audio signal is denoised as described above, steady-state noise sources within the preset scanning range will not affect the scanning results.

[0119] Based on this, the target audio signal collected by each microphone in the target microphone array can be used as the input signal. A preset scanning algorithm is used to scan the preset scanning range consisting of a preset scanning angle and a preset scanning distance. The target angle of the target object relative to the target microphone array is determined based on the scanning results.

[0120] In practice, each subarray can include at least two microphones. The target angle relative to the target microphone array can be calculated based on the phase and phase matching coefficient of each target audio signal. The phase matching coefficient is generated using a beamforming method based on a preset scanning distance and a preset scanning angle. (Target angle) As shown in formula (5) below.

[0121]

[0122] in,

[0123]

[0124]

[0125]

[0126] Where, θ l Here, m represents the preset scanning angle, m is the microphone, U1 is the first subarray, U2 is the second subarray, U3 is the third subarray, and U is the target microphone array. Let be the phase matching coefficient of the target audio signal acquired by the m-th microphone. Let S1(k, θ) be the phase of the target audio signal acquired by the m-th microphone. l S2(k, θ) represents the intermediate result used to calculate the first sub-angle. l S3(k, θ) represents the intermediate result used to calculate the second sub-angle. l , n) is the intermediate result used to calculate the third sub-angle.

[0127] In summary, by utilizing a preset scanning algorithm, a target angle that meets the scanning conditions is determined from a preset scanning angle. This target angle is then used as the target angle of the target object relative to the target microphone array, providing a data basis for subsequent positioning of the target object.

[0128] Step 208: Determine the sub-angle of the target object relative to each sub-array based on the target angle.

[0129] Specifically, after determining the target angle, the sub-angles of the target angle relative to each subarray can be determined based on the target angle.

[0130] In practice, since locating the target object solely based on the target angle is insufficient to guarantee the accuracy of the positioning results, we can also utilize the sub-angles of the target object relative to each subarray, combining them with the target angle to ensure the accuracy of the positioning results. The specific implementation method is as follows:

[0131] The at least two subarrays include a first subarray and a second subarray, the first subarray and the second subarray being located at opposite ends of the target microphone array, respectively;

[0132] Accordingly, determining the sub-angle of the target object relative to each sub-array based on the target angle includes:

[0133] Based on the target angle, a first sub-angle of the target object relative to the first sub-array and a second sub-angle of the target object relative to the second sub-array are determined.

[0134] As can be seen from the above formula (5), when performing a full-angle scan of the target audio signals collected by each microphone in the target microphone array, since the target microphone array includes each sub-array, the scanning result obtained by performing a full-angle scan also includes the scanning result of the microphones included in each sub-array. The first sub-angle and the second sub-angle can be directly obtained by reusing the intermediate result of the full-angle scan operation in formula (5).

[0135] In summary, by directly obtaining the first and second sub-angles based on the target angle, the computational power consumption in the subarray scanning operation can be eliminated, thereby saving computational power.

[0136] In specific implementation, determining the first sub-angle of the target object relative to the first sub-array and the second sub-angle of the target object relative to the second sub-array based on the target angle includes:

[0137] Based on the target angle, obtain the first coordinate angle of the target object relative to the first subarray under the target reference coordinate axis of the target microphone array, and the second coordinate angle of the target object relative to the second subarray under the target reference coordinate axis;

[0138] The first coordinate angle is converted into a first sub-angle of the target object relative to the first sub-array under the first reference coordinate axis of the first sub-array.

[0139] The second coordinate angle is converted into a second sub-angle of the target object relative to the second sub-array under the second reference coordinate axis of the second sub-array.

[0140] Specifically, the first coordinate angle is shown in formula (6) below, and the second coordinate angle is shown in formula (7) below.

[0141]

[0142]

[0143] in, Let the first coordinate angle be... This is the second coordinate angle.

[0144] In the aforementioned process of performing a full-angle scan of the target audio signals collected by each microphone in the target microphone array, the scan is performed within the target reference coordinate axis of the target microphone array. Therefore, the target angle is determined as the first coordinate angle of the target object relative to the first subarray under the target reference coordinate axis, and the second coordinate angle of the target object relative to the second subarray under the target reference coordinate axis. To achieve the positioning of the target object, it is also necessary to transform the first and second coordinate angles, converting the first coordinate angle into a first sub-angle under the first reference coordinate axis, and the second coordinate angle into a second sub-angle under the second reference coordinate axis.

[0145] Accordingly, the first sub-angle is shown in formula (8) below, and the second sub-angle is shown in formula (9) below.

[0146]

[0147]

[0148] in, All are intermediate variables. c1 is the distance between the first reference point O1 of the first reference coordinate axis and the target reference point O of the target reference coordinate axis. R is the preset scan distance. c2 is the distance between the second reference point O2 of the second reference coordinate axis and the target reference point O of the target reference coordinate axis.

[0149] In summary, by transforming the coordinate axes, we can determine the first sub-angle of the target object relative to the first subarray and the second sub-angle of the target object relative to the second subarray, which facilitates the subsequent determination of the target object's position based on the first and second sub-angles.

[0150] Step 210: Determine the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each sub-array.

[0151] Specifically, after determining the target angle and the sub-angles of the target object relative to each subarray, the position of the target object relative to the target microphone array can be determined based on the target angle and the sub-angles of the target object relative to each subarray.

[0152] Here, both the target angle and the sub-angle can be understood as the angle (i.e., the direction of arrival angle) estimated for the direction of arrival of the sound emitted by the target object.

[0153] In specific implementation, determining the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each sub-array includes:

[0154] Calculate the vertical distance of the target object relative to the target microphone array based on the first sub-angle and the second sub-angle;

[0155] The position of the target object relative to the target microphone array is determined based on the target angle and the vertical distance.

[0156] The position of the target object relative to the target microphone array can be understood as the position of the target object in the target reference coordinate axis.

[0157] Specifically, the vertical distance D of the target object relative to the target microphone array can be calculated according to the following formula (10).

[0158]

[0159] Specifically, the vertical distance and angle of the target object relative to the target microphone array can be used as the position of the target object relative to the target microphone array.

[0160] In practical applications, when the target microphone array is large in size and the microphones are unevenly distributed, a close proximity of the target object to the array can lead to significant deviations in the target angle, resulting in inaccurate positioning results. Therefore, a third subarray located on the same target reference coordinate axis as the target microphone array can be used to correct the target angle, thereby improving the accuracy of the arrival direction angle estimation. The specific implementation method is as follows:

[0161] The at least two subarrays further include a third subarray, which is located between the first subarray and the second subarray;

[0162] Accordingly, determining the position of the target object relative to the target microphone array based on the target angle and the vertical distance includes:

[0163] Based on the target angle, determine the third sub-angle of the target object relative to the third sub-array;

[0164] If the vertical distance is determined to be less than or equal to a preset distance threshold, the third sub-angle is taken as the adjusted target angle;

[0165] The position of the target object relative to the target microphone array is determined based on the adjusted target angle and the vertical distance.

[0166] The preset distance threshold can be determined based on the actual size of the target microphone array.

[0167] Similarly, the third sub-angle of the target object relative to the third sub-array can be obtained directly from the target angle, as shown in the following formula (11).

[0168]

[0169] in, This is the third sub-angle.

[0170] Specifically, if the vertical distance between the target object and the target microphone array is less than or equal to a preset distance threshold, the third sub-angle of the target object relative to the third sub-array can be used as the adjusted target angle. This adjusted target angle (i.e., the third sub-angle) and the vertical distance are then used as the position of the target object relative to the target microphone array.

[0171] Accordingly, determining the position of the target object relative to the target microphone array based on the target angle and the vertical distance further includes:

[0172] If the vertical distance is determined to be greater than the preset distance threshold, the position of the target object relative to the target microphone array is determined based on the target angle and the vertical distance.

[0173] In practical applications, the target angle can be corrected according to the following formula (12).

[0174]

[0175] Where D0 is the preset distance threshold.

[0176] In summary, by correcting the target angle using the third sub-direction angle, deviations in target angle estimation are avoided when the target object is close to the target microphone array, further ensuring the accuracy of the target object positioning results.

[0177] In practical applications, the target object localization method is illustrated using the example of audio capture of a speaker in a video conferencing scenario. The actual maximum size of the target microphone array is 0.72 meters, with a total of 19 microphones; the first subarray has 4 microphones, c1 = 0.39 meters; the second subarray has 3 microphones, c2 = 0.175 meters; and the third subarray has 4 microphones. During speaker localization, the scanning distance R = 7.0 meters, the preset distance threshold D0 = 2.1 meters, and the smoothing coefficient λ = 0.9.

[0178] See Figure 6 , Figure 6 A schematic diagram of a positioning result of a target object positioning method provided according to an embodiment of this specification is shown. Figure 6The speaker's location is shown when he is positioned to the left front of the target microphone array. Figure 6 (a) shows the results of the vertical distance positioning between the speaker and the target microphone array. Figure 6 Figure (b) shows the localization results of the azimuth angle between the speaker and the target microphone array. The true value of the speaker's vertical distance is 0.86 meters, and the true value of the speaker's azimuth angle is 60 degrees. The true values can be understood as the actual values. The first 2 seconds of the localization results include a negligible algorithm startup time. In the vertical distance estimation results, the vertical estimation error is within 10%. In the speaker azimuth angle results, the speaker azimuth angle of the main array (target microphone array) (uncorrected) has a deviation of more than 10 degrees and obvious fluctuations, while the azimuth angle of the intermediate subarray (third subarray) is stable and has only a deviation of 2 degrees. Using the intermediate subarray to correct the speaker azimuth angle of the main array will significantly improve the accuracy of the localization results in the near range.

[0179] See Figure 7 , Figure 7 A schematic diagram showing another positioning result of a target object positioning method provided according to an embodiment of this specification is illustrated. Figure 7 The speaker's location is shown when they are positioned directly in front of the target microphone array. Figure 7 (a) shows the results of the vertical distance positioning between the speaker and the target microphone array. Figure 7 Figure (b) shows the localization results of the azimuth angle between the speaker and the target microphone array. The true value of the speaker's vertical distance is 1.0 meter, and the true value of the speaker's azimuth angle is 88 degrees. The first 2 seconds of the localization results include a negligible algorithm startup time. The results are similar to those above: the estimation error of the vertical distance is small (within 10%), and by using the intermediate subarray to correct the speaker's azimuth angle, the angular deviation within the close range is controlled to within 2 degrees, significantly improving accuracy.

[0180] In summary, the above method, when determining the position of the target object relative to the target microphone array, not only considers the target angle of the target object relative to the target microphone array, but also the sub-angles of the target object relative to each sub-array. By combining the target angle and multiple sub-angles, it can provide more computational data when determining the position of the target object relative to the target microphone array, realize the positioning of the target object based on the sound emitted by the target object, and further ensure the accuracy of the target object positioning.

[0181] The following is in conjunction with the appendix Figure 8 Taking the target object localization method provided in this specification in a video conferencing scenario as an example, the target object localization method will be further explained. Among other things, Figure 8The flowchart of a target object localization method according to an embodiment of this specification is shown, which specifically includes the following steps.

[0182] Step 802: Determine the target microphone array, wherein the target microphone array includes at least two subarrays, and each subarray includes a microphone.

[0183] Specifically, in a video conferencing scenario, conferencing equipment can be deployed. This equipment includes a target microphone array, comprising a left subarray (first subarray), a middle subarray (third subarray), and a right subarray (second subarray). Each subarray contains at least two microphones. This target microphone array is used to capture the speaker's audio, which can then be used to locate the speaker.

[0184] Step 804: Determine the initial audio signal of the target object collected by each microphone.

[0185] In a video conferencing scenario, the target audience can be the speaker in the video conference.

[0186] Specifically, it can determine the initial audio signal of the speaker collected by each microphone in the target microphone array.

[0187] Step 806: Perform noise reduction processing on the initial audio signal to obtain the target audio signal of the target object collected by each microphone.

[0188] Specifically, when denoising the initial audio signals, a short-time Fourier transform (SFT) can be performed on each initial audio signal acquired by each microphone to convert it from the time domain to the frequency domain, obtaining the vector of each initial audio signal in the frequency domain. The SFT is a method for converting time-domain signals to frequency-domain signals. Alternatively, other methods such as the Fourier transform can also be used to achieve the conversion from the time domain to the frequency domain; this specification does not limit the specific methods used in this embodiment. Furthermore, among the initial audio signals acquired by each microphone, any one initial audio signal is selected for noise estimation to determine its signal-to-noise ratio (SNR). Based on the SNR and the vectors of each initial audio signal in the frequency domain, the cross-power spectral density between each initial audio signal and the selected initial audio signal for noise estimation, as well as the cross-power spectral density between the noise signal of each initial audio signal and the selected initial audio signal for noise estimation, can be calculated. Finally, based on the calculated cross-power spectral density, the denoised target audio signals are determined.

[0189] Step 808: Using the target audio signal as the input signal, a preset scanning algorithm is used to scan the preset scanning range, and the target angle of the target object relative to the target microphone array is determined based on the scanning result.

[0190] The preset scanning range can be determined based on the preset scanning distance and preset scanning angle.

[0191] Specifically, when using a preset scanning algorithm to scan a preset scanning range, an angle that meets the preset scanning conditions can be determined within the preset scanning range as the target angle.

[0192] Step 810: Determine the sub-angle of the target object relative to each sub-array based on the target angle.

[0193] Specifically, since the aforementioned scanning of the target audio signals collected by all microphones in the target microphone array is performed at all angles, based on this, the first sub-angle of the target object relative to the left terminal array, the second sub-angle of the target object relative to the right terminal array, and the third sub-angle of the target object relative to the middle sub-array can be obtained according to the aforementioned target angle scanning results.

[0194] Step 812: Calculate the vertical distance of the target object relative to the target microphone array based on the first sub-angle and the second sub-angle.

[0195] Step 814: Adjust the target angle according to the third sub-angle to obtain the adjusted target angle.

[0196] Specifically, if the vertical distance is less than or equal to the distance threshold (i.e., the preset distance threshold), the third sub-angle is used as the adjusted target angle. If the vertical distance is greater than the distance threshold, the target angle is not adjusted.

[0197] Step 816: Determine the position of the target object relative to the target microphone array based on the adjusted target angle and vertical distance.

[0198] When determining the position of the target object relative to the target microphone array, the above method not only considers the target angle of the target object relative to the target microphone array, but also the sub-angles of the target object relative to each sub-array. By combining the target angle and multiple sub-angles, more computational data can be provided when determining the position of the target object relative to the target microphone array, so as to realize the positioning of the target object based on the sound emitted by the target object, and further ensure the accuracy of the target object positioning.

[0199] See Figure 9 , Figure 9This diagram illustrates a target object localization method provided in one embodiment of this specification applied to a teaching scenario. In practical applications, when a teacher (i.e., the target object) is recording a video course or conducting an online course livestream in a recording studio, a conferencing device 902 is placed in the recording studio. This conferencing device 902 includes a target microphone array and a camera device. Before the video course recording or online course livestream begins, the teacher can stand at any position in the recording studio. The display interface of the end-side device 904 in the recording studio can display a positional map of the recording studio's interior. That is, the display interface shows the position of the target microphone array in the recording studio. When the teacher speaks, the target microphone array collects the audio signal emitted by the teacher, thereby determining the teacher's position and displaying the teacher's position through the display interface of the end-side device 904. As the teacher moves, the teacher's position on the display interface of the end-side device 904 also changes accordingly. The teacher speaks at position 1 in the recording studio, moves from position 1 to position 2, speaks at position 2, moves from position 2 to position 3, and speaks at position 3. The conferencing device 902 or the end-side device 904 can determine the teacher's location (positions 1, 2, and 3) each time they speak, based on the aforementioned target object positioning method. Furthermore, based on positions 1, 2, and 3, and the audio signals collected by the target microphone array each time the teacher speaks, the target position is determined. When the teacher stands at the target position, the target microphone array receives sound more clearly. This target position is then displayed to the teacher through the display interface of the end-side device 904. The teacher can then determine their speaking position during video course recording or online course live streaming based on the target position displayed by the end-side device 904.

[0200] Corresponding to the above method embodiments, this specification also provides embodiments of a target object positioning device. Figure 10 A schematic diagram of a target object positioning device according to one embodiment of this specification is shown. Figure 10 As shown, the device includes:

[0201] The first determining module 1002 is configured to determine a target microphone array, wherein the target microphone array includes at least two subarrays, and each subarray includes a microphone;

[0202] The second determining module 1004 is configured to determine the target audio signal of the target object collected by the plurality of microphones;

[0203] Processing module 1006 is configured to process the target audio signal and determine the target angle of the target object relative to the target microphone array based on the processing result;

[0204] The third determining module 1008 is configured to determine the sub-angle of the target object relative to each sub-array based on the target angle;

[0205] The fourth determining module 1010 is configured to determine the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each sub-array.

[0206] In an optional embodiment, the at least two subarrays include a first subarray and a second subarray, the first subarray and the second subarray being located at opposite ends of the target microphone array; the third determining module 1008 is further configured to:

[0207] Based on the target angle, a first sub-angle of the target object relative to the first sub-array and a second sub-angle of the target object relative to the second sub-array are determined.

[0208] In an optional embodiment, the fourth determining module 1010 is further configured to:

[0209] Calculate the vertical distance of the target object relative to the target microphone array based on the first sub-angle and the second sub-angle;

[0210] The position of the target object relative to the target microphone array is determined based on the target angle and the vertical distance.

[0211] In an optional embodiment, the at least two subarrays further include a third subarray located between the first subarray and the second subarray; the fourth determining module 1010 is further configured to:

[0212] Based on the target angle, determine the third sub-angle of the target object relative to the third sub-array;

[0213] If the vertical distance is determined to be less than or equal to a preset distance threshold, the third sub-angle is taken as the adjusted target angle;

[0214] The position of the target object relative to the target microphone array is determined based on the adjusted target angle and the vertical distance.

[0215] In an optional embodiment, the fourth determining module 1010 is further configured to:

[0216] If the vertical distance is determined to be greater than the preset distance threshold, the position of the target object relative to the target microphone array is determined based on the target angle and the vertical distance.

[0217] In an optional embodiment, the third determining module 1008 is further configured to:

[0218] Based on the target angle, obtain the first coordinate angle of the target object relative to the first subarray under the target reference coordinate axis of the target microphone array, and the second coordinate angle of the target object relative to the second subarray under the target reference coordinate axis;

[0219] The first coordinate angle is converted into a first sub-angle of the target object relative to the first sub-array under the first reference coordinate axis of the first sub-array.

[0220] The second coordinate angle is converted into a second sub-angle of the target object relative to the second sub-array under the second reference coordinate axis of the second sub-array.

[0221] In an optional embodiment, the processing module 1006 is further configured to:

[0222] The target audio signal is used as the input signal, and a preset scanning algorithm is used to scan a preset scanning range, wherein the preset scanning range is determined according to a preset scanning angle and a preset scanning distance;

[0223] The target angle of the target object relative to the target microphone array is determined based on the scanning results.

[0224] In an optional embodiment, the second determining module 1004 is further configured to:

[0225] Determine the initial audio signal of the target object collected by the multiple microphones;

[0226] The initial audio signal is denoised to obtain the target audio signal of the target object collected by the plurality of microphones.

[0227] In an optional embodiment, the second determining module 1004 is further configured to:

[0228] Determine the signal-to-noise ratio of the first initial audio signal acquired by any one microphone, and determine the first cross power spectral density between the first initial audio signal and the initial audio signal;

[0229] Based on the signal-to-noise ratio, determine the second cross-power spectral density between the noise of the first initial audio signal and the noise of the initial audio signal;

[0230] Based on the first cross power spectral density and the second cross power spectral density, the initial audio signal is denoised to obtain the target audio signal of the target object collected by the plurality of microphones.

[0231] In summary, when determining the position of the target object relative to the target microphone array, the above-mentioned device not only considers the target angle of the target object relative to the target microphone array, but also the sub-angles of the target object relative to each sub-array. By combining the target angle and multiple sub-angles, it can provide more calculation data when determining the position of the target object relative to the target microphone array, realize the positioning of the target object based on the sound emitted by the target object, and further ensure the accuracy of the target object positioning.

[0232] The above is a schematic scheme of a target object positioning device according to this embodiment. It should be noted that the technical solution of this target object positioning device and the technical solution of the target object positioning method described above belong to the same concept. For details not described in detail in the technical solution of the target object positioning device, please refer to the description of the technical solution of the target object positioning method described above.

[0233] Corresponding to the above method embodiments, this specification also provides a target object processing method applied to an end-side device. Figure 11 A flowchart of a target object processing method according to an embodiment of this specification is shown, and the specific steps are as follows.

[0234] Step 1102: Receive the initial image of the target object set captured by the camera device, and display the initial image on the display interface;

[0235] Step 1104: According to the target object localization method provided in one embodiment of this specification, determine the position of the target object relative to the target microphone array, wherein the target object is any one of the target objects in the set of target objects;

[0236] Step 1106: Receive the target image of the target object captured by the camera device and display the target image on the display interface, wherein the target image is captured by the camera device based on the position of the target object relative to the microphone array.

[0237] The initial image of the target object set captured by the camera device can be understood as the initial image of the target object set captured by the camera device based on a preset initial position. For example, if the camera device is initially facing forward, and the target object set is located directly in front of the camera device, then the initial image of the target object set is the initial image captured by the camera device facing forward at the target object set. Both the target microphone array and the camera device are set up within the conference equipment.

[0238] Specifically, the system can receive initial images of a set of target objects captured by a camera device based on a preset initial position, and display these initial images on the display interface. According to the aforementioned target object localization method, any one object in the target object set is located to determine its position relative to the target microphone array. The system receives a target image of the object captured by the camera device based on its position relative to the target microphone array, and displays this image on the display interface.

[0239] For example, in a video conferencing scenario, see Figure 12 , Figure 12 This diagram illustrates an application scenario of a target object processing method provided in one embodiment of this specification. Figure 12 The system includes a conferencing device 1202 and an end-side device 1204. Specifically, when multiple participants (i.e., a set of target objects) participate in a video conference, they can sit in front of the conferencing device 1202. The camera of the conferencing device 1202 can be used to capture images of the participants, and the target microphone array of the conferencing device 1202 can be used to capture audio signals emitted by the participants. When one of the participants, A (i.e., the target object), speaks, the conferencing device 1202 or the end-side device 1204 can determine the position of participant A relative to the target microphone array (which can be understood as relative to the conferencing device) based on the audio signal emitted by participant A while speaking, captured by the target microphone array. Based on this position, the camera of the conferencing device 1202 is then directed towards participant A. Correspondingly, on the display interface of the end-side device 1204, before participant A speaks, images of the multiple participants captured by the camera at the initial position are displayed. After participant A speaks, the camera turns to participant A, and the image of participant A is displayed on the display interface of the edge device 1204. Alternatively, the display interface of the edge device 1204 can continue to display images of multiple participants, and display a light reminder at the position of participant A. When there are many participants, it can indicate to other participants who the current speaker is.

[0240] Understandably, the above methods can also be applied to virtual meeting scenarios, with a similar process, which will not be repeated here.

[0241] Corresponding to the above method embodiments, this specification also provides embodiments of a target object processing apparatus. Figure 13 A schematic diagram of a target object processing apparatus according to one embodiment of this specification is shown. Figure 13 As shown, the device includes:

[0242] The first display module 1302 is configured to receive an initial image of a set of target objects captured by a camera device and display the initial image on a display interface.

[0243] The determination module 1304 is configured to determine the position of a target object relative to a target microphone array according to the target object positioning method described in the embodiments of this specification, wherein the target object is any one of the target objects in the set of target objects;

[0244] The second display module 1306 is configured to receive a target image of the target object captured by the camera device and display the target image on a display interface, wherein the target image is captured by the camera device based on the position of the target object relative to the microphone array.

[0245] The above is a schematic scheme of a target object processing device according to this embodiment. It should be noted that the technical solution of this target object processing device and the technical solution of the target object processing method described above belong to the same concept. For details not described in detail in the technical solution of the target object processing device, please refer to the description of the technical solution of the target object processing method described above.

[0246] Corresponding to the above method embodiments, this specification also provides a conference device, including:

[0247] Target microphone array, memory, and processor;

[0248] Each microphone in the target microphone array is used to collect the target audio signal of the target object;

[0249] The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, which, when executed by the processor, implement the steps of the above method.

[0250] When determining the position of a target object relative to the target microphone array, this conference equipment considers not only the target angle of the target object relative to the target microphone array, but also the sub-angles of the target object relative to each sub-array. By combining the target angle and multiple sub-angles, it can provide more calculation data when determining the position of the target object relative to the target microphone array, and realize the positioning of the target object based on the sound emitted by the target object, further ensuring the accuracy of the target object positioning.

[0251] The above is an illustrative scheme of a conference device according to this embodiment. It should be noted that the technical solution of this conference device and the technical solution of the target object positioning method described above belong to the same concept. For details not described in detail in the technical solution of the conference device, please refer to the description of the technical solution of the target object positioning method described above.

[0252] Figure 14A structural block diagram of a computing device 1400 according to one embodiment of this specification is shown. The components of the computing device 1400 include, but are not limited to, a memory 1410 and a processor 1420. The processor 1420 is connected to the memory 1410 via a bus 1430, and a database 1450 is used to store data.

[0253] The computing device 1400 also includes an access device 1440, which enables the computing device 1400 to communicate via one or more networks 1460. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. The access device 1440 may include one or more of any type of wired or wireless network interface (e.g., a network interface card (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, a Wi-MAX (Worldwide Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a Near Field Communication (NFC) interface, and so on.

[0254] In one embodiment of this application, the aforementioned components of the computing device 1400 and Figure 14 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 14 The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this application. Those skilled in the art can add or replace other components as needed.

[0255] The computing device 1400 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 1400 can also be a mobile or stationary server.

[0256] The processor 1420 is configured to execute the following computer-executable instructions, which, when executed by the processor, implement the steps of the above method.

[0257] The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device and the technical solution of the above method belong to the same concept, and all details not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above method.

[0258] An embodiment of this specification also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described method.

[0259] The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium and the technical solution of the method described above belong to the same concept, and all details not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the method described above.

[0260] An embodiment of this specification also provides a computer program, wherein when the computer program is executed in a computer, it causes the computer to perform the steps of the above-described method.

[0261] The above is an illustrative example of a computer program according to this embodiment. It should be noted that the technical solution of this computer program and the technical solution of the method described above belong to the same concept. Details not described in detail in the technical solution of the computer program can be found in the description of the technical solution of the method described above.

[0262] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0263] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0264] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.

[0265] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0266] The preferred embodiments disclosed above are merely illustrative of this specification. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments described herein. These embodiments are selected and specifically described in this specification to better explain the principles and practical applications of the embodiments, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and their full scope and equivalents.

Claims

1. A method for locating a target object, comprising: A target microphone array is determined, wherein the target microphone array comprises at least two subarrays, and each subarray comprises a microphone; Determine the target audio signal of the target object collected by the multiple microphones; The target audio signal is processed, and the target angle of the target object relative to the target microphone array is determined based on the processing result. Based on the target angle, the intermediate results of the processing are reused to determine the sub-angles of the target object relative to each sub-array; the sub-angles include the first sub-angle of the first sub-array, the second sub-angle of the second sub-array, and the third sub-angle of the third sub-array; the third sub-array is located between the first sub-array and the second sub-array; The position of the target object relative to the target microphone array is determined based on the target angle and the sub-angles of the target object relative to each sub-array. This determination includes: calculating the vertical distance of the target object relative to the target microphone array based on the first sub-angle and the second sub-angle; if the vertical distance is less than or equal to a distance threshold, using the third sub-angle as the adjusted target angle; and determining the position of the target object relative to the target microphone array based on the adjusted target angle and the vertical distance.

2. The method according to claim 1, wherein the at least two subarrays include a first subarray and a second subarray, the first subarray and the second subarray being located at opposite ends of the target microphone array; Accordingly, determining the sub-angle of the target object relative to each sub-array based on the target angle includes: Based on the target angle, a first sub-angle of the target object relative to the first sub-array and a second sub-angle of the target object relative to the second sub-array are determined.

3. The method according to claim 2, wherein determining the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each sub-array comprises: Calculate the vertical distance of the target object relative to the target microphone array based on the first sub-angle and the second sub-angle; The position of the target object relative to the target microphone array is determined based on the target angle and the vertical distance.

4. The method according to claim 3, wherein the at least two subarrays further include a third subarray, the third subarray being located between the first subarray and the second subarray; Accordingly, determining the position of the target object relative to the target microphone array based on the target angle and the vertical distance includes: Based on the target angle, determine the third sub-angle of the target object relative to the third sub-array; If the vertical distance is determined to be less than or equal to a preset distance threshold, the third sub-angle is taken as the adjusted target angle; The position of the target object relative to the target microphone array is determined based on the adjusted target angle and the vertical distance.

5. The method according to claim 4, wherein determining the position of the target object relative to the target microphone array based on the target angle and the vertical distance further comprises: If the vertical distance is determined to be greater than the preset distance threshold, the position of the target object relative to the target microphone array is determined based on the target angle and the vertical distance.

6. The method according to claim 2, wherein determining a first sub-angle of the target object relative to the first subarray and a second sub-angle of the target object relative to the second subarray based on the target angle comprises: Based on the target angle, obtain the first coordinate angle of the target object relative to the first subarray under the target reference coordinate axis of the target microphone array, and the second coordinate angle of the target object relative to the second subarray under the target reference coordinate axis; The first coordinate angle is converted into a first sub-angle of the target object relative to the first sub-array under the first reference coordinate axis of the first sub-array. The second coordinate angle is converted into a second sub-angle of the target object relative to the second sub-array under the second reference coordinate axis of the second sub-array.

7. The method according to claim 1, wherein processing the target audio signal and determining the target angle of the target object relative to the target microphone array based on the processing result includes: The target audio signal is used as the input signal, and a preset scanning algorithm is used to scan a preset scanning range, wherein the preset scanning range is determined according to a preset scanning angle and a preset scanning distance; The target angle of the target object relative to the target microphone array is determined based on the scanning results.

8. The method according to claim 1, wherein determining the target audio signal of the target object collected by the plurality of microphones includes: Determine the initial audio signal of the target object collected by the multiple microphones; The initial audio signal is denoised to obtain the target audio signal of the target object collected by the plurality of microphones.

9. The method according to claim 8, wherein the step of denoising the initial audio signal to obtain the target audio signal of the target object collected by the plurality of microphones includes: Determine the signal-to-noise ratio of the first initial audio signal acquired by any one microphone, and determine the first cross power spectral density between the first initial audio signal and the initial audio signal; Based on the signal-to-noise ratio, determine the second cross-power spectral density between the noise of the first initial audio signal and the noise of the initial audio signal; Based on the first cross power spectral density and the second cross power spectral density, the initial audio signal is denoised to obtain the target audio signal of the target object collected by the plurality of microphones.

10. A target object positioning device, comprising: The first determining module is configured to determine a target microphone array, wherein the target microphone array includes at least two subarrays, and each subarray includes a microphone; The second determining module is configured to determine the target audio signal of the target object collected by the plurality of microphones; The processing module is configured to process the target audio signal and determine the target angle of the target object relative to the target microphone array based on the processing result; The third determining module is configured to determine the sub-angles of the target object relative to each sub-array based on the target angle and by reusing the intermediate results of the processing; the sub-angles include the first sub-angle of the first sub-array, the second sub-angle of the second sub-array, and the third sub-angle of the third sub-array; the third sub-array is located between the first sub-array and the second sub-array. The fourth determining module is configured to determine the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each sub-array. The determination of the position of the target object relative to the target microphone array based on the target angle and the sub-angles of the target object relative to each sub-array includes: calculating the vertical distance of the target object relative to the target microphone array based on the first sub-angle and the second sub-angle; if the vertical distance is less than or equal to a distance threshold, using the third sub-angle as the adjusted target angle; and determining the position of the target object relative to the target microphone array based on the adjusted target angle and the vertical distance.

11. A conference device, comprising: Target microphone array, memory, and processor; Each microphone in the target microphone array is used to collect the target audio signal of the target object; The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1 to 9.

12. A target object processing method, applied to an end-side device, comprising: Receive initial images of a set of target objects captured by a camera device, and display the initial images on the display interface; According to any one of claims 1-9, the target object localization method determines the position of the target object relative to the target microphone array, wherein the target object is any one object in the set of target objects; The system receives a target image of the target object captured by the camera device and displays the target image on the display interface. The target image is captured by the camera device based on the position of the target object relative to the microphone array.

13. A computing device, comprising: Memory and processor; The memory is used to store computer-executable instructions, and the processor is used to execute the computer-executable instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1 to 9.

14. A computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 9.