A dual-end speech separation method and system suitable for a handheld device, a handheld device, and a storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a dual-microphone linear array to construct forward and backward beam signals on a handheld device and combining it with a neural network for speech separation, the problem of speech recognition difficulties on handheld devices is solved, achieving low-cost and highly stable dual-end speech separation results.

CN122201330APending Publication Date: 2026-06-12ELEVOC TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ELEVOC TECH CO LTD
Filing Date: 2026-05-15
Publication Date: 2026-06-12

Application Information

Patent Timeline

15 May 2026

Application

12 Jun 2026

Publication

CN122201330A

IPC: G10L21/0272; G10L21/0216

AI Tagging

Application Domain

Speech analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Sound source isolating device
WO2026126309A1Speech analysis
Method for Driving Face of Virtual Image, Electronic Device, and Non-Transitory Readable Storage Medium
US20260162670A1Character and pattern recognition Animation
Wireless audio system and method for wirelessly communicating audio information
US20260161344A1Headphones for stereophonic communicationSpeech analysis
Methods, apparatus, and systems for enabling adaptive prediction and quantization in frequency domain predictors
WO2026122426A1Speech analysisCode conversion
Systems, devices, and methods for generating vocal data
US20260171102A1Speech analysisElectrostatic transducer microphones

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies struggle to effectively distinguish between voices from opposite directions on handheld devices, leading to difficulties in speech recognition crosstalk, role confusion, and echo suppression. Furthermore, traditional methods are expensive and have large hardware dimensions, while deep learning methods are unstable on small devices.

⚗Method used

A dual-microphone linear array is used to construct forward and backward end-fire beam signals. Speech separation is performed by combining the signal with a neural network. Through preprocessing and feature construction, the speech in the forward and backward regions is output.

🎯Benefits of technology

It achieves low-cost, low-complexity two-end voice separation, improves stability in noisy and reverberant environments, is suitable for two-sided interaction scenarios such as cashiers and customers, and reduces hardware requirements.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122201330A_ABST

Patent Text Reader

Abstract

The application discloses a kind of double-end speech separation method, system, handheld device and storage medium suitable for handheld device, method includes: based on the first microphone and second microphone on handheld device acquisition double-channel original speech signal;Double-channel original speech signal is preprocessed, and forward end-on beam signal pointing to 0 degree direction and backward end-on beam signal pointing to 180 degree direction are constructed in frequency domain;Based on forward end-on beam signal, backward end-on beam signal and double-channel original speech signal, network input feature is constructed, and is input to the double-region speech separation network of pre-training, and output forward region speech and backward region speech.The application realizes the speech separation of handheld device front and back double-region under the condition of using only a small amount of microphone, with the advantages of simple structure, low cost, high robustness and easy productization deployment, suitable for handheld cash register, mobile inquiry, intercom terminal and other double-side speech interaction scene.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech signal processing technology, and in particular to a dual-end speech separation method, system, handheld device, and storage medium suitable for handheld devices. Background Technology

[0002] In applications such as handheld POS terminals, handheld intercom devices, mobile service terminals, and portable human-computer interaction devices, the devices typically need to face speakers from both the front and back simultaneously. For example, in a handheld POS scenario, the cashier is on one side of the device, and the customer is on the other side; they may speak alternately or even partially overlap. If the device cannot effectively distinguish between voices from both directions, it will cause speech recognition crosstalk, role confusion, difficulties in echo suppression, and a degraded interactive experience.

[0003] In existing technologies, single-microphone solutions cannot provide spatial orientation information, making it difficult to separate multiple speakers. While multi-microphone array solutions can achieve better spatial resolution, they are costly, large in size, and consume a lot of power, making them unsuitable for handheld devices with limited space. Traditional beamforming methods typically focus on enhancing the target direction, but when speech exists simultaneously in two regions, it is often difficult to directly output two independent speech streams. Furthermore, blind separation methods that rely entirely on deep learning lack stable spatial priors and are easily affected by noise, reverberation, changes in speaker position, and individual device differences on small dual-microphone devices.

[0004] Therefore, existing technologies still have shortcomings. Summary of the Invention

[0005] To address the aforementioned deficiencies in the prior art, this invention provides a dual-end voice separation method, system, handheld device, and storage medium suitable for handheld devices. The technical solution adopted by this invention is as follows: In a first aspect, the present invention provides a dual-end voice separation method suitable for handheld devices, the method comprising: Dual-channel raw speech signals are acquired using a first microphone and a second microphone on a handheld device, wherein the first microphone and the second microphone form a dual-microphone linear array; The dual-channel raw speech signal is preprocessed to construct a forward end-fire beam signal pointing to the 0-degree direction and a backward end-fire beam signal pointing to the 180-degree direction in the frequency domain. The 0-degree direction is the array axis from the first microphone to the second microphone, and the 180-degree direction is the array axis from the second microphone to the first microphone. Network input features are constructed based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal; The network input features are fed into a pre-trained dual-region speech separation network, which outputs forward and backward region speech.

[0006] In one implementation, the preprocessing includes any one or more of the following: framing, windowing, short-time Fourier transform, and amplitude normalization.

[0007] In one implementation, network input features are constructed based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal, including: Based on the forward end-fired beam signal and the backward end-fired beam signal, the complex spectrum of the forward beam and the complex spectrum of the backward beam are obtained. Based on the original dual-channel speech signal, a dual-channel complex spectrum is obtained; The forward beam complex spectrum, the backward beam complex spectrum, and the dual-channel complex spectrum are used as the main input features of the network input features.

[0008] In one implementation, constructing network input features based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal further includes: The dual-channel phase difference corresponding to the dual-channel original speech signal, the dual-beam logarithmic amplitude spectrum obtained based on the forward end-fired beam signal and the backward end-fired beam signal, the dual-beam energy difference feature, and the dual-beam energy ratio feature are used as auxiliary input features of the network input features.

[0009] In one implementation, the dual-region speech separation network employs any one or more of the following: recurrent neural network, temporal convolutional network, Transformer network, and dual-path temporal network. The dual-region speech separation network includes a feature encoding layer, a temporal modeling layer, and a dual-branch decoding layer; The training data of the dual-region speech separation network includes: target speech located in the forward region of the handheld device, target speech located in the backward region of the handheld device, ambient noise, impulse noise, reverberation response, and device frequency response disturbance. The loss function of the dual-region speech separation network is a weighted combination of frequency domain loss and time domain loss.

[0010] In one implementation, the network input features are fed into a pre-trained dual-region speech separation network, which outputs forward and backward region speech, including: The network input features are input into the dual-region speech separation network, and the forward beam complex spectrum, the backward beam complex spectrum, and the dual-channel complex spectrum are jointly encoded based on the feature coding layer in the dual-region speech separation network. The forward and backward speech masks are output from the dual-branch decoding layer in the dual-region speech separation network. The forward and backward speech masks are then applied to a reference spectrum or the corresponding beam spectrum to obtain the forward and backward speech.

[0011] In one implementation, the network input features are fed into a pre-trained dual-region speech separation network to output forward and backward region speech, and the network further includes: The network input features are input into the dual-region speech separation network, and speech separation is achieved by outputting the complex spectrum or time-domain waveform of the forward and backward speech regions based on the dual-region speech separation network.

[0012] Secondly, embodiments of the present invention also provide a dual-end voice separation system suitable for handheld devices, the system being used to implement the steps of the dual-end voice separation method suitable for handheld devices described in any of the above solutions, the system comprising: A microphone array acquisition module is used to acquire dual-channel raw speech signals based on a first microphone and a second microphone on a handheld device, wherein the first microphone and the second microphone form a dual-microphone linear array; A dual-end beamforming module is used to preprocess the dual-channel raw speech signal and construct a forward end-fire beam signal pointing to the 0-degree direction and a backward end-fire beam signal pointing to the 180-degree direction in the frequency domain. The 0-degree direction is the array axis from the first microphone to the second microphone, and the 180-degree direction is the array axis from the second microphone to the first microphone. The feature construction module is used to construct network input features based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal; The speech separation module is used to input the network input features into a pre-trained dual-region speech separation network and output forward region speech and backward region speech.

[0013] Thirdly, embodiments of the present invention also provide a handheld device, wherein the handheld device includes a memory, a processor, and a dual-end voice separation program for the handheld device stored in the memory and executable on the processor. When the processor executes the dual-end voice separation program for the handheld device, it implements the steps of the dual-end voice separation method for the handheld device according to any of the above-mentioned schemes.

[0014] Fourthly, embodiments of the present invention also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a dual-end voice separation program suitable for a handheld device, the dual-end voice separation program suitable for a handheld device implementing the steps of the dual-end voice separation method suitable for a handheld device as described in any of the above schemes on the computer-readable storage medium.

[0015] Beneficial Effects: Compared with existing technologies, this invention provides a dual-end speech separation method suitable for handheld devices. First, this invention acquires dual-channel raw speech signals using a first and second microphone on the handheld device, where the first and second microphones form a dual-microphone linear array. Then, the dual-channel raw speech signals are preprocessed to construct a forward end-fire beam signal pointing at 0 degrees and a backward end-fire beam signal pointing at 180 degrees in the frequency domain. The 0-degree direction is the array axis from the first microphone to the second microphone, and the 180-degree direction is the array axis from the second microphone to the first microphone. Next, network input features are constructed based on the forward end-fire beam signal, the backward end-fire beam signal, and the dual-channel raw speech signals. Finally, the network input features are input into a pre-trained dual-region speech separation network, outputting forward and backward region speech.

[0016] This invention achieves low-cost pre-partitioning of the front and rear spaces of a handheld device by constructing forward and backward end-fire beam signals using only two microphones, significantly reducing hardware complexity and structural size requirements.

[0017] This invention combines dual-end beamforming with neural network speech separation, giving the speech separation network a clear spatial direction prior, resulting in higher stability under noise, reverberation, and dual-talk conditions compared to purely blind separation schemes.

[0018] This invention can simultaneously output forward and backward speech, making it suitable for two-way interaction scenarios such as cashiers and customers, operators and service recipients, and equipment owners and external speakers.

[0019] This invention can also be combined with echo cancellation, noise suppression, automatic gain control, speaker detection and speech recognition modules, making it easy to implement quickly in existing handheld device products. Attached Figure Description

[0020] Figure 1 This is a flowchart of a preferred embodiment of the dual-end voice separation method for handheld devices according to an embodiment of the present invention.

[0021] Figure 2 This is a schematic diagram of the structure of a dual-microphone linear array in a handheld device according to an embodiment of the present invention.

[0022] Figure 3 This is a schematic diagram illustrating the technical route of a dual-end voice separation method applicable to handheld devices according to an embodiment of the present invention.

[0023] Figure 4 This is a schematic diagram of the structure of a dual-region speech separation network in a dual-end speech separation method applicable to handheld devices according to an embodiment of the present invention.

[0024] Figure 5 The audio spectrogram for test speech 1.

[0025] Figure 6 The spectrum analysis diagram of the algorithm output for testing speech 1.

[0026] Figure 7 The audio spectrogram for test speech 2.

[0027] Figure 8 The spectrum analysis diagram of the algorithm output for testing speech 2.

[0028] Figure 9 This is a technical framework diagram of a dual-end voice separation system applicable to handheld devices according to an embodiment of the present invention.

[0029] Figure 10 A schematic diagram of a handheld device provided in an embodiment of the present invention. Detailed Implementation

[0030] To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0031] The flowchart shown in the attached diagram is for illustrative purposes only and does not necessarily include all content, operations, or steps, nor does it require execution in the described order. For example, some operations or steps can be broken down, combined, or partially merged, so the actual execution order may change depending on the actual situation.

[0032] It should be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0033] It should be understood that, in order to clearly describe the technical solutions of the embodiments of the present invention, the terms "first" and "second" are used in the embodiments of the present invention to distinguish identical or similar items with essentially the same function and effect. For example, "first control information" and "second control information" are only used to distinguish different control information and do not limit their order.

[0034] Those skilled in the art will understand that the words "first" and "second" do not limit the quantity or the order of execution, and that the words "first" and "second" do not necessarily imply that they are different.

[0035] It should also be understood that the term “and / or” as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0036] The core idea of this invention is to integrate the explicit linear array direction prior with data-driven separation capability. Specifically, at the front end, low-cost spatial pre-partitioning is established using 0-degree and 180-degree dual-end beamforming. At the back end, statistical features of speech, noise and reverberation in different regions are learned by jointly using the front and rear beam spectra and the dual-channel complex spectra corresponding to the two microphones, so as to improve the robustness of dual-region speech separation in real complex scenarios.

[0037] Specifically, the dual-end voice separation method for handheld devices in this embodiment can be applied to terminals, such as mobile phones and other intelligent handheld devices. Figure 1 As shown, the dual-end voice separation method for handheld devices includes the following steps: Step S100: Acquire dual-channel raw voice signals based on the first and second microphones on the handheld device, wherein the first microphone and the second microphone form a dual-microphone linear array.

[0038] In this embodiment, combined with Figure 2 As shown, the handheld device is equipped with two omnidirectional microphones, namely a first microphone (mic1) and a second microphone (mic2). The first microphone and the second microphone are arranged at intervals along the main axis of the handheld device, thereby forming a dual-microphone linear array with a fixed spacing d on the main axis of the handheld device.

[0039] To clarify the dual-zone audio pickup logic, this invention defines the array axis pointing from the first microphone (mic1) to the second microphone (mic2) as the 0-degree direction, and the array axis pointing from the second microphone (mic2) to the first microphone (mic1) as the 180-degree direction. The 0-degree direction corresponds to the forward region of the handheld device, and the 180-degree direction corresponds to the backward region. In other words, the essence of dual-zone audio pickup is to distinguish and separate the forward and backward regions along opposite axes at both ends of the dual-microphone linear array. When the product structure rotates, only the mapping relationship between 0 degrees and 180 degrees and the forward and backward directions of the business needs to be updated synchronously. Based on the above dual-microphone linear array, this embodiment can acquire dual-channel raw voice signals based on the first and second microphones on the handheld device, i.e., obtain the raw voice signals from the first microphone and the raw voice signals from the second microphone.

[0040] Arrange S200 to preprocess the dual-channel raw voice signal and construct a forward end-fire beam signal pointing to 0 degrees and a backward end-fire beam signal pointing to 180 degrees in the frequency domain, wherein the 0-degree direction is the array axis from the first microphone to the second microphone, and the 180-degree direction is the array axis from the second microphone to the first microphone.

[0041] Combination Figure 3 As can be seen from the technical route described, this embodiment constructs dual-end-fire beams after determining the hardware topology of the dual-microphone linear array, that is, constructing forward end-fire beam signals and backward end-fire beam signals. This embodiment preprocesses the acquired dual-channel raw speech signals. Preprocessing includes any one or more of the following: framing, windowing, short-time Fourier transform, and amplitude normalization. In specific applications, this embodiment can perform framing and short-time Fourier transform on the dual-channel raw speech signals to convert them into the frequency domain, and then construct a forward end-fire beam signal pointing in the 0-degree direction and a backward end-fire beam signal pointing in the 180-degree direction in the frequency domain, thereby mapping the mixed sound field into two beam signals with forward and backward offsets, providing low-dimensional spatial priors.

[0042] This embodiment can construct forward end-fire and backward end-fire beam signals based on the geometric spacing, array axis, and time delay relationship of the dual-microphone linear array. Specifically, for an array spacing of... The speed of sound is A dual-microphone linear array, the relative time delay of the incident sound wave along the array axis is determined by the incident angle. The decision is made that when the beam points to 0 degrees and 180 degrees respectively, two directional enhancement outputs can be formed corresponding to the forward and backward regions, and their relative time delay relationship can be expressed as follows:

[0043]

[0044]

[0045] in, This reflects the angle of incidence as The time delay of the original dual-channel speech signal. For the dual-microphone reference delay in the forward region (i.e., 0 degrees), The dual-microphone reference delay is for the backward region (i.e., 180 degrees).

[0046] After determining the above time delay relationship, this embodiment can combine the complex spectrum obtained after preprocessing the dual-channel original speech signal to complete the weighted beamforming of the dual-microphone array, and output two beam signals with front and rear offsets to initially enhance the speech in the target area and suppress cross-regional crosstalk components.

[0047] In a preferred implementation, the forward end-fire beam signal and the backward end-fire beam signal are obtained by preprocessing the complex spectra of the dual-channel original speech signals, and then performing complex weighting in the corresponding directions. Their outputs can be expressed as follows:

[0048]

[0049] in, Represents the frequency index of the audio signal (i.e., the frequency domain dimension); For audio signal frame index (i.e., the sequence number after time-domain framing); The original speech signal from the first microphone, after being framed and subjected to short-time Fourier transform, is the... Frequency point The complex spectrum of a frame; The original speech signal from the second microphone, after being framed and subjected to short-time Fourier transform, is the... Frequency point The complex spectrum of a frame; It means that in the first Frequency point The frame outputs the forward end-fire beam signal; It means that in the first Frequency point The frame outputs the rear-end beam signal; The original speech signal from the first microphone in the forward-firing beam is in the... Complex weighting coefficients for frequency points; The original speech signal from the second microphone in the forward-firing beam is in the first... Complex weighting coefficients for frequency points; The original speech signal from the first microphone in the backward beamforming signal is in the... Complex weighting coefficients for frequency points; The original speech signal from the second microphone in the rear-facing beam signal is in the... Complex weighting coefficients for frequency points.

[0050] By forming the aforementioned forward and backward end-fired beam signals, the original mixed sound field is mapped into two beam outputs with forward and backward offsets. Even with only two microphones, this embodiment can provide subsequent networks with low-dimensional spatial priors reflecting the difference between the 0-degree and 180-degree directions.

[0051] Step S300: Construct network input features based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal.

[0052] Combination Figure 3 As shown, after determining the forward end-fired beam signal and the backward end-fired beam signal, this embodiment begins to construct the network input features. In specific applications, to improve the network's ability to represent complex scenes, the network input features fed into the dual-region speech separation network include not only the complex spectra of the forward end-fired beam signal and the backward end-fired beam signal, but also the dual-channel complex spectrum of the dual-channel original speech signal. Specifically, this embodiment obtains the forward beam complex spectrum and the backward beam complex spectrum based on the forward end-fired beam signal and the backward end-fired beam signal; and obtains the dual-channel complex spectrum based on the dual-channel original speech signal. Then, the forward beam complex spectrum, the backward beam complex spectrum, and the dual-channel complex spectrum are used as the main input features of the network input features. Furthermore, in other implementations, this embodiment can also superimpose any one or more of the following as auxiliary input features of the network input features: the dual-channel phase difference corresponding to the dual-channel original speech signal, the dual-beam logarithmic amplitude spectrum obtained based on the forward end-fired beam signal and the backward end-fired beam signal, the dual-beam energy difference feature, and the dual-beam energy ratio feature.

[0053] in, For network input features, splicing operation, For the first Frame dual-beam energy difference characteristics, For the first Frame dual-beam energy ratio characteristics.

[0054] Furthermore, the dual-beam energy difference and dual-beam energy ratio in this embodiment can be used as auxiliary inputs for network input features, and can also be used to form forward and backward confidence indices. Their calculation relationship can be expressed as follows:

[0055]

[0056] in, For the forward beam in the first Frame-level energy, For the backward beam in the first Frame-level energy, To ensure the energy is a very small positive number, we avoid the energy ratio being zero and guarantee the stability of the value.

[0057] In some implementations, the forward and backward speech outputs of the dual-region speech separation network can be post-processed and enhanced based on the directional confidence. For example, when the forward directional confidence of a frame is significantly higher than the backward directional confidence, the forward output is enhanced and the backward residue is suppressed; conversely, the backward output is enhanced. This approach can further reduce crosstalk during dual-talk switching.

[0058] Step S400: Input the network input features into a pre-trained dual-region speech separation network and output forward region speech and backward region speech.

[0059] Combination Figure 3 As shown, when the network input features are fed into a pre-trained dual-region speech separation network, it outputs forward and backward region speech, thereby achieving speech separation. Specifically, the dual-region speech separation network in this embodiment employs any one or more of the following: recurrent neural networks, temporal convolutional networks, Transformer networks, and dual-path temporal networks. Figure 4 As shown, the dual-region speech separation network includes a feature coding layer, a temporal modeling layer, and a dual-branch decoding layer. Specifically, the feature coding layer is used to jointly encode the forward beam complex spectrum, the backward beam complex spectrum, and the dual-channel complex spectrum to obtain the encoding result. The dual-branch decoding layer is used to decode the encoding result, outputting a forward region speech mask and a backward region speech mask, and applying the forward region speech mask and the backward region speech mask to a reference spectrum or the corresponding beam spectrum to obtain the forward region speech and the backward region speech.

[0060] During training, considering the presence of noise, reverberation, device posture changes, and dual-talk overlap in real-world handheld scenarios, this invention preferably employs a data generation method combining simulation and real-world acquisition to train the dual-region speech separation network. The training data for the dual-region speech separation network includes: target speech located in the forward region of the handheld device, target speech located in the backward region of the handheld device, ambient noise, impulse noise, reverberation response, and device frequency response disturbances. In the simulation phase, dual-channel mixed speech can be generated based on the room impulse response, and corresponding forward and backward beam signals can be synthesized according to the dual-microphone geometric parameters. In the actual acquisition phase, bilateral dialogue data can be collected through scenarios such as handheld cash registers, mobile inquiries, and intercom interactions to correct the network's generalization ability on real devices. The loss function of the dual-region speech separation network in this embodiment uses a weighted combination of frequency domain loss and time domain loss. The frequency domain loss is used to constrain the amplitude error and complex phase error between the predicted spectrum and the true spectrum, while the time domain loss is used to constrain the waveform consistency between the reconstructed speech and the target speech, such as scale-invariant signal-to-noise ratio loss. Through multi-objective joint optimization, separation degree, intelligibility, and sound quality fidelity can be balanced. After training, the dual-region speech separation network forms an end-to-end nonlinear mapping relationship with strict point-to-point alignment of time and frequency. When the above-mentioned network input features are input into the dual-region speech separation network, the network will automatically output two time-frequency speech masks with the same resolution as the network input features.

[0061] Specifically, when the network input features are input into the dual-region speech separation network, the forward beam complex spectrum, the backward beam complex spectrum, and the dual-channel complex spectrum are jointly encoded based on the feature coding layer in the dual-region speech separation network. Then, the forward region speech mask and the backward region speech mask are output based on the dual-branch decoding layer in the dual-region speech separation network, and the forward region speech mask and the backward region speech mask are applied to the reference spectrum or the corresponding beam spectrum to obtain the forward region speech and the backward region speech. For example, when a mask estimation method is used, it can be expressed as:

[0062]

[0063] in, The complex spectrum of the separated forward region speech; The complex spectrum of the speech in the separated backward region; For the first Frequency point Forward region speech mask of a frame; For the first Frequency point Backward region speech mask of a frame; For reference spectrum; This represents the element-wise multiplication operation between the mask and the spectrum.

[0064] In another implementation, the dual-region speech separation network may not explicitly output a mask. Instead, it can directly output the complex spectra or time-domain waveforms of the forward and backward speech regions to achieve speech separation. Any speech separation targeting two opposite regions that can be achieved by combining the forward and backward spatial priors formed by the 0° / 180° dual-end-beams and the dual-channel complex spectrum information should fall within the scope of this invention.

[0065] Furthermore, after separating the forward and backward speech regions, this embodiment can select, route, enhance, or recognize the two speech streams according to business needs. For example, the forward speech region can be provided to a first service path, and the backward speech region can be provided to a second service path, or the forward and backward speech regions can be used by speech recognition, recording and storage, intercom transmission, and human-computer interaction decision modules, respectively.

[0066] In the online inference phase of the dual-region speech separation network in this embodiment, the handheld device acquires dual-channel audio in real time and continuously outputs forward and backward region speech. For handheld POS scenarios, forward region speech can be used as the cashier's input to the business system, and backward region speech can be used as the customer's input to the business system, thereby being used for role recognition, order confirmation, voice command parsing, or transaction recording, respectively.

[0067] For intercom or mobile service scenarios, the system can output only forward voice or only backward voice depending on the current business mode. It can also transmit the two voices to different remote ends, or perform separate recognition locally and then execute different control logic.

[0068] In extended embodiments, the present invention can also be used in combination with an echo cancellation module, a noise suppression module, an automatic gain control module, and a speaker recognition module. Preferably, echo cancellation and basic preprocessing are performed first, followed by dual-end beamforming and dual-region speech separation to further improve the processing effect under complex acoustic conditions.

[0069] In addition to the standard implementation described above, this invention allows for various modifications without altering the core principles. For example, a reference microphone, bone conduction microphone, or vibration sensor can be added to the dual-microphone setup to enrich the input features; the end-fire beam can be replaced with other bidirectional beam structures with opposite main lobe directions; and the definitions of forward and backward can be mapped to the left and right regions of the device or other relative spatial regions. Therefore, any technical solution that substantially utilizes a small number of microphones to construct two pre-partitioned areas with opposite spatial directions on a handheld device and combines them with a neural network to achieve dual-region speech separation can be considered an equivalent solution of this invention.

[0070] Furthermore, this embodiment also analyzes and verifies the speech separation effect of the method of the present invention through experiments. For example... Figure 5 As shown, Figure 5 The spectrogram of test speech 1 is shown, with the front direction as 0° and changing every 10°, for a total of 36 test sentences, used to test the accuracy of front and back direction judgment. Figure 6 The spectrum analysis of the algorithm output for test speech 1 is shown. The first channel represents the reference microphone, the second channel represents the rear output channel, and the third channel represents the front output channel. From the spectrum, the second channel, 90°~270°, falls precisely in the rear speech pickup region. The third channel, 0°~90° and 270°~360°, falls in the front speech pickup region, verifying the correctness of the front and rear pickup.

[0071] Figure 7 The audio spectrogram of test speech 2 is shown, with the front speech, rear speech, and dual-talk speech segments marked in white boxes, used to test the performance of differentiation and separation. Figure 8 The spectrum analysis of the algorithm output of test speech 2 is shown. The first channel is the first channel of the reference signal test speech 2, the second channel is the rear output channel, and the third channel is the front output channel. The effectiveness of the performance in distinguishing between front and back and separating the speech can be clearly seen.

[0072] Based on the above embodiments, the present invention also provides a dual-end voice separation system suitable for handheld devices, the system being used to implement the steps of the above method embodiments. Specifically, as Figure 9 As shown, the system in this embodiment includes: a microphone array acquisition module 10, a dual-end beamforming module 20, a feature construction module 30, and a speech separation module 40. Specifically, the microphone array acquisition module 10 is used to acquire dual-channel raw speech signals based on a first microphone and a second microphone on a handheld device, wherein the first microphone and the second microphone form a dual-microphone linear array. The dual-end beamforming module 20 is used to preprocess the dual-channel raw speech signals to construct a forward end-fire beam signal pointing in the 0-degree direction and a backward end-fire beam signal pointing in the 180-degree direction in the frequency domain, wherein the 0-degree direction is the array axis from the first microphone to the second microphone, and the 180-degree direction is the array axis from the second microphone to the first microphone. The feature construction module 30 is used to construct network input features based on the forward end-fire beam signal, the backward end-fire beam signal, and the dual-channel raw speech signals. The speech separation module 40 is used to input the network input features into a pre-trained dual-region speech separation network and output forward region speech and backward region speech.

[0073] The functions and principles of each module in the dual-end voice separation system of the handheld device in this embodiment are the same as those of each step in the above method embodiment, and will not be repeated here.

[0074] Based on the above embodiments, the present invention also provides a handheld device, the schematic diagram of which can be as follows: Figure 10 As shown. The handheld device may include one or more processors 100 ( Figure 10 (Only one is shown in the diagram), memory 101, and computer program 102 stored in memory 101 and executable on one or more processors 100. For example, a two-way speech separation program suitable for a handheld device. When one or more processors 100 execute computer program 102, they can implement the various steps in the two-way speech separation method embodiment suitable for a handheld device. Alternatively, when one or more processors 100 execute computer program 102, they can implement the functions of various modules / units in the two-way speech separation apparatus embodiment suitable for a handheld device, which is not limited here.

[0075] In one embodiment, the processor 100 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

[0076] In one embodiment, memory 101 may be an internal storage unit of the handheld device, such as a hard drive or memory. Memory 101 may also be an external storage unit of the handheld device, such as a plug-in hard drive, smart media card (SM), secure digital card (SD), flash card, etc. Furthermore, memory 101 may include both internal and external storage units of the handheld device. Memory 101 is used to store computer programs and other programs and data required by the handheld device. Memory 101 can also be used to temporarily store data that has been output or will be output.

[0077] Those skilled in the art will understand that Figure 10The schematic diagram shown is merely a partial structural diagram related to the present invention and does not constitute a limitation on the handheld device to which the present invention is applied. A specific handheld device may include more or fewer components than shown in the figure, or combine certain components, or have different component arrangements.

[0078] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided by this invention can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), direct memory bus RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0079] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A dual-end voice separation method suitable for handheld devices, characterized in that, The method includes: Dual-channel raw speech signals are acquired using a first microphone and a second microphone on a handheld device, wherein the first microphone and the second microphone form a dual-microphone linear array; The dual-channel raw speech signal is preprocessed to construct a forward end-fire beam signal pointing to the 0-degree direction and a backward end-fire beam signal pointing to the 180-degree direction in the frequency domain. The 0-degree direction is the array axis from the first microphone to the second microphone, and the 180-degree direction is the array axis from the second microphone to the first microphone. Network input features are constructed based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal; The network input features are fed into a pre-trained dual-region speech separation network, which outputs forward and backward region speech.

2. The dual-end voice separation method for handheld devices according to claim 1, characterized in that, The preprocessing includes any one or more of the following: framing, windowing, short-time Fourier transform, and amplitude normalization.

3. The dual-end voice separation method for handheld devices according to claim 1, characterized in that, The network input features are constructed based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal, including: Based on the forward end-fired beam signal and the backward end-fired beam signal, the complex spectrum of the forward beam and the complex spectrum of the backward beam are obtained. Based on the original dual-channel speech signal, a dual-channel complex spectrum is obtained; The forward beam complex spectrum, the backward beam complex spectrum, and the dual-channel complex spectrum are used as the main input features of the network input features.

4. The dual-end voice separation method for handheld devices according to claim 3, characterized in that, The network input features are constructed based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal, and further include: The dual-channel phase difference corresponding to the dual-channel original speech signal, the dual-beam logarithmic amplitude spectrum obtained based on the forward end-fired beam signal and the backward end-fired beam signal, the dual-beam energy difference feature, and the dual-beam energy ratio feature are used as auxiliary input features of the network input features.

5. The dual-end voice separation method for handheld devices according to claim 4, characterized in that, The dual-region speech separation network employs any one or more of the following: recurrent neural network, temporal convolutional network, Transformer network, and dual-path temporal network. The dual-region speech separation network includes a feature encoding layer, a temporal modeling layer, and a dual-branch decoding layer; The training data of the dual-region speech separation network includes: target speech located in the forward region of the handheld device, target speech located in the backward region of the handheld device, ambient noise, impulse noise, reverberation response, and device frequency response disturbance. The loss function of the dual-region speech separation network is a weighted combination of frequency domain loss and time domain loss.

6. The dual-end voice separation method for handheld devices according to claim 5, characterized in that, The network input features are fed into a pre-trained dual-region speech separation network, which outputs forward and backward region speech, including: The network input features are input into the dual-region speech separation network, and the forward beam complex spectrum, the backward beam complex spectrum, and the dual-channel complex spectrum are jointly encoded based on the feature coding layer in the dual-region speech separation network. The forward and backward speech masks are output from the dual-branch decoding layer in the dual-region speech separation network. The forward and backward speech masks are then applied to a reference spectrum or the corresponding beam spectrum to obtain the forward and backward speech.

7. The dual-end voice separation method for handheld devices according to claim 5, characterized in that, The network input features are fed into a pre-trained dual-region speech separation network, which outputs forward and backward region speech, and further includes: The network input features are input into the dual-region speech separation network, and speech separation is achieved by outputting the complex spectrum or time-domain waveform of the forward and backward speech regions based on the dual-region speech separation network.

8. A dual-end voice separation system suitable for handheld devices, characterized in that, The system is used to implement the steps of the dual-end voice separation method for handheld devices according to any one of claims 1-7, the system comprising: A microphone array acquisition module is used to acquire dual-channel raw speech signals based on a first microphone and a second microphone on a handheld device, wherein the first microphone and the second microphone form a dual-microphone linear array; A dual-end beamforming module is used to preprocess the dual-channel raw speech signal and construct a forward end-fire beam signal pointing to the 0-degree direction and a backward end-fire beam signal pointing to the 180-degree direction in the frequency domain. The 0-degree direction is the array axis from the first microphone to the second microphone, and the 180-degree direction is the array axis from the second microphone to the first microphone. The feature construction module is used to construct network input features based on the forward end-fired beam signal, the backward end-fired beam signal, and the dual-channel raw speech signal; The speech separation module is used to input the network input features into a pre-trained dual-region speech separation network and output forward region speech and backward region speech.

9. A handheld device, characterized in that, The handheld device includes a memory, a processor, and a dual-end voice separation program for the handheld device stored in the memory and executable on the processor. When the processor executes the dual-end voice separation program for the handheld device, it implements the steps of the dual-end voice separation method for the handheld device as described in any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a dual-end speech separation program suitable for handheld devices, the dual-end speech separation program suitable for handheld devices implementing the steps of the dual-end speech separation method suitable for handheld devices as described in any one of claims 1-7 on the computer-readable storage medium.