Unlock instant, AI-driven research and patent intelligence for your innovation.

Multi-channel double-speaker separation method and system

A speaker separation and speaker technology, applied in speech analysis, instruments, etc., can solve the problems of speech separation performance degradation, interfering with the speaker's voice, and the inability to know the specific location of the target speaker in advance, etc.

Pending Publication Date: 2021-12-31
INST OF ACOUSTICS CHINESE ACAD OF SCI
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Studies have shown that in a conference environment, the ratio of speaker overlap is generally not higher than 20%, so the robust performance for speech separation with different low speaker overlap ratios still needs to be improved
On the other hand, for mixed speech audio with different overlapping ratios of speakers, the specific position of the target speaker in the speech cannot be known in advance, and the input for neural network training can only be the entire speech
In this case, if average pooling is used, interfering speaker speech and silent frames will seriously affect the estimation of target speaker position information, thereby degrading the performance of speech separation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-channel double-speaker separation method and system
  • Multi-channel double-speaker separation method and system
  • Multi-channel double-speaker separation method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] Based on the above-mentioned ideal sound source position estimation network and speaker masking estimation network, an embodiment of the present application provides a multi-channel dual-speaker separation method, the method comprising:

[0056] S31. Obtain the mixed voice audio including the voices of two speakers, perform frame division, windowing and Fourier transform processing on the mixed voice audio, and obtain the frequency spectrum of each frame of audio.

[0057] In a feasible implementation manner, the following steps S311-S313 are included:

[0058] S311, divide the mixed speech and audio to be separated into frames, each frame is 25 milliseconds, and the frame is shifted by 6.25 milliseconds;

[0059] S312, add a window to each frame, and the window function is a Hamming window;

[0060] S313. Perform a 512-point Fourier transform on each frame of audio to obtain a frequency spectrum of each frame of audio.

[0061] S32, input the frequency spectrum of ea...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a multi-channel double-speaker separation method and system, and the method comprises the steps: carrying out the processing of a mixed voice audio, and obtaining the frequency spectrum of each frame of audio; according to each frame of audio and the sound source position estimation network, obtaining estimated frame-level Cartesian coordinates and corresponding weights; obtaining a first logarithmic energy spectrum and a first phase difference between sine and cosine channels according to the frequency spectrum of each frame of audio; according to the estimated frame-level Cartesian coordinates and the corresponding weights, obtaining Cartesian coordinate estimation of a target speaker in the mixed voice audio; obtaining a first angle feature according to the Cartesian coordinate of the target speaker; obtaining a target speaker and a first estimated speaker mask according to the first logarithmic energy spectrum, the first phase difference between sine and cosine channels, the first angle feature and a speaker mask estimation network; and obtaining separated voices of the at least two speakers based on the target speaker, the first estimated speaker mask and the mixed voice audio.

Description

technical field [0001] The embodiments of the present application relate to the field of speech separation, and in particular to a multi-channel dual-speaker separation method and system. Background technique [0002] The goal of speech separation is to separate different speakers from the mixed speech audio with reverberation and noise to obtain clean individual speaker speech. Speech separation, as the front end of speech recognition system, voice log and other technologies, is widely used in various environments such as teaching environment and conference environment. [0003] Deep clustering is a traditional method for speech separation. It obtains the separated speech of the target speaker by training the ideal binary masking of the target speaker on the mixed speech audio. During the training process, each time-frequency unit needs to be vectorized, and then the time-frequency units with similar distances are clustered together. However, for the influence of differe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G10L21/0272G10L25/27
CPCG10L21/0272G10L25/27Y02T10/40
Inventor 张鹏远杨弋陈航艇颜永红
Owner INST OF ACOUSTICS CHINESE ACAD OF SCI