Multi-speaker clustering system and method based on attention mechanism

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
An attention and speaker technology, applied in computer parts, speech analysis, instruments, etc., can solve the problem of insufficient generalization ability, one person, the latter part is another person, and the model learning frame features lack of ability to discriminate characteristics, etc. problem, to achieve the effect of improving the clustering effect and reducing the performance degradation

Active Publication Date: 2020-07-28

SOUTH CHINA UNIV OF TECH

View PDF4 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The feature representations generated by these methods can effectively highlight the differences between the features of different speakers, but the feature vector d-vector generated by the deep neural network (DNN) does not learn the timing relationship between frames very well, and the time delay The feature vector x-vector generated by deep neural network (TDNN) has insufficient ability to learn the global characteristics of audio

[0005] In the improvement of the clustering method, the LSTM network is used to learn the similarity of different frames, and a similarity matrix is generated as the similarity matrix of spectral clustering, which avoids the performance degradation caused by the unfortunate selection of the hyperparameters of the similarity matrix. However, this method can only describe the local relationship between frames, and it is easy to cause speaker clustering to have temporal aggregation, that is, the front part of a piece of audio is one person, and the back part is another person. Dialogue audio is not ideal; there are also direct use of LSTM network or GRU network to train a fully supervised clustering network, and train the corresponding fully supervised clustering model for a specific data set, but this model learns the discriminant characteristics of frame features Insufficient ability, and insufficient generalization ability

In addition, some current clustering methods combined with deep learning do not use permutation-independent loss functions, but only use some relatively simple loss functions, which may easily cause the clustering effect to be affected by the permutation order, and the model is not easy to converge

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0064] Such as figure 1 As shown, the present embodiment is a kind of multi-speaker clustering system based on attention mechanism, including noise removal module, voice activity detection module, deep-level feature vector generation module and deep-level feature vector clustering module;

[0065] The noise removal module is used to remove noise in the audio; the noise removal module is used to remove audio background noise methods, including but not limited to the following methods and variants thereof: wavelet transform, Wienerfiltering (Wienerfiltering), LogMMSE , neural network DNN, CNN, etc.

[0066] The voice activity detection module is used to detect the start and end position of the sound, and separate the voice part and the non-voice part; the voice activity detection module is used to detect the start and end position of the sound, and separate the voice part and the non-voice Some methods, including but not limited to the following methods and their combinations o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multi-speaker clustering system and method based on an attention mechanism, and the system comprises a noise removal module which is used for removing noise in an audio; a voice activity detection module which is used for detecting the starting and ending positions of the sound and separating a voice part from a non-voice part; a deep feature vector generation network based on the self-attention mechanism is used for extracting deep feature vectors of the audio clips; and a full-supervised clustering network based on the bidirectional long-short-term memory network Bi-LSTM and the self-attention mechanism is used for clustering the deep feature vectors and outputting a clustering result. According to the multi-speaker clustering method based on the attention mechanism, the influence of noise on the clustering result is removed, and the feature vector generation module based on the self-attention mechanism can learn the global structure features of the audio and generate the feature vectors with discrimination features. The full-supervised clustering network based on Bi-LSTM and the self-attention mechanism can better learn the time sequence and discriminate the features, and the clustering effect is better.

Description

technical field [0001] The invention relates to the technical field of speech processing and clustering, in particular to a multi-speaker clustering system method based on an attention mechanism. Background technique [0002] With the development of science and technology and the Internet, the data in modern society has increased significantly, and the human age has entered an era of big data. The information that people receive is huge and complex, and voice data occupies an important position in these information data. How to extract effective information from these voice data is a difficult hotspot that is constantly being studied and seeking breakthroughs. Speaker diarization is an important branch of speech processing. Its main idea is to separate the parts of a piece of audio that are different speakers, and then perform clustering to solve a problem of "who speaks when". But speaker clustering is different from speaker identification. Speaker clustering focuses on cl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06K9/62G06N3/04G10L21/0208G10L25/30

CPCG10L21/0208G10L25/30G06N3/044G06N3/045G06F18/23

Inventor林伟伟胡康立

OwnerSOUTH CHINA UNIV OF TECH

Multi-speaker clustering system and method based on attention mechanism

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology