Method of using spectrograms and deep convolution neural network (CNN) to identify voice emotion

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of convolutional neural network and speech emotion recognition, which is applied in speech analysis, instruments, etc., can solve the problems of not enough convolution layers, too many, overfitting, etc., and achieve the effect of improving speech recognition ability

Inactive Publication Date: 2018-02-16

BEIJING UNION UNIVERSITY

View PDF4 Cites 26 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Image feature extraction is not fine enough compared to 3 convolutional layers

The fully connected layer can retain the internal relationship between features, but it should not be too much, which will easily lead to overfitting

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0040] Such as figure 1 As shown, step 100 is executed to generate a spectrogram, and the spectrogram is generated according to the speech signal as the input data of the deep convolutional neural network model. The generation of the spectrogram specifically includes: the spectrogram is a visual representation of the frequency of a specific waveform of the speech signal changing with time. The spectrogram is a two-dimensional graph, the abscissa represents time, and the ordinate represents frequency. In the graph, the amplitude of the speech signal at a certain time and frequency portion is represented by the density and color of that point. Dark blue indicates low amplitude and bright red indicates high amplitude. The relationship between time and frequency is obtained by adding FFT conversion to the speech signal, that is, the spectrum diagram. In order to observe the frequency of the voice signal at a certain moment, the signal is divided into multiple blocks, and each b...

Embodiment 2

[0042] Such as figure 2 As shown, the overall system architecture of the present invention includes five parts: a speech input module 200 , a spectrogram generation module 210 , a data preprocessing module 220 , a classifier module 230 and an output module 240 .

[0043] The voice input module 200 is used for receiving input voice data.

[0044] The spectrogram generating module 210 is used to divide the input speech data to generate a spectrogram. The steps of its work are as follows: the signal is divided into multiple blocks, and each block is transformed by FFT. The Fourier change of a non-periodic continuous-time signal X(t) is defined as: What is calculated in the formula is the continuous frequency spectrum of the signal X(t). What is obtained in practical application is the discrete sampling value X(nT) of the continuous signal X(t). Therefore, it is necessary to use the discrete signal X(nT) to calculate the frequency spectrum of the signal X(t). DFT definition...

Embodiment 3

[0049] Such as image 3 As shown, the system is further explained from two parts of training and testing. Divide the voice signal 300 into a spectrogram 310, and the division method is as follows: Divide the signal into multiple blocks, and perform FFT transformation on each block. The Fourier change of a non-periodic continuous-time signal X(t) is defined as: What is calculated in the formula is the continuous frequency spectrum of the signal X(t). What is obtained in practical application is the discrete sampling value X(nT) of the continuous signal X(t). Therefore, it is necessary to use the discrete signal X(nT) to calculate the frequency spectrum of the signal X(t). DFT definition of finite length discrete signal X(n), n=0, 1,..., N-1 k=0, 1, . . . , N-1, Among them, N is the number of sampling points, and j represents the imaginary part of the negative number. Using the method above to generate 5,000 spectrograms, import them into the classifier 302 of the deep ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The present invention provides a method of using spectrograms and a deep convolution neural network to identify the voice emotion. The method comprises the following steps of generating the spectrograms according to the voice signals; constructing a deep convolution neural network model; using a lot of spectrograms as the input, training and optimizing the deep convolution neural network model; testing and optimizing the trained deep convolution neural network model. According to the present invention, a new voice emotion identification method is used to convert the voice signals into the images to process, and by combining the CNN, enables the identification capability to be improved effectively.

Description

technical field [0001] The invention relates to the technical field of speech signal processing and pattern recognition, in particular to a method for speech emotion recognition using a spectrogram and a deep convolutional neural network. Background technique [0002] With the continuous development of information technology, social development puts forward higher requirements for affective computing. For example, in terms of human-computer interaction, a computer with emotional capabilities can acquire, classify, recognize, and respond to human emotions, thereby helping users to obtain an efficient and friendly feeling, and can effectively reduce people's frustration in using computers, and even Can help people understand their own and other people's emotional world. For example, using such technology to detect whether the driver is focused, the level of stress he feels, etc., and respond accordingly. In addition, affective computing can also be applied in related industr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G10L25/63G10L25/30

CPCG10L25/30G10L25/63

Inventor 袁家政刘宏哲龚灵杰

Owner BEIJING UNION UNIVERSITY

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method of using spectrograms and deep convolution neural network (CNN) to identify voice emotion

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology