Unlock instant, AI-driven research and patent intelligence for your innovation.

Bimodal fusion emotion recognition method based on video and voice information

A voice information and emotion recognition technology, applied in the field of emotion recognition, can solve the problems of low accuracy of emotional features, unobjectivity, loss of emotional information, etc.

Pending Publication Date: 2021-07-23
CHANGCHUN UNIV OF SCI & TECH
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, due to the complexity and diversity of human emotional expressions, if one only thinks about a certain expression and judges human emotions, the final result will be one-sided and not objective, which will lead to many valuable emotions. information loss
[0004] With the in-depth development of artificial intelligence technology in the information age, the majority of people pay more attention to the research of emotional computing, but the human emotions are complex and changeable, and the accuracy of judging emotional characteristics by measuring one of the information alone is low. In order to improve the accuracy, The present invention is hereby proposed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Bimodal fusion emotion recognition method based on video and voice information
  • Bimodal fusion emotion recognition method based on video and voice information
  • Bimodal fusion emotion recognition method based on video and voice information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] A bimodal fusion emotion recognition method based on video and voice information, such as figure 1 described, including the following steps:

[0056] 1) Acquisition of voice signals and facial images: non-contact acquisition of natural voice and facial images using microphones and cameras;

[0057] The camera refers to a CMOS digital camera, the output electrical signal is directly amplified and converted into a digital signal;

[0058] The microphone refers to a digital MEMS microphone that outputs a 1 / 2 cycle pulse density modulated digital signal;

[0059] 2) Signal preprocessing: Preprocessing the signals of the video and voice modes, including the video signal and the voice signal, respectively, so that they meet the input requirements of the corresponding models of different modes;

[0060] 3) Emotion feature extraction: perform feature extraction on the face image signal and voice signal after step 2) preprocessing respectively, and obtain corresponding feature...

Embodiment 2

[0065] Video information processing flow, such as figure 2 described, including the following steps:

[0066] 1) obtain the video file to be processed; analyze the video file to obtain a video frame; filter the video frame based on the pixel information of the video frame, and use the video frame obtained after filtering as the image of the facial emotion to be recognized ;

[0067] 2) based on the pixel information of the video frame, generating a histogram corresponding to the video frame and simultaneously determining the definition of the video frame; according to the histogram and edge detection operator, clustering the video frame, Obtain at least one class; Filter video frames repeated in each class and video frames whose resolution is less than the resolution threshold;

[0068] 3) based on the filtered video frame, the method based on convolutional neural network is used to perform face detection, alignment, rotation and resizing operations on the video frame to ob...

Embodiment 3

[0072] Voice information processing flow, such as image 3 described, including the following steps:

[0073] 1) using a digital MEMS microphone to obtain a human body voice signal, and pre-emphasizing the human body voice signal through a first-order high-pass FIR digital filter, and outputting voice data after pre-emphasis;

[0074] 2) utilizing short-term analysis technology to carry out frame processing to the voice data after the pre-emphasis, and obtain the time series of voice feature parameters;

[0075] 3) Using the Hamming window function to perform windowing processing on the speech feature parameter time series to obtain speech windowing data

[0076] 4) utilize double-threshold comparison method to carry out endpoint detection to described voice window data, obtain the voice data after preprocessing;

[0077] 5) Carry out short-time Fourier transform to the speech data after the preprocessing, draw speech spectrogram;

[0078] 6) said spectrogram is input into ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a bimodal fusion emotion recognition method based on video information and voice information. Face and voice information is acquired through a camera and a microphone of external equipment. Face image and voice features are extracted from the video and voice information, normalization processing is carried out on face image feature vectors and voice feature vectors, the processed features are transmitted into a Bi-GRU network for training, and then the input features in the two single-mode sub-networks are used for calculating the weight of state information at each moment. The input features of the two single-mode sub-networks are fused to obtain a multi-mode joint feature vector, the joint feature vector is used as the input of a pre-trained deep neural network, the deep neural network contains an emotion classifier, different types of emotion evaluation information are obtained through the emotion classifier, and the emotion evaluation information is used as the input of the multi-mode joint feature vector. The emotion evaluation information of the user is more objective and has reference value, so that a more accurate emotion recognition result is obtained.

Description

technical field [0001] This application relates to the field of emotion recognition, in particular to a dual-modal fusion emotion recognition method based on video and voice information. Background technique [0002] In general, the way humans naturally communicate and express emotion is multimodal. This means we can express emotions verbally or visually. When more emotions are expressed with tones, audio data may contain the main clues for emotion recognition; when more face images are used to express emotions, it can be considered that most of the cues needed to mine emotions exist in face images Using multimodal information such as human facial expressions, speech intonation, and language content is an interesting and challenging problem. [0003] Affective computing under the traditional mode research direction focuses on a single mode. Such as the recognition of speech emotion, video action and face image. These traditional single-modal emotion recognition calculati...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00G06K9/62G06N3/08G10L25/63
CPCG06N3/08G10L25/63G06V40/176G06F18/24G06F18/25
Inventor 臧景峰史玉欢王鑫磊刘瑞
Owner CHANGCHUN UNIV OF SCI & TECH