Bimodal fusion emotion recognition method based on video and voice information
A voice information and emotion recognition technology, applied in the field of emotion recognition, can solve the problems of low accuracy of emotional features, unobjectivity, loss of emotional information, etc.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0055] A bimodal fusion emotion recognition method based on video and voice information, such as figure 1 described, including the following steps:
[0056] 1) Acquisition of voice signals and facial images: non-contact acquisition of natural voice and facial images using microphones and cameras;
[0057] The camera refers to a CMOS digital camera, the output electrical signal is directly amplified and converted into a digital signal;
[0058] The microphone refers to a digital MEMS microphone that outputs a 1 / 2 cycle pulse density modulated digital signal;
[0059] 2) Signal preprocessing: Preprocessing the signals of the video and voice modes, including the video signal and the voice signal, respectively, so that they meet the input requirements of the corresponding models of different modes;
[0060] 3) Emotion feature extraction: perform feature extraction on the face image signal and voice signal after step 2) preprocessing respectively, and obtain corresponding feature...
Embodiment 2
[0065] Video information processing flow, such as figure 2 described, including the following steps:
[0066] 1) obtain the video file to be processed; analyze the video file to obtain a video frame; filter the video frame based on the pixel information of the video frame, and use the video frame obtained after filtering as the image of the facial emotion to be recognized ;
[0067] 2) based on the pixel information of the video frame, generating a histogram corresponding to the video frame and simultaneously determining the definition of the video frame; according to the histogram and edge detection operator, clustering the video frame, Obtain at least one class; Filter video frames repeated in each class and video frames whose resolution is less than the resolution threshold;
[0068] 3) based on the filtered video frame, the method based on convolutional neural network is used to perform face detection, alignment, rotation and resizing operations on the video frame to ob...
Embodiment 3
[0072] Voice information processing flow, such as image 3 described, including the following steps:
[0073] 1) using a digital MEMS microphone to obtain a human body voice signal, and pre-emphasizing the human body voice signal through a first-order high-pass FIR digital filter, and outputting voice data after pre-emphasis;
[0074] 2) utilizing short-term analysis technology to carry out frame processing to the voice data after the pre-emphasis, and obtain the time series of voice feature parameters;
[0075] 3) Using the Hamming window function to perform windowing processing on the speech feature parameter time series to obtain speech windowing data
[0076] 4) utilize double-threshold comparison method to carry out endpoint detection to described voice window data, obtain the voice data after preprocessing;
[0077] 5) Carry out short-time Fourier transform to the speech data after the preprocessing, draw speech spectrogram;
[0078] 6) said spectrogram is input into ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


