Audio and video multi-mode sentiment classification method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A sentiment classification, multi-modal technology, applied in speech analysis, neural learning method, speech recognition and other directions, can solve the problem of lack of unified method for audio and video raw data, unable to extract facial features, etc., to improve information processing efficiency, The effect of simplifying computing overhead and improving accuracy

Active Publication Date: 2021-09-17

SOUTH CHINA UNIV OF TECH

View PDF7 Cites 17 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0012] 5. In this invention application, there is no uniform approach to the processing of audio and video raw data. In the processing of audio and video data, the format and content of data are very different.

For example, there may not be a human face in the video, so it is impossible to extract facial features according to the method described in the invention application

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0032] Such as figure 1 As shown, in the present embodiment, audio-video multimodal emotion classification method comprises the following steps:

[0033] S1. Processing and calculation of raw video data

[0034] Obtain key frames and audio signals from the input original video clip; for each key frame, the frame picture is scaled and input to the face detection module, if the frame picture does not contain a face, the frame picture is equal-sized Segmentation; if the frame picture contains a human face, use the Megvii Face++ open source API to extract the key points of the human face; perform Mel Spectrogram calculation and MFCC (Mel Frequency Cepstral Coefficient) calculation on the audio signal, using open source voice The text-to-text toolkit Deepspeech converts audio into text, and the related functions provided in Transformers (self-attention transformation network) convert the text into word vectors and generate sentence symbols according to the text sentence structure....

Embodiment 2

[0060]Based on the same inventive concept as Embodiment 1, this embodiment provides an audio-video multimodal emotion classification system, such as figure 2 shown, including:

[0061] The data preprocessing module is used to realize the step S1 of embodiment 1, to the processing and calculation of the original video data, to obtain video data samples, audio data samples and text feature samples;

[0062] The emotional feature extraction module is used to realize the step S2 of embodiment 1, construct an emotional feature extraction network, and perform feature extraction on video data samples, audio data samples and text feature samples respectively, and obtain visual modal features, audio features and text features;

[0063] The feature fusion and classification module is used to implement step S3 of Embodiment 1, unify the extracted visual modality features, audio features and text features through the fully connected layer, input them into the tensor fusion network for f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to the field of voice and image processing and mode recognition, in particular to an audio and video multi-mode sentiment classification method and system, and the method comprises the steps: carrying out the processing and calculation of original video data, and obtaining a video data sample, an audio data sample and a text feature sample; constructing an emotion feature extraction network, and performing feature extraction on the video data sample, the audio data sample and the text feature sample to obtain a visual modal feature, an audio feature and a text feature in multiple modalities; and carrying out dimension unification on the extracted visual modal features, audio features and text features, inputting the features into a tensor fusion network for fusion learning, and finally carrying out classification to output a multi-modal sentiment classification probability result. Cross-modal emotion information can be effectively integrated, space-time high-dimension feature extraction is carried out on videos, audios and texts, the videos, the audios and the texts are spliced into multi-modal feature vectors, fusion learning is carried out then, and emotion classification is carried out.

Description

technical field [0001] The invention relates to the fields of speech and image processing and pattern recognition, and specifically relates to an audio and video multimodal emotion classification method and system based on an open source deep learning framework. Background technique [0002] With the advent of the 5G era, on the basis of the development of the emerging Internet entertainment industry represented by short videos, the lifting of network speed restrictions will further make short videos a new mainstream information carrier. Followed by the explosive growth of data volume with video as the carrier, "information overload" has become an inevitable problem. Personalized recommendation systems based on information content are playing an increasingly important role, so the demand for tagged description and classification of videos is also increasing. Secondly, due to the continuous popularization of 4G and 5G networks and the increase in the number of active online ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/00G06K9/62G06N3/04G06N3/08G10L15/26G10L25/03G10L25/24G10L25/30G10L25/63

CPCG06N3/08G10L25/63G10L25/30G10L25/03G10L25/24G10L15/26G06N3/044G06N3/045G06F18/2415

Inventor 岑敬伦李志鹏青春美罗万相

Owner SOUTH CHINA UNIV OF TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Audio and video multi-mode sentiment classification method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology