A Method for Sentiment Video Content Analysis Based on Multimodal Fusion
By performing multi-level encoding and multimodal fusion of video and audio, the problem of neglecting multimodal interaction in existing technologies is solved, thereby improving the accuracy and model efficiency of emotional video content analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHONGQING UNIV OF POSTS & TELECOMM
- Filing Date
- 2024-01-12
- Publication Date
- 2026-06-30
Smart Images

Figure CN117765449B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of pattern recognition technology, specifically relating to a method for analyzing emotional video content based on multimodal fusion. Background Technology
[0002] With the rapid development of science and technology and the widespread use of mobile devices, the number of videos has increased dramatically in recent years. In this context, the automatic analysis of video content has become particularly important. Since a large number of videos both disseminate information and provide entertainment, they inevitably influence users' emotional state. Therefore, emotional video content analysis has become an important research topic in video content analysis. It has attracted increasing attention and is widely used in applications such as video content retrieval, personalized video recommendation, and video summarization.
[0003] The purpose of emotional video content analysis is to automatically identify the emotions evoked by videos. The emotional content of a video is defined as the intensity and type of emotion that people expect to experience when watching it. Generally, there are two main types of methods for measuring emotion: discrete methods and dimensional methods. Discrete methods categorize emotions into different types, such as happiness, sadness, surprise, disgust, anger, and fear. Dimensional methods map emotions to a continuous space, namely arousal, valence, and dominance. Arousal measures the intensity of the emotion, valence represents the type of emotion, and dominance describes the control and dominance of the emotion. Dominance is difficult to measure and is often overlooked; therefore, the arousal-emotion model is a commonly used dimensional method. The process of emotional video content analysis involves mapping the information contained in the video to the emotions that users experience after watching it. Generally, emotional video content analysis methods can be divided into two categories: direct methods, which predict emotional video content through audio and video features; and implicit methods, which analyze the spontaneous reactions of the audience while watching the video.
[0004] Generally, existing research largely focuses on direct methods. The framework for direct video sentiment analysis mainly consists of two parts: extracting features from the video and then using those features for classification or regression. People typically perceive the world through multimodal information, such as visual and auditory information. Compared to modeling only single-modal information, many studies have demonstrated that modeling using multimodal information can improve the performance of video content analysis. In summary, the common pattern for sentiment video content analysis is to first encode the video and audio (feature extraction) and then fuse them into a compact representation for classification or regression. Therefore, the key to this pattern lies in the form of video encoding, audio encoding, and multimodal fusion. However, previous research on sentiment video content analysis has utilized multiple audio and video features, but often employs simple fusion schemes (early fusion, late fusion, etc.) to combine multiple features, neglecting the interactions between multiple modalities. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention proposes a method for analyzing emotional video content based on multimodal fusion. The method includes: acquiring the emotional video to be analyzed, inputting it into a trained emotional video content analysis model, and obtaining the emotional video content analysis results.
[0006] The training process of the emotion video content analysis model includes:
[0007] S1: Acquire training emotion videos and extract multimodal features from them; multimodal features include video depth features and audio features;
[0008] S2: Perform global and temporal coding on the video depth features to obtain global and local video features;
[0009] S3: Perform motion encoding on the training emotion video to obtain video embedding features;
[0010] S4: Combine the global video features, local video features, and video embedding features to obtain the video stitching features;
[0011] S5: Perform global and temporal encoding on the audio features to obtain global and local audio features; concatenate the global and local audio features to obtain concatenated audio features;
[0012] S6: Combine video splicing features and audio splicing features to obtain the combined features;
[0013] S7: Classify the fused features to obtain the sentiment classification results of the video;
[0014] S8: Calculate the total loss of the model and adjust the model parameters according to the total loss to obtain the trained emotional video content analysis model.
[0015] Preferably, the process of extracting multimodal features from training emotional videos includes: processing the training emotional videos using a ResNet-101 network to obtain video depth features; separating audio information from the training emotional videos and processing the audio information using a VGGlish network to obtain audio features.
[0016] Preferably, the process of globally encoding video depth features includes: calculating the feature mean of the video depth features; calculating the feature standard deviation based on the feature mean and the video depth features; and combining the feature mean and the feature standard deviation as global video features.
[0017] Preferably, the process of temporally encoding video depth features includes: processing the video depth features using an LSTM network to obtain a temporal feature vector; and performing average pooling on the temporal feature vector to obtain local video features.
[0018] Preferably, step S3 specifically includes:
[0019] The training emotion video was divided into eight slices, and sixteen frames were extracted from each slice as keyframes.
[0020] The keyframes are embedded to obtain video embedding features.
[0021] Preferably, the process of fusing video splicing features and audio splicing features includes:
[0022] The video stitching features and audio stitching features are multiplied by different intramodal mapping matrices to obtain the video mapping matrix and audio mapping matrix, respectively.
[0023] The video mapping matrix and the audio mapping matrix are fused to obtain the fused features.
[0024] Furthermore, the formula for fusing the video mapping matrix and the audio mapping matrix is as follows:
[0025]
[0026] Where f represents the fusion feature. Let v represent the video mapping matrix obtained in the r-th sub-common space, and v represent the video stitching features. Let represent the audio mapping matrix obtained in the r-th sub-common space, a represent the audio splicing feature, R represent the total number of common spaces, and ⊙ represent the dot product operation.
[0027] Preferably, the total model loss is the sum of the cross-entropy loss and the mean squared error loss; the formula for calculating the cross-entropy loss is:
[0028]
[0029] in, Represents cross-entropy loss, This represents the probability that the predicted result is the label "class". This represents the probability of predicting the j-th label in the output.
[0030] The formula for calculating the mean squared error loss is:
[0031]
[0032] in, This represents the mean squared error loss. y represents the predicted result, and y represents the true label.
[0033] The beneficial effects of this invention are as follows: This invention proposes a novel emotional video content analysis framework in which video and audio are encoded at multiple levels to obtain powerful representation capabilities; compared with the prior art, this invention has high classification accuracy and reduces the number of model parameters, and has good application prospects. Attached Figure Description
[0034] Figure 1 This is a flowchart of the emotional video content analysis method based on multimodal fusion in this invention;
[0035] Figure 2 This is a comparison chart of the ablation experiment results of the present invention.
[0036] Figure 3 This is a comparison diagram of the effects of the present invention and the comparative method. Detailed Implementation
[0037] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0038] This invention proposes a method for analyzing emotional video content based on multimodal fusion, such as... Figure 1 As shown, the method includes the following: acquiring the sentiment video to be analyzed, inputting it into a trained sentiment video content analysis model, and obtaining the sentiment video content analysis result.
[0039] The training process of the emotion video content analysis model includes:
[0040] S1: Acquire training emotion videos and extract multimodal features from them; multimodal features include video depth features and audio features.
[0041] Obtain training sentiment videos with sentiment classification labels, and extract n video frames from the training sentiment videos using a sparse sampling method; input the n video frames into a pre-trained ResNet-101 network for processing to obtain video depth features, and use the feature vector {v1, v2, ..., v...} n} to describe, where v t Let represent the depth feature vector of frame t.
[0042] Audio information is extracted from training emotion videos, and the VGGlish network is used to process the audio information to obtain audio features (128 dimensions), symbolically represented as {a1, a2, ..., a...}. n}, an This represents the audio feature vector of the nth frame.
[0043] S2: Perform global and temporal encoding on the video depth features to obtain global and local video features.
[0044] Calculate the feature mean μ of the video depth features, and calculate the feature standard deviation σ based on the feature mean and video depth features; combine the feature mean and feature standard deviation as the global feature f of the video. v 1 By combining the feature mean with the feature standard deviation, a more robust display effect can be obtained; the global feature representation of the video is obtained as follows:
[0045]
[0046]
[0047]
[0048] An LSTM network is used to process the video depth features to obtain the temporal feature vector H. v Average pooling is performed on the temporal feature vectors to obtain local video features.
[0049] LSTM networks can effectively simulate temporal information in videos. The processing of video depth features by LSTM networks can be represented as follows:
[0050] i = σ(W ii x+b ii +W hi h+b hi )
[0051] f=σ(W if x+b if +W hf h+b hf )
[0052] g = tanh(W) ig x+b ig +W hc h+b hg )
[0053] o=σ(W io x+b io +W ho h+b ho )
[0054] c′=f*c+i*g
[0055] h′=o*tanh(c′)
[0056] In this context, i, f, o, and c represent the input gating mechanisms: the input gate, the forget gate, the output gate, and the unit activation vector, respectively. σ(*) denotes the activation function. The hidden dimension in LSTM is empirically set to 1024.
[0057] The video depth features are processed by an LSTM network to obtain the temporal feature vector H. v , represented as
[0058] The average pooling process applied to the time series feature vector is represented as follows:
[0059]
[0060] S3: Perform motion encoding on the training emotion video to obtain video embedding features.
[0061] Motion information between adjacent frames in a video provides information about actions and movements, which is also important for analyzing induced emotions. Therefore, this invention performs motion encoding on training emotion videos, that is, extracts keyframes from the training emotion videos and processes the keyframes; specifically:
[0062] The training emotion video is divided into eight slices, and sixteen frames are extracted from each slice as keyframes. Preferably, an I3D network is used to read consecutive frames from the video, outputting a fixed-length feature vector every 16 frames. The keyframes are then processed by a linear layer for embedding, yielding a 2048-dimensional video embedding feature from the output.
[0063] S4: Combine the global video features, local video features, and video embedding features to obtain the video stitching feature F. v .
[0064] The combined global features, local features, and embedded features of the video are represented as follows:
[0065]
[0066] S5: Perform global and temporal encoding on the audio features to obtain global and local audio features; concatenate the global and local audio features to obtain the concatenated audio feature F. a .
[0067] The process of global and temporal encoding of audio features is similar to that of global and temporal encoding of video depth features, and will not be repeated here.
[0068] The global audio features obtained after encoding are represented as follows: The feature vector obtained by the LSTM network during the temporal coding process is represented as follows: The size is (512×n). Local audio features are obtained. Represented as:
[0069]
[0070] Concatenate global and local audio features:
[0071]
[0072] S6: Combine video splicing features and audio splicing features to obtain the fused features.
[0073] Before fusion, F a F v The video and audio splicing features are obtained by processing through a dropout layer.
[0074] Multiplying the video stitching features and audio stitching features by different intramodal mapping matrices yields the video mapping matrix and audio mapping matrix, respectively.
[0075]
[0076]
[0077] Where v represents the video mapping matrix, a represents the audio mapping matrix, and W v Let W represent the first mapping matrix. a Let d represent the second mapping matrix. v d represents the dimension of the video mapping matrix. a This indicates the dimension of the audio mapping matrix.
[0078] Fuse video mapping matrix and audio mapping matrix:
[0079] After mapping v and a to the same dimension f, dot product is then performed for fusion:
[0080]
[0081] To fully encode the bilinear interaction between the two modalities, this invention further imposes a rank constraint R on the fusion vector f, representing it as the sum of R rank-1 vectors through R linear layers, instead of performing a single feature merging function. This allows for the direct learning of R distinct sub-common spaces, and the final fusion can be expressed as:
[0082]
[0083] Where f represents the fusion feature. Let v represent the video mapping matrix obtained in the r-th sub-common space, and v represent the video stitching features. Let represent the audio mapping matrix obtained in the r-th sub-common space, a represent the audio splicing feature, R represent the total number of common spaces, and ⊙ represent the dot product operation.
[0084] S7: Classify the fused features to obtain the sentiment classification results of the video.
[0085] Preferably, a classifier is used to process the fused features to obtain the probability of each category, and the category with the highest probability is taken as the sentiment classification result of the video.
[0086] S8: Calculate the total loss of the model and adjust the model parameters according to the total loss to obtain the trained emotional video content analysis model.
[0087] The total loss of the model in this invention is the sum of the cross-entropy loss and the mean squared error loss; the formula for calculating the cross-entropy loss is:
[0088]
[0089] in, Represents cross-entropy loss, This represents the probability that the predicted result is the label "class". This represents the probability of the j-th label in the predicted output.
[0090] The formula for calculating the mean squared error loss is:
[0091]
[0092] in, This represents the mean squared error loss. y represents the predicted result, and y represents the true label.
[0093] Based on the model's total loss, the backpropagation algorithm is used to update the network parameters until the required number of iterations or accuracy is reached. The model parameters are then saved, resulting in a trained sentiment video content analysis model. The sentiment video to be analyzed is then input into the trained model to obtain the sentiment video content analysis results, i.e., the sentiment classification results of the video.
[0094] Evaluation of the present invention:
[0095] This invention was evaluated on the MediaEval 2015 and MediaEval 2016 datasets. (As...) Figure 2 As shown, ACC represents the accuracy of the classification task, while MSE and PCC represent the mean squared error and Pearson correlation coefficient in the regression task. Figure 3 As shown, since the experimental characteristics and configurations of the comparative methods differ from each other, only their best results are compared here. This can be seen from the figure:
[0096] The method proposed in this invention achieves superior results and is competitive with state-of-the-art methods. This invention compares the capabilities of different coding strategies in multi-level coding. It finds that global + temporal + motion coding for video and global + temporal coding for audio exhibit the best performance. Experiments with different fusion methods verify the effectiveness of the multimodal tensor fusion network proposed in this invention. Experimental results and comparative analysis reveal the capabilities of different fusion methods and also demonstrate the superiority of the method proposed in this invention.
[0097] The above-described embodiments further illustrate the purpose, technical solution, and advantages of the present invention. It should be understood that the above-described embodiments are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made to the present invention within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for analyzing emotional video content based on multimodal fusion, characterized in that, include: The emotional video to be analyzed is obtained and input into the trained emotional video content analysis model to obtain the emotional video content analysis results. The training process of the emotion video content analysis model includes: S1: Acquire training emotion videos and extract multimodal features from them; multimodal features include video depth features and audio features; S2: Perform global and temporal coding on the video depth features to obtain global and local video features; S3: Perform motion coding on the training emotional video to obtain video embedding features; specifically, this includes: dividing the training emotional video into eight slices, extracting sixteen frames from each slice as keyframes; and embedding the keyframes to obtain video embedding features. S4: Combine the global video features, local video features, and video embedding features to obtain the video stitching features; S5: Perform global and temporal encoding on the audio features to obtain global and local audio features; concatenate the global and local audio features to obtain concatenated audio features; S6: Fuse video splicing features and audio splicing features to obtain fused features; the process of fusing video splicing features and audio splicing features includes: The video stitching features and audio stitching features are multiplied by different intramodal mapping matrices to obtain the video mapping matrix and audio mapping matrix, respectively. The fused video mapping matrix and audio mapping matrix are combined to obtain the fused features; the formula for combining the video mapping matrix and audio mapping matrix is: ; in, Indicates fusion features, This represents the video mapping matrix obtained in the r-th sub-common space. Indicates video splicing features, This represents the audio mapping matrix obtained in the r-th sub-common space. This indicates audio splicing features. Indicates the total number of public spaces. This represents the dot product operation; S7: Classify the fused features to obtain the sentiment classification results of the video; S8: Calculate the total loss of the model and adjust the model parameters according to the total loss to obtain the trained emotional video content analysis model.
2. The emotional video content analysis method based on multimodal fusion according to claim 1, characterized in that, The process of extracting multimodal features from training emotional videos includes: processing the training emotional videos using a ResNet-101 network to obtain video depth features; separating audio information from the training emotional videos and processing the audio information using a VGGlish network to obtain audio features.
3. The emotional video content analysis method based on multimodal fusion according to claim 1, characterized in that, The process of globally encoding video depth features includes: calculating the feature mean of the video depth features; calculating the feature standard deviation based on the feature mean and the video depth features; and combining the feature mean and the feature standard deviation as the global features of the video.
4. The emotional video content analysis method based on multimodal fusion according to claim 1, characterized in that, The process of temporal encoding of video depth features includes: using an LSTM network to process the video depth features to obtain a temporal feature vector; and performing average pooling on the temporal feature vector to obtain local video features.
5. The emotional video content analysis method based on multimodal fusion according to claim 1, characterized in that, The total model loss is the sum of the cross-entropy loss and the mean squared error loss; the formula for calculating the cross-entropy loss is: ; in, Represents cross-entropy loss, This indicates that the predicted result is determined as a label. The probability, This represents the probability of predicting the j-th label in the output; The formula for calculating the mean squared error loss is: ; in, This represents the mean squared error loss. Indicates the prediction result. This indicates the actual label.