The invention discloses a multimodal speech emotion recognition method based on an enhanced depth residual neural network, which relates to the technical fields of video stream image processing, speech signal analysis and the like, and solves the emotion recognition problem of human-computer interaction. The invention mainly comprises the following steps: extracting the feature expression of video(sequence data) and speech, including converting the speech data into the corresponding speech spectrum expression, and encoding the time sequence data; Convolution neural network is used to extractthe emotional features of the original data for classification. The model accepts multiple inputs and the input dimensions are different. The cross-convolution layer is proposed to fuse the data features of different modes, and the overall network structure of the model is enhanced depth residual neural network. After the model is initialized, the multi-classification model is trained with speechspectrum, sequence video information and corresponding affective tagging. After the training, the unlabeled speech and video are predicted to obtain the probability value of affective prediction, andthe maximum value of probability is selected as the affective category of the multi-modal data. The invention improves the recognition accuracy on the problem of multi-modal emotion recognition.