The invention relates to a natural interactive method of a virtual learning environment based on speech emotion recognition, belonging to the field of depth learning. The method comprises the following steps: 1, collecting speech signals of students and users through kinect, resampling, adding windows by frames, and mute processing to obtain short-time single frame signals; 2, carrying out fast Fourier transform on that signal to obtain the frequency domain data, obtaining the pow spectrum thereof, and adopting a Mel filter bank to obtain a Mel spectrum diagram; 3, inputting the features of the Mel spectrum map into a convolution neural network, performing convolution operation and pooling operation, and inputting the matrix vectors of the last desample layer to the whole connecting layerto form a vector output feature; 4, compressing and inputting the output characteristic into a bi-directional long-short time memory neural network; 5, inputting the output features into a support vector machine to classify and output a classification result; 6, feeding back the classification result to the virtual learning system for virtual learning environment interaction. The invention driveslearners to adjust the learning state and enhances the practicability of the virtual learning environment.