The invention discloses an 
open domain video 
natural language description method based on multi-
modal feature fusion. According to the method, a deep 
convolutional neural network model is adopted forextracting the 
RGB image features and the 
grayscale light 
stream picture features, video spatio-
temporal information and audio information are added, then a multi-
modal feature 
system is formed, whenthe C3D feature is extracted, the coverage rate among the continuous frame blocks input into the three-dimensional 
convolutional neural network model is dynamically regulated, the limitation problem of the size of the training data is solved, meanwhile, robustness is available for the video length capable of being processed, the audio information makes up the deficiencies in the visual sense, andfinally, fusion is carried out aiming at the multi-
modal features. For the method provided by the invention, a data 
standardization method is adopted for standardizing the modal feature values withina certain range, and thus the problem of differences of the feature values is solved; the individual modal 
feature dimension is reduced by adopting the PCA method, 99% of the important information iseffectively reserved, the problem of training failure caused by the excessively large dimension is solved, the accuracy of the generated 
open domain video description sentences is effectively improved, and the method has high robustness for the scenes, figures and events.