The invention discloses an open domain video natural language description method based on multi-modal feature fusion. According to the method, a deep convolutional neural network model is adopted forextracting the RGB image features and the grayscale light stream picture features, video spatio-temporal information and audio information are added, then a multi-modal feature system is formed, whenthe C3D feature is extracted, the coverage rate among the continuous frame blocks input into the three-dimensional convolutional neural network model is dynamically regulated, the limitation problem of the size of the training data is solved, meanwhile, robustness is available for the video length capable of being processed, the audio information makes up the deficiencies in the visual sense, andfinally, fusion is carried out aiming at the multi-modal features. For the method provided by the invention, a data standardization method is adopted for standardizing the modal feature values withina certain range, and thus the problem of differences of the feature values is solved; the individual modal feature dimension is reduced by adopting the PCA method, 99% of the important information iseffectively reserved, the problem of training failure caused by the excessively large dimension is solved, the accuracy of the generated open domain video description sentences is effectively improved, and the method has high robustness for the scenes, figures and events.