The invention provides a video image classification method based on a time-space co-occurrence double-flow network. The main content of the video image classification method comprises data input, a time-space double-flow network, fusion and an SVM classifier. The process comprises the steps of firstly inputting image and light stream information, performing early fusion by being combined with a time network and a space network, enabling fusion output to act as a feature vector, inputting the feature vector into the SVM classifier, and acquiring a final classification result. According to the invention, a method that the early fusion double-flow network is combined with time information and space information (time-space co-occurrence) is adopted, a video data set of the monkey is utilized, and more frames, namely, more space data, are utilized from each piece of video so as to generate significant improvement in precision; the space information and the time information are combined and form mutual complementation, and the precision reaches 65.8%. A smaller number of separated clusters are formed through using the co-occurrence method, the separated clusters generally stay together more closely, and the time information is better utilized.