The invention publishes a method for detecting video action, and relates to the technical field of computer vision recognition. The video action detection method is based on a convolutional neural network, a space-time pyramid pooling layer is added to a network structure, restrictions of a network on input is eliminated, the speed of training and detection is improved, and the performance of video action classification and time positioning is improved; the convolutional neural network includes convolutional layers, common pooling layers, space-time pyramid pooling layers and full connection layers; and output of the convolutional neural network includes a category classification output layer and a time positioning calculation result output layer. According to the method provided by the invention, video clips of different time lengths do not need to be obtained through downsampling, but by direct one-time input of a whole video, thereby improving efficiency; and at the same time, since the network trains video clips of the same frequency, difference within a category is not increased, the learning burden of the network is reduced, model convergence is relatively fast, and a detection effect is relatively good.