The invention relates to the field of computer vision and deep learning, in particular to a video group behavior recognition method based on cascade Transformer, which comprises the following steps: firstly, acquiring and generating a video data set, extracting three-dimensional spatial-temporal features from the video data set through a three-dimensional backbone network, and selecting a key frame image spatial feature map; preprocessing the key frame image spatial feature map, sending the preprocessed key frame image spatial feature map into a human body target detection Transformer, and outputting a human body target box in the key frame image; then, mapping a sub-feature map corresponding to the screened human body target box on the key frame image feature map, calculating query / key / value in combination with a key frame image surrounding frame feature map, inputting the query / key / value into a group behavior recognition Transfomer, and outputting a group level spatial-temporal coding feature map; and finally, classifying group behaviors through a multi-layer perceptron. The method has the effect of effectively improving the group behavior recognition accuracy.