The invention relates to the field of
computer vision and
deep learning, in particular to a video group
behavior recognition method based on
cascade Transformer, which comprises the following steps: firstly, acquiring and generating a video
data set, extracting three-dimensional spatial-temporal features from the video
data set through a three-dimensional
backbone network, and selecting a
key frame image spatial feature map; preprocessing the
key frame image spatial feature map, sending the preprocessed
key frame image spatial feature map into a
human body target detection
Transformer, and outputting a
human body target box in the key frame image; then, mapping a sub-feature map corresponding to the screened
human body target box on the key frame image feature map, calculating query / key / value in combination with a key frame image surrounding frame feature map, inputting the query / key / value into a group
behavior recognition Transfomer, and outputting a
group level spatial-temporal coding feature map; and finally, classifying group behaviors through a multi-layer
perceptron. The method has the effect of effectively improving the group
behavior recognition accuracy.