A group behavior analysis method based on scene features and location distance information

By combining a dual-stream feature extraction network, a graph convolutional network, and a self-attention graph pooling network, and utilizing scene features and location distance information, the problem of insufficient utilization of scene information and location distance interaction information in group behavior recognition is solved, thereby improving the accuracy of group behavior recognition.

CN117576776BActive Publication Date: 2026-06-30SICHUAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SICHUAN UNIV
Filing Date
2022-08-08
Publication Date
2026-06-30

Smart Images

  • Figure CN117576776B_ABST
    Figure CN117576776B_ABST
Patent Text Reader

Abstract

This invention proposes a group behavior analysis method based on scene features and location distance information, mainly involving the extraction and inference of contextual features through deep learning for individual and group behavior recognition. This method proposes a contextual feature extraction strategy that focuses on both local and global information, inferring interaction information through graph convolutional networks, and incorporating a self-attention graph pooling network to improve the accuracy of behavior recognition. First, a dual-stream feature extraction network is used to extract individual features. Then, a novel contextual feature extraction strategy is proposed, modeling individual location distance information to enrich individual features, and extracting global features from RGB frames as scene features. Next, the interaction information between individuals and the scene is inferred in a graph convolutional network, and group features are formed through a self-attention graph pooling network to obtain the behavior recognition result. This invention fully considers the contextual features of groups, solving the problem of insufficient utilization of contextual features in behavior recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the problem of individual behavior and group behavior recognition in the field of deep learning, and in particular to a group behavior analysis method based on scene features and location distance information. Background Technology

[0002] Group behavior recognition, as an important field of computer vision, has garnered increasing attention. In recent years, deep learning has achieved remarkable results in group behavior recognition. Most related research extracts individual features using convolutional neural networks (CNNs), then designs a global module to abstract group features from individual features for individual and group behavior recognition. Unlike individual behavior recognition, group behavior recognition focuses more on the interaction relationships between individuals within a group. With the development of graph convolutional networks (GCNNs), their excellent ability to handle non-Euclidean distance data has been gradually applied to inferring individual interaction relationships in group behavior recognition, obtaining rich group features and improving the accuracy of behavior recognition. Currently, video-based group behavior recognition has been widely applied in fields such as intelligent video surveillance and motion video analysis, possessing profound research significance and broad application prospects.

[0003] Existing behavior recognition methods rarely consider scene information and location-distance interaction information comprehensively. Scene information can be regarded as global context information when a behavior occurs, while location-distance interaction information can be regarded as local context information. Therefore, this patent proposes a group behavior analysis method based on scene features and location-distance information, which fully explores local and global context information and infers the relationships between them. First, the data is preprocessed, and optical flow images of video frames are extracted using the TV-L1 method. Then, we use a dual-stream feature extraction network to extract appearance features and motion features from the original RGB frames and optical flow frames, respectively, and use Roi-Align to obtain the appearance features and motion features of individuals based on their bounding box information. Next, we propose a new feature extraction strategy, including a location feature extraction strategy and a scene exploration strategy. In the location feature extraction strategy, the location information of individuals and the distance information between pairs of individuals are modeled to enrich individual features. In the scene exploration strategy, global features of the original RGB frames are extracted by a convolutional neural network as scene features. Building upon this foundation, we designed an inference module incorporating a graph convolutional network and self-attention graph pooling. Individual and scene features are treated as nodes in a graph, and the similarity between features is considered as edges. The constructed graph is fed into the graph convolutional network to infer pairwise interactions between individual nodes and global scene nodes. Then, a self-attention graph pooling network assigns different weights to different individuals, selecting those with higher weights for pooling to form higher-order group features. Finally, these group features are input into a classifier for group activity recognition. Summary of the Invention

[0004] The purpose of this invention is to provide a group behavior recognition method based on scene features and location distance information. It comprehensively considers global scene information and local location information, where local location information is used to further enrich individual features. It uses graph convolutional networks to infer the relationship between the global scene and individuals, and assigns different weights to different individuals through self-attention graph pooling networks, thereby inferring group behavior, improving the accuracy of group behavior recognition, and solving the problem of insufficient utilization of global-local features in group behavior recognition.

[0005] To make this easier to explain, let's first introduce the following concepts:

[0006] Convolutional Neural Network (CNN): Inspired by the visual neural mechanism, it is a multilayer perceptron designed to recognize two-dimensional shapes. This network structure is highly invariant to translation, scaling, tilting or other forms of deformation.

[0007] The Roi-Align module uses bilinear interpolation to calculate the accurate values ​​of features in each Region of Interest (RoI) and uses pooling to obtain the output, thus solving the quantization error problem in RoI-Pooling.

[0008] Graph Convolutional Network (GCN): Used to extract spatial features of topological graphs. The data it processes is graph-structured data, i.e., non-Euclidean structure.

[0009] Self-Attention Graph Pooling Network (SAG-Pool): A network that pools graph structures using a self-attention mechanism, and has strong portability.

[0010] The present invention specifically adopts the following technical solution:

[0011] A method for analyzing group behavior based on scene features and location distance information, characterized in that:

[0012] a. Extract individual behavioral features using a two-stream feature extraction network, a fully connected network, and a graph convolutional network;

[0013] b. Use the location distance module to model the location and relative distance information of individuals, obtain the topological structure of the crowd in space, and further enrich individual characteristics;

[0014] c. Extract global scene features and use scene features and individual features as nodes in a graph convolutional network to deeply explore the pairwise relationship between scenes and individuals;

[0015] d. Using a self-attention graph pooling network, different levels of attention are given to different individuals, and the graph structure is pooled to obtain high-level group behavior features;

[0016] This method mainly includes the following steps:

[0017] (1) Data preprocessing: Optical flow frames are generated from the original video frames, and the consecutive original video frames and optical flow frames are sampled and then input into the network.

[0018] (2) Feature extraction: A dual-stream feature extraction network was adopted, with the Inception-v3 network as the backbone network to extract global appearance features and global motion features respectively. The appearance features and motion features were added together, and then the RoIAlign module was used to obtain the appearance and motion features of the individual. The information obtained was then integrated using a fully connected layer.

[0019] (3) Location and distance information modeling: Based on the individual bounding box information, calculate the center position of the individual and the distance between two individuals, cascade the individual position and distance, map them into high-dimensional features through a fully connected layer, and cascade the individual appearance motion features extracted in step (2) with the location and distance features into individual features;

[0020] (4) Graph node generation: The global appearance features generated in step (2) are transformed into scene features through adaptive average pooling, and then combined with the individual features generated in step (3) to form behavioral feature nodes;

[0021] (5) Graph construction and reasoning: Based on the graph nodes generated in step (4), the edge features of the behavioral relationship graph are constructed by calculating the similarity between nodes, and the graph is reasoned through a graph convolutional network to fully explore the paired interactions between individuals and scenes;

[0022] (6) Individual behavior classification: The features generated in step (3) are added to the features inferred in step (5) to obtain individual features, which are then sent to the classifier to identify the individual behavior category;

[0023] (7) Self-attention graph pooling: The graph structure after reasoning in step (5) generates node self-attention scores through graph convolution operation. The nodes are sorted according to the attention scores, and the top k nodes are selected and sent to the maximum pooling network to obtain the final group behavior features.

[0024] (8) Classification of group behavior: The group behavior features generated in step (7) are fed into a classifier to identify the group behavior category;

[0025] (9) Model training: The model training constructed by (2)-(8) is divided into two steps. The first step is to fine-tune the pre-trained model of the backbone network. After extracting individual features from the backbone network, the behavior classification is performed directly. The model parameters are saved and input into the network model in the second step. The second step adds scene information and individual location information. Combined with the individual features extracted by the backbone network, a graph convolutional network is constructed to infer the pairwise relationship between the scene and the individual. Then, through self-attention graph pooling, the group behavior features are obtained and the group behavior is classified.

[0026] The beneficial effects of this invention are:

[0027] (1) Effectively model individual location information and relative location information, fully explore individual contextual information, and thus better identify group behavior.

[0028] (2) Take into account the global scene features, break through the local scope, and grasp the state of behavior from a global perspective.

[0029] (3) Graph convolutional networks are used to infer the interaction between the global scene and individuals, and the global and local features are considered to obtain richer behavioral features.

[0030] (4) A self-attention graph pooling network was used to assign different weights to different individuals to highlight important features. Attached Figure Description

[0031] Figure 1 This is a framework diagram for group behavior recognition based on scene information and location distance information. Detailed Implementation

[0032] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be noted that the following embodiments are only used to further illustrate the present invention and should not be construed as limiting the scope of protection of the present invention. Those skilled in the art can make some non-essential improvements and adjustments to the present invention based on the above-described invention, and these improvements and adjustments should still fall within the scope of protection of the present invention.

[0033] The group behavior recognition method based on scene information and location distance information specifically includes the following steps:

[0034] (1) Data preprocessing

[0035] The original video frames are used to generate optical flow frames using the TV-L1 method. Video frames and optical flow frames are randomly selected according to the strategy of sampling three frames every ten frames and then input into the network.

[0036] (2) Feature extraction

[0037] A dual-stream feature extraction network was employed, using the Inception-v3 network as the backbone to extract global appearance features and global motion features separately. The appearance and motion features were then added together, and the RoIAlign module was used to obtain the individual's appearance and motion features. A fully connected layer was then used to synthesize the obtained information to obtain the initial individual feature f. init ;

[0038] (3) Location and distance information modeling

[0039] Location and distance information modeling: Based on the individual bounding box information, calculate the center position of the individual and the distance between any two individuals, concatenate the individual positions and distances, and map them into high-dimensional features through a fully connected layer. The initial individual features f extracted in step (2) are then used to model the features. init Location distance feature f pos Cascaded into individual characteristics f actor As shown in the following formula:

[0040] f actor =[f init ;f pos (1)

[0041] (4) Graph node generation

[0042] The global appearance features generated in step (2) are then subjected to adaptive average pooling to generate scene features f. scene Together with the individual characteristics generated in step (3), they form behavioral feature nodes;

[0043] (5) Graph Construction and Reasoning

[0044] Based on the graph nodes generated in step (4), edge features of the behavioral relationship graph are constructed by calculating the similarity between nodes. Then, the graph is reasoned through a graph convolutional network to fully explore the paired interactions between individuals and scenes, and output features f. graph ;

[0045] (6) Classification of individual behaviors

[0046] The features generated in step (3) are added to the features inferred in step (5) to obtain individual features, which are then fed into the classifier to identify the individual behavior category, as shown in the following formula:

[0047] a1, a2, ..., a i =f classifier ([f scene ;f actor ]+f graph (2)

[0048] Among them, a i Let be the predicted behavior category for the i-th individual.

[0049] (7) Self-attention graph pooling

[0050] The graph structure f after reasoning in step (5) graph Node self-attention scores are generated through graph convolution operations. Nodes are then sorted based on their attention scores, and the top k nodes are fed into a max-pooling network to obtain the final group behavior features. The specific method is as follows:

[0051] Suppose that the graph convolutional network outputs N nodes, denoted as f. graph First, the self-attention score S∈R of each node is obtained using formula (3). N×1 ,

[0052]

[0053] Where σ(·) is the activation function, G∈R N×N The adjacency matrix represents the output of the last layer of the graph neural network. W is the degree matrix of matrix G. att ∈R d×1 These are learnable parameters.

[0054] Then, the N*k feature nodes with higher scores are selected by pooling parameters k∈(0,1] to complete the self-attention map pooling operation. After that, the features are further reduced in dimensionality by max pooling to obtain the group behavior features f. group .

[0055] (8) Classification of group behavior

[0056] The group behavior features generated in step (7) are fed into the classifier to identify group behavior;

[0057] A = f activity-classifier (f actor +f graph (4)

[0058] Here, A represents group behavior.

[0059] (9) Model Training

[0060] The training of the model constructed by (2)-(8) is divided into two steps. The first step is to fine-tune the pre-trained model of the backbone network. After the backbone network extracts individual features, it directly performs behavior classification and saves the model parameters and inputs them into the network model of the second step. The second step adds scene information and individual location information. Combined with the individual features extracted by the backbone network, a graph convolutional network is constructed to infer the pairwise relationship between the scene and the individual. Then, through self-attention graph pooling, the group behavior features are obtained and the group behavior is classified.

Claims

1. A method for analyzing group behavior based on scene features and location distance information, characterized in that: a. Extract individual behavioral features using a two-stream feature extraction network, a fully connected network, and a graph convolutional network; b. Use the location distance module to model the location and relative distance information of individuals, obtain the topological structure of the crowd in space, and further enrich individual characteristics; c. Use the scene exploration module to extract global scene features, and use scene features and individual features as nodes in the graph convolutional network to deeply explore the pairwise relationship between scenes and individuals; d. Using a self-attention graph pooling network, different levels of attention are given to different individuals, and the graph structure is pooled to obtain high-level group behavior features; This method mainly includes the following steps: (1) Data preprocessing: Optical flow frames are generated from the original video frames, and the consecutive original video frames and optical flow frames are sampled and then input into the network. (2) Feature extraction: A dual-stream feature extraction network was adopted, with the Inception-v3 network as the backbone network to extract global appearance features and global motion features respectively. The appearance features and motion features were added together, and then the RoIAlign module was used to obtain the appearance and motion features of the individual. The information obtained was integrated using a fully connected layer to obtain the initial features of the individual. (3) Location distance information modeling: Based on the individual bounding box information, calculate the center position of the individual and the distance between two individuals, concatenate the individual position and distance, and map them into high-dimensional location distance features through a fully connected layer. Concatenate the initial individual features extracted in step (2) with the location distance features to form individual features; (4) Graph node generation: The global appearance features generated in step (2) are transformed into scene features through adaptive average pooling, and then combined with the individual features generated in step (3) to form graph nodes; (5) Graph construction and reasoning: Based on the graph nodes generated in step (4), the edge features of the behavioral relationship graph are constructed by calculating the similarity between nodes, and the graph is reasoned through a graph convolutional network to fully explore the paired interaction information between individuals and scenes, and output the reasoned features. (6) Individual behavior classification: The features generated in step (3) are added to the features inferred in step (5) to obtain individual features, which are then sent to the classifier to identify the individual behavior category; (7) Self-attention graph pooling: The features after inference in step (5) are processed by graph convolution to generate node self-attention scores. The nodes are sorted according to the attention scores, and the top k nodes are selected and sent to the maximum pooling network to obtain the final group behavior features. (8) Classification of group behavior: The group behavior features generated in step (7) are fed into the classifier to identify the group behavior category; (9) Model training: The model training constructed by (2)-(8) is divided into two steps. The first step is to fine-tune the pre-trained model of the backbone network. After extracting individual features from the backbone network, the behavior classification is performed directly. The model parameters are saved and input into the network model in the second step. The second step adds scene information and individual location information. Combined with the individual features extracted by the backbone network, a graph convolutional network is constructed to infer the pairwise relationship between the scene and the individual. Then, the group behavior features are obtained and the group behavior classification is performed through self-attention graph pooling.

2. The group behavior analysis method based on scene features and location distance information as described in claim 1, characterized in that... In step (3), the center position of an individual is calculated by the individual bounding box, and the distance between two individuals is calculated. The distance between two individuals is concatenated with the individual bounding box and mapped into a high-dimensional individual position distance feature through a fully connected layer. The individual appearance motion feature extracted in step (2) is concatenated with the position distance feature to form an individual feature.

3. The group behavior analysis method based on scene features and location distance information as described in claim 1, characterized in that... In step (4), the global appearance features extracted in step (2) are used to generate scene features through an adaptive average pooling network. These scene feature nodes are then fed into a graph convolutional network along with the individual features generated in step (3) to perform relational reasoning, thereby inferring the potential relationship between scene and group behavior.

4. The group behavior analysis method based on scene features and location distance information as described in claim 1, characterized in that... In step (5), during the graph construction process, the pairwise similarity between individual features and scene features is calculated and used as the initial edge information of the graph to be fed into the graph convolutional network for inference.

5. The group behavior analysis method based on scene features and location distance information as described in claim 1, characterized in that... In step (7), the features after graph reasoning are fed into the self-attention graph pooling network, attention scores are generated through graph convolution, nodes are sorted according to attention scores, and the top k nodes are selected and fed into the max pooling network to obtain the final group behavior features.