Group behavior recognition method and device based on self-supervised global-local contrast learning
By employing a self-supervised global-local contrastive learning method, a weighted graph and data augmentation module are constructed. Combined with high-order hypergraph inference and separable key-value memory networks, the problems of high-order interaction and outsider differentiation in sensor-based group behavior recognition are solved, achieving highly accurate and robust group behavior recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV OF TECH
- Filing Date
- 2026-04-21
- Publication Date
- 2026-06-23
AI Technical Summary
Existing sensor-based group behavior recognition methods rely on a large amount of manually labeled data. The labeling process is complex and costly, and it is difficult to capture high-order interaction information between group members and distinguish group members from outsiders, resulting in poor robustness.
We employ a self-supervised global-local contrastive learning approach, generating high-quality augmented samples by constructing a weighted graph and a data augmentation module. We combine high-order hypergraph inference and a separable key-value memory network to capture high-order interaction information of the population, and optimize the model using global-global and global-local contrastive losses.
While reducing reliance on manually labeled data, it improves the accuracy and robustness of group behavior recognition, maintains a high recognition rate with a small amount of labeled data, and has a strong resistance to interference from external individuals, especially in complex scenarios.
Smart Images

Figure CN122265637A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of sensor data group behavior recognition technology, specifically relating to a group behavior recognition method and device based on self-supervised global-local contrastive learning. Background Technology
[0002] Sensor-based group behavior recognition is an important research direction in swarm intelligence analysis, with significant application value in fields such as intelligent monitoring, health surveillance, and sports analysis. Unlike vision-based group behavior recognition methods, sensor-based methods can directly capture fine-grained movements of individuals, are unaffected by factors such as occlusion and lighting, better protect individual privacy, and exhibit higher stability and adaptability in complex environments. However, since group behavior is composed of complex interactions among multiple individuals, learning discriminative group representations from sensor data remains a key challenge.
[0003] Existing sensor-based group behavior recognition methods mostly employ supervised learning frameworks, requiring large amounts of labeled data for training. However, the labeling process for group behavior is complex and costly, limiting its application in real-world scenarios. To reduce reliance on manual labeling, self-supervised learning has been introduced into the field of behavior recognition, using internal learning tasks to learn information-rich feature representations from unlabeled data. Although existing research has validated its effectiveness in individual behavior recognition, directly extending self-supervised frameworks such as contrastive learning to the group level still has the following limitations:
[0004] First, many methods rely on fixed or random augmentation methods (such as random node dropping), which can easily lead to insufficient discrimination of generated samples and even damage key semantic information, resulting in misleading samples. Second, the interactions between group members are not limited to pairwise relationships, but involve higher-order interactions involving three or more individuals, which traditional graph neural networks struggle to capture. Furthermore, existing methods mostly model at a single semantic level, treating the group as a whole for global representation learning. While this approach can capture overall dynamic features, in real-world scenarios, groups are often disturbed by external individuals or noise, whose behavior is often inconsistent with the group's internal dynamics. Existing global aggregation strategies cannot distinguish between group members and outsiders, making it difficult to stably capture the group's core behavioral features and exhibiting poor robustness to external disturbances. Summary of the Invention
[0005] The purpose of this invention is to provide a group behavior recognition method and apparatus based on self-supervised global-local contrastive learning, which can avoid semantic noise caused by augmentation, capture high-order interaction information between members, and simultaneously realize a self-supervised contrastive learning framework for overall group modeling and local individual differentiation. This significantly reduces the dependence on manually labeled data while improving the accuracy of group behavior recognition.
[0006] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0007] Firstly, a method for identifying group behavior based on self-supervised global-local contrastive learning is provided, including:
[0008] Based on the sensor data and location data corresponding to each individual, a weighted graph oriented towards the group is constructed, where nodes in the weighted graph are individuals and edges are the connections between individuals;
[0009] The data augmentation module introduces perturbations into the structure and nodes of the weighted graph to generate an augmented view of the population.
[0010] The weighted graph and the augmented view are processed by individual feature extraction and high-order hypergraph inference respectively to obtain the global representation of the population. The representation is then input into the projection layer for projection to obtain the projected global representation.
[0011] By processing individual sensor data through a cross-group individual mixing strategy and a separable key-value memory network, a local representation of the group is obtained, which is then input into a projection layer for projection to obtain a projected local representation.
[0012] Based on the global and local projection representations, global-global contrastive loss and global-local contrastive loss are calculated for training, updating and optimization until training ends;
[0013] In the downstream group behavior classification task stage, the projection layer is removed and the data augmentation module and cross-group individual mixing strategy are disabled. A global representation is obtained based on the weighted graph, and a local representation is obtained based on the sensor data. The global representation and the local representation are concatenated and then input into a linear classifier to obtain the group behavior recognition result.
[0014] Several alternative methods are provided below, but they are not intended as additional limitations on the overall solution above. They are merely further additions or optimizations. Provided there are no technical or logical contradictions, each alternative method can be combined individually with respect to the overall solution above, or multiple alternative methods can be combined with each other.
[0015] Preferably, the construction of a group-oriented weighted graph based on the sensor data and location data corresponding to each individual includes:
[0016] Average pooling is performed on the sensor data and location data of each individual to obtain the behavior vector and location vector respectively;
[0017] The behavior vector and the scaled position vector are concatenated to construct an individual vector that integrates behavioral and spatial information;
[0018] Based on individual vectors, the similarity between any two individuals is calculated, thereby constructing a weighted adjacency matrix;
[0019] The weighted adjacency matrix is normalized to complete the construction of the weighted graph.
[0020] Preferably, the data enhancement module operates as follows:
[0021] The individual vectors of each pair of individuals with connections in the weighted graph are concatenated and input into the multilayer perceptron and activation function to generate an edge weight perturbation coefficient.
[0022] Multiply the edge weight perturbation coefficient by the edge weight of the original weighted adjacency matrix to obtain the structure-enhanced adjacency matrix;
[0023] The sensor data of each individual is input into the augmentation network to generate the perturbation strength. The perturbation strength is then multiplied element-wise with standard Gaussian noise to obtain the final perturbation term.
[0024] The perturbation term is added to the original sensor data of the individual to obtain the node-enhanced sensor data. The enhanced view includes the node-enhanced sensor data and the structure-enhanced adjacency matrix.
[0025] Preferably, the step of processing the weighted graph and the enhanced view through individual feature extraction and higher-order hypergraph inference respectively to obtain the global representation of the group includes:
[0026] A feature extractor consisting of a convolutional neural network and a bidirectional long short-term memory network is used to encode the sensor data of each individual to obtain individual features;
[0027] Set up several semantic anchor points, calculate the similarity between each individual feature and all semantic anchor points, and normalize the similarity to obtain the probability distribution of each node's membership to all semantic anchor points;
[0028] Based on the membership probability distribution, the preliminary aggregated representation of the hyperedge is calculated, and the attention mechanism from the node to the hyperedge is introduced to obtain the contribution weight of each node to the hyperedge. The contribution weight is used to perform a weighted summation of individual features to obtain the updated hyperedge representation.
[0029] Based on the membership probability distribution, an attention mechanism from hyperedge to node is introduced to obtain the importance weight of hyperedge to node. The updated hyperedge representation is then aggregated using the importance weight to obtain the updated node representation.
[0030] Perform average pooling on all updated node representations to obtain the global representation of the population.
[0031] As a preferred embodiment, the cross-group individual mixing strategy is as follows:
[0032] Calculate a representative score for each individual to identify core members of the group;
[0033] The individuals with the highest representative scores are retained as core members and, together with individuals drawn from other groups, form a new mixed group.
[0034] Preferably, the separable key-value memory network processes the following steps:
[0035] Initialize the key memory slots and value memory slots to obtain the key memory matrix and value memory matrix;
[0036] Map the features of each individual in the mixed population to a query vector, and calculate the correlation between the query and the key;
[0037] Select several memory slots that are most relevant to the query as a subset, and normalize the memory slots within the subset to obtain sparse addressing weights.
[0038] Information is read from the value memory matrix in a weighted manner by addressing weights to generate prototype features related to individual identity;
[0039] The individual characteristics are fused with the read prototype characteristics to obtain the individual's identity characteristics, and the probability that the individual belongs to the original group is output.
[0040] Average pooling is performed on the identity features of individuals identified as belonging to the original group in the mixed group to obtain a local representation of the group.
[0041] Preferably, the calculation of global-global contrastive loss and global-local contrastive loss for training, updating, and optimization includes:
[0042] Based on the projected global representations of the original weighted graph and the enhanced view corresponding to the same group, the InfoNCE contrast loss is calculated as the global-to-global contrast loss.
[0043] Based on the global and local projection representations of the same group, the InfoNCE contrast loss is calculated as the global-local contrast loss;
[0044] The global-global contrastive loss and the global-local contrastive loss are weighted and summed to obtain the overall optimization objective, which is then used for training, updating, and optimization.
[0045] Secondly, a group behavior recognition device based on self-supervised global-local contrastive learning is provided, including a processor and a memory storing a number of computer instructions, wherein the computer instructions, when executed by the processor, implement the steps of the group behavior recognition method based on self-supervised global-local contrastive learning.
[0046] This invention provides a method and apparatus for group behavior recognition based on self-supervised global-local contrastive learning. In the group behavior recognition process, two contrastive learning tasks—global-global and global-local—are first constructed through self-supervised learning. In the global-global task, a learnable data augmentation mechanism is designed to generate high-quality augmented samples, and a high-order hypergraph inference module is introduced to capture complex collaborative patterns in the group that go beyond pairwise interactions. In the global-local task, a cross-group individual hybrid strategy is used to simulate real-world interference scenarios, and a separable key-value memory network is used to extract local features that characterize the core identity of the group. By jointly optimizing the two tasks, the model can learn a comprehensive group representation that includes both overall dynamics and focuses on core members. After self-supervised pre-training based on unlabeled group behavior data, this method further performs downstream classification on a small portion of labeled group behavior data. The global and local representations are concatenated and input into a linear classifier. Outputs group behavior categories. This achieves high accuracy in group behavior recognition while requiring minimal manual annotation. Attached Figure Description
[0047] Figure 1 This is a flowchart of the group behavior recognition method based on self-supervised global-local contrastive learning according to the present invention;
[0048] Figure 2 Flowchart of an embodiment of the present invention for data augmentation;
[0049] Figure 3 This is a flowchart of an embodiment of the present invention for high-order hypergraph reasoning;
[0050] Figure 4 This is a flowchart of the processing of the separable key-value memory network of the present invention. Detailed Implementation
[0051] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0052] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to limit the invention.
[0053] This invention provides a group behavior recognition method based on self-supervised global-local contrastive learning. This method comprises two complementary learning tasks: a global-global task and a global-local task, to collaboratively learn group behavior representations at different semantic levels. In the global-global task, a learnable data augmentation module is first constructed to generate challenging positive sample pairs by applying perturbations at the structural and node levels. Subsequently, a higher-order hypergraph inference module is introduced to model the interaction relationships between multiple individuals. Through a two-stage attention aggregation mechanism, individuals can participate in multiple semantic hyperedges with different weights, thereby capturing higher-order collaborations not limited to pairwise relationships. In the global-local task, a separable key-value memory network is proposed to distinguish original group members from external interference individuals, extracting highly discriminative local representations from mixed group samples. By jointly optimizing the above two contrastive learning objectives, the model learns comprehensive and highly representative group behavior features. After self-supervised pre-training based on unlabeled group behavior data, this method further performs downstream classification on a small portion of labeled group behavior data, concatenating the global and local representations and inputting them into a linear classifier. Output the group behavior category.
[0054] In one embodiment, such as Figure 1 As shown, a group behavior recognition method based on self-supervised global-local contrastive learning is presented. This method collaboratively learns the overall and local behavioral representations of a group through two contrastive learning tasks: global-global and global-local. It mainly includes a graph construction module, a data augmentation module, a global representation encoding module, a local representation encoding module, and a joint contrastive learning module.
[0055] Step 1: Divide the sensor data and location data corresponding to each individual into time series and input them into the graph construction module. Using each individual in the group as a node, integrate the individual's behavioral information and spatial information to construct a weighted graph. The sensor data includes three-axis accelerometer data and three-axis gyroscope data.
[0056] In this embodiment, the sensor data corresponding to each individual is segmented using a sliding window. For example, a sliding window with a fixed length of 200 and an overlap rate of 50% is used. The sliding window is placed at the beginning of the time series, and then the sliding window moves forward to form another short time series with a length of 200, and so on, to segment a long time series sample into multiple short time series. For a group of 5 people, the input of the model is a group of sensor data of the 5 people within the same sliding window.
[0057] In group activities, interactions between individuals depend not only on the similarity of their behavioral patterns but also on their relative spatial positions. In the graph structure module, the following operations are specifically performed:
[0058] Step 101: In this embodiment, a graph that integrates behavioral and spatial information is constructed. , used to represent group structure. Among them, node set Representative group Individual, edge set It indicates the connection between individuals.
[0059] Step 102: For each individual's (multi-channel) sensor data and location (sequence) data, perform average pooling in the time dimension to obtain behavior vector and location vector respectively.
[0060] For including A group of individuals, within any sliding window, the first... The multi-channel sensor data of an individual is represented as follows: ,in Number of sensor channels The time step is ; its corresponding position sequence is , where 2 represents two-dimensional spatial coordinates To obtain a compact representation of each individual, an average pooling operation is performed over the time dimension. , respectively obtained the first Individual behavior vectors and position vector .
[0061] Step 103: Concatenate the behavior vector and the scaled position vector to construct an individual vector that integrates behavior and spatial information.
[0062] Specifically, the behavior vector and position vector By splicing the data, we obtain the first part, which integrates behavioral and spatial information. Individual vectors of each individual ,in This is a scaling factor used to balance the weights of location information, in this embodiment... .
[0063] Step 104: Calculate the similarity between any two individuals using Gaussian radial basis functions, thereby constructing a weighted adjacency matrix.
[0064] This embodiment calculates the individual using the following formula. and Calculate the similarity between them and construct a weighted adjacency matrix. :
[0065]
[0066] in, Represents the weighted adjacency matrix medium-sized individuals and Edge weights between them Indicates the number of people in a group. Represents Euclidean distance. Represents the square of the Euclidean distance. It is the bandwidth parameter of the Gaussian radial basis function. It is a distance threshold, in this embodiment , . It is an indicator function, when the distance between individuals exceeds... When the condition is met, the function takes the value of 0; otherwise, it takes the value of 1, thus ensuring the sparsity of the graph structure and helping to focus on local interactions.
[0067] Step 105: Perform row normalization on the weighted adjacency matrix. Specifically, to improve numerical stability, the following formula is used: For weighted adjacency matrix Normalize each row. Represents the weighted adjacency matrix medium-sized individuals and The edge weights between them are used to complete the weighted graph. The construction, This is a sensor data matrix containing sensor data from all individuals in the population. .
[0068] Step 2: In the data augmentation module, learnable perturbations are introduced into the structure and nodes of the graph to generate an augmented view of the original population sample.
[0069] The performance of self-supervised contrastive learning largely depends on the diversity and semantic consistency of the augmented samples. Overly random augmentations may disrupt the original structural semantics; overly conservative augmentations may result in overly simplistic generated samples, thus limiting the model's generalization ability. In group behavior recognition, the overall behavioral pattern is often determined by both the group structure (interactions between members) and individual behaviors. This embodiment designs a data augmentation module, such as... Figure 2 As shown, perturbations are introduced simultaneously at both the structural and node levels to simulate changes in member interactions and disturbances in individual behavior. This step specifically performs the following operations:
[0070] Step 201: Concatenate the individual vectors of each pair of individuals and input them into the multilayer perceptron and the sigmoid function to generate an edge weight perturbation coefficient.
[0071] This embodiment designs a learnable structural enhancer, for a pair of individuals Concatenate their corresponding individual vectors as And input into a multilayer perceptron Then, the Sigmoid function is used to generate an edge weight perturbation coefficient in the range (0,1). The calculation formula is as follows:
[0072]
[0073] Step 202: Multiply the edge weight perturbation coefficients by the edge weights of the original weighted adjacency matrix to obtain the structure-enhanced adjacency matrix. Specifically, the perturbation coefficients obtained in step 201 are... Edge weights corresponding to the original adjacency matrix Element-by-element multiplication yields the individual. and The adjacency matrix with enhanced structure , For individuals and The edge weight perturbation coefficients between them yield the adjacency matrix containing all structural enhancements for each individual. .
[0074] Step 203: Input the raw sensor data of each individual into the enhancement network to generate the perturbation intensity, and multiply it element-wise with the standard Gaussian noise to obtain the final perturbation term.
[0075] Node-level augmentation enhances the ability to adapt to individual action differences and sensor noise by adding perturbations to the raw sensor data of individuals, simulating changes in individual behavior itself.
[0076] This embodiment uses an enhanced network. (Also a multilayer perceptron) for each individual raw sensor data Generation of disturbance intensity and standard Gaussian noise Multiplying element by element yields the final perturbation term. ,in It is a hyperparameter for controlling the basic noise intensity, in this embodiment , It represents the identity matrix. Indicates a normal distribution. It represents the Hadamardi (or Hadama) stack.
[0077] Step 204: Add the perturbation term to the original sensor data of the individual to obtain the sensor data after node enhancement.
[0078] Specifically, the perturbation term obtained in step 203 Compared with raw sensor data Add them together to obtain the sensor data with node enhancement. Through structural and node-level enhancements, each original population sample... Generate a corresponding enhanced view , This is a node-enhanced sensor data matrix, containing node-enhanced sensor data for all individuals in the population. .
[0079] Step 3: Input the group data into the global representation encoding module. Through the individual feature extractor and the higher-order hypergraph inference module, obtain the global representation of the group. Then, map this global representation to the contrastive learning space through the projection layer to obtain the projected global representation.
[0080] The individual feature extractor performs the following operations:
[0081] Step 301: Use a feature extractor composed of a convolutional neural network and a bidirectional long short-term memory network to encode the sensor data of each individual to obtain individual features (i.e., node features).
[0082] In this embodiment, the feature extractor A structure is adopted that sequentially connects a convolutional neural network and a bidirectional long short-term memory network. The convolutional branch consists of two one-dimensional convolutional layers and two one-dimensional average pooling layers connected sequentially. The first convolutional layer has 6 input channels, 64 output channels, a kernel size of 3, a stride of 1, and padding of 1. The second convolutional layer has 64 input channels, 32 output channels, a kernel size of 3, a stride of 1, and padding of 1. Each convolutional layer is followed by an average pooling layer with a kernel size of 2, and a ReLU activation function is used for non-linear transformation to obtain deep features. .
[0083] Subsequently, a bidirectional long short-term memory network (BLSTM) was used to process deep features. To capture long-term dependencies, a bidirectional Long Short-Term Memory (LSTM) network with one layer, 50 input feature dimensions, and 32 hidden units is used. The network employs a bidirectional structure, concatenating the forward and backward outputs along the feature dimensions, resulting in an output dimension of 64 at each time step. Subsequently, average pooling is performed along the temporal dimension to obtain the temporal representation. Finally, and By concatenating the features along the feature dimension, we obtain the individual feature representation. The feature matrix is composed of the characteristics of all individuals. .
[0084] In group behavior scenarios, the interactions between individuals are often not limited to pairwise relationships but may also occur in the form of collaboration among three or more people. Traditional graph neural networks (GNNs) mainly handle pairwise relationships and are difficult to effectively describe the higher-order interaction patterns that exist in groups. Therefore, this embodiment introduces a higher-order hypergraph inference module, such as... Figure 3 As shown, this allows each individual to participate in multiple semantic hyperedges simultaneously with different weights, thereby enabling more fine-grained high-order group modeling.
[0085] The higher-order hypergraph inference module first defines a set of semantic anchors to represent different types of interaction patterns. Then, based on the membership degree of each node to these semantic anchors, the node features are weighted and aggregated to construct a semantic hyperedge that can initially represent the group structure. Furthermore, the higher-order interaction information contained in the semantic hyperedge is used to update the node representation. Specifically, the higher-order hypergraph inference module performs the following operations:
[0086] Step 302: Calculate the similarity between each node feature and all semantic anchors.
[0087] This embodiment is set There are semantic anchors, and their representation matrix is as follows: Among them, each anchor point vector This represents an interaction mode. In this embodiment, the semantic anchors are learnable vector parameters, randomly initialized in the early stages of training, and optimized and updated through backpropagation during training. Each anchor vector gradually learns to form a prototype representation of different interaction modes in the feature space, with different vector values corresponding to different potential interaction semantics.
[0088] Calculate the node using the following formula. With anchor point similarity :
[0089]
[0090] in, and This is the balance coefficient, which in this embodiment is taken as 0.6 and 0.4 respectively. Represents the magnitude of a vector. It is an adjacency matrix. Indicates node degree. This indicates a transpose operation. The first term in the formula calculates the similarity between the node's own features and the anchor point, while the second term incorporates prior information from its neighboring nodes, thus integrating local structural information.
[0091] Step 303: Normalize the similarity using the Softmax function to obtain the membership probability distribution of each node to all anchor points. Specifically, this is done using the formula... Normalization is performed to obtain the nodes. Belongs to the The probability of each anchor point , Represents a node With anchor point The similarity.
[0092] Furthermore, in the node representation update process, this embodiment uses the anchor point as a hyperedge and designs a two-stage attention aggregation mechanism for implementation, as shown in steps 304 and 305.
[0093] Step 304: Update the hyperedge representation through the first stage of the two-stage attention aggregation mechanism, namely, the aggregation of information from nodes to hyperedges.
[0094] This embodiment first uses the membership probability matrix... Calculate hyperedge Preliminary aggregation representation The degree of the hyperedge Next, an attention mechanism from nodes to the hyperedge is introduced, and the contribution weight of each node to the hyperedge is calculated using the following formula:
[0095]
[0096] in, Represents a node For superedge Contribution weight, node Node characteristics, Activation function , These are learnable parameters. Indicates splicing.
[0097] Finally, the node features are weighted and summed using contribution weights to obtain the hyperedge. Updated hyperedge representation :
[0098]
[0099] in, Activation function , It is a learnable aggregate weight matrix.
[0100] Step 305: Update the node representation through the second stage of the two-stage attention aggregation mechanism, namely, the information aggregation from the hyperedge to the node.
[0101] This embodiment calculates the hyperedge. For nodes Importance weight :
[0102]
[0103] in, These are the learnable parameters for this stage. Indicates the superedge Updated hyperedge representation.
[0104] Next, the information of all hyperedges is aggregated using importance weights, and then normalized by layers. The updated representation of the node is obtained by connecting it with the residual. :
[0105]
[0106] Step 306: Perform average pooling on all updated node representations to obtain the global representation of the population, and input it into the projection layer for projection to obtain the projected global representation.
[0107] This embodiment represents the update of all nodes. Perform average pooling to obtain the global representation of the population. Subsequently, a projection layer is introduced to further project the global representation onto the contrastive learning space, resulting in a projected global representation. In this embodiment, the weighted graph and the enhanced view are processed separately using a global representation encoding module. The global representation corresponding to the weighted graph is denoted as... And the global projection is represented as Let the global representation corresponding to the augmented view be . And the global projection is represented as .
[0108] Step 4: Input the group data into the local representation encoding module. Through a cross-group individual mixing strategy and a separable key-value memory network, obtain the local representation of the group. Then, map this local representation to the contrastive learning space through a projection layer to obtain the projected local representation.
[0109] Unlike global representations that capture the overall dynamics of a group, local representations focus more on the core members that constitute the group's identity. To this end, this embodiment designs a cross-group individual hybridization strategy and introduces a separable key-value memory network to identify the source identity of individuals in the hybrid group. The output of this network serves as a local representation of the group.
[0110] In real-world scenarios, group behavior is often influenced by external individuals. If a model is trained only on a single group, it may overfit to a specific group structure, resulting in poor generalization performance in group recognition tasks. Therefore, this embodiment proposes a cross-group individual hybrid strategy, which constructs semantically diverse hybrid groups by introducing external individuals. The cross-group individual hybrid strategy specifically performs the following operations:
[0111] Step 401: Calculate the representative score for each individual to identify core members in the group.
[0112] This embodiment defines each individual Representative score The negative distance between its behavioral sequence and the group's average behavioral sequence is used. A higher score indicates that its behavior is closer to the group center. The calculation formula is as follows:
[0113]
[0114] in, Indicates the first The individual in the first Raw sensor data at each time step, Indicates the first The individual in the first Raw sensor data at each time step.
[0115] Step 402: Retain the individuals with the highest representative scores as core members, and combine them with individuals drawn from other groups to form a new mixed group.
[0116] Specifically, retain the highest score. Each individual serves as a core member, among whom This is the retention ratio, ensuring that the main characteristics of the original population are maintained after mixing. (This is from an example.) Subsequently, random selection was made from another group within the current batch. Individuals, together with core members, form a mixed group. .
[0117] In mixed-group samples, the origin (i.e., group affiliation) of different individuals is often difficult to distinguish directly. Traditional memory networks typically use a unified memory matrix for addressing, which can easily lead to confusion of different identity features. This application proposes a separable key-value memory network, such as... Figure 4 As shown, the memory structure is decomposed into a key and value space, and a competitive sparse addressing mechanism is introduced to enhance identity discrimination and memory update stability.
[0118] Mixed groups First, the individual features of each mixed individual are obtained through the individual feature extractor in step 301. Subsequently, a separable key-value memory network is used to identify individuals, i.e., whether they belong to the original group. Specifically, the following operations are performed in the separable key-value memory network:
[0119] Step 403: Initialize the key memory slot and value memory slot.
[0120] This embodiment uses the K-Means algorithm to cluster individual features to generate... Cluster center This is used to initialize the memory value slots. Each centroid serves as the initial state of the memory value slot, i.e., the [number]th [centroid]. Initial vector of memory slots The centroid, after a nonlinear mapping transformation, is used as the initial state of the corresponding keyway:
[0121]
[0122] in, It is the first The initial vector of each memory keyway. These are learnable parameters. Activation function The final value memory matrix is obtained. Bond memory matrix .
[0123] Step 404: Map the features of each individual in the mixed population to a query vector, and calculate the correlation between the query and the key.
[0124] Specifically, each individual characteristic First, it is mapped to a query vector. .in, These are learnable parameters.
[0125] The matching score was then calculated using the bilinear correlation function. .in, Represents an individual The query and the first The original correlation score of each memory bond. It is a learnable bilinear weight.
[0126] Step 405: Select the memory slots most relevant to the query and normalize them within this subset to obtain sparse addressing weights.
[0127] In this embodiment, to achieve addressing sparsity and reduce interference from irrelevant memory slots, the most relevant memory slots are selected first. A subset is composed of memory slots. And perform Softmax normalization within that subset:
[0128]
[0129] in, Represents an individual The query and the first The final addressing weights correspond to each memory key. For memory slots not in this subset, their weights are set to 0. Represents an individual The query and the first The original relevance score of each memory key.
[0130] Step 406: Read information from the value memory matrix in a weighted manner through addressing weights to generate prototype features related to individual identity.
[0131] Specifically, using addressing weights From value memory matrix Weighted information is read to generate information related to individuals. Identity-related prototype features :
[0132]
[0133] Step 407: Fuse the individual features with the read prototype features and output the probability that the individual belongs to the original group.
[0134] This embodiment will use individual characteristics Prototype features read from the memory network Integration, resulting in enhanced identity characteristics :
[0135]
[0136] in These are learnable parameters. The probability of an individual belonging to the original population is then output using the Sigmoid function, and individuals with probabilities greater than a threshold are identified as belonging to the original population.
[0137] Step 408: Perform average pooling on the identity features identified as belonging to the original group in the mixed group to obtain the local representation of the group, and input it into the projection layer for projection to obtain the projected local representation.
[0138] This embodiment will use mixed groups Identification characteristics of all members identified as belonging to the primitive group Perform average pooling to obtain a local representation of the population. Subsequently, a projection layer is introduced to further project this local representation onto the contrastive learning space, resulting in a projected local representation. .
[0139] Step 5: Input the projected global representation and projected local representation into the joint contrastive learning module, and optimize the model using global-global contrastive loss and global-local contrastive loss. This step, through joint optimization objectives, enables the model to learn a comprehensive group representation. Specifically, the following operations are performed:
[0140] Step 501: Calculate the global-global contrast loss to maximize the consistency between the projected global representations of the original view and the enhanced view within the same group.
[0141] This embodiment uses InfoNCE contrastive loss as the optimization objective, given the original population. global projection representation and its enhanced view's projection global representation Its loss function The definition is as follows:
[0142]
[0143] in, It is the cosine similarity function. For batch size, Representing a group The projected global representation obtained from the weighted graph Representing a group The projected global representation is obtained based on the enhanced view.
[0144] Step 502: Calculate the global-local contrast loss to maximize the consistency between the projected global representation and its projected local representation of the same group.
[0145] Global representation captures the overall dynamics and high-order interactions of the group, while local representation focuses on the individual characteristics that constitute the group's identity. This embodiment maximizes the projection of the global representation. and projection local representation The similarity between them enables the model to learn a more comprehensive group representation, and its loss function Defined as:
[0146]
[0147] in, Representing a group The projection of the local representation.
[0148] Step 503: The global-global contrast loss and the global-local contrast loss are weighted and summed to obtain the total loss as the overall optimization objective.
[0149] The overall optimization objective of the model in this embodiment consists of the two loss functions mentioned above, and their calculation formula is as follows: ,in and It is a hyperparameter that balances the importance of the two loss terms. The loss of all populations in the entire batch is aggregated, and then backpropagation is performed to update the data.
[0150] Step S6: Input the group representation into the classifier to obtain the group behavior recognition result.
[0151] After completing the pre-training of the global-local contrastive learning, this embodiment performs a downstream classification task on labeled group behavior data to verify the effectiveness of the learned group representation. To maintain the stability of the representation, the projection layer in the pre-trained model is removed in this stage, and the data augmentation module and cross-group individual mixing strategy are disabled. Subsequently, the entire network is fine-tuned using only 10% of the labeled samples in the training set.
[0152] Specifically, the raw sensor data and location data of each group sample are first input into the graph construction module, the global representation encoding module (composed of an individual feature extractor and a higher-order hypergraph inference module), and the local representation encoding module (composed of an individual feature extractor and a separable key-value memory network) to obtain the global representation. and local representation Subsequently, the global representation and the local representation are concatenated to obtain the final group representation. And input it into a linear classifier, after which... The layer outputs the group behavior category.
[0153] This invention conducted group behavior recognition experiments on the UT-Data-gar dataset to verify the effectiveness of the proposed method. In the experiments, the group size was set to 5 people. Two scenarios were considered: in scenario one, all 5 individuals belonged to the same group; in scenario two, one or two individuals were randomly selected not to belong to the group, to simulate individual non-critical behaviors within the group, aiming to evaluate the method's ability to suppress interference from non-critical behaviors.
[0154] The overall experimental procedure consists of two steps: a self-supervised pre-training phase and a downstream group behavior classification phase. First, in the pre-training phase, the model is optimized using unlabeled data. Training employs the Adam optimizer with a learning rate of 0.001, a batch size of 32, and a total of 30 epochs.
[0155] After pre-training, a downstream classification task is performed on labeled group behavior data to validate the effectiveness of the learned group representation. In this stage, the loss function for group behavior classification uses cross-entropy loss, the optimizer is Adam, the learning rate is 0.001, training is performed for 20 epochs, and 10% of the labeled samples are randomly selected from the training set to fine-tune the entire network. To maintain the stability of the representation, the projection layer of the pre-trained model is removed, data augmentation modules and cross-group individual mixing strategies are disabled, and a linear classifier is connected.
[0156] The experiment employed five-fold cross-validation, using the mean of accuracy, precision, recall, and F1 score after 10 repetitions to measure recognition performance. Comparative methods included TJAMSD (Two-Domain Joint Attention Mechanism based on Sensor Data for Group Activity Recognition), which introduces a two-domain joint attention mechanism, and RGSA (Sensor-based Group Activity Recognition Model with Relation Gating and Spatio-temporal Attention), which further incorporates relation gating and spatio-temporal attention mechanisms. Table 1 shows the experimental results on the UT-Data-gar dataset.
[0157] Table 1
[0158]
[0159] The comparison results show that, under the given conditions, the method of this invention achieves a precision of 74.62%, a recall of 73.57%, an F1 score of 74.11%, and an accuracy of 72.23%. Compared to TJAMSD, a group behavior recognition method also based on sensor data, the precision, recall, F1 score, and accuracy of this invention are improved by 6.70%, 7.16%, 6.96%, and 6.37%, respectively. Compared to the RGSA method, the precision of this invention is improved by 4.92%, the recall by 4.74%, the F1 score by 4.87%, and the accuracy by 5.17%.
[0160] In scenario two, the precision of this invention is 59.87%, recall is 58.13%, F1 score is 58.94%, and accuracy is 59.52%. Compared to the TJAMSD method, the precision, recall, F1 score, and accuracy of this invention are significantly improved by 10.04%, 9.71%, 9.87%, and 10.91%, respectively; compared to the RGSA method, the precision of this invention is improved by 9.28%, the recall by 7.09%, the F1 score by 8.11%, and the accuracy by 8.24%.
[0161] In summary, the method of this invention significantly outperforms other methods in both standard group scenarios and complex scenarios involving individual interference. This demonstrates that the method of this invention can effectively learn discriminative group representations while significantly reducing reliance on manually labeled data, thereby reducing interference from non-critical behaviors and improving the overall performance of group behavior recognition.
[0162] In another embodiment, the present invention also provides a group behavior recognition device based on self-supervised global-local contrastive learning, including a processor and a memory storing a plurality of computer instructions, wherein the computer instructions, when executed by the processor, implement the steps of the group behavior recognition method based on self-supervised global-local contrastive learning.
[0163] For specific limitations regarding the group behavior recognition device based on self-supervised global-local contrastive learning, please refer to the limitations of the group behavior recognition method based on self-supervised global-local contrastive learning mentioned above, which will not be repeated here.
[0164] The memory and processor are electrically connected directly or indirectly to enable data transmission or interaction. For example, these components can be electrically connected to each other via one or more communication buses or signal lines. The memory stores a computer program that can run on the processor, which implements the method of the present invention by running the computer program stored in the memory.
[0165] The memory may be, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), etc. The memory stores the program, and the processor executes the program upon receiving an execution instruction.
[0166] The processor may be an integrated circuit chip with data processing capabilities. The aforementioned processor can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this invention. The general-purpose processor can be a microprocessor or any conventional processor.
[0167] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0168] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the appended claims.
Claims
1. A crowd behavior recognition method based on self-supervised global-local contrastive learning, characterized in that, include: Based on the sensor data and location data corresponding to each individual, a weighted graph oriented towards the group is constructed, where nodes in the weighted graph are individuals and edges are the connections between individuals; The data augmentation module introduces perturbations into the structure and nodes of the weighted graph to generate an augmented view of the population. The weighted graph and the augmented view are processed by individual feature extraction and high-order hypergraph inference respectively to obtain the global representation of the population. The representation is then input into the projection layer for projection to obtain the projected global representation. By processing individual sensor data through a cross-group individual mixing strategy and a separable key-value memory network, a local representation of the group is obtained, which is then input into a projection layer for projection to obtain a projected local representation. Based on the global and local projection representations, global-global contrastive loss and global-local contrastive loss are calculated for training, updating and optimization until training ends; In the downstream group behavior classification task stage, the projection layer is removed and the data augmentation module and cross-group individual mixing strategy are disabled. A global representation is obtained based on the weighted graph, and a local representation is obtained based on the sensor data. The global representation and the local representation are concatenated and then input into a linear classifier to obtain the group behavior recognition result.
2. The group behavior recognition method based on self-supervised global-local contrastive learning according to claim 1, characterized in that, The construction of a group-oriented weighted graph based on sensor and location data for each individual includes: Average pooling is performed on the sensor data and location data of each individual to obtain the behavior vector and location vector respectively; The behavior vector and the scaled position vector are concatenated to construct an individual vector that integrates behavioral and spatial information; Based on individual vectors, the similarity between any two individuals is calculated, thereby constructing a weighted adjacency matrix; The weighted adjacency matrix is normalized to complete the construction of the weighted graph.
3. The group behavior recognition method based on self-supervised global-local contrastive learning according to claim 1, characterized in that, The data augmentation module operates as follows: The individual vectors of each pair of individuals with connections in the weighted graph are concatenated and input into the multilayer perceptron and activation function to generate an edge weight perturbation coefficient. Multiply the edge weight perturbation coefficient by the edge weight of the original weighted adjacency matrix to obtain the structure-enhanced adjacency matrix; The sensor data of each individual is input into the augmentation network to generate the perturbation strength. The perturbation strength is then multiplied element-wise with standard Gaussian noise to obtain the final perturbation term. The perturbation term is added to the original sensor data of the individual to obtain the node-enhanced sensor data. The enhanced view includes the node-enhanced sensor data and the structure-enhanced adjacency matrix.
4. The group behavior recognition method based on self-supervised global-local contrastive learning according to claim 1, characterized in that, The process of processing the weighted graph and the enhanced view through individual feature extraction and higher-order hypergraph inference to obtain the global representation of the population includes: A feature extractor consisting of a convolutional neural network and a bidirectional long short-term memory network is used to encode the sensor data of each individual to obtain individual features; Set up several semantic anchor points, calculate the similarity between each individual feature and all semantic anchor points, and normalize the similarity to obtain the probability distribution of each node's membership to all semantic anchor points; Based on the membership probability distribution, the preliminary aggregated representation of the hyperedge is calculated, and the attention mechanism from the node to the hyperedge is introduced to obtain the contribution weight of each node to the hyperedge. The contribution weight is used to perform a weighted summation of individual features to obtain the updated hyperedge representation. Based on the membership probability distribution, an attention mechanism from hyperedge to node is introduced to obtain the importance weight of hyperedge to node. The updated hyperedge representation is then aggregated using the importance weight to obtain the updated node representation. Perform average pooling on all updated node representations to obtain the global representation of the population.
5. The group behavior recognition method based on self-supervised global-local contrastive learning according to claim 1, characterized in that, The cross-group individual mixing strategy is as follows: Calculate a representative score for each individual to identify core members of the group; The individuals with the highest representative scores are retained as core members and, together with individuals drawn from other groups, form a new mixed group.
6. The group behavior recognition method based on self-supervised global-local contrastive learning according to claim 5, characterized in that, The separable key-value memory network processes the following steps: Initialize the key memory slots and value memory slots to obtain the key memory matrix and value memory matrix; Map the features of each individual in the mixed population to a query vector, and calculate the correlation between the query and the key; Select several memory slots that are most relevant to the query as a subset, and normalize the memory slots within the subset to obtain sparse addressing weights. Information is read from the value memory matrix in a weighted manner by addressing weights to generate prototype features related to individual identity; The individual characteristics are fused with the read prototype characteristics to obtain the individual's identity characteristics, and the probability that the individual belongs to the original group is output. Average pooling is performed on the identity features of individuals identified as belonging to the original group in the mixed group to obtain a local representation of the group.
7. The group behavior recognition method based on self-supervised global-local contrastive learning according to claim 1, characterized in that, The calculation of global-global contrastive loss and global-local contrastive loss for training, updating, and optimization includes: Based on the projected global representations of the original weighted graph and the enhanced view corresponding to the same group, the InfoNCE contrast loss is calculated as the global-to-global contrast loss. Based on the global and local projection representations of the same group, the InfoNCE contrast loss is calculated as the global-local contrast loss; The global-global contrastive loss and the global-local contrastive loss are weighted and summed to obtain the overall optimization objective, which is then used for training, updating, and optimization.
8. A group behavior recognition device based on self-supervised global-local contrastive learning, comprising a processor and a memory storing a plurality of computer instructions, characterized in that, When the computer instructions are executed by the processor, they implement the steps of the method according to any one of claims 1 to 7.