Method and device for recognizing group behavior based on spatiotemporal individual interaction relationship

By integrating individual interaction features from the temporal and spatial domains in group behavior recognition, and utilizing GLM-Net, TSN, GCN, and Transformer models, the problem of low accuracy in group behavior recognition in existing technologies is solved, and accurate recognition of time-continuous group events is achieved.

CN115761601BActive Publication Date: 2026-06-23CHINA TELECOM DIGITAL INTELLIGENCE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM DIGITAL INTELLIGENCE TECH CO LTD
Filing Date
2022-12-29
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing group behavior recognition algorithms mainly identify events by mining the interaction relationships between individuals in the spatial domain. However, their accuracy needs to be improved, and they are difficult to effectively determine the category of group events with temporal continuity.

Method used

By extracting features from the motion information of multiple individuals in each target video frame, fusing the individual interaction relationship features in the temporal and spatial domains, removing camera motion information using the GLM-Net method, employing the TSN network model for sparse sampling, and combining the GCN and Transformer models for graph convolution and self-attention mechanisms, the transmission and fusion of inter-frame individual interaction relationships are achieved.

Benefits of technology

It improves the accuracy of group behavior recognition and enables the deep semantic feature extraction of group behavior and the correct judgment of time-continuous events.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115761601B_ABST
    Figure CN115761601B_ABST
Patent Text Reader

Abstract

Embodiments of the present application disclose a group behavior recognition method and device based on spatio-temporal individual interaction relationship. The method comprises: extracting a plurality of target video frames from a target video; performing feature extraction on motion information of a plurality of individuals in each target video frame to obtain motion features; processing initial individual features of the plurality of individuals in a current target video frame to obtain target individual features; performing fusion processing on the target individual features of the plurality of individuals in the current frame and motion features of the plurality of individuals in a next frame to obtain initial individual features of the plurality of individuals in the next frame; repeating the above process; the target individual features of each individual in the kth target video frame are used to indicate motion features of each individual in the 1st to kth frames and interaction relationship features between each individual and other individuals; and group behavior is recognized according to the target individual features of the plurality of individuals of the plurality of target video frames. Based on the method, the accuracy of group behavior recognition can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments of the present invention relate to the field of computer technology, and in particular to a method and apparatus for identifying group behavior based on spatiotemporal individual interaction relationships. Background Technology

[0002] Group behavior recognition is a crucial and widely applied research problem in the field of computer vision. Currently, intelligent analysis techniques based on group behavior recognition are widely used in sports video analysis. As a high-level semantic analysis technique in computer vision, group behavior recognition can analyze motion in sports videos to achieve intelligent annotation and automated highlight generation. However, current group behavior recognition algorithms primarily identify events by mining individual interaction relationships in the spatial domain, and the accuracy of group behavior recognition needs improvement. Summary of the Invention

[0003] One object of the embodiments of the present invention is to solve at least the above-mentioned problems and / or defects, and to provide at least the advantages described below.

[0004] This invention provides a method and apparatus for identifying group behavior based on spatiotemporal individual interaction relationships, which can achieve the fusion of individual interaction relationship features in the time and space domains, thereby improving the accuracy of group behavior identification.

[0005] Firstly, a method for identifying group behavior based on spatiotemporal individual interaction relationships is provided, including:

[0006] Multiple target video frames are extracted from the target video, wherein each video frame in the target video contains the same multiple individuals, and the multiple target video frames are ordered in chronological order;

[0007] The motion information of multiple individuals in each target video frame is obtained. The motion information of each individual in each target video frame includes the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video.

[0008] Motion information of multiple individuals in each target video frame is used to extract features, thus obtaining the motion features of each individual in each target video frame;

[0009] The initial individual features of multiple individuals in the current target video frame are processed to obtain the target individual features of each individual in the current target video frame; the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame are fused to obtain the initial individual features of multiple individuals in the next target video frame; the next target video frame is used as the current target video frame, and the above process is repeated until the target individual features of each individual in the last target video frame are obtained; wherein, the target individual features of each individual in the k-th target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction relationship features between each individual and other individuals, and the initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame;

[0010] Based on the target individual characteristics of multiple individuals in the multiple target video frames, the behavior of the group composed of the multiple individuals is identified.

[0011] Optionally, before extracting multiple target video frames from the target video, the method includes:

[0012] Extract optical flow information of multiple individuals between every two adjacent video frames from the target video;

[0013] The GLM-Net method is used to process the optical flow information of multiple individuals between every two adjacent video frames to obtain the motion information of multiple individuals in each video frame of the target video. The motion information of each individual in each video frame includes the relative motion information of each individual in each video frame relative to each individual in the previous video frame in the target video.

[0014] Optionally, extracting multiple target video frames from the target video includes:

[0015] A sparse sampling strategy based on the TSN network model is used to extract multiple target video frames from the target video.

[0016] Optionally, the step of processing the initial individual features of multiple individuals in the current target video frame to obtain the target individual features of each individual in the current target video frame includes:

[0017] Based on the initial individual characteristics of each pair of individuals in the current target video frame, determine the interaction relationship weight between each pair of individuals in the current target video frame;

[0018] An initial graph of the current target video frame is established by taking each individual in the current target video frame as each node, taking the initial individual features of each individual as the node features of each node, and taking the interaction weight between each individual and other individuals as the weight of the edge between each individual and other individuals.

[0019] The initial image is input into the GCN model for processing to obtain the target individual features of each individual in the current target video frame.

[0020] Optionally, determining the interaction relationship weight between every two individuals in the current target video frame based on the initial individual features of every two individuals in the current target video frame includes:

[0021] If the distance between any two individuals in the current target video frame meets the preset distance filtering conditions, it is determined that there is an interaction relationship between the two individuals. Based on the initial individual characteristics between the two individuals, the interaction relationship weight between the two individuals is determined.

[0022] If the distance between any two individuals in the current target video frame does not meet the preset distance filtering conditions, it is determined that there is no interaction relationship between the two individuals, and the interaction relationship weight between the two individuals is set to 0.

[0023] Optionally, determining the interaction relationship weight between two individuals based on their initial individual characteristics includes:

[0024] The interaction weights between two individuals are determined based on the distance between their initial individual features.

[0025] Optionally, the step of fusing the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame to obtain the initial individual features of multiple individuals in the next target video frame includes:

[0026] The target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame are fused using the Transformer model to obtain the initial individual features of multiple individuals in the next target video frame.

[0027] Secondly, a group behavior recognition device based on spatiotemporal individual interaction relationships is provided, including:

[0028] The target video frame extraction module is used to extract multiple target video frames from the target video, wherein each video frame in the target video contains multiple identical individuals, and the multiple target video frames are ordered in chronological order.

[0029] The motion information acquisition module is used to acquire motion information of multiple individuals in each target video frame. The motion information of each individual in each target video frame includes the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video.

[0030] The motion feature extraction module is used to extract motion features from the motion information of multiple individuals in each target video frame, so as to obtain the motion features of each individual in each target video frame;

[0031] The target individual feature determination module is used to process the initial individual features of multiple individuals in the current target video frame to obtain the target individual features of each individual in the current target video frame; to fuse the target individual features of multiple individuals in the current target video frame with the motion features of multiple individuals in the next target video frame to obtain the initial individual features of multiple individuals in the next target video frame; and to repeat the above process using the next target video frame as the current target video frame until the target individual features of each individual in the last target video frame are obtained. The target individual features of each individual in the k-th target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction relationship features between each individual and other individuals. The initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame.

[0032] The group behavior recognition module is used to identify the behavior of a group composed of multiple individuals based on the target individual characteristics of multiple individuals in the multiple target video frames.

[0033] Thirdly, an electronic device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to cause the at least one processor to perform the method.

[0034] Fourthly, a storage medium is provided on which a computer program is stored, characterized in that, when the program is executed by a processor, it implements the method described thereon.

[0035] The embodiments of the present invention include at least the following beneficial effects:

[0036] The present invention provides a method, apparatus, electronic device, and storage medium for group behavior recognition based on spatiotemporal individual interaction relationships. The method first extracts multiple target video frames from the target video, wherein each video frame in the target video contains the same multiple individuals, and the multiple target video frames are ordered in chronological order. Then, the motion information of the multiple individuals in each target video frame is obtained. The motion information of each individual in each target video frame includes the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video. Then, feature extraction is performed on the motion information of the multiple individuals in each target video frame to obtain the motion features of each individual in each target video frame. Finally, the initial individual features of the multiple individuals in the current target video frame are processed to obtain the target individual features of each individual in the current target video frame. The target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame are fused to obtain the initial individual features of multiple individuals in the next target video frame. The next target video frame is used as the current target video frame, and the above process is repeated until the target individual features of each individual in the last target video frame are obtained. Among them, the target individual features of each individual in the k-th target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction relationship features between each individual and other individuals. The initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame. Finally, the behavior of the group composed of the multiple individuals is identified based on the target individual features of the multiple individuals in the multiple target video frames. Based on this method and apparatus, the initial individual features of multiple individuals in each target video frame are processed to fuse the target individual features of each individual in the intra-frame spatial domain with individual interaction relationship features. Then, by fusing the target individual features of multiple individuals in the previous target video frame with the motion features of multiple individuals in the next target video frame, the intra-frame individual interaction relationship features of the previous target video frame are transferred to the next target video frame. This allows the initial individual features of each individual in the next target video frame to fuse with inter-frame temporal domain individual interaction relationship features. Through this process, the fusion of temporal and spatial domain individual interaction relationship features is achieved, enabling the extraction of deep semantic features of the contextual relationships between multiple target video frames and improving the accuracy of group behavior recognition.

[0037] Other advantages, objectives, and features of the embodiments of the present invention will be apparent in part from the following description, and in part will be understood by those skilled in the art through study and practice of the embodiments of the present invention. Attached Figure Description

[0038] Figure 1A flowchart of a group behavior recognition method based on spatiotemporal individual interaction relationships provided in an embodiment of the present invention.

[0039] Figure 2 This is a flowchart illustrating the process of determining the target individual features of each individual in multiple target video frames, as provided in one embodiment of the present invention.

[0040] Figure 3 This is a schematic diagram of the structure of a group behavior recognition device based on spatiotemporal individual interaction relationships provided in an embodiment of the present invention.

[0041] Figure 4 This is a schematic diagram of the structure of an electronic device provided in one embodiment of the present invention. Detailed Implementation

[0042] The embodiments of the present invention will now be described in further detail with reference to the accompanying drawings, so that those skilled in the art can implement them based on the description.

[0043] Group behavior recognition is used to identify group events in a given scenario, which are formed by the interactions and influences between individuals within that scenario. More specifically, changes in the motion state of individuals within a scenario are caused by interactions between them. Therefore, existing group behavior recognition methods mostly focus on extracting the interaction relationship features between individuals at the same moment (i.e., spatial domain individual interaction relationship features). However, group events occur continuously in the temporal dimension, and it is difficult to accurately and effectively determine the category of time-continuous group events using only a single spatial domain individual interaction relationship feature. Based on this, this invention processes the initial individual features of multiple individuals in each target video frame, fusing the target individual features of each individual in each target video frame with spatial domain individual interaction relationship features. Then, by fusing the target individual features of multiple individuals in the previous target video frame with the motion features of multiple individuals in the next target video frame, the intra-frame individual interaction relationship features of the previous target video frame are transferred to the next target video frame. This allows the initial individual features of each individual in the next target video frame to be fused with temporal domain individual interaction relationship features, thereby achieving the fusion of temporal and spatial domain individual interaction relationship features and improving the accuracy of group behavior recognition.

[0044] Figure 1 The flowchart of the group behavior recognition method based on spatiotemporal individual interaction relationships provided in the embodiments of the present invention is executed by a system with processing capabilities, a server device, or a group behavior recognition device based on spatiotemporal individual interaction relationships. Figure 1 As shown, the method includes steps 110 to 150.

[0045] Step 110: Extract multiple target video frames from the target video, wherein each video frame in the target video contains multiple identical individuals, and the multiple target video frames are ordered in chronological order.

[0046] The target video can be a sports event video, such as a basketball game video or a volleyball game video, or a video containing a group event. It should be understood that each video frame in the target video contains the same multiple individuals; therefore, multiple target video frames extracted from the target video also contain the same multiple individuals, that is, each target video frame contains the same multiple individuals.

[0047] In some embodiments, before performing step 110, the method includes: extracting optical flow information of multiple individuals between every two adjacent video frames from the target video; processing the optical flow information of multiple individuals between every two adjacent video frames using the GLM-Net method to obtain motion information of multiple individuals in each video frame of the target video, wherein the motion information of each individual in each video frame includes the relative motion information of each individual in each video frame relative to each individual in the previous video frame in the target video.

[0048] Here, the optical flow information of an individual between two adjacent video frames can be understood as the trajectory of that individual's motion between the two adjacent video frames, which may include information such as the individual's instantaneous velocity, direction, and position. The motion information of an individual in a video frame includes the relative motion information of that individual in the current video frame with respect to the corresponding individual in the previous video frame, reflecting the change in the individual's motion state between two adjacent video frames, which may include changes in velocity, changes in direction, and displacement.

[0049] During the acquisition of target video, cameras often need to move to follow the movement of individuals. Therefore, the target video contains not only the motion information of individuals but also the motion information of the camera. The camera's motion information can interfere with the modeling of individual motion features. Based on this, this embodiment of the invention uses the GLM-Net method to process the optical flow information of multiple individuals between every two adjacent video frames. The GLM-Net method can remove the camera's motion information and eliminate the interference of redundant motion information, thereby extracting the true motion information of individuals from the video frames and improving the accuracy of group behavior recognition.

[0050] In some embodiments, extracting multiple target video frames from the target video includes: using a sparse sampling strategy based on a TSN network model to extract multiple target video frames from the target video.

[0051] Group behavior is continuous over time. Motion information from multiple individuals within a single video frame is insufficient to characterize group behavior. Therefore, it is necessary to analyze the motion information of multiple individuals across multiple target video frames to identify group behavior. The number of target video frames can be selected as needed, typically three.

[0052] However, in practical applications, the change in an individual's motion state between two adjacent video frames is relatively small. Using all video frames in the target video as the target video frame would result in high computational cost and low efficiency. Therefore, a subset of video frames can be extracted from the target video for analysis. In some examples, a sparse sampling strategy based on the TSN network model is used to extract multiple target video frames. The sparse sampling strategy of the TSN network model involves dividing the input target video into K segments (K can be 3), and then randomly selecting a short segment from each segment. This short segment is the target video frame in this embodiment of the invention. Based on this strategy, the extracted multiple target video frames can fully reflect the changes in an individual's motion state.

[0053] Step 120: Obtain motion information of multiple individuals in each target video frame. The motion information of each individual in each target video frame includes the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video.

[0054] It is important to note that two adjacent target video frames are not necessarily adjacent in the target video; there may be multiple video frames between them. Therefore, the motion information of an individual in a target video frame includes the relative motion information of that individual in the current target video frame relative to the corresponding individual in the previous video frame in the target video, rather than the relative motion information relative to the corresponding individual in the previous target video frame. For example, if target video frames A, B, and C are extracted from the target video, and target video frames A and B are sequentially separated by video frames a, b, and c, then the motion information of individual s in target video frame B is the relative motion information of individual s in video frame c.

[0055] Step 130: Extract features from the motion information of multiple individuals in each target video frame to obtain the motion features of each individual in each target video frame.

[0056] This step extracts features from the motion information of multiple individuals in each target video frame to obtain the deep features of each individual from a higher-level semantic understanding. Specifically, the Inception-v3 model can be used to extract the motion features of each individual in multiple target video frames, and then the motion features extracted by the RoIAlign algorithm can be mapped to the target tracking bounding box of each individual. This embodiment of the invention does not impose specific limitations on the network model used for feature extraction.

[0057] Step 140: Process the initial individual features of multiple individuals in the current target video frame to obtain the target individual features of each individual in the current target video frame; fuse the target individual features of multiple individuals in the current target video frame with the motion features of multiple individuals in the next target video frame to obtain the initial individual features of multiple individuals in the next target video frame; use the next target video frame as the current target video frame and repeat the above process until the target individual features of each individual in the last target video frame are obtained; wherein, the target individual features of each individual in the kth target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction relationship features between each individual and other individuals, and the initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame.

[0058] In practical applications, group events occur continuously over time. Relying solely on spatial domain individual interaction features is insufficient for accurately and effectively classifying such events. Therefore, this invention processes the initial individual features of multiple individuals in the current target video frame, fusing spatial domain individual interaction features with the target individual features of each individual in the current target video frame. Then, by fusing the target individual features of multiple individuals in the current target video frame with the motion features of multiple individuals in the next target video frame, the intra-frame individual interaction features of the current target video frame are transferred to the next target video frame. This allows the initial individual features of each individual in the next target video frame to be fused with temporal domain individual interaction features, thereby achieving the fusion of temporal and spatial domain individual interaction features and improving the accuracy of group behavior recognition.

[0059] It should be understood that multiple target video frames are ordered in chronological order. Therefore, the next target video frame after the current target video frame is the first target video frame that follows the current target video frame in chronological order.

[0060] Figure 2 A flowchart illustrating the process of determining the target individual features of each individual in multiple target video frames provided by an embodiment of the present invention is shown. Figure 2As shown, step 140 may include steps 1410 to 1430.

[0061] Step 1410: Process the initial individual features of multiple individuals in the current target video frame to obtain the target individual features of each individual in the current target video frame. Specifically, the target individual features of each individual in the k-th target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction features between each individual and other individuals. The initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame.

[0062] In step 1410, the initial individual features of multiple individuals in the current target video frame are processed to extract the interaction relationship features between each individual and other individuals in the current target video frame. Simultaneously, the extracted interaction relationship features between each individual and other individuals in the current target video frame are fused with the initial individual features of each individual to obtain the target individual features of each individual in the current target video frame. Thus, the target individual features of each individual in the current target video frame incorporate the interaction relationship features between each individual and other individuals in the current target video frame, as well as the motion features of each individual. In other words, step 1410 can extract intra-frame individual interaction relationship features (i.e., spatial domain interaction relationship features) and further fuse these intra-frame individual interaction relationship features with the initial individual features.

[0063] For the first target video frame, since it has not undergone inter-frame interaction feature transfer, the initial individual feature of each individual is its motion feature. Therefore, the target individual feature of each individual in the first target video frame is the fusion result of the motion feature of each individual in the first target video frame and the interaction feature between each individual and other individuals. For the other target video frames after the first target video frame, all have undergone inter-frame interaction feature transfer. Therefore, the initial individual feature of each target video frame in the other target video frames is the fusion result obtained after inter-frame interaction feature transfer. In other words, for the other target video frames after the first target video frame, after the extraction of intra-frame interaction features in step 1410, the target individual feature of each individual not only fuses the intra-frame interaction feature and intra-frame motion feature of the corresponding individual, but also fuses the interaction feature and motion feature of the corresponding individual in all previous target video frames.

[0064] In some embodiments, step 1410 further includes: determining the interaction relationship weights between every two individuals in the current target video frame based on the initial individual features of every two individuals in the current target video frame; establishing an initial graph of the current target video frame with each individual in the current target video frame as each node, the initial individual features of each individual as the node features of each node, and the interaction relationship weights between each individual and other individuals as the weights of the edges between each individual and other individuals; and inputting the initial graph into a GCN model for processing to obtain the target individual features of each individual in the current target video frame.

[0065] Specifically, in group behavior, changes in individual motion state are caused by interactions between individuals. Therefore, the importance (or influence) of one individual's initial individual characteristics on the initial individual characteristics of another individual can be determined based on the motion characteristics of two individuals or the initial individual characteristics that incorporate motion characteristics. This allows for the determination of the interaction relationship weight between each pair of individuals in the current target video frame.

[0066] Then, using individuals as nodes, an initial graph is built for all individuals in the current target video frame. ,in, This represents the set of nodes in the initial graph. Let M represent the nth node, and M represent the interaction relationships between individuals. This represents the importance weight of the initial individual features of the j-th individual to the initial individual features of the i-th individual in the current target video frame (also known as the interaction weight between the i-th and j-th individuals). If , it means that the initial individual characteristics of the j-th individual are not important to the initial individual characteristics of the i-th individual; if The value represents the importance of the initial individual features of the j-th individual to the initial individual features of the i-th individual. The initial graph can reflect the interaction relationships between multiple individuals in the current target video frame, and therefore can be called the interaction relationship graph of multiple individuals in the current target video frame.

[0067] Furthermore, in this embodiment of the invention, GCN is used to perform graph convolution operation on the interaction relationship graph of multiple individuals in the current target video frame, and the initial individual features of each node are updated using weights representing the interaction relationship between individuals, so that the final node features are fused with the individual interaction relationship features within the frame.

[0068] In some examples, determining the interaction relationship weight between any two individuals in the current target video frame based on their initial individual characteristics includes: if the distance between any two individuals in the current target video frame meets a preset distance filtering condition, then it is determined that there is an interaction relationship between the two individuals, and the interaction relationship weight between the two individuals is determined based on their initial individual characteristics; if the distance between any two individuals in the current target video frame does not meet the preset distance filtering condition, then it is determined that there is no interaction relationship between the two individuals, and the interaction relationship weight between the two individuals is set to 0.

[0069] Analysis of the target video reveals that the smaller the distance between two individuals in a target video frame, the greater the likelihood of an interaction between them; conversely, the larger the distance, the less likely an interaction exists. Therefore, when calculating the interaction weight between two individuals in the current target video frame, a preset distance filtering condition is used to determine if an interaction exists. If the preset distance filtering condition is met, the interaction weight is calculated based on the initial individual characteristics of the two individuals. If the preset distance filtering condition is not met, the interaction weight is set to 0, and no further calculation is performed. This process achieves a balance between ensuring the accuracy of interaction weight calculation and reducing the overall computational cost of the group recognition method.

[0070] The preset distance filtering condition can be: the distance between two individuals is less than a preset distance threshold. The preset distance threshold can be determined based on actual conditions or experience.

[0071] Furthermore, the distance between two individuals in the current target video frame can be the Euclidean distance between the center coordinates of the target tracking boxes of the two individuals.

[0072] Furthermore, determining the interaction relationship weight between two individuals based on their initial individual characteristics includes: determining the interaction relationship weight between two individuals based on the distance between their initial individual characteristics.

[0073] Specifically, in some examples, the interaction weights between the i-th individual and the j-th individual in the current target video frame are... The calculation formula is shown in formula (1):

[0074] (1)

[0075] Among them, the function Used for individuals in the initial diagram and individuals Modeling the initial individual characteristics between them to obtain individual and individuals The distance between initial individual features can measure the importance of one individual relative to another, where c represents the feature dimension of the individual node. Function This represents the coordinates of the center points of the target tracking bounding boxes of the i-th individual and the j-th individual in the current target video frame. and The Euclidean distance between them. Set a distance threshold. ,when hour, This indicates that there is an interaction relationship between the i-th individual and the j-th individual, when When the condition is met, it indicates that there is no interaction between the i-th individual and the j-th individual. In the target video frame, the initial individual characteristics and spatial interaction relationships of an individual have different semantic attributes. Therefore, by modeling the initial individual characteristics and spatial interaction relationships of an individual separately, and then multiplying and fusing them, the spatial domain interaction relationship model of the individual in the current target video frame can be realized.

[0076] Step 1420: Perform fusion processing on the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame to obtain the initial individual features of multiple individuals in the next target video frame.

[0077] This step fuses the target individual features of multiple individuals in the current target video frame with the motion features of multiple individuals in the next target video frame, and transfers the intra-frame individual interaction relationship features of the current target video frame to the next target video frame, thereby fusing the initial individual features of each individual in the next target video frame with the inter-frame individual interaction relationship features.

[0078] In some embodiments, step 1420 includes: fusing the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame using a Transformer model to obtain the initial individual features of multiple individuals in the next target video frame.

[0079] Specifically, the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame can be input into the Transformer model for processing to obtain the initial individual features of multiple individuals in the next target video frame.

[0080] Furthermore, a self-attention mechanism can be introduced into the Transformer model. The Transformer's self-attention mechanism allows each input individual to leverage the features of other individuals to enhance its own features, which helps in uncovering inter-frame individual interaction features. In addition, because the interval between two adjacent target video frames is very short, there is redundant information between the individual interaction features in the previous target video frame and the individual interaction features in the next target video frame. This redundancy can interfere with group behavior recognition. The Transformer's self-attention mechanism can shield against the interference of inter-frame redundancy, focusing more on information related to group behavior discrimination, thereby improving efficiency and accuracy.

[0081] Step 1430: Take the next target video frame as the current target video frame and repeat the above process until the target individual features of each individual in the last target video frame are obtained.

[0082] Taking three target video frames as an example, firstly, the motion features of multiple individuals in the first target video frame are processed to obtain the target individual features of each individual in the first target video frame. Next, the target individual features of multiple individuals in the first target video frame and the motion features of multiple individuals in the second target video frame are fused to obtain the initial individual features of multiple individuals in the second target video frame. Then, the initial individual features of multiple individuals in the second target video frame are processed again to obtain the target individual features of each individual in the first target video frame. Next, the target individual features of multiple individuals in the second target video frame and the motion features of multiple individuals in the third target video frame are fused to obtain the initial individual features of multiple individuals in the third target video frame. Finally, the initial individual features of multiple individuals in the third target video frame are processed to obtain the target individual features of each individual in the third target video frame. This completes the extraction of the target individual features of each individual in the three target video frames.

[0083] In fact, since the transmission of individual interaction relationship features between frames is carried out in a hierarchical manner from the first frame to the next, the target individual features of an individual in any subsequent target video frame will not only be integrated with the interaction relationship features and motion features of the corresponding individual in the previous target video frame, but also with the interaction relationship features and motion features of the corresponding individuals in all previous target video frames.

[0084] Step 150: Identify the behavior of the group composed of the multiple individuals based on the target individual characteristics of the multiple individuals in the multiple target video frames.

[0085] After the calculation in step 140, the target individual features of multiple individuals from multiple target video frames can be obtained. Assuming there are 3 target video frames and 10 individuals, then 30 target individual features can be obtained.

[0086] Furthermore, the individual target features of all individuals can be fused into a single group feature, and then the group behavior can be identified based on this group feature. For example, max pooling can be used to fuse the individual target features of multiple individuals from multiple target video frames into a single group feature, and then a fully connected layer can be used to classify the group behavior and identify its category using the cross-entropy function as the loss function.

[0087] In summary, this invention provides a method for group behavior recognition based on spatiotemporal individual interaction relationships. First, multiple target video frames are extracted from the target video, where each video frame contains the same multiple individuals. These multiple target video frames are ordered chronologically. Then, motion information of the multiple individuals in each target video frame is obtained. This motion information includes the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video. Next, feature extraction is performed on the motion information of the multiple individuals in each target video frame to obtain the motion features of each individual in each target video frame. Finally, the initial individual features of the multiple individuals in the current target video frame are processed to obtain the target individual features of each individual in the current target video frame. The target individual features of multiple individuals in the target video frame and the motion features of multiple individuals in the next target video frame are fused to obtain the initial individual features of multiple individuals in the next target video frame. This process is repeated, using the next target video frame as the current target video frame, until the target individual features of each individual in the last target video frame are obtained. The target individual features of each individual in the k-th target video frame indicate the motion features of each individual in the 1st to kth target video frames and the interaction features between each individual and other individuals. The initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame. Finally, based on the target individual features of multiple individuals in the multiple target video frames, the behavior of the group composed of these individuals is identified. This method can achieve the fusion of temporal and spatial domain individual interaction features, thereby enabling the extraction of deep semantic features of the contextual relationships between multiple target video frames and improving the accuracy of group behavior recognition.

[0088] The following provides a specific implementation scenario to further illustrate the group behavior recognition method based on spatiotemporal individual interaction relationships provided by the embodiments of the present invention.

[0089] Step (1) Obtain the target video.

[0090] A 1-second video clip of a basketball game was collected. This example uses the NCAA basketball game group behavior dataset for the experiment. The basketball game video clip contains 10 basketball players.

[0091] Step (2) Extract the optical flow information of all players between every two adjacent video frames from the target video.

[0092] The PWC-Net optical flow estimation algorithm is used to extract the optical flow information of all players between every two adjacent video frames.

[0093] Step (3) processes the optical flow information of all players between every two adjacent video frames to obtain the motion information of all players in each video frame.

[0094] The GLM-Net method is used to process the optical flow information of 10 players between every two adjacent video frames, calculating the motion information of each player in each video frame. This motion information is the relative motion information of each player compared to the previous video frame.

[0095] Step (4) Extract 3 target video frames from the target video.

[0096] A sparse sampling strategy based on the TSN network model randomly samples all video frames in the target video, extracting 3 target video frames. The size of the target video frame is H×W=360×490.

[0097] Step (5) Extract features from the motion information of all players in each target video frame to obtain the motion features of each individual in each target video frame.

[0098] Player motion information from three target video frames is input into the Inception-v3 network model. The Inception-v3 network model is then used to extract deep semantic features, or motion features, for each player's motion information in the three target video frames. The feature dimension of the motion features is c=512.

[0099] The RoIAlign algorithm is used to map the target tracking bounding box of each player in each target video frame to the motion features of each player. There are a total of 10 player motion features in each target video frame. The target tracking bounding box size is 5×5.

[0100] Step (6) Extract the individual features of each player in the first target video frame using the GCN model.

[0101] This step includes: (1) calculating the interaction weights between every two players in the first target video frame according to formula (1). Let the distance threshold θ = 0.2. The interaction weight between the two players is determined when the distance between the center points of their target tracking boxes is greater than or equal to 0.2 times the image width. A value of 0 indicates no interaction between the two players; the interaction weight between the two players is determined when the distance between the center points of their target tracking boxes is less than 0.2 times the image width. (2) When establishing the initial graph, the initial graph is established for all players in the first target video frame, with the players as nodes. ,in, This represents the set of nodes in the initial graph. Let M represent the nth node, M represent the interaction relationship between players, and let each player's motion features be used as the node features of each node. The node feature dimension c=512. Let the interaction weight between each player and other players be used as the weight of the edge between each player and other players. The edge feature dimension z=256. (3) Input the established initial graph into the GCN model for processing to obtain the target individual features of each player in the first target video frame.

[0102] Step (7) uses the Transformer model to fuse the individual features of the 10 players in the first target video frame and the motion features of the 10 players in the second target video frame to obtain the initial individual features of the 10 players in the second target video frame.

[0103] Step (8) Extract the individual features of each player in the second target video frame using the GCN model.

[0104] This step includes: (1) Calculating the interaction weights between every two players in the second target video frame according to formula (1). Let the distance threshold θ = 0.2. The interaction weight between the two players is determined when the distance between the center points of their target tracking boxes is greater than or equal to 0.2 times the image width. A value of 0 indicates no interaction between the two players; the interaction weight between the two players is determined when the distance between the center points of their target tracking boxes is less than 0.2 times the image width. (2) When establishing the initial graph, the initial graph is established for all players in the second target video frame, with the players as nodes. ,in, This represents the set of nodes in the initial graph. Let M represent the nth node, M represent the interaction relationship between players, and the initial individual features of each player are used as the node features of each node, with a node feature dimension c=512. The interaction weight between each player and other players is used as the weight of the edge between each player and other players, with an edge feature dimension z=256. (3) Input the established initial graph into the GCN model for processing to obtain the target individual features of each player in the second target video frame.

[0105] Step (9) uses the Transformer model to fuse the individual features of the 10 players in the second target video frame and the motion features of the 10 players in the third target video frame to obtain the initial individual features of the 10 players in the third target video frame.

[0106] Step (10) Extract the individual features of each player in the third target video frame using the GCN model.

[0107] This step includes: (1) Calculating the interaction weights between every two players in the third target video frame according to formula (1). Let the distance threshold θ = 0.2. The interaction weight between the two players is determined when the distance between the center points of their target tracking boxes is greater than or equal to 0.2 times the image width. A value of 0 indicates no interaction between the two players; the interaction weight between the two players is determined when the distance between the center points of their target tracking boxes is less than 0.2 times the image width. (2) When establishing the initial graph, the initial graph is established for all players in the third target video frame, with the players as nodes. ,in, This represents the set of nodes in the initial graph. Let M represent the nth node, M represent the interaction relationship between players, and the initial individual features of each player are used as the node features of each node, with a node feature dimension c=512. The interaction weight between each player and other players is used as the weight of the edge between each player and other players, with an edge feature dimension z=256. (3) Input the established initial graph into the GCN model for processing to obtain the target individual features of each player in the third target video frame.

[0108] Step (11) fuses the individual features of the 10 players in the 3 target video frames to generate group features.

[0109] Max pooling was used to fuse the individual target features of 10 players from 3 target video frames to generate group features. The individual target features of the 10 players from the 3 target video frames comprised a total of 30 individual target features.

[0110] Step (12) inputs the group features into the fully connected layer for classification and outputs the classification results.

[0111] The cross-entropy function is used as the loss function. There are 6 types of group behaviors: free throw, layup, dunk, two-point shot, three-point shot, and steal. The fully connected layer output layer has 6 dimensions, with each dimension corresponding to one group behavior.

[0112] Furthermore, the Inception-v3, GCN, and Transformer network models used in the above steps are all trained network models. Hyperparameters are used during the training of GCN and Transformer. , and The Adam optimizer was used to train the network model. Training was conducted for 300 epochs in an end-to-end manner, with an initial learning rate of 0.001. After every 50 batches of training, the learning rate was reduced to 0.2 times its original value.

[0113] The accuracy of this embodiment in identifying group behavior during basketball games is shown in Table 1.

[0114] Table 1. Accuracy of group behavior identification in basketball games

[0115] Behavior Categories accuracy Three-pointer 0.842 penalty 0.939 layup 0.628 Two-point shot 0.625 dunk 0.407 Steal 0.870 average 0.719

[0116] In contrast, the existing GLM-Net method was used to identify group behavior in basketball games. Comparing the group behavior identification results obtained in this embodiment with those of the existing GLM-Net method, it was found that the average accuracy of the method provided in this embodiment is improved by 0.029.

[0117] In summary, the group behavior recognition method based on spatiotemporal individual interaction relationships provided by this invention processes the initial individual features of multiple individuals in each target video frame, fusing the target individual features of each individual in each target video frame with intra-frame spatial domain individual interaction relationship features. Then, by fusing the target individual features of multiple individuals in the previous target video frame with the motion features of multiple individuals in the next target video frame, the intra-frame individual interaction relationship features of the previous target video frame are transferred to the next target video frame, thereby fusing the initial individual features of each individual in the next target video frame with inter-frame temporal domain individual interaction relationship features. Through the above process, the fusion of temporal and spatial domain individual interaction relationship features is achieved, enabling the extraction of deep semantic features of the contextual relationships between multiple target video frames, thus improving the accuracy of group behavior recognition.

[0118] Figure 3A schematic diagram of the structure of the apparatus for constructing a contaminated site knowledge graph provided in an embodiment of the present invention is shown. Figure 3 As shown, the group behavior recognition device 300 based on spatiotemporal individual interaction relationships includes: a target video frame extraction module 310, used to extract multiple target video frames from the target video, wherein each video frame in the target video contains the same multiple individuals, and the multiple target video frames are ordered in chronological order; a motion information acquisition module 320, used to acquire motion information of multiple individuals in each target video frame, the motion information of each individual in each target video frame including the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video; a motion feature extraction module 330, used to extract features from the motion information of multiple individuals in each target video frame to obtain the motion features of each individual in each target video frame; and a target individual feature determination module 340, used to process the initial individual features of multiple individuals in the current target video frame to obtain the current target video frame. The target individual features of each individual in the current target video frame and the motion features of multiple individuals in the next target video frame are fused to obtain the initial individual features of multiple individuals in the next target video frame. The next target video frame is used as the current target video frame, and the above process is repeated until the target individual features of each individual in the last target video frame are obtained. Among them, the target individual features of each individual in the kth target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction relationship features between each individual and other individuals. The initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame. The group behavior recognition module 350 is used to identify the behavior of the group composed of multiple individuals based on the target individual features of multiple individuals in the multiple target video frames.

[0119] In some embodiments, the device includes:

[0120] The optical flow information extraction module is used to extract optical flow information of multiple individuals between every two adjacent video frames from the target video.

[0121] The motion information extraction module is used to process the optical flow information of multiple individuals between every two adjacent video frames using the GLM-Net method to obtain the motion information of multiple individuals in each video frame of the target video. The motion information of each individual in each video frame includes the relative motion information of each individual in each video frame relative to each individual in the previous video frame in the target video.

[0122] In some embodiments, the target video frame extraction module is specifically used for:

[0123] A sparse sampling strategy based on the TSN network model is used to extract multiple target video frames from the target video.

[0124] In some embodiments, the target individual feature determination module includes:

[0125] The target individual characteristic determination submodule includes:

[0126] The interaction relationship weight determination unit is used to determine the interaction relationship weight between each pair of individuals in the current target video frame based on the initial individual characteristics of each pair of individuals in the current target video frame.

[0127] The initial graph establishment unit is used to establish an initial graph of the current target video frame, with each individual in the current target video frame as each node, the initial individual features of each individual as the node features of each node, and the interaction relationship weight between each individual and other individuals as the weight of the edge between each individual and other individuals.

[0128] The target individual feature determination unit is used to input the initial image into the GCN model for processing to obtain the target individual features of each individual in the current target video frame.

[0129] In some embodiments, the interaction relationship weight determination unit is specifically used for:

[0130] If the distance between any two individuals in the current target video frame meets the preset distance filtering conditions, it is determined that there is an interaction relationship between the two individuals. Based on the initial individual characteristics between the two individuals, the interaction relationship weight between the two individuals is determined.

[0131] If the distance between any two individuals in the current target video frame does not meet the preset distance filtering conditions, it is determined that there is no interaction relationship between the two individuals, and the interaction relationship weight between the two individuals is set to 0.

[0132] In some embodiments, the interaction relationship weight determination unit is specifically used for:

[0133] The interaction weights between two individuals are determined based on the distance between their initial individual features.

[0134] In some embodiments, the target individual feature determination module includes:

[0135] The initial individual feature determination submodule is used to fuse the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame using the Transformer model to obtain the initial individual features of multiple individuals in the next target video frame.

[0136] Figure 4 An electronic device according to an embodiment of the present invention is shown. For example... Figure 4 As shown, the electronic device 400 includes: at least one processor 410, and a memory 420 communicatively connected to the at least one processor 410, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to cause the at least one processor to perform a method.

[0137] Specifically, the memory 420 and processor 410 are connected together via bus 430. These can be general-purpose memory and processors, without specific limitations. When the processor 410 runs the computer program stored in the memory 420, it can execute the functions described in this embodiment of the invention. Figures 1 to 3 The described operations and functions.

[0138] In this embodiment of the invention, the electronic device 400 may include, but is not limited to: personal computer, server computer, workstation, desktop computer, laptop computer, notebook computer, mobile computing device, smartphone, tablet computer, personal digital assistant (PDA), handheld device, messaging device, wearable computing device, etc.

[0139] This invention also provides a storage medium storing a computer program that, when executed by a processor, implements a method. Specific implementation details can be found in the method embodiments and will not be repeated here. Specifically, a system or apparatus equipped with a storage medium storing software program code that implements the functions of any of the embodiments described above, and enabling the computer or processor of the system or apparatus to read and execute instructions stored in the storage medium. The program code read from the storage medium itself can implement the functions of any of the embodiments described above; therefore, machine-readable code and the storage medium storing machine-readable code constitute a part of this invention.

[0140] Storage media include, but are not limited to, floppy disks, hard disks, magneto-optical disks, optical disks, magnetic tapes, non-volatile memory cards, and ROMs. Program code can also be downloaded from server computers or the cloud via communication networks.

[0141] It should be noted that not all steps and modules in the above processes and system structures are necessary; some steps and units can be omitted as needed. The execution order of each step is not fixed and can be determined as required. The device structure described in the above embodiments can be a physical structure or a logical structure. A module or unit may be implemented by the same physical entity, a module or unit may be implemented by multiple physical entities respectively, or a module or unit may be jointly implemented by multiple components in multiple independent devices.

[0142] Although embodiments of the present invention have been disclosed above, they are not limited to the applications listed in the specification and embodiments. It can be applied to various fields suitable for embodiments of the present invention. Other modifications can be readily implemented by those skilled in the art. Therefore, without departing from the general concept defined by the claims and their equivalents, embodiments of the present invention are not limited to the specific details and illustrations shown and described herein.

Claims

1. A method for identifying group behavior based on spatiotemporal individual interaction relationships, characterized in that, include: Multiple target video frames are extracted from a target video, wherein each video frame in the target video contains multiple identical individuals, and the multiple target video frames are ordered in chronological order. The motion information of multiple individuals in each target video frame is obtained. The motion information of each individual in each target video frame includes the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video. Motion information of multiple individuals in each target video frame is used to extract features, thus obtaining the motion features of each individual in each target video frame; The initial individual features of multiple individuals in the current target video frame are processed to obtain the target individual features of each individual in the current target video frame; the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame are fused to obtain the initial individual features of multiple individuals in the next target video frame; the next target video frame is used as the current target video frame, and the above process is repeated until the target individual features of each individual in the last target video frame are obtained; wherein, the target individual features of each individual in the k-th target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction relationship features between each individual and other individuals, and the initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame; Based on the target individual characteristics of multiple individuals in the multiple target video frames, the behavior of the group composed of the multiple individuals is identified; The spatiotemporal individual interaction relationships include spatial individual interaction relationships and temporal individual interaction relationships. The spatial individual interaction relationships are the interaction relationships between each individual and other individuals in the same target video frame, and the temporal individual interaction relationships are the interaction relationships between each individual and other individuals in different target video frames.

2. The group behavior recognition method based on spatiotemporal individual interaction relationships as described in claim 1, characterized in that, Before extracting multiple target video frames from the target video, the method includes: Extract optical flow information of multiple individuals between every two adjacent video frames from the target video; The GLM-Net method is used to process the optical flow information of multiple individuals between every two adjacent video frames to obtain the motion information of multiple individuals in each video frame of the target video. The motion information of each individual in each video frame includes the relative motion information of each individual in each video frame relative to each individual in the previous video frame in the target video.

3. The group behavior recognition method based on spatiotemporal individual interaction relationships as described in claim 1, characterized in that, The step of extracting multiple target video frames from the target video includes: A sparse sampling strategy based on the TSN network model is used to extract multiple target video frames from the target video.

4. The group behavior recognition method based on spatiotemporal individual interaction relationships as described in claim 1, characterized in that, The process of processing the initial individual features of multiple individuals in the current target video frame to obtain the target individual features of each individual in the current target video frame includes: Based on the initial individual characteristics of each pair of individuals in the current target video frame, determine the interaction relationship weight between each pair of individuals in the current target video frame; An initial graph of the current target video frame is established by taking each individual in the current target video frame as each node, taking the initial individual features of each individual as the node features of each node, and taking the interaction weight between each individual and other individuals as the weight of the edge between each individual and other individuals. The initial image is input into the GCN model for processing to obtain the target individual features of each individual in the current target video frame.

5. The group behavior recognition method based on spatiotemporal individual interaction relationships as described in claim 4, characterized in that, The step of determining the interaction relationship weight between every two individuals in the current target video frame based on the initial individual features of every two individuals in the current target video frame includes: If the distance between any two individuals in the current target video frame meets the preset distance filtering conditions, it is determined that there is an interaction relationship between the two individuals. Based on the initial individual characteristics between the two individuals, the interaction relationship weight between the two individuals is determined. If the distance between any two individuals in the current target video frame does not meet the preset distance filtering conditions, it is determined that there is no interaction relationship between the two individuals, and the interaction relationship weight between the two individuals is set to 0.

6. The group behavior recognition method based on spatiotemporal individual interaction relationships as described in claim 5, characterized in that, The step of determining the interaction relationship weight between two individuals based on their initial individual characteristics includes: The interaction weights between two individuals are determined based on the distance between their initial individual features.

7. The group behavior recognition method based on spatiotemporal individual interaction relationships as described in claim 1, characterized in that, The process of fusing the target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame to obtain the initial individual features of multiple individuals in the next target video frame includes: The target individual features of multiple individuals in the current target video frame and the motion features of multiple individuals in the next target video frame are fused using the Transformer model to obtain the initial individual features of multiple individuals in the next target video frame.

8. A group behavior recognition device based on spatiotemporal individual interaction relationships, characterized in that, include: The target video frame extraction module is used to extract multiple target video frames from a target video, wherein each video frame in the target video contains multiple identical individuals, and the multiple target video frames are ordered in chronological order. The motion information acquisition module is used to acquire motion information of multiple individuals in each target video frame. The motion information of each individual in each target video frame includes the relative motion information of each individual in each target video frame relative to each individual in the previous video frame in the target video. The motion feature extraction module is used to extract motion features from the motion information of multiple individuals in each target video frame, so as to obtain the motion features of each individual in each target video frame; The target individual feature determination module is used to process the initial individual features of multiple individuals in the current target video frame to obtain the target individual features of each individual in the current target video frame; to fuse the target individual features of multiple individuals in the current target video frame with the motion features of multiple individuals in the next target video frame to obtain the initial individual features of multiple individuals in the next target video frame; and to repeat the above process using the next target video frame as the current target video frame until the target individual features of each individual in the last target video frame are obtained. The target individual features of each individual in the k-th target video frame are used to indicate the motion features of each individual in the 1st to kth target video frames and the interaction relationship features between each individual and other individuals. The initial individual features of each individual in the 1st target video frame are the motion features of each individual in the 1st target video frame. The group behavior recognition module is used to identify the behavior of a group composed of multiple individuals based on the target individual characteristics of multiple individuals in the multiple target video frames; The spatiotemporal individual interaction relationships include spatial individual interaction relationships and temporal individual interaction relationships. The spatial individual interaction relationships are the interaction relationships between each individual and other individuals in the same target video frame, and the temporal individual interaction relationships are the interaction relationships between each individual and other individuals in different target video frames.

9. An electronic device, characterized in that, include: At least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of claims 1-7.

10. A storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method of any one of claims 1-7.