A video scene understanding method and system based on attention fusion

CN119206577BActive Publication Date: 2026-06-23HUNAN UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUNAN UNIV
Filing Date: 2024-09-12
Publication Date: 2026-06-23

AI Technical Summary

Technical Problem

Existing video scene understanding technologies lack effective information processing capabilities when dealing with temporal sequence information in videos. In particular, they struggle to capture complex temporal dynamics when faced with nonlinear motion and rapidly changing scenes, resulting in poor pedestrian behavior recognition performance.

Method used

A novel multi-dimensional attention fusion module is adopted, which independently calculates attention information in the three dimensions of channel, time and space, and integrates them through the feature fusion module to build a recognition network in pedestrian video scenes. This includes a channel preprocessing module, a spatial shrinkage sampling module, a temporal expansion focusing module and a multi-dimensional feature calculation module, which optimizes the processing of video data.

Benefits of technology

It enhances the ability to understand video scenes, especially in pedestrian action recognition. By effectively utilizing temporal information, it improves the ability to process video time-series information, thereby enhancing the reliability and accuracy of recognition results.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN119206577B_ABST

Patent Text Reader

Abstract

The application discloses a video scene understanding method and system based on attention fusion, which adds a multi-dimensional attention fusion module to a backbone network to construct a recognition network in a pedestrian video scene, such as a pedestrian action recognition network, wherein the multi-dimensional attention fusion module independently calculates the attention of three dimensions of channels, time and space and then fuses the attention; then, pedestrian data of each frame of image in a video data set is acquired, and the recognition network in the pedestrian video scene is trained by using the pedestrian data of each frame of image in the video data set; finally, the trained recognition network in the pedestrian video scene is used for video understanding of a to-be-detected video, such as outputting a pedestrian action type. The technical scheme of the application not only makes the most effective use of various attention information by using the attention information alone, but also integrates the attention information in time and space dimensions to obtain more comprehensive information, enhances the use of time sequence information and improves the video scene understanding capability.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of pedestrian image / video recognition technology, specifically relating to a video scene understanding method and system based on attention fusion. Background Technology

[0002] Video scene understanding is an important topic in the fields of computer vision and artificial intelligence. It involves extracting and interpreting visual information from videos to achieve understanding and analysis of video content, including multiple levels of technologies and tasks such as object detection, object tracking, behavior recognition, and scene segmentation.

[0003] PoseC3D is a commonly used network for video scene understanding. It employs a 3D convolutional neural network (3DCNN) architecture to process spatiotemporal information in video data, enabling video scene understanding and human pose analysis. PoseC3D is more efficient and accurate in processing dynamic human poses in videos, making it suitable for applications requiring long-duration motion recognition and complex motion capture. It first obtains a sequence of bounding boxes containing only the human body through an object detection network. Frame extraction and normalization are then performed on the video sequence. 3D convolutional layers are used to extract features from the video frames, generating keypoint locations for the human body. The network outputs these keypoints for smoothing and correction to improve the accuracy and robustness of pose estimation. Finally, based on pose estimation, changes in human pose are further analyzed to identify specific human actions or behavioral patterns.

[0004] PoseC3D uses a 3D-CNN network as its backbone for keypoint processing. While 3D-CNNs are typically designed with the assumption that the time intervals between video frames are fixed, this may not always hold true in real-world applications. Furthermore, 3D-CNNs usually treat time as another spatial dimension and are not specifically optimized to understand complex temporal changes. Therefore, they are insufficient to capture more complex temporal dynamics in non-linear motion and rapidly changing scenes. Additionally, since action recognition requires full utilization of temporal information, the lack of a dedicated temporal information processing module means that PoseC3D still has room for improvement.

[0005] Adding several specially designed temporal attention modules to the backbone network can enhance the model's ability to process time-series information in videos with only a few additional parameters, thereby optimizing the accuracy and efficiency of pose recognition. CBAM (Convolutional Block Attention Module) is an attention mechanism module for convolutional neural networks that improves the network's representational power by jointly using channel attention and spatial attention. The CBAM module first weights different channels of the input feature map using the channel attention module to highlight important feature channels, and then weights different spatial locations of the feature map using the spatial attention module to emphasize important spatial regions.

[0006] However, when processing long-term series data, the CBAM module may not be able to fully capture global temporal dependencies because its design focuses primarily on channel and spatial dimensions, lacking modeling and utilization of temporal information. Furthermore, extending convolutions in CBAM to 3D-CBAM significantly increases the number of parameters due to the use of 3D convolutions. Also, processing information from the temporal dimension mixed with channel or spatial dimensions using the same method leads to inefficient utilization of temporal information.

[0007] The aforementioned technical obstacles mean that video scene understanding and pedestrian behavior technology still need further improvement. Summary of the Invention

[0008] The purpose of this invention is to overcome the limitations of existing technologies in processing temporal sequence information of videos, enhance the utilization of temporal information, and thus improve video scene understanding capabilities. This invention provides a video scene understanding method and system based on attention fusion. Specifically, the technical solution of this invention proposes a novel multi-dimensional attention fusion module, which is integrated into a backbone network to construct a recognition network for pedestrian video scenes. This multi-dimensional attention fusion module independently calculates the attention of the channel, time, and space dimensions, and then fuses the channel attention, temporal attention, and spatial attention information. This approach not only utilizes attention information individually to maximize the utilization of various attention types but also integrates attention information in both the temporal and spatial dimensions, resulting in more comprehensive information. The network exhibits particularly outstanding performance in pedestrian action recognition applications.

[0009] Therefore, the present invention provides the following technical solution:

[0010] On one hand, the present invention provides a video scene understanding method based on attention fusion, comprising the following steps:

[0011] A multi-dimensional attention fusion module is added to the backbone network to construct a recognition network for pedestrian video scenarios. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information.

[0012] The pedestrian data of each frame in the video dataset is obtained, and then the pedestrian data of each frame in the video dataset is used to train the recognition network in the pedestrian video scene to obtain the content of video understanding.

[0013] The trained pedestrian video scene recognition network is used to perform video understanding on the video to be detected.

[0014] Specifically, the input and output data of the recognition network in the pedestrian video scene are set according to the target of video understanding. The input data corresponds to pedestrian data, and the output data is the content of video understanding.

[0015] Further, optionally, the video is understood as pedestrian action recognition, and the recognition network corresponding to the pedestrian video scene is a pedestrian action recognition network;

[0016] The pedestrian data is a 3D volumetric heatmap integrated from pedestrian keypoint posture data. Pedestrian keypoints represent the positions of human joints, and the pedestrian keypoint posture data includes joint positions and limbs represented by joint lines. The output data of the recognition network in the pedestrian video scene is the pedestrian action type. The specific content of the pedestrian action category is manually set according to recognition requirements, such as walking, running, etc.

[0017] This invention processes information from the three dimensions of channel, time, and space independently, and calculates attention information for each dimension separately. Then, it utilizes the attention information uniformly through feature fusion. Specifically, the input first passes through the channel attention module, and the output after attention calculation simultaneously passes through the spatial attention module and the temporal attention module. The spatial and temporal attention modules each perform their own attention calculations. Finally, the attention information from these three modules is fused in the feature fusion module to obtain the final output of the module.

[0018] Optionally, the multi-dimensional attention fusion module includes a channel preprocessing module, a spatial shrinkage sampling module, a temporal expansion focusing module, and a multi-dimensional feature calculation module. The channel preprocessing module, spatial shrinkage sampling module, and temporal expansion focusing module are used to acquire channel attention, spatial attention, and temporal attention, respectively. The output X obtained by the channel preprocessing module is... c_out These are respectively used as inputs to the spatial contraction sampling module and the temporal expansion focusing module;

[0019] The multi-dimensional attention fusion module is used to fuse channel attention, temporal attention, and spatial attention information. The processing procedure is as follows:

[0020] The output X of the spatial shrinkage sampling module sa The output X of the time-extended focusing module ta Multiply by the output X respectively c_out The activation output X obtains spatial attention information. s_out Activation output X of time attention t_out ; and output X sa With output X ta Multiplication yields the spatiotemporal attention information X. sta Multiply by the output X c_out Get the output X st_out Finally, X s_out X t_out X st_out The output of the multi-dimensional feature calculation module is weighted according to a preset weight ratio, which is also the output of the multi-dimensional attention fusion module.

[0021] The technical solution of this invention uses a multi-dimensional attention fusion module to fuse channel attention with temporal attention and spatial attention. Simultaneously, it fuses temporal attention with spatial attention to obtain spatiotemporal attention, ultimately focusing on X. s_out X t_out X st_out By weighting the information according to a preset weight ratio, the technical solution of this invention not only uses attention information independently, making the most effective use of attention information, but also integrates attention information in both spatiotemporal dimensions to obtain more comprehensive information.

[0022] Optionally, the processing procedure of the time-extended focusing module is as follows:

[0023] Enter X c_out Spatial information is extended frame by frame to the temporal channel to obtain X t Among them, a sub-module of the time extension focusing module, namely the time information fusion extension module, will input X. c_our Spatial information is extended frame by frame to the temporal channel to obtain X t The specific implementation process is as follows: "Each frame, i.e., the information in the spatial dimension, is divided into several 2x2 blocks, and the pixels at the same position in the blocks are spliced together to obtain four spliced subframes. Then, the four spliced subframes are sequentially spliced together to complete the information fusion within a single frame. Finally, the results of the information fusion of each single frame are spliced together in the order of the temporal dimension."

[0024] X is then obtained through 3D average pooling and 3D max pooling, respectively. t_avg X t_max ;

[0025] Next, X is passed through a shared-weight MLP layer. t_avg X t_max Element-wise addition is performed, and then the output X is obtained by passing the sigmoid activation function. ta This refers to temporal attention information; wherein, the MLP layer consists of two 1D convolutions and a ReLU function.

[0026] The temporal extension focusing module provided by this invention concentrates spatial information into the temporal dimension, avoiding the loss of effective information. Furthermore, the shared-weight MLP in the CBAM module is improved from 2D convolution to 1D convolution, reducing computational load while isolating information from other dimensions, thus focusing attention information more intently on the temporal dimension.

[0027] Optionally, the processing procedure of the spatial shrinkage sampling module is as follows:

[0028] Enter X c_out X is obtained through average pooling and max pooling. s_avg X s_max ;

[0029] Then X s_avg X s_max X is obtained by splicing s ;

[0030] Finally, through dilated convolution... The output X is obtained by using the sigmoid activation function. sa This refers to spatial attention information.

[0031] Optionally, after the input of the multi-dimensional attention fusion module enters the channel preprocessing module, the processing procedure is as follows:

[0032] First, X is obtained through 3D average pooling and 3D max pooling respectively. c_avg X c_max ∈R B×C×1×1×1 ;

[0033] X is then passed through a shared-weight MLP layer. c_avg X c_max Element-wise addition is performed, and then the channel attention information X is obtained by passing it through the sigmoid activation function. ca ;

[0034] Finally, channel attention information X ca Multiply the input X to get the output X c_out .

[0035] The channel preprocessing module provided by the technical solution of this invention improves the shared weight MLP in the CBAM module from 2D convolution to 1D convolution, reducing the amount of computation while isolating information from other dimensions, making the attention information more concentrated in the channel dimension.

[0036] Furthermore, in some implementations, a video scene understanding method based on attention fusion includes the following steps:

[0037] Step 1: Create a video dataset and an image dataset of pedestrians.

[0038] Step 2: Train a pedestrian keypoint detection network using the pedestrian image dataset to obtain a pedestrian detection model, wherein the input of the pedestrian detection model is a pedestrian image and the output is the bounding box of the pedestrian;

[0039] Step 3: Use the pedestrian detection model to perform frame-by-frame recognition on the video dataset to obtain the pedestrian recognition result for each frame of the image, and identify the pose of the pedestrian key points based on the pedestrian recognition result. The pedestrian key points represent the position of human joints, including the joint position and the limbs represented by the joint lines.

[0040] That is, pedestrian detection is performed using a constructed pedestrian detection model;

[0041] Step 4: Add the multi-dimensional attention fusion module to the backbone network to construct the pedestrian action recognition network, and then use the 3D volumetric heatmap integrated from the pedestrian key point pose data of each frame of the video data to train the pedestrian action recognition network.

[0042] The pedestrian action recognition network takes as input a 3D volumetric heatmap based on pedestrian key point posture data integration and outputs as pedestrian action categories, which is the content of video understanding.

[0043] Step 5: Input the video to be detected into the pedestrian detection model and perform pedestrian key point recognition. Then input the recognition results of the pedestrian key points into the pedestrian action recognition network to obtain the action category.

[0044] Further optionally, the backbone network of the pedestrian action recognition network is SlowPath of Slow Fast-RCNN, and the multi-dimensional attention fusion module is added to each res-laye residual layer of ResNet-3D.

[0045] Secondly, the system based on the video scene understanding method provided by the present invention includes at least:

[0046] A pedestrian video scene recognition network construction module is used to add a multi-dimensional attention fusion module to the backbone network to construct a pedestrian video scene recognition network. The multi-dimensional attention fusion module calculates the attention of the channel, time and space dimensions independently, and then fuses the channel attention, time attention and spatial attention information.

[0047] The pedestrian data acquisition module is used to acquire pedestrian data for each frame of the video dataset;

[0048] The training module is used to train the pedestrian video scene recognition network using pedestrian data from each frame of the video dataset to obtain the content of video understanding;

[0049] The detection module is used to perform video understanding on the video to be detected using the trained pedestrian video scene recognition network;

[0050] Specifically, the input and output data of the recognition network in the pedestrian video scene are set according to the target of video understanding. The input data corresponds to pedestrian data, and the output data is the content of video understanding.

[0051] In three aspects, the present invention provides an electronic terminal, comprising:

[0052] One or more processors;

[0053] A memory that stores one or more computer programs;

[0054] The processor invokes the computer program to implement the video scene understanding method.

[0055] In four aspects, the present invention provides a computer-readable storage medium storing a computer program, which is called by a processor to implement the video scene understanding method.

[0056] Beneficial effects

[0057] Compared with existing methods, the advantages of the present invention are:

[0058] This invention provides a novel video scene understanding method based on attention fusion. It proposes a novel multi-dimensional attention fusion module, which is incorporated into the backbone network to construct a new pedestrian action recognition network. This multi-dimensional attention fusion module processes channel, temporal, and spatial information independently. That is, each of the three dimensions uses its own module to calculate attention separately, with each dimension emphasizing different aspects. This separate processing makes the attention information more effective. Finally, the temporal and spatial attention information are combined into spatiotemporal attention information, avoiding an excessive proportion of single-dimensional attention information in the output. This improves the processing capability of temporal sequence information in the video, fully and effectively utilizing temporal information, and ultimately enhancing the video scene understanding capability.

[0059] Compared to the existing CBAM, the MDAF proposed in this invention extends the data from 2D to 3D, enabling the processing of video data. In addition, this invention adds a time extension focusing module compared to CBAM, which is specifically used to process temporal information. CBAM is used to process 2D information and does not have the ability to utilize temporal information.

[0060] Furthermore, the technical solution of this invention not only performs new processing on the three dimensions, but also adopts different structures to process the information of each dimension based on the characteristics of each dimension. At the same time, it has a multi-dimensional feature calculation module, which balances the effective information between each dimension through self-learning weights, thereby ensuring the reliability of the recognition results. Attached Figure Description

[0061] Figure 1 This is a schematic diagram of the network detection process provided by the present invention;

[0062] Figure 2 This is a schematic diagram of the architecture of the MDAF module provided by the present invention;

[0063] Figure 3 This is a schematic diagram of the processing flow of the channel preprocessing module;

[0064] Figure 4 This is a schematic diagram of the processing flow of the time-extended focusing module;

[0065] Figure 5 This is a schematic diagram of the processing flow of the spatial shrinkage sampling module. Detailed Implementation

[0066] The present invention will be further described below with reference to embodiments.

[0067] The technical solution of this invention provides a video scene understanding method based on attention fusion. Its core improvement lies in proposing a novel multi-dimensional attention fusion module and integrating it into the backbone network. Therefore, the technical approach of this invention is as follows:

[0068] A multi-dimensional attention fusion module is added to the backbone network to build a recognition network for pedestrian video scenarios. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information.

[0069] The pedestrian data of each frame in the video dataset is obtained, and then the pedestrian data of each frame in the video dataset is used to train a recognition network for pedestrian video scenes to obtain the content of video understanding.

[0070] A pre-trained pedestrian video scene recognition network is used to perform video understanding on the video to be detected.

[0071] Specifically, the input and output data of the recognition network for pedestrian video scenes are set according to the goals of video understanding. The input data corresponds to pedestrian data, and the output data is the content of video understanding. It should be understood that video understanding is often divided into pedestrian action recognition, target recognition, target tracking, scene segmentation, etc. Therefore, the selection of input and output is closely related to and corresponds one-to-one with the goals of video understanding, and can be set according to the requirements.

[0072] Because the technical solution of this invention performs outstandingly in pedestrian action recognition applications, the following description will use this application as an example. Specifically, the video is understood as pedestrian action recognition, and the recognition network in the corresponding pedestrian video scene is a pedestrian action recognition network. The pedestrian data is a 3D volumetric heatmap based on the integration of pedestrian key point posture data. The pedestrian key points represent the positions of human joints, and the pedestrian key point posture data includes joint positions and limbs represented by joint lines. The output data of the recognition network in the pedestrian video scene is the pedestrian action type.

[0073] Specifically as follows:

[0074] S1: A multi-dimensional attention fusion module is added to the backbone network to construct a pedestrian action recognition network. This module independently calculates the attention for each of the three dimensions—channel, time, and space—before fusing the channel, time, and spatial attention information. Specifically, the multi-dimensional attention fusion module includes a channel preprocessing module, a spatial contraction sampling module, a temporal expansion focusing module, and a multi-dimensional feature calculation module. The output X obtained from the channel preprocessing module... c_out These are respectively used as inputs to the spatial contraction sampling module and the temporal expansion focusing module; the multi-dimensional attention fusion module is used to fuse channel attention, temporal attention, and spatial attention information, and the processing procedure is as follows:

[0075] The output X of the space shrinkage sampling module sa With the output X of the time-extended focusing module ta Multiply by the output X respectivelyc_out The activation output X obtains spatial attention information. s_out Activation output X of time attention t_out ; and output X sa With output X ta Multiplication yields the spatiotemporal attention information X. sta Multiply by the output X c_out Get the output X st_out Finally, X s_out X t_out X st_out The multidimensional feature calculation module outputs a weighted sum according to a preset weight ratio, which is also the output of the multidimensional attention fusion module. As can be seen from the above, this invention utilizes the multidimensional attention fusion module to achieve multidimensional feature fusion.

[0076] S2: Obtain pedestrian keypoint pose data for each frame of the video dataset related to pedestrian behavior, where pedestrian keypoints represent the positions of human joints. The purpose of this step is to obtain pedestrian keypoint pose data. Example 1 below is a preferred feasible method of the present invention. Other feasible embodiments do not restrict the method of obtaining pedestrian keypoint pose data.

[0077] S3: Train the pedestrian action recognition network using a 3D volumetric heatmap integrated from pedestrian keypoint pose data of each frame in the video dataset. The input to the pedestrian action recognition network is the 3D volumetric heatmap integrated from pedestrian keypoint pose data, and the output is the type of pedestrian action identified.

[0078] S4: Using the trained pedestrian action recognition network, perform video understanding on the video to be detected. It should be understood that, following the method in S2, the pedestrian keypoint pose data of the video to be detected is obtained, then integrated into a 3D volumetric heatmap, which is then input into the trained pedestrian action recognition network to identify the pedestrian action type.

[0079] The following Example 1 is a preferred embodiment of the present invention, as detailed below:

[0080] Example 1

[0081] This invention provides a video scene understanding method based on attention fusion, comprising the following steps:

[0082] Step 1: Establish a video dataset and a pedestrian image dataset related to pedestrian behavior. In this embodiment, the video dataset is used to train the subsequent pedestrian action recognition network, and the pedestrian image dataset is used to train the pedestrian detection model.

[0083] In this embodiment, a portion of the pedestrian image dataset is derived from sampling a video dataset, specifically by extracting a few frames; the remaining images are derived from other acquired pedestrian images, collectively forming the pedestrian image dataset. The pedestrian image dataset established using this method can be expanded, enabling more comprehensive and complete acquisition of pedestrian images.

[0084] Step 2: Train a pedestrian keypoint detection network using the pedestrian image dataset to obtain a pedestrian detection model, wherein the input of the pedestrian detection model is the pedestrian image and the output is the pedestrian bounding box.

[0085] In this embodiment, the Faster-RCNN network is selected as the pedestrian keypoint detection network. Since the Faster-RCNN network is an existing network architecture, it will not be described in detail in this invention. In other feasible embodiments, the network for pedestrian detection can also be applied to the technical requirements of this invention.

[0086] It should be understood that the Faster-RCNN network is existing technology. Given that its input and output are clearly defined, those skilled in the art can understand that the network is trained using a pedestrian image dataset as the sample set. Therefore, its training process and network architecture will not be described in detail.

[0087] Step 3: Use a pedestrian detection model to perform frame-by-frame recognition on the video dataset to obtain pedestrian recognition results for each frame, and then identify pedestrian keypoint poses based on the pedestrian recognition results. Pedestrian keypoints represent the positions of human joints, typically including major joints such as the head, shoulders, elbows, wrists, hips, knees, and ankles. The connections between keypoints represent the structure of the human skeleton, such as the connection from the shoulder to the elbow and then to the wrist.

[0088] It should be understood that after obtaining the pedestrian's bounding box, the pose estimation of the pedestrian's key points is performed within the bounding box based on the coordinates of the bounding box, that is, the pedestrian's joints and limbs are extracted. Among these, the identification of the pose of the pedestrian's key points is achievable with existing technology, so it will not be described in detail.

[0089] Step 4: A multi-dimensional attention fusion module is added to the backbone network to construct a pedestrian action recognition network. This network is then trained using a 3D volumetric heatmap integrated from pedestrian keypoint pose data of each frame of the video data. The input data for the pedestrian action recognition network is the 3D volumetric heatmap integrated from pedestrian keypoint pose data, and the output is the recognized pedestrian action category. Scene understanding includes behavior recognition, and the detection result of behavior recognition is the pedestrian action category.

[0090] In this embodiment, the Slow Path network of Slow Fast-RCNN is preferably used as the backbone network, and a multi-dimensional attention fusion module is used to enhance the utilization of temporal feature information. In other feasible embodiments, the Slow Fast-RCNN backbone network is not the only option. For example, using ResNet-3D as the backbone network and adding the multi-dimensional attention fusion module MDAF to the residual layer of ResNet-3D also falls within the protection scope of this invention.

[0091] like Figure 1 As shown, this embodiment adds a multi-dimensional attention fusion module (MDAF) to each residual layer of the ReSNet-3D in Slow Fast-RCNN. Each residual layer in ResNet 3D has a different size; adding a module to each residual layer allows for the processing and modeling of information of different sizes, thus addressing information of varying dimensions within the data. It should be noted that, theoretically, a multi-dimensional attention fusion module (MDAF) can be added anywhere in the backbone network, as the input and output sizes of the MDAF module are identical.

[0092] like Figure 2 As shown, the implementation process of the channel preprocessing module, spatial shrinkage sampling module, temporal expansion focusing module, and multidimensional feature calculation module in the MDAF module is as follows: Channel preprocessing module: Let the input be X ∈ R B×C×TxH×W B represents the batch size, indicating the number of data samples input at one time; C represents the number of channels, indicating the number of channels in the input data; T represents the time dimension or the length of the sequence; H represents the height, indicating the height of the input data in the spatial dimension; and W represents the width, indicating the width of the input data in the spatial dimension. For example... Figure 3 As shown, the channel preprocessing module obtains X from the input X through 3D average pooling and 3D max pooling, respectively. c_avg X c_max ∈R B×C×1×1×1 The two outputs are then summed element-wise through a shared-weight MLP layer, and the channel attention information X is obtained by passing the sigmoid activation function. ca ∈R B×C×1×1×1 X ca ∈R B×C×1×1×1 Multiplying the input X again yields the output X of the channel preprocessing module. c_out ∈R B×C×T×H×W The MLP layer of the channel preprocessing module consists of two 1D convolutions and a ReLU function.

[0093] In this invention, the shared weight MLP in the CBAM module is improved from 2D convolution to 1D convolution, which reduces the amount of computation and isolates information from other dimensions, making the attention information more concentrated in the channel dimension.

[0094] Spatial contraction sampling module: such as Figure 5 As shown, the output X of the channel preprocessing module c_out As input to the spatial shrinkage sampling module, X is first obtained through average pooling and max pooling. s_avg X s_max ∈R B×1×T×H×W Then, combine the two to get X. s ∈R B×2×T×H×W Then through dilated convolution Spatial attention information X is obtained by using the sigmoid activation function. sa ∈R B ×1×1×H×W .

[0095] From X s To X sa Involving X s ∈R B×2×T×H×W T becomes 1. To process information in each dimension separately, using ordinary 3D convolution would utilize temporal information while calculating spatial attention information. Therefore, it is preferable to use 3D convolution that dilates only in the temporal dimension to reduce the reliance of spatial attention information on temporal information, shrinking the temporal information while simultaneously calculating spatial attention information, thus expanding the receptive field and reducing information loss.

[0096] Time-extended focus module: such as Figure 4 As shown, the output X of the channel preprocessing module c_out The input X is fed into the time-extended focusing module. First, the time information fusion extension module processes the input X... c_out ∈R B×C×T×H×W The spatial information in X is expanded frame by frame to the temporal channel to obtain X. t ∈R B×C×4T×H / 2×W / 2 X is then obtained through 3D average pooling and 3D max pooling, respectively. t_avg X t_max ∈R B×1×T×1×1 The two outputs are then summed element-wise using a shared-weight MLP layer, and finally X is obtained by applying a sigmoid activation function. ta ∈R B×1×T×1×1 Temporal attention information. The MLP layer of the temporal extended focusing module consists of two 1D convolutions and a ReLU function.

[0097] The temporal expansion focusing module divides each frame (spatial information) into several 2x2 blocks and concatenates pixels at the same position within these blocks to obtain four concatenated subframes. These four subframes are then sequentially concatenated to complete the information fusion within a single frame. Finally, the fused results of each single frame are concatenated in temporal order to obtain the output of the temporal information fusion expansion module, thus avoiding the loss of valuable information. It should be understood that improving the shared weight MLP in the CBAM module from 2D convolution to 1D convolution reduces computation while isolating information from other dimensions, allowing attention information to be more concentrated in the temporal dimension.

[0098] Multidimensional Feature Calculation Module: Multiple inputs enter the multidimensional feature calculation module, which then incorporates spatial attention information X. sa Multiply by X c_out The activation output X obtains spatial attention information. s_out ; Transfer time attention information X ta Multiply by X c_out Activation output X of time attention t_out Multiplying spatial attention and temporal attention yields spatiotemporal attention information X. sta ∈R B×1×T×H×W Spatiotemporal attention information X sta Multiply by X c_out Get the output X st_out X s_out X t_out X st_out The three components are added together with different weight ratios to obtain the output X of the MDAF module. out The weight ratios are obtained by training the network based on different data and application scenarios, and are empirical values. This invention does not impose constraints on their specific values.

[0099] In summary, the MDAF module provided by this invention simultaneously passes the output after attention calculation through both a spatial attention module and a temporal attention module. Each module performs its own attention calculation, and finally, the attention information from these three modules is fused in the feature fusion module to obtain the module's final output. This process utilizes attention information independently, making its most effective use, while also integrating it across both spatiotemporal dimensions to obtain more comprehensive information.

[0100] Step 5: Input the video to be detected into the pedestrian detection model and perform pedestrian keypoint recognition. Then, input the recognition results of the pedestrian keypoints into the trained pedestrian action recognition network to obtain the action category. That is, in this embodiment, the video to be detected is input into the network. First, the pose of the keypoints is recognized through the Faster-RCNN network and the pose estimation of the keypoints. Then, the keypoints are recognized through the Slow Fast-RCNN network with added spatiotemporal attention module to obtain the detection result.

[0101] To fully compare the technical solution provided in this embodiment with the prior art, the experimental results are shown in Table 1 below (comparison of MDAF accuracy metrics with other existing methods in three datasets):

[0102] Table 1

[0103]

[0104] The MDAF module proposed in this invention demonstrates excellent performance on three datasets on the Posec3d baseline.

[0105] Example 2

[0106] This embodiment provides a system based on a video scene understanding method, which includes at least: a recognition network construction module for pedestrian video scenes, a pedestrian data acquisition module, a training module, and a detection module.

[0107] The pedestrian video scene recognition network construction module is used to add the multi-dimensional attention fusion module to the backbone network to build a pedestrian action recognition network. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information.

[0108] The pedestrian data acquisition module is used to acquire pedestrian key point pose data of each frame in the video dataset related to pedestrian behavior and integrate it into a 3D volumetric heat map. Pedestrian key points represent the positions of human joints.

[0109] The training module is used to train a pedestrian action recognition network using a 3D volumetric heatmap that integrates pedestrian keypoint pose data from each frame of the video dataset.

[0110] The detection module is used to perform video understanding on the video to be detected using a trained pedestrian action recognition network. The input of the pedestrian action recognition network is a 3D volumetric heatmap integrated based on pedestrian key point pose data; the output is the pedestrian action category, which is the content of video understanding.

[0111] In some embodiments, the pedestrian data acquisition module can be divided into a dataset construction module, a pedestrian detection model construction module, and a key point pose recognition module. The dataset construction module is used to establish a video dataset and a pedestrian image dataset about pedestrian behavior. The pedestrian detection model construction module is used to train a pedestrian key point detection network using the pedestrian image dataset to obtain a pedestrian detection model. The input of the pedestrian detection model is a pedestrian image, and the output is the pedestrian's bounding box. The key point pose recognition module is used to perform frame-by-frame recognition on the video dataset using the pedestrian detection model to obtain the pedestrian recognition result for each frame of the image, and to recognize the pose of the pedestrian's key points based on the pedestrian recognition result.

[0112] It should be understood that the specific implementation process of each module is described in the above method. This invention will not repeat the details here. The above division of functional modules is only for illustrative purposes. In some embodiments, some functional modules can be combined and some functional modules can be separated. Each functional module can be implemented in software, hardware, or a combination of software and hardware. The software and hardware devices include, but are not limited to, general-purpose computer equipment, programmable gate arrays, digital signal processors, microprocessors and their corresponding programming or burning software.

[0113] Example 3:

[0114] This invention provides an electronic terminal, including: one or more processors, and a memory storing one or more computer programs;

[0115] The processor calls a computer program to implement a video scene understanding method.

[0116] For example, perform / implement the following steps:

[0117] A multi-dimensional attention fusion module is added to the backbone network to build a recognition network for pedestrian video scenarios. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information.

[0118] The pedestrian data of each frame in the video dataset is obtained, and then the pedestrian data of each frame in the video dataset is used to train a recognition network for pedestrian video scenes to obtain the content of video understanding.

[0119] A pre-trained pedestrian video scene recognition network is used to perform video understanding on the video to be detected.

[0120] Specifically, the input and output data of the recognition network in the pedestrian video scene are set according to the target of video understanding. The input data corresponds to pedestrian data, and the output data is the content of video understanding.

[0121] Taking pedestrian action recognition as an example, the following steps are performed:

[0122] A multi-dimensional attention fusion module is added to the backbone network to construct a pedestrian action recognition network. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information.

[0123] The pedestrian keypoint pose data of each frame in the video dataset related to pedestrian behavior is obtained and integrated into a 3D volumetric heatmap. The pedestrian keypoints represent the positions of human joints.

[0124] A pedestrian action recognition network was trained using a 3D volumetric heatmap that integrates pedestrian keypoint pose data from each frame of a video dataset.

[0125] A trained pedestrian motion recognition network is used to perform video understanding on the video to be detected.

[0126] Alternatively, perform / implement the following steps:

[0127] Step 1: Create a video dataset and an image dataset of pedestrians.

[0128] Step 2: Train a pedestrian keypoint detection network using a pedestrian image dataset to obtain a pedestrian detection model. The input of the pedestrian detection model is the pedestrian image, and the output is the bounding box of the pedestrian.

[0129] Step 3: Use the pedestrian detection model to perform frame-by-frame recognition on the video dataset to obtain the pedestrian recognition results for each frame of the image, and identify the pose of pedestrian key points based on the pedestrian recognition results. Pedestrian key points represent the positions of human joints.

[0130] Step 4: Add the multi-dimensional attention fusion module to the backbone network to build a pedestrian action recognition network, and then use the 3D volumetric heatmap integrated from the pedestrian key point pose data of each frame of the video data to train the pedestrian action recognition network.

[0131] The pedestrian action recognition network takes a 3D volumetric heatmap as input based on pedestrian keypoint pose data and outputs the pedestrian action category, which is the content of video understanding.

[0132] Step 5: Input the video to be detected into the pedestrian detection model and perform pedestrian key point recognition. Then input the recognition results of the pedestrian key points into the pedestrian action recognition network to obtain the action category.

[0133] Please refer to the explanation of the method above for the specific implementation process of each step.

[0134] It should be understood that, in the embodiments of the present invention, the processor may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor. The memory may include read-only memory and random access memory, and provides instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

[0135] Example 4:

[0136] This invention provides a computer-readable storage medium storing a computer program that is called by a processor to implement a video scene understanding method.

[0137] For example, perform / implement the following steps:

[0138] A multi-dimensional attention fusion module is added to the backbone network to build a recognition network for pedestrian video scenarios. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information.

[0139] The pedestrian data of each frame in the video dataset is obtained, and then the pedestrian data of each frame in the video dataset is used to train a recognition network for pedestrian video scenes to obtain the content of video understanding.

[0140] A pre-trained pedestrian video scene recognition network is used to perform video understanding on the video to be detected.

[0141] Specifically, the input and output data of the recognition network in the pedestrian video scene are set according to the target of video understanding. The input data corresponds to pedestrian data, and the output data is the content of video understanding.

[0142] Taking pedestrian action recognition as an example, the following steps are performed:

[0143] A multi-dimensional attention fusion module is added to the backbone network to construct a pedestrian action recognition network. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information.

[0144] The pedestrian keypoint pose data of each frame in the video dataset related to pedestrian behavior is obtained and integrated into a 3D volumetric heatmap. The pedestrian keypoints represent the positions of human joints.

[0145] A pedestrian action recognition network was trained using a 3D volumetric heatmap that integrates pedestrian keypoint pose data from each frame of a video dataset.

[0146] A trained pedestrian motion recognition network is used to perform video understanding on the video to be detected.

[0147] Alternatively, perform / implement the following steps:

[0148] Step 1: Create a video dataset and an image dataset of pedestrians.

[0149] Step 2: Train a pedestrian keypoint detection network using a pedestrian image dataset to obtain a pedestrian detection model. The input of the pedestrian detection model is the pedestrian image, and the output is the bounding box of the pedestrian.

[0150] Step 3: Use the pedestrian detection model to perform frame-by-frame recognition on the video dataset to obtain the pedestrian recognition results for each frame of the image, and identify the pose of pedestrian key points based on the pedestrian recognition results. Pedestrian key points represent the positions of human joints.

[0151] Step 4: Add the multi-dimensional attention fusion module to the backbone network to build a pedestrian action recognition network, and then use the 3D volumetric heatmap integrated from the pedestrian key point pose data of each frame of the video data to train the pedestrian action recognition network.

[0152] The pedestrian action recognition network takes a 3D volumetric heatmap as input based on pedestrian keypoint pose data and outputs the pedestrian action category, which is the content of video understanding.

[0153] Step 5: Input the video to be detected into the pedestrian detection model and perform pedestrian key point recognition. Then input the recognition results of the pedestrian key points into the pedestrian action recognition network to obtain the action category.

[0154] Please refer to the explanation of the method above for the specific implementation process of each step.

[0155] The readable storage medium is a computer-readable storage medium, which can be an internal storage unit of the hardware and software device described in any of the foregoing embodiments, such as the hard drive or memory of the controller. The readable storage medium can also be an external storage device of the controller, such as a plug-in hard drive, Smart MediaCard (SMC), Secure Digital (SD) card, or Flash Card equipped on the controller. Further, the readable storage medium can include both internal storage units and external storage devices of the controller. The readable storage medium is used to store the computer program and other programs and data required by the controller. The readable storage medium can also be used to temporarily store data that has been output or will be output.

[0156] Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned readable storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0157] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. This application refers to flowchart illustrations and / or instructions executed by a processor of a method, apparatus (system), and computer program product according to embodiments of this application to create means for implementing the functions specified in one or more flowchart illustrations and / or one or more block diagrams. These computer program instructions may also be stored in a computer-readable storage medium capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means that implement the functions specified in one or more flowchart illustrations and / or one or more block diagrams. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, such that the instructions, which execute on the computer or other programmable apparatus, provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of a block diagram.

[0158] It should be emphasized that the examples described in this invention are illustrative rather than limiting. Therefore, this invention is not limited to the examples described in the specific embodiments. Any other embodiments derived by those skilled in the art based on the technical solutions of this invention, without departing from the spirit and scope of this invention, whether modifications or substitutions, are also within the protection scope of this invention.

Claims

1. A video scene understanding method based on attention fusion, characterized in that: Includes the following steps: A multi-dimensional attention fusion module is added to the backbone network to construct a recognition network for pedestrian video scenarios. The multi-dimensional attention fusion module calculates the attention of the three dimensions of channel, time and space independently, and then fuses the channel attention, time attention and spatial attention information. The multi-dimensional attention fusion module includes a channel preprocessing module, a spatial shrinkage sampling module, a temporal expansion focusing module, and a multi-dimensional feature calculation module. The channel preprocessing module, spatial shrinkage sampling module, and temporal expansion focusing module are used to acquire channel attention, spatial attention, and temporal attention, respectively. The output obtained by the channel preprocessing module... These are respectively used as inputs to the spatial contraction sampling module and the temporal expansion focusing module; The multi-dimensional attention fusion module is used to fuse channel attention, temporal attention, and spatial attention information. The processing procedure is as follows: the output of the spatial shrinkage sampling module is... With the output of the time-extended focusing module Multiply by the output respectively Activation output that obtains spatial attention information Activation output of time attention ; and output With output Multiplication yields spatiotemporal attention information Multiply by the output Get output Finally, , , The output of the multi-dimensional feature calculation module is calculated by weighting the values according to a preset weight ratio, which is the output of the multi-dimensional attention fusion module. The processing procedure of the time-expanded focusing module is as follows: Input... Spatial information is extended frame by frame to the time channel to obtain Then, through 3D average pooling and 3D max pooling, respectively, we obtain... , Next, it is passed through an MLP layer with shared weights. , Element-wise addition is performed, and then the output is obtained by passing the sigmoid activation function. This refers to temporal attention information; wherein, the MLP layer consists of two 1D convolutions and a ReLU function; The processing procedure of the spatial contraction sampling module is as follows: Input... Obtained through average pooling and max pooling , Then , spliced together Finally, through dilated convolution... The output is obtained by using the sigmoid activation function. That is, spatial attention information; The pedestrian data of each frame in the video dataset is obtained, and then the pedestrian data of each frame in the video dataset is used to train the recognition network in the pedestrian video scene to obtain the content of video understanding. The trained pedestrian video scene recognition network is used to perform video understanding on the video to be detected. Specifically, the input and output data of the recognition network in the pedestrian video scene are set according to the target of video understanding. The input data corresponds to pedestrian data, and the output data is the content of video understanding.

2. The method according to claim 1, characterized in that: The video is understood to be a pedestrian action recognition, and the recognition network corresponding to the pedestrian video scene is a pedestrian action recognition network; The pedestrian data is a 3D volumetric heatmap integrated based on pedestrian key point posture data. Pedestrian key points represent the positions of human joints, and the pedestrian key point posture data includes joint positions and limbs represented by joint lines. The output data of the recognition network in the pedestrian video scene is the pedestrian action type.

3. The method according to claim 2, characterized in that: The backbone network of the pedestrian action recognition network is the Slow Path of SlowFast-RCNN, and the multi-dimensional attention fusion module is added to each res-layer residual layer of ResNet-3D.

4. The method according to claim 1, characterized in that: After the input of the multi-dimensional attention fusion module enters the channel preprocessing module, the processing procedure is as follows: First, obtain the results through 3D average pooling and 3D max pooling respectively. , ; Then, through a shared-weight MLP layer... , Element-wise addition is performed, and then channel attention information is obtained by passing the sigmoid activation function. ; Finally, channel attention information Multiply input Get output .

5. A system based on the video scene understanding method according to any one of claims 1-4, characterized in that: At least include: A pedestrian video scene recognition network construction module is used to add a multi-dimensional attention fusion module to the backbone network to construct a pedestrian video scene recognition network. The multi-dimensional attention fusion module calculates the attention of the channel, time and space dimensions independently, and then fuses the channel attention, time attention and spatial attention information. The pedestrian data acquisition module is used to acquire pedestrian data for each frame of the video dataset; The training module is used to train the pedestrian video scene recognition network using pedestrian data from each frame of the video dataset to obtain the content of video understanding; The detection module is used to perform video understanding on the video to be detected using the trained pedestrian video scene recognition network; Specifically, the input and output data of the recognition network in the pedestrian video scene are set according to the target of video understanding. The input data corresponds to pedestrian data, and the output data is the content of video understanding.

6. An electronic terminal, characterized in that: include: One or more processors; A memory that stores one or more computer programs; The processor calls the computer program to implement: The video scene understanding method according to any one of claims 1-4.

7. A computer-readable storage medium, characterized in that: The computer program is stored and is invoked by the processor to implement: The video scene understanding method according to any one of claims 1-4.