Human behavior recognition method and device, electronic equipment and readable storage medium
By extracting deep features from I-frames and residuals in compressed videos and fusing them, local and global spatiotemporal features are generated, solving the problem of low accuracy in human behavior recognition in existing technologies and achieving higher recognition accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF AUTOMATION CHINESE ACAD OF SCI
- Filing Date
- 2022-10-20
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, the human dynamic features extracted from compressed videos cannot fully reflect the dynamic information in the videos, resulting in low accuracy in video human behavior recognition.
By extracting compressed domain information, including I-frames, residuals, and motion vectors, from compressed video data, deep feature fusion processing is performed to obtain local spatiotemporal features. Adjacent local spatiotemporal features are then fused to generate global spatiotemporal features. Finally, the human behavior recognition result is determined based on the global spatiotemporal features, motion vectors, and residuals.
It improves the accuracy of human behavior recognition. By fully complementing the information of I-frames and residuals, it obtains local spatiotemporal features with stronger expressive power. Extracting global spatiotemporal features is beneficial for behavior recognition and helps to identify target features that better reflect human behavior.
Smart Images

Figure CN115909479B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, and in particular to a method, apparatus, electronic device, and readable storage medium for human behavior recognition. Background Technology
[0002] With the development of computer vision technology, video human behavior recognition technology has been widely applied in various fields. As a crucial component of intelligent video analysis, video human behavior recognition technology refers to classifying the actions performed by human targets in a video and then identifying the type of human behavior. As a current hot research topic in video analysis and understanding, video human behavior recognition technology has broad application prospects in human-computer interaction, autonomous driving, intelligent monitoring, security, and motion analysis.
[0003] The core of video human behavior recognition algorithms lies in extracting the dynamic features of the human body in a video. In practical applications, to facilitate video transmission and storage, most videos on the internet are compressed and encoded. Therefore, it is necessary to decode the compressed videos to extract the dynamic features of the human body.
[0004] However, in related technologies, the human dynamic features extracted from compressed videos cannot fully reflect the dynamic information in the videos, resulting in a low accuracy rate for recognizing human behavior in videos. Summary of the Invention
[0005] To address the problems existing in the prior art, the present invention provides a human behavior recognition method, device, electronic device, and readable storage medium.
[0006] This invention provides a method for human behavior recognition, comprising:
[0007] Extract compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals;
[0008] The deep features corresponding to each I-frame and each target residual are fused to obtain the local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; the local spatiotemporal features of two adjacent ones are fused to obtain the global spatiotemporal features corresponding to the compressed video data.
[0009] Based on the global spatiotemporal features, the motion vector, and the residual, the target features corresponding to the compressed video data are determined, and the human behavior recognition result corresponding to the compressed video data is determined based on the target features.
[0010] Optionally, extracting compression domain information from compressed video data includes:
[0011] The compressed video data is entropy decoded to obtain the motion vector in the compressed domain and the discrete cosine transform coefficients corresponding to each macroblock in the compressed domain;
[0012] The discrete cosine transform coefficients are subjected to inverse discrete cosine transform to obtain the residual corresponding to each video frame in the compressed domain.
[0013] Each target macroblock is identified, and each target macroblock is subjected to loop filtering to obtain the I-frame, wherein the target macroblock is the macroblock corresponding to the I-frame.
[0014] Optionally, after extracting compression domain information from the compressed video data, the method further includes:
[0015] The compressed domain information is preprocessed to obtain preprocessed compressed domain information.
[0016] Optionally, the step of fusing the deep features corresponding to each I-frame and each target residual to obtain the local spatiotemporal features corresponding to each I-frame includes:
[0017] For each I-frame and its corresponding target residual, the I-frame is input into a first network model to obtain a first shallow feature corresponding to the I-frame; the target residual is input into a second network model to obtain a second shallow feature corresponding to the target residual. The first network model and the second network model each include a single convolutional layer, which includes a two-dimensional convolutional layer, a pooling layer, and a batch normalization layer.
[0018] The first shallow feature and the second shallow feature are fused together to obtain the fused features corresponding to the I-frame and the target residual;
[0019] The fused features are input into the third network model to obtain the first deep features corresponding to the I-frame; the second shallow features are input into the fourth network model to obtain the second deep features corresponding to the target residual. The third network model and the fourth network model include multiple convolutional layers, and each convolutional layer includes a two-dimensional convolutional layer, a pooling layer and a batch normalization layer.
[0020] The first deep feature and the second deep feature are fused together to obtain the local spatiotemporal features corresponding to the I-frame.
[0021] Optionally, the step of fusing two adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data includes:
[0022] The local spatiotemporal features corresponding to each I-frame are subjected to feature enhancement processing to obtain the target local spatiotemporal enhanced features corresponding to each I-frame;
[0023] The local spatiotemporal enhancement features of two adjacent targets are fused to obtain the global spatiotemporal features corresponding to the compressed video data.
[0024] Optionally, the step of performing feature enhancement processing on the local spatiotemporal features corresponding to each of the I-frames to obtain the target local spatiotemporal enhancement features corresponding to each of the I-frames includes:
[0025] For each of the I-frames, the local spatiotemporal features are input into the fifth network model to obtain the local spatiotemporal enhancement features corresponding to the local spatiotemporal features. The fifth network model includes a single convolutional layer, which includes a two-dimensional convolutional layer and a batch normalization layer.
[0026] The difference between two adjacent local spatiotemporal enhancement features is obtained by subtracting the local spatiotemporal enhancement features of two adjacent I-frames.
[0027] Based on the difference in local spatiotemporal enhancement features, determine the attention weights corresponding to the local spatiotemporal enhancement features;
[0028] The attention weights and the local spatiotemporal enhancement features are fused together to obtain the target local spatiotemporal enhancement features corresponding to the I-frame.
[0029] Optionally, fusing two adjacent target local spatiotemporal enhancement features to obtain the global spatiotemporal features corresponding to the compressed video data includes:
[0030] The channel features corresponding to two adjacent target local spatiotemporal enhancement features are replaced to obtain the global spatiotemporal features corresponding to the compressed video data.
[0031] Optionally, determining the target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vector, and the residual includes:
[0032] The motion vector is input into the first residual network model to obtain the motion features corresponding to the compressed video data, wherein the first residual network model is a ResNet18 network model;
[0033] Each of the second deep features is input into the second residual network model to obtain the residual features corresponding to the compressed video data, wherein the second residual network model is a ResNet50 network model;
[0034] The motion features, residual features, and global spatiotemporal features are fused together to obtain the target features.
[0035] The present invention also provides a human behavior recognition device, comprising:
[0036] An extraction module is used to extract compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals;
[0037] The fusion module is used to fuse the deep features corresponding to each I-frame and each target residual to obtain the local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; and to fuse two adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data.
[0038] The determination module is used to determine the target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vector, and the residual, and to determine the human behavior recognition result corresponding to the compressed video data based on the target features.
[0039] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement any of the human behavior recognition methods described above.
[0040] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the human behavior recognition method as described above.
[0041] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the human behavior recognition method as described above.
[0042] The human behavior recognition method, device, electronic device, and readable storage medium provided by this invention, by fusing the deep features of each I-frame and each target residual in the compressed domain information, can fully complement the information of each I-frame and each target residual in the compressed domain information, thereby obtaining the local spatiotemporal features corresponding to the I-frame with stronger expressive power; by fusing two adjacent local spatiotemporal features, the global spatiotemporal features corresponding to the compressed video data can be extracted, which are more conducive to behavior recognition tasks; based on the global spatiotemporal features, motion vectors, and residuals, target features that better reflect human behavior can be determined, and human behavior can be recognized based on these target features, thereby improving the accuracy of human behavior recognition. Attached Figure Description
[0043] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0044] Figure 1 This is one of the flowcharts of the human behavior recognition method provided by the present invention;
[0045] Figure 2 This is the second flowchart of the human behavior recognition method provided by the present invention;
[0046] Figure 3 This is a schematic diagram of the structure of the human behavior recognition device provided by the present invention;
[0047] Figure 4 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0048] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0049] With the development of computer vision technology, video human behavior recognition technology has been widely applied in various fields. As a crucial component of intelligent video analysis, video human behavior recognition technology refers to classifying the actions performed by human targets in a video and then identifying the type of human behavior. As a current hot research topic in video analysis and understanding, video human behavior recognition technology has broad application prospects in human-computer interaction, autonomous driving, intelligent monitoring, security, and motion analysis.
[0050] The core of video human behavior recognition algorithms lies in extracting the dynamic features of the human body in a video. In practical applications, to facilitate video transmission and storage, most videos on the internet are compressed and encoded. Therefore, it is necessary to decode the compressed videos to extract the dynamic features of the human body.
[0051] In related technologies, methods for decoding compressed video and extracting dynamic features of the human body mainly fall into two categories:
[0052] The first type involves fully decoding the compressed video to obtain an image sequence, and then extracting dynamic features from the video based on these image sequences. Examples include extracting optical flow between video frames and using 3D convolution to automatically learn dynamic features from the video.
[0053] However, the computational overhead of performing full decoding on compressed video to obtain video frames is unavoidable, and further extracting dynamic information from the video further increases the computational burden, making it unsuitable for use in actual production activities with limited computing resources.
[0054] The second category is behavior recognition methods based on video compression domain information. This method only performs partial decoding on the compressed video, saving some computational overhead.
[0055] However, current behavior recognition methods based on video compression domain information cannot fully reflect the dynamic features of the human body extracted from the compression domain information, resulting in low accuracy in recognizing human behavior in videos.
[0056] Therefore, based on the above-mentioned technical problems, the present invention provides a human behavior recognition method, which can improve the accuracy of human recognition.
[0057] The following is combined Figures 1-2 The human behavior recognition method provided by this invention will be described in detail.
[0058] See Figure 1 , Figure 1 This is one of the flowcharts of the human behavior recognition method provided by the present invention, specifically including steps 101 to 103.
[0059] Step 101: Extract compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals.
[0060] It should be noted that the subject of this invention can be any electronic device with human behavior recognition function, such as any kind of smartphone, smartwatch, desktop computer, laptop computer, etc.
[0061] In this embodiment, the compression domain information needs to be extracted from the compressed video data, which is obtained by compressing video data using the H.264 compression standard. The H.264 compressed video data includes compression domain information, which includes multiple I-frames, residuals, and motion vectors.
[0062] It should be noted that adjacent frames in video data are often quite similar. Therefore, the H.264 compression algorithm uses this characteristic to divide the video data into a series of Groups of Pictures (GOPs). Each GOP only stores the RGB image of the first frame (called the I-frame), while recording the motion vectors and residuals of subsequent frames (called P-frames) relative to the I-frame.
[0063] I-frames, motion vectors, and residuals are collectively referred to as compressed domain information. Motion vectors describe the displacement of a pixel block in a P-frame relative to the most similar pixel block in an I-frame, while residuals refer to the color difference between these two pixel blocks. Therefore, there is a correspondence between I-frames and residuals.
[0064] Step 102: The deep features corresponding to each I-frame and each target residual are fused to obtain the local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; the two adjacent local spatiotemporal features are fused to obtain the global spatiotemporal features corresponding to the compressed video data.
[0065] In this embodiment, after extracting the compression domain information from the compressed video data, it is necessary to fuse the deep features corresponding to each I-frame and the deep features corresponding to each target residual to obtain the local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame.
[0066] The local spatiotemporal features corresponding to each I-frame are used to reflect the local dynamic features of the video data. After obtaining the local spatiotemporal features corresponding to each I-frame, it is necessary to fuse adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data. These global spatiotemporal features are used to reflect the global dynamic features of the video data. The global spatiotemporal features obtained by fusing the local spatiotemporal features are more conducive to pedestrian human behavior recognition tasks.
[0067] Step 103: Based on the global spatiotemporal features, the motion vector, and the residual, determine the target features corresponding to the compressed video data, and determine the human behavior recognition result corresponding to the compressed video data based on the target features.
[0068] In this embodiment, after fusing two adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data, it is necessary to determine the target features corresponding to the compressed video data based on the global spatiotemporal features, motion vectors, and residuals.
[0069] The target feature represents the human dynamic feature information corresponding to the compressed video data. Based on the target feature, the human behavior recognition result corresponding to the compressed video data can be better determined.
[0070] Based on the target features, the predicted probability of each behavior category in the compressed video data can be obtained. Based on the predicted probability, the human behavior recognition result corresponding to the compressed video data can be determined.
[0071] For example, based on the target features, if the probability of human behavior in compressed video data being running is 5% and the probability of human behavior being dancing is 95%, then the human behavior recognition result corresponding to the compressed video data is determined to be dancing.
[0072] The human behavior recognition method provided by this invention fuses the deep features of each I-frame and each target residual in the compressed domain information, enabling the information of each I-frame and each target residual in the compressed domain information to be fully complementary, thereby obtaining the local spatiotemporal features corresponding to the I-frame with stronger expressive power; by fusing two adjacent local spatiotemporal features, the global spatiotemporal features corresponding to the compressed video data can be extracted, which are more conducive to behavior recognition tasks; based on the global spatiotemporal features, motion vectors and residuals, target features that better reflect human behavior can be determined, and human behavior can be recognized based on these target features, thereby improving the accuracy of human behavior recognition.
[0073] The specific implementation of the human behavior recognition method provided by the present invention will be described in detail below.
[0074] Optionally, in one possible implementation of this invention, the extraction of compression domain information from compressed video data can be achieved through the following steps, specifically including steps 1) to 3):
[0075] Step 1) Perform entropy decoding on the compressed video data to obtain the motion vector in the compressed domain and the discrete cosine transform coefficients corresponding to each macroblock in the compressed domain;
[0076] Step 2) Perform inverse discrete cosine transform on the discrete cosine transform coefficients to obtain the residual corresponding to each video frame in the compressed domain;
[0077] Step 3) Identify each target macroblock and perform loop filtering on each target macroblock to obtain the I-frame, wherein the target macroblock is the macroblock corresponding to the I-frame.
[0078] In this embodiment, entropy decoding is first performed on the compressed video data compressed by the H.264 standard. That is, partial decoding is performed on the compressed video data compressed by the H.264 standard to obtain the motion vector in the compressed domain and the discrete cosine transform coefficients corresponding to each macroblock in the compressed domain.
[0079] Among them, the i-th frame f in the video data iThe motion vector corresponding to the macroblock in the m-th row and n-th column can be represented as MV. i (m, n); Discrete cosine transform coefficients (DCT) for each macroblock m,n Used to reflect the richness of texture in the residual.
[0080] After obtaining the DCT corresponding to each macroblock m,n Next, DCT needs to be performed. m,n An inverse discrete cosine transform is performed to obtain the residual corresponding to each video frame in the compressed domain.
[0081] Specifically, the residual corresponding to the macroblock in the m-th row and n-th column of the i-th frame can be calculated using formula (1), which is shown below:
[0082]
[0083] Among them, RES i,m,n ∈R N×N Represents the discrete cosine transform coefficients (DCT) i,m,n The residual obtained after inverse transformation; R N×N Represents an N×N two-dimensional matrix; DCT i,m,n (u, v) represents the value of the discrete cosine transform coefficient corresponding to the m-th row and n-th column of the macroblock in the i-th frame of the video, corresponding to the values of the m-th row and n-th column; x, y represent the indices of the x-th row and y-th column of the macroblock in the m-th row and n-th column of the i-th frame, x, y∈[0, 1, ..., 15]; c(u) and c(v) represent the coefficients in the discrete cosine transform formula, where c(u) and c(v) can be represented by the following formulas (2) and (3) respectively:
[0084]
[0085]
[0086] After obtaining the residual corresponding to each video frame, it is necessary to identify the target macroblock corresponding to each I-frame, and then perform loop filtering on each target macroblock to obtain each I-frame in the compressed domain information.
[0087] For example, we define the macroblock M(i,j) in the i-th row and j-th column as the target macroblock corresponding to a certain I-frame in the video. First, we obtain the pixel values of the macroblocks adjacent to the target macroblock M(i,j). Then, we calculate the value of each pixel in the current target macroblock M(i,j) using the intra-frame prediction method. Finally, we sum the value of each pixel with the residual corresponding to the target macroblock M(i,j) to obtain the final pixel value of the target macroblock M(i,j).
[0088] Then, the final pixel values of the target macroblock M(i,j) are subjected to loop filtering to obtain the I-frame corresponding to macroblock M(i,j). Loop filtering eliminates block artifacts in the video frame, thereby improving the quality of the I-frame.
[0089] In the above embodiments, by partially decoding the compressed video data to obtain the motion vectors and residuals in the compressed domain, the computational overhead of decoding can be saved and the decoding efficiency can be improved; at the same time, the target macroblock is subjected to loop filtering, which can eliminate the block artifacts in the video frame and thus improve the quality of the I-frame.
[0090] Optionally, in one possible implementation of the present invention, after extracting the compression domain information from the compressed video data, the following step is further included:
[0091] The compressed domain information is preprocessed to obtain preprocessed compressed domain information.
[0092] In this embodiment, after extracting the compression domain information from the compressed video data, it is also necessary to preprocess the I-frames, residuals, and motion vectors in the video data to enhance the compression domain information and obtain the preprocessed compression domain information.
[0093] In practical applications, each I-frame can be subjected to data augmentation processes such as multi-scale cropping, random horizontal flipping, and normalization. Correspondingly, in order to correspond to the cropping position of each I-frame, the motion vector and residual also need to be cropped at the same multi-scale.
[0094] It is important to note that when randomly flipping the motion vector horizontally, in order to ensure the accuracy of the motion direction information described by the motion vector, the value of the horizontal direction corresponding to the motion vector needs to be inverted; then the motion vector and the residual are normalized.
[0095] In the above embodiments, by preprocessing the compressed domain information, it is possible to enhance the compressed domain information. Based on the preprocessed compressed domain information, the accuracy of human behavior recognition can be improved.
[0096] Optionally, in one possible implementation of the present invention, the fusion processing of the deep features corresponding to each I-frame and each target residual to obtain the local spatiotemporal features corresponding to each I-frame is specifically implemented through the following steps, including steps (a) to (d):
[0097] Step (a): For each I-frame and its corresponding target residual, the I-frame is input into the first network model to obtain the first shallow feature corresponding to the I-frame; the target residual is input into the second network model to obtain the second shallow feature corresponding to the target residual, wherein the first network model and the second network model include a single convolutional layer, and the single convolutional layer includes a two-dimensional convolutional layer, a pooling layer and a batch normalization layer;
[0098] Step (b): The first shallow feature and the second shallow feature are fused to obtain the fused features corresponding to the I-frame and the target residual;
[0099] Step (c): Input the fused features into the third network model to obtain the first deep features corresponding to the I-frame; input the second shallow features into the fourth network model to obtain the second deep features corresponding to the target residual, wherein the third network model and the fourth network model include multiple convolutional layers, each convolutional layer including a two-dimensional convolutional layer, a pooling layer and a batch normalization layer;
[0100] Step (d): The first deep feature and the second deep feature are fused to obtain the local spatiotemporal features corresponding to the I-frame.
[0101] In this embodiment, for each I-frame and the target residual corresponding to each I-frame in the compressed video data, the I-frame needs to be input into the first network model to obtain the first shallow feature corresponding to the I-frame; and the target residual needs to be input into the second network model to obtain the second shallow feature corresponding to the target residual. Specifically, this can be expressed by the following formulas (4) and (5):
[0102] F I =f1(I) (4)
[0103] F RES =f2(RES) (5)
[0104] Where I and RES represent the I-frame and the corresponding target residual in the compressed domain information, respectively; f1 and f2 represent the first network model and the second network model, respectively; f1 and f2 each include a single convolutional layer, wherein the single convolutional layer includes a two-dimensional convolutional layer, a pooling layer, and a batch normalization layer; F I and F RES These represent the first and second shallow features obtained after processing the I-frame and the corresponding target residual through a single convolutional layer, respectively.
[0105] After obtaining the first and second shallow features, the first and second shallow features need to be fused to obtain the fused features corresponding to the I-frame and the target residual. Specifically, this can be expressed by the following formula (6):
[0106]
[0107] Among them, F I F represents the first shallow feature corresponding to the I-frame. RES F′ represents the second shallow feature corresponding to the target residual. I This represents the fusion features corresponding to the I-frame and the target residual.
[0108] After obtaining the fused features corresponding to the I-frame and the target residual, the fused features need to be input into the third network model to obtain the first deep feature corresponding to the I-frame; and the second shallow feature needs to be input into the fourth network model to obtain the second deep feature corresponding to the target residual. Specifically, this can be expressed by the following formulas (7) and (8):
[0109] F″ I =f3(F′ I (7)
[0110] F′ RES =f4(F RES (8)
[0111] Here, f3 and f4 are the third and fourth network models, respectively. Each f3 and f4 has multiple convolutional layers with short connections (e.g., 9 convolutional layers), where each convolutional layer includes a 2D convolutional layer, a pooling layer, and a batch normalization layer; F″ I and F′ RES These represent the first deep feature corresponding to the I-frame and the second deep feature corresponding to the target residual, respectively.
[0112] After obtaining the first and second deep features, the first and second deep features need to be fused to obtain the local spatiotemporal features corresponding to each I-frame, which can be expressed by the following formula (9):
[0113]
[0114] Among them, F″ I F′ represents the first deep feature. RES This indicates the second deep feature. This represents the local spatiotemporal features corresponding to each I-frame.
[0115] In the above implementation, by fusing the I-frames and their corresponding target residuals layer by layer, deep features corresponding to each I-frame and each target residual can be obtained. By fusing the deep features corresponding to each I-frame and each target residual, local spatiotemporal features corresponding to each I-frame with stronger expressive power can be obtained. These local spatiotemporal features can fully express the local dynamic information in the video data. Based on these local spatiotemporal features, the accuracy of human behavior recognition can be improved.
[0116] Optionally, in one possible implementation of the present invention, the step of fusing two adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data can be implemented in the following way, specifically including steps [a]-[b]:
[0117] Step [a]: Perform feature enhancement processing on the local spatiotemporal features corresponding to each I-frame to obtain the target local spatiotemporal enhancement features corresponding to each I-frame;
[0118] Step [b]: Fuse the two adjacent target local spatiotemporal enhancement features to obtain the global spatiotemporal features corresponding to the compressed video data.
[0119] In this embodiment, after obtaining the local spatiotemporal features corresponding to each I-frame, it is also necessary to perform feature enhancement processing on the local spatiotemporal features corresponding to each I-frame. By further enhancing the local spatiotemporal features, target local spatiotemporal enhancement features that are more conducive to expressing local dynamic information in video data can be obtained.
[0120] Then, the local spatiotemporal enhancement features of two adjacent targets are fused to obtain the global spatiotemporal features corresponding to the compressed video data.
[0121] In the above implementation, by further enhancing the local spatiotemporal features corresponding to each I-frame and fusing the enhanced target local spatiotemporal features, a global spatiotemporal feature that can express the global dynamic information of the video is obtained. Based on this global spatiotemporal feature, the accuracy of human behavior recognition can be improved.
[0122] Optionally, in one possible implementation of the present invention, the step of performing feature enhancement processing on the local spatiotemporal features corresponding to each of the I-frames to obtain the target local spatiotemporal enhancement features corresponding to each of the I-frames can be implemented in the following way, specifically including steps [a1]-a4]:
[0123] Step [a1]: For each of the I-frames, the local spatiotemporal features are input into the fifth network model to obtain the local spatiotemporal enhancement features corresponding to the local spatiotemporal features. The fifth network model includes a single convolutional layer, which includes a two-dimensional convolutional layer and a batch normalization layer.
[0124] Step [a2] performs a subtraction operation on two adjacent local spatiotemporal enhancement features to obtain the difference in local spatiotemporal enhancement features between two adjacent I-frames;
[0125] Step [a3]: Based on the difference in local spatiotemporal enhancement features, determine the attention weights corresponding to the local spatiotemporal enhancement features;
[0126] Step [a4]: The attention weights and the local spatiotemporal enhancement features are fused to obtain the target local spatiotemporal enhancement features corresponding to the I-frame.
[0127] In this embodiment, after obtaining the local spatiotemporal features corresponding to each I-frame, for each local spatiotemporal feature corresponding to the I-frame, it is first necessary to perform an enhancement operation on the local spatiotemporal features, that is, input the local spatiotemporal features into the fifth network model to obtain the local spatiotemporal enhanced features corresponding to the local spatiotemporal features, which can be specifically expressed by the following formula (10):
[0128]
[0129] Where f5 represents the fifth network model, which includes a single convolutional layer, wherein the single convolutional layer includes a two-dimensional convolutional layer and a batch normalization layer; This represents the local spatiotemporal enhancement feature corresponding to the local spatiotemporal features. A single convolutional layer can enhance the expressive power of local spatiotemporal features.
[0130] After obtaining the local spatiotemporal enhancement features corresponding to the local spatiotemporal features, it is necessary to perform subtraction on two adjacent local spatiotemporal enhancement features to obtain the difference of local spatiotemporal enhancement features between two adjacent I-frames, which can be expressed by the following formula (11):
[0131]
[0132] in, This represents the difference in local spatiotemporal enhancement features between two adjacent I-frames. This difference in local spatiotemporal enhancement features can reflect the motion information within different I-frames in the video. This represents the local spatiotemporal enhancement feature corresponding to the t-th I-frame; This represents the local spatiotemporal enhancement feature corresponding to the (t+1)th I-frame.
[0133] After obtaining the difference in local spatiotemporal enhancement features between two adjacent I-frames, it is necessary to determine the attention weights corresponding to the local spatiotemporal enhancement features based on the difference in local spatiotemporal enhancement features.
[0134] Specifically, firstly, the difference in local spatiotemporal enhancement features between two adjacent I-frames needs to be input into the sixth network model to obtain the first output feature A1 corresponding to the difference in local spatiotemporal enhancement features between the two adjacent I-frames. Then, A1 is input into the seventh network model to obtain the second output feature A2 corresponding to the first output feature. The second output feature A2 is the enhancement feature of the first output feature. Specifically, this can be expressed by the following formulas (12) and (13):
[0135]
[0136]
[0137] Here, f6 is the sixth network model, which includes pooling layers, convolutional layers, and upsampling layers. A1 represents the first output feature corresponding to the local spatiotemporal enhancement feature difference between two adjacent I-frames. It should be noted that by analyzing the local spatiotemporal enhancement feature difference between two adjacent I-frames... Performing pooling followed by convolution makes A1 more robust to the spatial shifts of local spatiotemporal feature differences; f7 represents the seventh network model, which includes a single convolutional layer for further extracting the differences in local spatiotemporal features. The network features are used to obtain the second output feature A2.
[0138] Then we need to add A1, A2, and... A fusion operation is performed to obtain the fused I-frame local spatiotemporal difference feature A3. Then, A3 is input into the eighth network model to obtain the enhanced I-frame local spatiotemporal difference feature att, which can be achieved through the following formulas (14) and (15):
[0139]
[0140] att=f8(A3) (15)
[0141] Where A3 represents the fused I-frame local spatiotemporal difference features; f8 represents the eighth network model, which includes a single convolutional layer, wherein the single convolutional layer includes a two-dimensional convolutional layer and a batch normalization layer; and att represents the enhanced I-frame local spatiotemporal difference features.
[0142] After obtaining the enhanced local spatiotemporal difference feature att of the I-frame, the attention weight of the local spatiotemporal enhancement feature corresponding to the I-frame needs to be calculated using the following formula (16):
[0143]
[0144] After obtaining the attention weights of the local spatiotemporal augmentation features corresponding to the I-frame, it is necessary to combine the attention weights Att with the local spatiotemporal augmentation features. The fusion process is performed to obtain the target local spatiotemporal enhancement features corresponding to the I-frame;
[0145] That is, firstly, Att needs to be combined with local spatiotemporal enhancement features. Dot multiplication is performed to enhance the local spatiotemporal enhancement features in the video that contain more dynamic information, and then combined with the original local spatiotemporal enhancement features. The fusion is performed to obtain the final target local spatiotemporal enhancement features, which can be expressed by the following formula (17):
[0146]
[0147] in, This indicates the local spatiotemporal enhancement features of the target.
[0148] In the above implementation, by further enhancing the local spatiotemporal features corresponding to each I-frame, a target local spatiotemporal enhancement feature that is more conducive to expressing local dynamic information in video data can be obtained. Based on the target local spatiotemporal enhancement feature, the accuracy of human behavior recognition can be improved.
[0149] Optionally, in one possible implementation of this invention, the step of fusing two adjacent target local spatiotemporal enhancement features to obtain the global spatiotemporal features corresponding to the compressed video data can be achieved through the following steps [b1]:
[0150] The channel features corresponding to two adjacent target local spatiotemporal enhancement features are replaced to obtain the global spatiotemporal features corresponding to the compressed video data.
[0151] In this embodiment, after obtaining the target local spatiotemporal enhancement features corresponding to each I-frame, it is necessary to replace the channel features corresponding to two adjacent target local spatiotemporal enhancement features to obtain the global spatiotemporal features corresponding to the compressed video data.
[0152] For example, for each target local spatiotemporal enhancement feature corresponding to I-frame, it is necessary to replace the first one-eighth channel feature of the target local spatiotemporal enhancement feature of the t-th I-frame with the first one-eighth channel feature of the target local spatiotemporal enhancement feature of the (t+1)-th I-frame, so as to realize the interaction between the target local spatiotemporal enhancement features of I-frames at different times, and thus obtain the global spatiotemporal features corresponding to the compressed video data.
[0153] In the above implementation, the network feature shifting method is used to shift some channel features corresponding to the target local spatiotemporal enhancement features of adjacent I-frames in the video along the time dimension, so that the target local spatiotemporal enhancement features of each I-frame can fully interact, thereby obtaining global spatiotemporal features that are more conducive to the behavior recognition task. Based on the global spatiotemporal features, the accuracy of human behavior recognition can be improved.
[0154] Optionally, in one possible implementation of this invention, determining the target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vector, and the residual can be achieved in the following ways, specifically including steps (1) to (3):
[0155] Step (1): Input the motion vector into the first residual network model to obtain the motion features corresponding to the compressed video data, wherein the first residual network model is a ResNet18 network model;
[0156] Step (2): Input each of the second deep features into the second residual network model to obtain the residual features corresponding to the compressed video data, wherein the second residual network model is a ResNet50 network model;
[0157] Step (3): The motion features, residual features and global spatiotemporal features are fused to obtain the target features.
[0158] In this embodiment, after obtaining the global spatiotemporal features corresponding to the compressed video data, it is necessary to determine the target features corresponding to the compressed video data based on the global spatiotemporal features, motion vectors, and residuals.
[0159] Specifically, the motion vectors need to be input into the ResNet18 network model to obtain the motion features corresponding to the compressed video data.
[0160] Then, the second deep features corresponding to each target residual output by the fourth network model are input into the ResNet50 network model to obtain the residual features corresponding to the compressed video data.
[0161] Then, the motion features, residual features, and global spatiotemporal features are fused to obtain the target features, which can be represented by the following formula (18):
[0162]
[0163] Among them, O pre Represents target features; O mv Indicates motion characteristics; O res Represents residual characteristics; O IIt represents global spatiotemporal features.
[0164] After obtaining the target feature O pre After that, O pre The softmax layer is used to obtain the predicted probability of each behavior category in the video. In the above implementation, based on global spatiotemporal features, motion vectors, and residuals, target features that better reflect human behavior can be determined. Human behavior can be identified based on these target features, thereby improving the accuracy of human behavior recognition.
[0165] See Figure 2 , Figure 2 This is the second flowchart of the human behavior recognition method provided by the present invention, specifically including steps 201-211:
[0166] Step 201: Extract compression domain information from compressed video data and preprocess the compression domain information to obtain preprocessed compression domain information, wherein the preprocessed compression domain information includes multiple I-frames, residuals and motion vectors.
[0167] Step 202: For each I-frame and its corresponding target residual, input the I-frame into the first network model to obtain the first shallow feature corresponding to the I-frame; input the target residual into the second network model to obtain the second shallow feature corresponding to the target residual. The first network model and the second network model include a single convolutional layer, which includes a two-dimensional convolutional layer, a pooling layer, and a batch normalization layer.
[0168] Step 203: Fuse the first shallow layer features and the second shallow layer features to obtain the fused features corresponding to the I-frame and the target residual.
[0169] Step 204: Input the fused features into the third network model to obtain the first deep features corresponding to the I-frame; input the second shallow features into the fourth network model to obtain the second deep features corresponding to the target residual. The third and fourth network models include multiple convolutional layers, and each convolutional layer includes a two-dimensional convolutional layer, a pooling layer, and a batch normalization layer.
[0170] Step 205: Fuse the first deep feature and the second deep feature to obtain the local spatiotemporal features corresponding to the I-frame.
[0171] Step 206: For each I-frame, input the local spatiotemporal features into the fifth network model to obtain the local spatiotemporal enhancement features corresponding to the local spatiotemporal features. The fifth network model includes a single convolutional layer, which includes a two-dimensional convolutional layer and a batch normalization layer.
[0172] Step 207: Subtract the local spatiotemporal enhancement features of two adjacent local spatiotemporal enhancement features to obtain the difference of local spatiotemporal enhancement features of two adjacent I-frames.
[0173] Step 208: Determine the attention weights corresponding to the local spatiotemporal enhancement features based on the difference in local spatiotemporal enhancement features.
[0174] Step 209: Fuse the attention weights and local spatiotemporal enhancement features to obtain the target local spatiotemporal enhancement features corresponding to the I-frame.
[0175] Step 210: Replace the channel features corresponding to the local spatiotemporal enhancement features of two adjacent targets to obtain the global spatiotemporal features corresponding to the compressed video data.
[0176] Step 211: Based on global spatiotemporal features, motion vectors and residuals, determine the target features corresponding to the compressed video data, and determine the human behavior recognition results corresponding to the compressed video data based on the target features.
[0177] The human behavior recognition method provided by this invention fuses the deep features of each I-frame and each target residual in the compressed domain information, enabling the information of each I-frame and each target residual in the compressed domain information to be fully complementary, thereby obtaining the local spatiotemporal features corresponding to the I-frame with stronger expressive power; by fusing two adjacent local spatiotemporal features, the global spatiotemporal features corresponding to the compressed video data can be extracted, which are more conducive to behavior recognition tasks; based on the global spatiotemporal features, motion vectors and residuals, target features that better reflect human behavior can be determined, and human behavior can be recognized based on these target features, thereby improving the accuracy of human behavior recognition.
[0178] The human behavior recognition device provided by the present invention is described below. The human behavior recognition device described below can be referred to in correspondence with the human behavior recognition method described above. See details below. Figure 3 , Figure 3 This is a schematic diagram of the structure of the human behavior recognition device 300 provided by the present invention.
[0179] Extraction module 301 is used to extract compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals and motion vectors, and there is a correspondence between the I-frames and the residuals;
[0180] The fusion module 302 is used to fuse the deep features corresponding to each I-frame and each target residual to obtain the local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; and to fuse two adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data.
[0181] The determination module 303 is used to determine the target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vector and the residual, and to determine the human behavior recognition result corresponding to the compressed video data based on the target features.
[0182] The human behavior recognition device provided by this invention, by fusing the deep features of each I-frame and each target residual in the compressed domain information, enables the information of each I-frame and each target residual in the compressed domain information to be fully complementary, thereby obtaining the local spatiotemporal features corresponding to the I-frame with stronger expressive power; by fusing two adjacent local spatiotemporal features, the global spatiotemporal features corresponding to the compressed video data can be extracted, which are more conducive to behavior recognition tasks; based on the global spatiotemporal features, motion vectors and residuals, target features that better reflect human behavior can be determined, and human behavior can be recognized based on these target features, thereby improving the accuracy of human behavior recognition.
[0183] Optionally, the extraction module 301 is further used for:
[0184] The compressed video data is entropy decoded to obtain the motion vector in the compressed domain and the discrete cosine transform coefficients corresponding to each macroblock in the compressed domain;
[0185] The discrete cosine transform coefficients are subjected to inverse discrete cosine transform to obtain the residual corresponding to each video frame in the compressed domain.
[0186] Each target macroblock is identified, and each target macroblock is subjected to loop filtering to obtain the I-frame, wherein the target macroblock is the macroblock corresponding to the I-frame.
[0187] Optionally, the extraction module 301 is further used for:
[0188] The compressed domain information is preprocessed to obtain preprocessed compressed domain information.
[0189] Optionally, the fusion module 302 is further used for:
[0190] For each I-frame and its corresponding target residual, the I-frame is input into a first network model to obtain a first shallow feature corresponding to the I-frame; the target residual is input into a second network model to obtain a second shallow feature corresponding to the target residual. The first network model and the second network model each include a single convolutional layer, which includes a two-dimensional convolutional layer, a pooling layer, and a batch normalization layer.
[0191] The first shallow feature and the second shallow feature are fused together to obtain the fused features corresponding to the I-frame and the target residual;
[0192] The fused features are input into the third network model to obtain the first deep features corresponding to the I-frame; the second shallow features are input into the fourth network model to obtain the second deep features corresponding to the target residual. The third network model and the fourth network model include multiple convolutional layers, and each convolutional layer includes a two-dimensional convolutional layer, a pooling layer and a batch normalization layer.
[0193] The first deep feature and the second deep feature are fused together to obtain the local spatiotemporal features corresponding to the I-frame.
[0194] Optionally, the fusion module 302 is further used for:
[0195] The local spatiotemporal features corresponding to each I-frame are subjected to feature enhancement processing to obtain the target local spatiotemporal enhanced features corresponding to each I-frame;
[0196] The local spatiotemporal enhancement features of two adjacent targets are fused to obtain the global spatiotemporal features corresponding to the compressed video data.
[0197] Optionally, the fusion module 302 is further used for:
[0198] For each of the I-frames, the local spatiotemporal features are input into the fifth network model to obtain the local spatiotemporal enhancement features corresponding to the local spatiotemporal features. The fifth network model includes a single convolutional layer, which includes a two-dimensional convolutional layer and a batch normalization layer.
[0199] The difference between two adjacent local spatiotemporal enhancement features is obtained by subtracting the local spatiotemporal enhancement features of two adjacent I-frames.
[0200] Based on the difference in local spatiotemporal enhancement features, determine the attention weights corresponding to the local spatiotemporal enhancement features;
[0201] The attention weights and the local spatiotemporal enhancement features are fused together to obtain the target local spatiotemporal enhancement features corresponding to the I-frame.
[0202] Optionally, the fusion module 302 is further used for:
[0203] The channel features corresponding to two adjacent target local spatiotemporal enhancement features are replaced to obtain the global spatiotemporal features corresponding to the compressed video data.
[0204] Optionally, the determining module 303 is further configured to include:
[0205] The motion vector is input into the first residual network model to obtain the motion features corresponding to the compressed video data, wherein the first residual network model is a ResNet18 network model;
[0206] Each of the second deep features is input into the second residual network model to obtain the residual features corresponding to the compressed video data, wherein the second residual network model is a ResNet50 network model;
[0207] The motion features, residual features, and global spatiotemporal features are fused together to obtain the target features.
[0208] Figure 4 An example is a schematic diagram of the physical structure of an electronic device 400, such as... Figure 4 As shown, the electronic device may include a processor 410, a communication interface 420, a memory 430, and a communication bus 440, wherein the processor 410, the communication interface 420, and the memory 430 communicate with each other through the communication bus 440. The processor 410 can call logical instructions in the memory 430 to execute a human behavior recognition method. The method includes: extracting compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals; fusing the deep features corresponding to each I-frame and each target residual to obtain local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; fusing two adjacent local spatiotemporal features to obtain global spatiotemporal features corresponding to the compressed video data; determining the target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vectors, and the residuals, and determining the human behavior recognition result corresponding to the compressed video data based on the target features.
[0209] Furthermore, the logical instructions in the aforementioned memory 430 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0210] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can implement the various processes of the above-described human behavior recognition method embodiments. The method includes: extracting compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals; fusing the deep features corresponding to each I-frame and each target residual to obtain local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; fusing two adjacent local spatiotemporal features to obtain global spatiotemporal features corresponding to the compressed video data; determining target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vectors, and the residuals, and determining the human behavior recognition result corresponding to the compressed video data based on the target features.
[0211] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program is capable of implementing the various processes of the above-described human behavior recognition method embodiments. The method includes: extracting compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals; fusing the deep features corresponding to each I-frame and each target residual to obtain local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; fusing two adjacent local spatiotemporal features to obtain global spatiotemporal features corresponding to the compressed video data; determining target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vectors, and the residuals, and determining the human behavior recognition result corresponding to the compressed video data based on the target features.
[0212] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0213] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0214] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for recognizing human behavior, characterized in that, include: Extract compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals; The deep features corresponding to each I-frame and each target residual are fused to obtain the local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; the local spatiotemporal features of two adjacent ones are fused to obtain the global spatiotemporal features corresponding to the compressed video data. Based on the global spatiotemporal features, the motion vector, and the residual, the target features corresponding to the compressed video data are determined, and the human behavior recognition result corresponding to the compressed video data is determined based on the target features. The process of fusing the deep features corresponding to each I-frame and each target residual to obtain the local spatiotemporal features corresponding to each I-frame includes: For each I-frame and its corresponding target residual, the I-frame is input into a first network model to obtain a first shallow feature corresponding to the I-frame; the target residual is input into a second network model to obtain a second shallow feature corresponding to the target residual. The first network model and the second network model each include a single convolutional layer, which includes a two-dimensional convolutional layer, a pooling layer, and a batch normalization layer. The first shallow feature and the second shallow feature are fused together to obtain the fused features corresponding to the I-frame and the target residual; The fused features are input into the third network model to obtain the first deep features corresponding to the I-frame; the second shallow features are input into the fourth network model to obtain the second deep features corresponding to the target residual. The third network model and the fourth network model include multiple convolutional layers, and each convolutional layer includes a two-dimensional convolutional layer, a pooling layer and a batch normalization layer. The first deep feature and the second deep feature are fused together to obtain the local spatiotemporal features corresponding to the I-frame.
2. The human behavior recognition method according to claim 1, characterized in that, The step of extracting compression field information from compressed video data includes: The compressed video data is entropy decoded to obtain the motion vector in the compressed domain and the discrete cosine transform coefficients corresponding to each macroblock in the compressed domain; The discrete cosine transform coefficients are subjected to inverse discrete cosine transform to obtain the residual corresponding to each video frame in the compressed domain. Each target macroblock is identified, and each target macroblock is subjected to loop filtering to obtain the I-frame, wherein the target macroblock is the macroblock corresponding to the I-frame.
3. The human behavior recognition method according to claim 1 or 2, characterized in that, After extracting compression domain information from the compressed video data, the method further includes: The compressed domain information is preprocessed to obtain preprocessed compressed domain information.
4. The human behavior recognition method according to claim 1, characterized in that, The step of fusing two adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data includes: The local spatiotemporal features corresponding to each I-frame are subjected to feature enhancement processing to obtain the target local spatiotemporal enhanced features corresponding to each I-frame; The local spatiotemporal enhancement features of two adjacent targets are fused to obtain the global spatiotemporal features corresponding to the compressed video data.
5. The human behavior recognition method according to claim 4, characterized in that, The step of performing feature enhancement processing on the local spatiotemporal features corresponding to each of the I-frames to obtain the target local spatiotemporal enhancement features corresponding to each of the I-frames includes: For each of the I-frames, the local spatiotemporal features are input into the fifth network model to obtain the local spatiotemporal enhancement features corresponding to the local spatiotemporal features. The fifth network model includes a single convolutional layer, which includes a two-dimensional convolutional layer and a batch normalization layer. The difference between two adjacent local spatiotemporal enhancement features is obtained by subtracting the local spatiotemporal enhancement features of two adjacent I-frames. Based on the difference in local spatiotemporal enhancement features, determine the attention weights corresponding to the local spatiotemporal enhancement features; The attention weights and the local spatiotemporal enhancement features are fused together to obtain the target local spatiotemporal enhancement features corresponding to the I-frame.
6. The human behavior recognition method according to claim 4, characterized in that, The step of fusing two adjacent target local spatiotemporal enhancement features to obtain the global spatiotemporal features corresponding to the compressed video data includes: The channel features corresponding to two adjacent target local spatiotemporal enhancement features are replaced to obtain the global spatiotemporal features corresponding to the compressed video data.
7. The human behavior recognition method according to claim 1, characterized in that, The step of determining the target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vector, and the residual includes: The motion vector is input into the first residual network model to obtain the motion features corresponding to the compressed video data, wherein the first residual network model is a ResNet18 network model; Each of the second deep features is input into the second residual network model to obtain the residual features corresponding to the compressed video data, wherein the second residual network model is a ResNet50 network model; The motion features, residual features, and global spatiotemporal features are fused together to obtain the target features.
8. A human behavior recognition device, characterized in that, include: An extraction module is used to extract compression domain information from compressed video data, wherein the compression domain information includes multiple I-frames, residuals, and motion vectors, and there is a correspondence between the I-frames and the residuals; The fusion module is used to fuse the deep features corresponding to each I-frame and each target residual to obtain the local spatiotemporal features corresponding to each I-frame, wherein the target residual is the residual corresponding to the I-frame; and to fuse two adjacent local spatiotemporal features to obtain the global spatiotemporal features corresponding to the compressed video data. The determination module is used to determine the target features corresponding to the compressed video data based on the global spatiotemporal features, the motion vector, and the residual, and to determine the human behavior recognition result corresponding to the compressed video data based on the target features; The fusion module is specifically used for: For each I-frame and its corresponding target residual, the I-frame is input into a first network model to obtain a first shallow feature corresponding to the I-frame; the target residual is input into a second network model to obtain a second shallow feature corresponding to the target residual. The first network model and the second network model each include a single convolutional layer, which includes a two-dimensional convolutional layer, a pooling layer, and a batch normalization layer. The first shallow feature and the second shallow feature are fused together to obtain the fused features corresponding to the I-frame and the target residual; The fused features are input into the third network model to obtain the first deep features corresponding to the I-frame; the second shallow features are input into the fourth network model to obtain the second deep features corresponding to the target residual. The third network model and the fourth network model include multiple convolutional layers, and each convolutional layer includes a two-dimensional convolutional layer, a pooling layer and a batch normalization layer. The first deep feature and the second deep feature are fused together to obtain the local spatiotemporal features corresponding to the I-frame.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the human behavior recognition method as described in any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the human behavior recognition method as described in any one of claims 1 to 7.
11. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the human behavior recognition method as described in any one of claims 1 to 7.