A behavior recognition method, device, apparatus and storage medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By fusing the initial motion vector difference of the video stream and image data obtained in streaming media behavior detection, the problems of long recognition cycle and resource redundancy in the existing technology are solved, realizing real-time streaming behavior recognition and efficient sharing of computing resources.

CN116994333BActive Publication Date: 2026-06-19CHINA MOBILE CHENGDU INFORMATION & TELECOMM TECH CO LTD +1

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA MOBILE CHENGDU INFORMATION & TELECOMM TECH CO LTD
Filing Date: 2023-07-25
Publication Date: 2026-06-19

Application Information

Patent Timeline

25 Jul 2023

Application

19 Jun 2026

Publication

CN116994333B

IPC: G06V40/20; G06V20/40; G06V10/82; G06V10/80

AI Tagging

Application Domain

Character and pattern recognition

Technology Topics

Computer graphics (images)Motion vector

Technical Efficacy Phrases

reduce wastenarrow search

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A cell holder assembly and a battery pack
CN224417949UGuaranteed stability low cost
Bed frame with ambient light
CN224307050UEasy to disassemble reduce waste Sofas Couches Mechanical engineering Troffer
An intelligent sterilizer for gynecological care
CN122230198Ano painAvoid dry frictionMedical devices
Scaffold wall connector mounting structure
CN224495748Ureduce wasteMeet quality requirements
A cooling tower
CN224479916UReduce replacement timeReduce cleanup timeCooling tower Tower

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing streaming media behavior detection methods require the acquisition of keyframes before decoding and encoding, resulting in long recognition cycles and an inability to achieve real-time streaming recognition. Furthermore, the lack of resource sharing leads to redundancy and delays in the recognition process.

Method used

By acquiring the initial motion vector difference of video stream data and image data, and performing fusion processing, end-to-end behavior recognition is achieved using deconvolution decoding and convolution processing, sharing computing resources and reducing unnecessary encoding and decoding steps.

Benefits of technology

It achieves real-time processing and real-time output of behavior recognition, reducing the waste of computing resources, improving recognition efficiency, and reducing recognition time.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116994333B_ABST

Patent Text Reader

Abstract

This invention discloses a behavior recognition method, apparatus, device, and storage medium. The method includes: obtaining an initial motion vector difference between a first video frame and a second video frame based on video stream data corresponding to a behavior to be recognized; the first video frame being the preceding video frame of the second video frame; if the behavior to be recognized satisfies preset conditions, obtaining first image data corresponding to the behavior to be recognized that satisfies the preset conditions; fusing the initial motion vector difference and the first image data to obtain target information; and performing behavior recognition on the target information to obtain a recognition result corresponding to the behavior to be recognized.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of behavior recognition technology, and in particular to a behavior recognition method, apparatus, device, and storage medium. Background Technology

[0002] Among related technologies, methods for streaming media behavior detection include streaming media video segment detection and image analysis detection. Both of these methods require first acquiring keyframes, then decoding them, and then encoding them into a video to be fed into a video recognition neural network. The entire behavior recognition cycle is long, and there is currently no effective solution to this problem. Summary of the Invention

[0003] To address the existing technical problems, the main objective of this invention is to provide a behavior recognition method, apparatus, device, and storage medium.

[0004] To achieve the above objectives, the technical solution of this invention is implemented as follows:

[0005] In a first aspect, the present invention provides a behavior recognition method, the method comprising:

[0006] The initial motion vector difference between the first video frame and the second video frame is obtained based on the video stream data corresponding to the behavior to be identified; the first video frame is the video frame preceding the second video frame.

[0007] If the behavior to be identified meets the preset conditions, the first image data corresponding to the behavior to be identified that meets the preset conditions is obtained;

[0008] The initial motion vector difference and the first image data are fused to obtain target information;

[0009] The target information is subjected to behavior recognition to obtain the recognition result corresponding to the behavior to be recognized.

[0010] In the above scheme, obtaining the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified includes:

[0011] The video stream data corresponding to the behavior to be identified is subjected to deconvolution decoding to obtain the initial motion vector difference between the first video frame and the second video frame.

[0012] In the above scheme, the step of acquiring the first image data corresponding to the behavior to be identified, which satisfies the preset conditions, when the behavior to be identified satisfies the preset conditions, includes:

[0013] Determine whether there is a change between the first video frame and the second video frame, or whether a target object is detected in the first video frame and the second video frame;

[0014] If there is a change between the first video frame and the second video frame, or if the target object is detected in the first video frame and the second video frame, the first image data corresponding to the behavior to be identified that meets the preset conditions is obtained.

[0015] In the above scheme, before acquiring the first image data corresponding to the behavior to be identified that satisfies the preset conditions, the method further includes:

[0016] Obtain the second image data corresponding to the video stream data corresponding to the behavior to be identified;

[0017] The second image data is cropped based on the initial motion vector difference to obtain the first cropping difference map corresponding to the behavior to be identified.

[0018] The first shear difference map and the initial motion vector difference are concatenated to obtain a second shear difference map; the second shear difference map is the next frame difference map corresponding to the first shear difference map.

[0019] The classification result corresponding to the behavior to be identified is determined based on the second shear difference map.

[0020] In the above scheme, the method further includes:

[0021] Based on the first image data, sampling and deconvolution processing are performed to obtain an image of at least one resolution corresponding to the first image data; each image of the resolution is used to identify the first image data.

[0022] In the above scheme, the step of fusing the initial motion vector difference and the first image data to obtain target information includes:

[0023] The initial motion vector difference and the first image data are subjected to prediction processing to obtain first target information; the first target information is used to predict the behavior to be identified.

[0024] The first image data is convolved to obtain second target information; the second target information is used to identify the behavior to be identified.

[0025] In the above scheme, the step of performing behavior recognition on the target information to obtain the recognition result corresponding to the behavior to be recognized includes:

[0026] Determine the target video stream from the video stream to be identified;

[0027] Based on the second target information, determine the state value sequence parameters corresponding to the target video stream;

[0028] The target video stream is divided into at least two windows according to the state value sequence parameters, and the windows are slid-processed to obtain the sequence values corresponding to each window.

[0029] The identification result corresponding to the behavior to be identified is obtained based on the sequence value.

[0030] Secondly, the present invention also provides a behavior recognition device, the device comprising an acquisition unit, a judgment unit, a processing unit, and a recognition unit, wherein,

[0031] The first acquisition unit is used to acquire the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified; the first video frame is the video frame preceding the second video frame.

[0032] The second acquisition unit is used to acquire first image data corresponding to the behavior to be identified that satisfies the preset conditions when the behavior to be identified satisfies the preset conditions.

[0033] The processing unit is used to fuse the initial motion vector difference and the first image data to obtain target information;

[0034] The identification unit is used to perform behavior identification on the target information and obtain the identification result corresponding to the behavior to be identified.

[0035] Thirdly, embodiments of the present invention provide a storage medium storing a computer program; when the computer program is executed by a processor, it implements the steps of any of the methods described above.

[0036] Fourthly, embodiments of the present invention provide a behavior recognition device, the behavior recognition device comprising: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor, when running the computer program, executes the steps of any of the methods described above.

[0037] This invention provides a behavior recognition method, apparatus, device, and storage medium. The method includes: obtaining an initial motion vector difference between a first video frame and a second video frame based on video stream data corresponding to a behavior to be recognized; the first video frame being the preceding video frame of the second video frame; determining that the behavior to be recognized meets preset conditions, and obtaining first image data corresponding to the behavior that meets the preset conditions; fusing the initial motion vector difference and the first image data to obtain target information; and performing behavior recognition on the target information to obtain a recognition result corresponding to the behavior to be recognized. By employing the technical solution of this invention, behavior recognition is performed only when the behavior to be recognized meets preset conditions, which reduces the waste of computational resources. Sharing the initial motion vector difference reduces the scope of the full-image search, thereby reducing the time required for behavior recognition and improving its efficiency. Attached Figure Description

[0038] Figure 1 A flowchart illustrating a behavior recognition method for related technologies;

[0039] Figure 2 A flowchart illustrating another behavior recognition method provided in an embodiment of the present invention;

[0040] Figure 3 A flowchart illustrating a decoding deconvolution module provided in an embodiment of the present invention;

[0041] Figure 4 A schematic diagram of a deconvolutional convolution code provided in an embodiment of the present invention;

[0042] Figure 5 A flowchart illustrating a 2D convolution module provided in an embodiment of the present invention;

[0043] Figure 6 This is a flowchart illustrating the generation of a convolution module according to an embodiment of the present invention;

[0044] Figure 7 A schematic diagram of deconvolution provided in an embodiment of the present invention;

[0045] Figure 8 This is a schematic diagram of a motion prediction convolution provided in an embodiment of the present invention;

[0046] Figure 9 A schematic diagram illustrating a motion prediction implementation process provided in an embodiment of the present invention;

[0047] Figure 10 A flowchart illustrating a sequence self-convolution module provided in an embodiment of the present invention;

[0048] Figure 11This is a schematic diagram of a multi-level sliding window provided in an embodiment of the present invention;

[0049] Figure 12 A specific flowchart of behavior recognition provided in an embodiment of the present invention;

[0050] Figure 13 This is a schematic diagram of the structure of a behavior recognition device provided in an embodiment of the present invention;

[0051] Figure 14 This is a schematic diagram of the hardware structure of a behavior recognition device according to an embodiment of the present invention. Detailed Implementation

[0052] In related technologies, video encoding and decoding technologies can be divided into two main categories: lossless compression and lossy compression. Lossless compression, also known as reversible encoding, refers to the situation where, when the compressed data is reconstructed (i.e., decompressed), the reconstructed data is completely identical to the original data. In other words, the decoded image and the original image are strictly identical; the compression is completely recoverable or unbiased, without distortion. Lossless compression is used in situations where the reconstructed signal must be completely identical to the original signal, such as the compression of disk files. Lossy compression, also known as irreversible encoding, refers to the situation where, when the compressed data is reconstructed, the reconstructed data differs from the original data, but this does not prevent misunderstanding of the information expressed in the original data. In other words, the decoded image and the original image differ; some distortion is permissible, but the visual effect is generally acceptable. Lossy compression has a wide range of applications, such as video conferencing, videophones, video broadcasting, and video surveillance.

[0053] Behavior detection technology, specifically video-based human motion recognition, is a hot and challenging research area in computer vision. Its core lies in automatically detecting, tracking, and recognizing human bodies from video sequences and understanding and describing their behavior using computer vision techniques. This research focuses on pedestrian detection and behavior recognition in videos, exploring and conducting in-depth studies on pedestrian detection, target segmentation and tracking, action recognition, and pose estimation. Due to the complexity and diversity of human motion, behavior recognition is often based on methods utilizing multi-scale information, spatiotemporal orientation information, domain features, and spatial coding features.

[0054] The mainstream solutions currently available for streaming media behavior detection are as follows:

[0055] The first method is to detect video segments in streaming media: extract video frames (e.g., keyframes or I-frames) from the video stream at regular or irregular intervals, and analyze the video frame segments uniformly for video behavior.

[0056] The second method is image analysis detection: image frames are extracted from the video stream of the streaming media at regular or irregular intervals, and images of typical behaviors are classified to identify the behaviors.

[0057] The disadvantages of the first approach are: the process of action detection and recognition is time-consuming, requiring the acquisition of keyframes, decoding, and encoding into video before feeding them into the video recognition neural network. The entire recognition cycle is long, and the encoding and decoding process is repetitive and redundant. The acquisition time affects the recognition accuracy, as different actions have different durations, making it difficult to accurately identify the starting point of the video action.

[0058] The disadvantages of the second approach are: limited judgment capability, only able to identify some relatively obvious abnormal states, unable to identify some less obvious routine action combinations, and unable to extract the temporal features unique to the work behavior; the sampling ratio of images in the video stream is not high, making it difficult to locate the time points of typical actions and behaviors, resulting in low recognition accuracy.

[0059] The above shortcomings can be summarized as follows: Existing streaming media cannot directly interface with behavior recognition neural networks; video must be encoded and decoded before recognition can be performed, making streaming recognition impossible; existing behavior recognition technologies have limited video input, and cannot recognize longer behaviors because they cannot be directly stored in memory; existing neural networks cannot directly process raw streaming media data, but can only process decoded data; existing streaming media data behavior recognition works half on the central processing unit (CPU) for encoding and decoding, and half on the graphics processing unit (GPU) for behavior detection, with the two not sharing computational resources, resulting in a long recognition process; existing streaming media data cannot be streamed end-to-end, requiring multiple redundant calculations, resulting in high forwarding latency after streaming processing.

[0060] Figure 1 A flowchart illustrating a behavior recognition method for related technologies, such as... Figure 1As shown, the left side mainly performs calculations in the CPU. The CPU writes the camera data stream into memory and, according to the video encoding and decoding rules, first performs entropy decoding and run-length decoding on the data stream to obtain the discrete cosine transform data of the keyframe images. Then, it obtains the intra-picture (I-frame) image data through inverse discrete cosine transform (IDCT). Next, it calculates the inference forward predictive frame (P-frame) and bi-directional interpolated prediction frame (B-frame) data based on the motion vector difference stored in the data stream and the I-frame data. Finally, it converts the image color encoding (YcbCr) format data into the three primary colors (Red, Green, Blue, RGB) format to form multi-frame RGB data. The right-hand side primarily performs computations on the GPU. A complete video dataset is decomposed into multiple frames of RGB data by the CPU. The CPU then transfers the RGB data from memory to the GPU's video memory. A series of neural network operations, such as two-dimensional (2D) convolutions, are performed on each frame to predict the position of the human body and its skeletal keypoints. Three-dimensional (3D) convolutions are performed on multiple frames, or a series of neural network operations are applied to the keypoint set. Finally, the actions in the video are determined, and the predicted categories, bounding boxes, keypoints, and other information are plotted on the original image. The multiple frames are then encoded to generate the video for display and playback. The drawback of this method is that intermediate resource information is not shared between the CPU and GPU computation parts. Figure 1 As shown in the connection points of the left and right sections, the RGB frame set transcoded by the CPU is not directly transmitted to the GPU after CPU processing; instead, it is first stored as an RGB video format. Figure 1 The right side is re-decoded to obtain RGB frames, but the RGB frame set resources are not shared. Because the video stream generated by the RGB frame set occupies too much storage, the output video must be encoded, resulting in duplicate encoding and decoding processes for both the input and generated videos, and redundant conversions between YcbCr and RGB formats. Prior computational resources from the original video calculations are wasted and discarded. Motion vector information from previous and subsequent frames cannot be provided to the deep learning part during video stream encoding, requiring the deep learning part to calculate from scratch. If the image remains unchanged, the entire deep learning part has no shortcut for discrimination. It can only process video segments and cannot process streaming data in real time, making it unable to discriminate behaviors with extremely long periods. It cannot output real-time streaming data; it can only process and output video segments, resulting in significant latency for real-time rendering and playback, and the output media is limited.

[0061] Based on this, embodiments of the present invention provide a behavior recognition method, apparatus, device, and storage medium that can encode video streams for output display and playback, and achieve resource sharing during the process, i.e., real-time processing and real-time output.

[0062] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the specific technical solutions of the invention will be further described in detail below with reference to the accompanying drawings of the embodiments of the present invention. The following embodiments are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

[0063] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.

[0064] Figure 2 This is a flowchart illustrating another behavior recognition method provided in an embodiment of the present invention. Figure 2 As shown, the method includes:

[0065] S201: Obtain the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified; the first video frame is the video frame preceding the second video frame.

[0066] It should be noted that the positions of the first video frame and the second video frame in the video stream are not limited, as long as the first video frame is the video frame preceding the second video frame.

[0067] In this embodiment, obtaining the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified can be understood as performing deconvolution decoding on the video stream data corresponding to the behavior to be identified to obtain the initial motion vector difference between the first video frame and the second video frame.

[0068] S202: If the behavior to be identified meets the preset conditions, obtain the first image data corresponding to the behavior to be identified that meets the preset conditions.

[0069] In this embodiment, the behavior to be identified meets preset conditions, which can be understood as either a change exists between the first video frame and the second video frame, or a target object is detected in the first video frame and the second video frame. The target object can be a person.

[0070] The first image data can be understood as a multi-layer feature map corresponding to the condition that is met (such as detecting the target object in a video frame).

[0071] It should be noted that the order of the initial motion vector difference obtained in S201 and the first image data obtained in S202 is not limited here. As an example, the initial motion vector difference can be obtained first, followed by the first image data; or the first image data can be obtained first, followed by the initial motion vector difference.

[0072] S203: The initial motion vector difference and the first image data are fused to obtain target information.

[0073] In this embodiment, the fusion processing of the initial motion vector difference and the first image data to obtain target information can be understood as follows: the target information includes first target information and second target information; prediction processing of the initial motion vector difference and the first image data to obtain first target information; the first target information is used to predict the behavior to be identified; and / or convolution processing of the first image data to obtain second target information; the second target information is used to identify the behavior to be identified.

[0074] S204: Perform behavior recognition on the target information to obtain the recognition result corresponding to the behavior to be recognized.

[0075] In this embodiment, the step of performing behavior recognition on the target information to obtain the recognition result corresponding to the behavior to be recognized can be understood as follows: determining the target video stream in the video stream corresponding to the behavior to be recognized; determining the state value sequence parameters corresponding to the target video stream based on the second target information; dividing the target video stream into at least two windows according to the state value sequence parameters; performing sliding processing on the windows to obtain the sequence value corresponding to each window; and obtaining the recognition result corresponding to the behavior to be recognized based on the sequence value.

[0076] It should be noted that the behavior recognition method provided in this embodiment of the invention is implemented using a decoding deconvolution module, a 2D module, a generative convolution module, a motion prediction convolution module, and a sequence self-convolution module.

[0077] The behavior recognition method provided by this invention can process end-to-end video data streams in real time, achieve parameter sharing, and save computing resources.

[0078] In an optional embodiment of the present invention, obtaining the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified includes: performing deconvolution decoding on the video stream data corresponding to the behavior to be identified to obtain the initial motion vector difference between the first video frame and the second video frame.

[0079] It should be noted that performing deconvolution decoding on the video stream data corresponding to the behavior to be identified can also yield an image matrix.

[0080] In this embodiment, the deconvolution decoding process is completed by a decoding-deconvolution module. The decoding-deconvolution module is used to parse the video data, and the inverse operation of its weights can obtain the encoded convolution, which is used to encode the data at the end. Its decoding is performed on each frame of the video stream. Due to the inverse interoperability of convolution and deconvolution, and the inverse relationship between encoding and decoding, this embodiment of the invention uses deconvolution, i.e., an upsampling method, for decoding. The weights are inherently determined by the preset encoding format and do not need to be trained. Figure 3 This is a flowchart illustrating a decoding deconvolution module provided in an embodiment of the present invention, as shown below. Figure 3 As shown, the video data stream is input into the decoding deconvolution layer to obtain the image matrix I. RGB And the difference in motion vector V.

[0081] To make it easier to understand, a detailed explanation is provided here. Figure 4 This is a schematic diagram of a deconvolutional convolutional convolutional code provided in an embodiment of the present invention, as shown below. Figure 4 As shown, the mapping from the data stream to the RGB image is implemented in the following order: input data stream layer, entropy decoding layer, discrete cosine transform layer, YCbCr transform layer, and RGB transform layer.

[0082] In the entropy decoding and run-length decoding layer, entropy encoding and run-length decoding need to be calculated on the data stream according to the encoding rules. Therefore, this layer is a fixed mapping layer, and its mapping function is shown in equation (1) below:

[0083] f(x1,x2,x3…xn)=(y1,y2,y3…yn),xn→yn (1)

[0084] In equation (1), x and y are mapped one-to-one, and the image matrix I and motion vector difference V after the Discrete Cosine Transform (DCT) are obtained.

[0085] The inverse discrete cosine transform layer performs an inverse discrete cosine transform on the image matrix I to obtain a matrix Im with image format YcbCr, as shown in equations (2), (3) and (4) below:

[0086]

[0087] Y = I * A (3)

[0088] Im = A T *Y (4)

[0089] In equation (2), A is the discrete desine transformation matrix; in equation (3), I is the image, Y is the Y layer of the YCbCr format image; in equation (4), A TLet I be the transpose of A, and Im be the matrix obtained by performing an inverse discrete cosine transform on I to obtain an image format of YCbCr.

[0090] Image data can be compressed by calculating a matrix in the YCbCr image format, concentrating information in the low-frequency channel and compressing the high-frequency channel.

[0091] The YCbCr transform layer performs convolution calculations on the Im matrix to obtain I. RGB The convolution weights are W and the bias is b, as shown in equations (5), (6), (7), (8), (9) and (10):

[0092] I RGB =W*Im+b (5)

[0093] I RGB =R+G+B (6)

[0094] Im=Y+Cb+Cr (7)

[0095] Y = 0.299R + 0.587G + 0.114B (8)

[0096] Cr = 0.713(RY) (9)

[0097] Cb = 0.564 (BY) (10)

[0098] In equation (5), W represents the convolution weight and b represents the bias. The weight and bias data are fixed according to the international encoding and decoding standard and can be obtained without training. In equation (6), R represents red, G represents green, and B represents blue. In equation (7), Y represents the luminance component, Cb represents the blue chrominance component, and Cr represents the red chrominance component. After this layer, it is possible to perform unique deconvolution decoding on streaming data and convolution encoding on multiple frames of RGB.

[0099] It should be noted that the above decoding is performed in a fully connected GPU, which can speed up the decoding process and allow the data to remain in video memory for a longer period of time.

[0100] In an optional embodiment of the present invention, the step of obtaining the first image data corresponding to the behavior to be identified that satisfies the preset conditions when the behavior to be identified satisfies the preset conditions includes: determining whether there is a change between the first video frame and the second video frame, or whether a target object is detected in the first video frame and the second video frame; and obtaining the first image data corresponding to the behavior to be identified that satisfies the preset conditions when there is a change between the first video frame and the second video frame, or when the target object is detected in the first video frame and the second video frame.

[0101] In this embodiment, whether there is a change between the first video frame and the second video frame can be understood as comparing the content in the first video frame and the content in the second video frame to determine whether there is a change in the video content. The target object can be understood as a person.

[0102] For ease of understanding, an example is given here: it is determined whether there is a change in the content of the first video frame and the content of the second video frame, or whether a person is detected in the first video frame and the second video frame. If there is a change in the content of the first video frame and the content of the second video frame, or a person is detected in the first video frame and the second video frame, the first image data corresponding to the behavior to be identified that meets the preset conditions is obtained.

[0103] It should be noted that if the content of the first video frame and the content of the second video frame are different but not significantly different, no behavior recognition will be performed.

[0104] By employing the technical solution of this invention, classification calculations are performed before behavior recognition, which can save computing resources.

[0105] In an optional embodiment of the present invention, before acquiring the first image data corresponding to the behavior to be identified that satisfies the preset conditions when the behavior to be identified satisfies the preset conditions, the method further includes: acquiring second image data corresponding to the video stream data corresponding to the behavior to be identified; performing cropping processing on the second image data according to the initial motion vector difference to obtain a first cropping difference map corresponding to the behavior to be identified; performing splicing processing on the first cropping difference map and the initial motion vector difference to obtain a second cropping difference map; the second cropping difference map is the next frame difference map corresponding to the first cropping difference map; and determining the classification result corresponding to the behavior to be identified based on the second cropping difference map.

[0106] In this embodiment, obtaining the second image data corresponding to the video stream data corresponding to the behavior to be identified can be achieved by obtaining the image matrix corresponding to the video stream data using the deconvolution module, and generating the second image data from the image matrix using the 2D convolution module; the second image data is a multi-layer feature map corresponding to the video stream data corresponding to the behavior to be identified.

[0107] It should be noted that the first image data and the second image data are acquired in the same way.

[0108] The second image data is cropped based on the initial motion vector difference to obtain a first cropping difference map corresponding to the behavior to be identified; the first cropping difference map and the initial motion vector difference are concatenated to obtain a second cropping difference map; the second cropping difference map is the difference map of the next frame corresponding to the first cropping difference map; the classification result corresponding to the behavior to be identified is determined based on the second cropping difference map. This can be understood as follows: the second image data is cropped using the initial motion vector difference, the initial motion vector difference is used to locate the vector difference map based on the difference value, and the obtained cropping difference map and the vector difference map are concatenated to obtain the difference map of the next frame. The classification result is calculated using a normalized exponential function (soft version of max, softmax).

[0109] It should be noted that if the video content (comparing the video content of consecutive frames) remains unchanged, the value of the vector difference map is 0. Calculating 0 will only consume very little computing resources, and the overall calculation will result in a classification result of 0. If the video content changes, only the clipping difference map will be calculated, which is much less computationally intensive than calculating the entire feature map.

[0110] For ease of understanding, an example is provided here. 2D convolution is a common technique used by neural networks to extract object features from a single image. By employing multiple layers of convolution, the combined feature information can be used as the basis for judging the object category in the image, thus achieving the classification of objects in the image. Object categories include: people, vehicles, common objects, common animals, etc. 2D convolution can use a residual network (ResNet50) as the backbone network for feature map extraction. The feature maps extracted by 2D convolution can be directly classified using fully connected and softmax methods. Formula (11) is as follows:

[0111]

[0112] In equation (11), z i Let be the output value of the i-th node, and C be the number of output nodes, i.e., the number of categories. Since video stream data consists of more than 25 frames per second, related methods need to detect every frame of the video stream and combine the data from all frames to determine the behavior. In the field of surveillance video streams, if there is no corresponding behavior for a long time, the system will consume a lot of computing resources.

[0113] In this embodiment of the invention, pre-classification calculation is performed. If the video does not change significantly or no person is detected, the subsequent behavior recognition calculation will not be performed, which greatly saves computing resources. Furthermore, the classification is based on traditional convolutional classification and incorporates the calculation of the difference between motion vectors before and after the video, sharing the prior calculation data of motion vectors.

[0114] Figure 5This is a flowchart illustrating a 2D convolution module provided in an embodiment of the present invention, as shown below. Figure 5 As shown, the image matrix I obtained using the deconvolution module is first... RGB The initial multi-layer feature map is obtained through 2D convolution (RestNet50). The motion vector difference is localized through interpolation to obtain the vector difference map. The obtained shearing difference map is concatenated with the vector difference map to obtain the difference map of the next frame. Convolution and SoftMax classification calculations are performed on the difference map of the next frame.

[0115] In an optional embodiment of the present invention, the method further includes: sampling and deconvolution processing based on the first image data to obtain an image of at least one resolution corresponding to the first image data; each image of the resolution is used to reconstruct the first image data.

[0116] In this embodiment, a generative convolution module is used to complete sampling and deconvolution processing. Sampling and deconvolution are performed based on the first image data to obtain an image of at least one resolution corresponding to the first image data. Each image of a given resolution is used to reconstruct the first image data. This can be understood as obtaining multiple images of different resolutions based on the first image data, with each image of a given resolution used to enhance the resolution of the first image data. It is understood that the resolution of the image obtained through sampling and deconvolution processing is greater than the resolution of the first image data. For example, if the first image data is blurry, the technical solution of this embodiment can make the blurry image data clear.

[0117] For example, the image matrix corresponding to the first image data is subjected to 2D convolution to obtain a downsampled target multi-layer feature map. The target multi-layer feature map is then inversely generated into an image matrix through upsampling and deconvolution. Multiple image matrices with different resolutions form a multi-resolution I-frame pyramid. For example, an image can be reconstructed by generating a convolution module that is as similar as possible to the image generated by the decoding deconvolution module. The resolution can be greater than that of the original image to achieve a super-resolution effect.

[0118] For ease of understanding, an example is provided below. Figure 6 This is a flowchart illustrating the generation of a convolution module according to an embodiment of the present invention, as shown below. Figure 6 As shown, the multi-layer feature map generated by the 2D convolution module is passed through the generative convolution module to obtain a multi-resolution I-frame. Figure 7 A schematic diagram of deconvolution provided in an embodiment of the present invention, as shown below. Figure 7As shown, by deconvolution, the multi-layer feature map obtained using 2D convolution is used as the output, and transposed convolution is used to obtain the input. After multiple operations, an image matrix with different resolutions compared to the original image can be obtained. The input is 2×2, the convolution kernel is 3×3, and the output is 4×4. The multi-resolution image of the original image can be obtained by training with the sum of squared differences of multi-layer elements as the first loss function (S). S can be expressed by formula (12) as follows:

[0119]

[0120] In equation (12), x ij The point values of the image matrix after deconvolution, y ij The point values of the original image matrix are used as the loss measure. The smaller the loss, the more similar the reconstructed image is to the original image, thus training a multi-resolution image.

[0121] In an optional embodiment of the present invention, the step of fusing the initial motion vector difference and the first image data to obtain target information includes: performing prediction processing on the initial motion vector difference and the first image data to obtain first target information; the first target information is used to predict the behavior to be identified; performing convolution processing on the first image data to obtain second target information; the second target information is used to identify the behavior to be identified.

[0122] It should be noted that the target information includes first target information and second target information; the first target information is obtained by performing prediction processing on the initial motion vector difference and the first image data; the first target information is used to predict the behavior to be identified; the second target information is obtained by performing convolution processing on the first image data; the second target information is used to identify the behavior to be identified. This can be understood as: performing prediction processing on the initial motion vector difference and the first image data to obtain the first target information; the first target information is used to predict the behavior to be identified; and / or, performing convolution processing on the first image data to obtain the second target information; the second target information is used to identify the behavior to be identified.

[0123] In this embodiment, behavior prediction and behavior recognition are independent of each other. Behavior prediction can be performed on the behavior to be identified, behavior recognition can be performed on the behavior to be identified, or behavior prediction and behavior recognition can be performed on the behavior to be identified at the same time. No limitation is made here.

[0124] In this embodiment, a motion prediction convolution module is used for prediction processing.

[0125] For ease of understanding, an example is given here. The data stream already includes the calculation of the motion vector difference between the images of the previous and next frames. Only the data of the motion vector difference is retained for data transmission. In related technical solutions, this part of the feature is completely discarded, and the prediction of moving objects is completely handed over to the neural network for calculation. In this embodiment of the invention, this part of the feature is combined with the multi-layer feature map generated by 2D convolution, which will reduce the use of computing resources, speed up the prediction time, and improve the accuracy of the detection part. Figure 8 This is a schematic diagram of a motion prediction convolution provided in an embodiment of the present invention, as shown below. Figure 8 As shown, the motion vector difference obtained by the deconvolution module and the multi-layer feature map obtained by the 2D convolution module are combined, and the target box and key points are obtained through the motion prediction convolution module. Figure 9 This is a schematic diagram illustrating a motion prediction implementation process provided in an embodiment of the present invention, as shown below. Figure 9 As shown, the intersection-union ratio (IU) of the motion difference map and the target box, along with the squared difference of key point locations, are used as the loss for regression, resulting in a more accurate motion prediction. The IU represents the ratio of the sum of the moving part of the object and the background relative to the moving part after the motion to the area of the actual moving part of the object in the image. Its value ranges from 0 to 2, allowing for overall normalization. The loss function is divided by 2 to normalize the range to between 0 and 1. Since the motion vector difference and multi-layer feature maps have already been obtained, this step only requires motion prediction based on the motion vector difference, reducing the computational load of the full image prediction. The loss function (Loss) can be expressed by formula (13) as follows:

[0126]

[0127] In equation (13), IOU is the intersection-union ratio of the area of the motion difference map and the area of the actual motion part, x i Let x be the x-coordinate of the real key point. j To predict the x-coordinate of key points, y i Let y be the coordinate of the real key point in the y direction. j To predict the y-coordinate of key points.

[0128] It should be noted that the vector difference between the multi-resolution I-frames generated by the generative convolution module and the source video data cannot be directly merged. It is necessary to adapt the vector difference size according to the resolution of the I-frame. The vector difference obtained from the input sharing of the motion prediction convolution is combined with the generative convolution to predict the B / P frame data, that is, to generate the B / P motion vector difference, and to realize super-resolution B / P.

[0129] The technical solution of this invention uses a motion prediction convolutional module that shares the motion vector difference generated by the decoding deconvolution module for motion prediction. By fusing the motion vector difference with the feature map, the image search range can be reduced, the amount of computation can be reduced, and the efficiency of behavior prediction can be improved.

[0130] In an optional embodiment of the present invention, the step of performing behavior recognition on the target information to obtain the recognition result corresponding to the behavior to be recognized includes: determining a target video stream in the video stream corresponding to the behavior to be recognized; determining a state value sequence parameter corresponding to the target video stream based on the second target information; dividing the target video stream into at least two windows according to the state value sequence parameter, performing sliding processing on the windows to obtain a sequence value corresponding to each window; and obtaining the recognition result corresponding to the behavior to be recognized based on the sequence value.

[0131] It should be noted that the sequence self-convolution module is used for behavior recognition.

[0132] In this embodiment, determining the target video stream in the video stream to be identified can be understood as arbitrarily selecting a segment of the video stream to be identified. The selection can be random, and the length of the selected video stream is not limited.

[0133] The step of determining the state value parameters corresponding to the target video stream based on the second target information can be understood as obtaining multi-layer feature maps generated by a 2D convolution module based on the second target information, and sharing the multi-layer feature maps generated by the 2D convolution module with a sequence autoconvolution module to generate a sequence of state values corresponding to the multi-layer feature maps. The state value sequence parameters can be understood as the set of values generated by the sequence autoconvolution in the target video stream.

[0134] The step of dividing the target video stream into at least two windows based on the state value sequence parameters and performing sliding processing on the windows to obtain the sequence value corresponding to each window can be understood as dividing the state value sequence into at least two windows, where each sequence in the state value sequence corresponds to a window, fixing one end, and sliding from the fixed end to the other end. During the sliding process, the sequence values in the previous window are merged into the next sequence, and the position and number of the sequence values corresponding to the window are recorded during the sliding process.

[0135] The process of obtaining the recognition result corresponding to the behavior to be identified based on the sequence value can be understood as follows: for a specified category, there is a specified sequence value length. The length can be calculated to obtain the value of the window, and then the behavior category can be obtained by reverse calculation of the value.

[0136] For ease of understanding, this example illustrates that the relevant solutions only target the recognition and classification of actions and behaviors for a limited amount of video data. To achieve behavior classification for an unlimited amount of streaming data, this embodiment uses sequential self-convolution without resetting weights. The data will be calculated with its own data to recognize and classify behaviors in an online learning manner, solving the problem of behavior recognition over a very long time span and avoiding the drawback of needing to retrain offline when adding categories.

[0137] Figure 10 This is a flowchart illustrating a sequence self-convolution module provided in an embodiment of the present invention, as shown below. Figure 10 As shown, the multi-layer feature maps obtained by the 2D convolution module are passed through the sequence self-convolution module to obtain the state space. The sequence self-convolution process involves convolving the multi-layer feature maps with the parameters of the feature maps themselves as the convolution kernel to generate the state values of the corresponding images. The state values form the state space, and a fixed sequence of state values is defined as a specified behavior. By matching the state value sequence through a sliding window, the category of the behavior can be quickly obtained. The state calculation can be expressed by formula (14) as follows:

[0138] W i =W s -W n (14)

[0139] In equation (14), W i For the weights that the current sequence's self-convolutional module wants to calculate, W s W represents the current module weight. n These are the self-convolution weights of the previous sequence.

[0140] After sequential self-convolution, the weights can be updated for each frame of the image. The model is continuously updated and optimized through online learning. By sharing the weight parameters of the sequential self-convolution module, the category of the frame object can be identified using fully connected regression, thus achieving regression classification.

[0141] Suppose a video stream sequence is {{I│B / P}}. After self-convolution, it generates a set of values, i.e., a sequence of state values U∈(a1,a2,a3,…,an), where a1,a2,a3 represent a certain action sequence. Using the sequence timestamp as the index and log2 of the number of sequence values as the merging rule, the sequence values of ultra-long video streams can be registered, thereby achieving a certain degree of recognition of ultra-long-duration actions. That is, the action is decomposed into sequence values, and the number of sequence values in the sequence is recorded with base 2. Figure 11 This is a schematic diagram of a multi-level sliding window provided in an embodiment of the present invention, as shown below. Figure 11 As shown, Figure 11 Below a1 is 1, and below the 0 closest to a1 is 2. Slide the window to the left. Figure 11The area below 'a2' changes to '2'. Continue sliding to the left. Figure 11 Below, we can see that a1, 0, and a2 are merged, and the bottom of the merged value becomes 1, the bottom of a3 becomes 2, and the bottom of a4 becomes 3. We can continue to slide in sequence to perform behavior recognition on ultra-long video streams. Among them, a sliding window with multiple fixed levels can realize behavior recognition of streaming data. Since the two sequence values within the window will be merged, we only need to calculate the position containing two sequence values. Formula (15) is as follows:

[0142]

[0143] In equation (15), N is the number of sequence values contained at the window position. w is the number of sequence values within the window, and n is the total number of state value sequences.

[0144] The behavior category determination is based on a specified sequence value length for a given category. The length can be calculated to obtain the window value, and then the part in the window is matched. For example, if we need to identify the action of the sequence value a1, a2, a3, we can calculate that the sequence is equal to 2^0 + 2^1. We can set the sliding window to 1, first take the window value to determine if it is a3, and then take two more sequence values to determine if it is a2, a1. This is how the behavior category is determined.

[0145] The technical solution of this invention can achieve behavior recognition over a very long time span through a multi-level sliding window.

[0146] In the above embodiments, parameters can be shared, saving serial computation time and outputting different processed data. These data can be combined in different ways through sharing, resulting in diversified output of streaming data for different needs, greatly reducing latency. Fusing the parameters can output super-resolution video streams, tracking region mapping and rendering video streams, target information mapping and rendering video streams, and behavior recognition mapping and rendering video streams.

[0147] Super-resolution video stream: The I-frames output by the generating convolution module and the B / P frames output by the motion vector difference shared by the motion prediction convolution are combined. Through the reverse process of the decoding deconvolution module in step 1, a super-resolution video stream can be generated, which can perform high-definition transcoding of low-resolution video in real time.

[0148] Tracking region drawing and rendering video stream: Based on the super-resolution video stream, the motion prediction convolution predicts the target box and key points, which can capture the motion part. Through the reverse process of the decoding deconvolution module in step 1, only the changed parts can be updated, realizing low-throughput motion monitoring.

[0149] Target information drawing and rendering video stream: Based on the video stream of the tracking region drawing and rendering, the classification and recognition results of the frames generated by the convolution and sequence autoconvolution modules are drawn to obtain the target information drawing and rendering video stream, which can realize the real-time recognition of pedestrians, vehicles and common objects in the video.

[0150] Behavior recognition mapping and video stream rendering: Based on the target information mapping and video stream rendering, the sequence self-convolution is used to determine the behavior category, so as to achieve the purpose of outputting behavior recognition results in real time on the video stream and to perform real-time behavior discrimination during the video stream operation.

[0151] For ease of understanding, the behavior recognition method of this invention will be described in detail below. Figure 12 A specific flowchart of behavior recognition provided in an embodiment of the present invention is shown below. Figure 12 As shown, the decoding deconvolution module first decodes single-frame data from the camera data stream. A 2D convolution module extracts image features and performs simple object classification, providing a shortcut filter for the video stream data. Frames without objects are excluded from the calculation; for example, if a frame is determined to be empty, it can be skipped, reducing computational waste. The image features extracted in the 2D convolution module are shared with the next layer's generative convolution, motion prediction convolution, and sequence autoconvolution modules. The generative convolution reconstructs an image as similar as possible to the decoding deconvolution, with a higher resolution to achieve super-resolution. The motion prediction convolution uses motion vector data from the video stream to predict changes in the image, reducing the scope of the traditional full-image search and obtaining bounding boxes and keypoints. The sequence autoconvolution module is a long-term, non-reset module used to perform cumulative convolution on each frame to identify behavioral information and determine the behavior category. In the next layer, combining I-frames and B / P-frames yields a super-resolution video stream. Combining this super-resolution video stream with bounding boxes and keypoints outputs a tracking region rendering video stream. Combining this tracking region rendering video stream with the results of classifying frame objects using fully connected regression yields a target information rendering video stream. Combining this target information rendering video stream with behavior categories results in a behavior recognition rendering video stream that outputs behavior categories in real-time on the video stream. Encoding the video stream allows for output and playback. Due to the shared computational resources of this solution, real-time processing and output are possible.

[0152] The technical solutions of this invention can be widely applied in real life. For example, in routine monitoring areas in rural areas such as streets, squares, and shop entrances, the technical solutions of this invention can be used to identify behaviors in scenes, promptly reporting dangerous situations in public areas such as fighting, smoking, and loitering; in monitoring various dangerous areas such as riverbanks and factory areas, the technical solutions of this invention can provide timely warnings of abnormal events such as falls, flooding, and smoking; identify falls among young and elderly people in rural areas, realizing humanistic care in rural revitalization; and live-stream rural scenes using streaming technology, facilitating external communication and supervision, promoting live-stream sales, remote control, and contributing to rural prosperity.

[0153] Based on the same inventive concept as described above, Figure 13 This is a schematic diagram of a behavior recognition device provided in an embodiment of the present invention. The device 1300 includes a first acquisition unit 1301, a second acquisition unit 1302, a processing unit 1303, and a recognition unit 1304, wherein...

[0154] The first acquisition unit 1301 is used to acquire the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified; the first video frame is the video frame preceding the second video frame.

[0155] The second acquisition unit 1302 is used to acquire first image data corresponding to the behavior to be identified that satisfies the preset conditions when the behavior to be identified satisfies the preset conditions.

[0156] The processing unit 1303 is used to fuse the initial motion vector difference and the first image data to obtain target information;

[0157] The identification unit 1304 is used to perform behavior identification on the target information and obtain the identification result corresponding to the behavior to be identified.

[0158] In some embodiments, the first acquisition unit 1301 is further configured to perform deconvolution decoding on the video stream data corresponding to the behavior to be identified, and obtain the initial motion vector difference between the first video frame and the second video frame.

[0159] In some embodiments, the apparatus further includes a determination unit, configured to determine whether there is a change between the first video frame and the second video frame, or whether a target object is detected in the first video frame and the second video frame; and to acquire first image data corresponding to the behavior to be identified that satisfies the preset conditions if there is a change between the first video frame and the second video frame, or if the target object is detected in the first video frame and the second video frame.

[0160] In some embodiments, the apparatus 1300 further includes a determining unit, configured to acquire second image data corresponding to the video stream data corresponding to the behavior to be identified; perform cropping processing on the second image data according to the initial motion vector difference to obtain a first cropping difference map corresponding to the behavior to be identified; perform splicing processing on the first cropping difference map and the initial motion vector difference to obtain a second cropping difference map; the second cropping difference map is the next frame difference map corresponding to the first cropping difference map; and determine the classification result corresponding to the behavior to be identified based on the second cropping difference map.

[0161] In some embodiments, the apparatus further includes a sampling and deconvolution processing unit for performing sampling and deconvolution processing based on the first image data to obtain an image of at least one resolution corresponding to the first image data; each image of the first resolution is used to reconstruct the first image data.

[0162] In some embodiments, the processing unit 1303 is further configured to perform prediction processing on the initial motion vector difference and the first image data to obtain first target information; the first target information is used to predict the behavior to be identified; the first image data is convolved to obtain second target information; the second target information is used to identify the behavior to be identified.

[0163] In some embodiments, the identification unit 1304 is further configured to: determine a target video stream in the video stream to be identified; determine a state value sequence parameter corresponding to the target video stream based on the second target information; divide the target video stream into at least two windows according to the state value sequence parameter; perform sliding processing on the windows to obtain a sequence value corresponding to each window; and obtain an identification result corresponding to the behavior to be identified based on the sequence value.

[0164] It should be noted that the behavior recognition device provided in the embodiments of the present invention and the behavior recognition method provided in the aforementioned embodiments of the present invention belong to the same inventive concept. The meanings of the terms appearing here have been explained in detail above and will not be repeated here.

[0165] This invention also provides a storage medium storing a computer program thereon. When the computer program is executed by a processor, it implements the steps of the above-described method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0166] This invention also provides a behavior recognition device, which includes a processor and a memory for storing a computer program that can run on the processor, wherein when the processor runs the computer program, it executes the steps of the method embodiments described above stored in the memory.

[0167] Figure 14 This is a schematic diagram of a hardware structure of a behavior recognition device according to an embodiment of the present invention. The behavior recognition device 1400 includes at least one processor 1401 and a memory 1402. Optionally, the behavior recognition device 1400 may further include at least one communication interface 1403. The various components in the behavior recognition device 1400 are coupled together through a bus system 1404. It can be understood that the bus system 1404 is used to realize the connection and communication between these components. In addition to a data bus, the bus system 1404 also includes a power bus, a control bus, and a status signal bus. However, for the sake of clarity, in... Figure 14 The general designated all buses as Bus System 1404.

[0168] It is understood that memory 1402 can be volatile memory or non-volatile memory, or both. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); magnetic surface memory can be disk storage or magnetic tape storage. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Sync Link Dynamic Random Access Memory (SLDRAM), and Direct Rambus Random Access Memory (DRRAM).The memory 1402 described in this embodiment of the invention is intended to include, but is not limited to, these and any other suitable types of memory.

[0169] The memory 1402 in this embodiment of the invention is used to store various types of data to support the operation of the behavior recognition device 1400. Examples of such data include any computer program for operation on the communication device 1400, and programs implementing the methods of this embodiment of the invention may be included in the memory 1402.

[0170] The methods disclosed in the above embodiments of the present invention can be applied to or implemented by processor 1401. The processor may be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by integrated logic circuits in the processor's hardware or by instructions in software form. The processor may be a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of the present invention. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in the embodiments of the present invention can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium, which is located in memory. The processor reads information from the memory and, in conjunction with its hardware, completes the steps of the aforementioned method.

[0171] In an exemplary embodiment, the behavior recognition device 1400 may be implemented by one or more application-specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field-programmable gate arrays (FPGAs), general-purpose processors, controllers, microcontrollers (MCUs), microprocessors, or other electronic components to perform the methods described above.

[0172] In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.

[0173] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.

[0174] In addition, in the various embodiments of the present invention, each functional unit can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0175] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0176] Alternatively, if the integrated units of this invention are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this invention, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

[0177] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A behavior recognition method, characterized in that, The method includes: The initial motion vector difference between the first video frame and the second video frame is obtained based on the video stream data corresponding to the behavior to be identified; the first video frame is the video frame preceding the second video frame. If the behavior to be identified meets the preset conditions, the first image data corresponding to the behavior to be identified that meets the preset conditions is obtained; The initial motion vector difference and the first image data are fused to obtain target information; The target information is subjected to behavior recognition to obtain the recognition result corresponding to the behavior to be recognized; The step of performing behavior recognition on the target information to obtain the recognition result corresponding to the behavior to be recognized includes: Determine the target video stream from the video stream corresponding to the behavior to be identified; Based on the second target information, the state value sequence parameters corresponding to the target video stream are determined; specifically, based on the second target information, a multi-layer feature map generated by a two-dimensional 2D convolution module is obtained, and the multi-layer feature map generated by the 2D convolution module is shared with a sequence self-convolution module to generate the state value sequence parameters corresponding to the multi-layer feature map; the state value sequence parameters are a set of sequence self-convolution generated values in the target video stream. The target video stream is divided into at least two windows according to the state value sequence parameters, and the at least two windows are slid-processed to obtain the sequence value corresponding to each window. The identification result corresponding to the behavior to be identified is obtained based on the sequence value.

2. The method according to claim 1, characterized in that, The step of obtaining the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified includes: The video stream data corresponding to the behavior to be identified is subjected to deconvolution decoding to obtain the initial motion vector difference between the first video frame and the second video frame.

3. The method according to claim 1, characterized in that, The step of acquiring first image data corresponding to the behavior to be identified, which satisfies the preset conditions, when the behavior to be identified satisfies the preset conditions includes: Determine whether there is a change between the first video frame and the second video frame, or whether a target object is detected in the first video frame and the second video frame; If there is a change between the first video frame and the second video frame, or if the target object is detected in the first video frame and the second video frame, the first image data corresponding to the behavior to be identified that meets the preset conditions is obtained.

4. The method according to claim 1, characterized in that, Before acquiring the first image data corresponding to the behavior to be identified, which satisfies the preset conditions, when the behavior to be identified satisfies the preset conditions, the method further includes: Obtain the second image data corresponding to the video stream data corresponding to the behavior to be identified; The second image data is cropped based on the initial motion vector difference to obtain the first cropping difference map corresponding to the behavior to be identified. The first shear difference map and the initial motion vector difference are concatenated to obtain a second shear difference map; the second shear difference map is the next frame difference map corresponding to the first shear difference map. The classification result corresponding to the behavior to be identified is determined based on the second shear difference map.

5. The method according to claim 1, characterized in that, The method further includes: Based on the first image data, sampling and deconvolution processing are performed to obtain an image of at least one resolution corresponding to the first image data; each image of the resolution is used to reconstruct the first image data.

6. The method according to claim 1, characterized in that, The process of fusing the initial motion vector difference and the first image data to obtain target information includes: The initial motion vector difference and the first image data are subjected to prediction processing to obtain first target information; the first target information is used to predict the behavior to be identified. The first image data is convolved to obtain the second target information; the second target information is used to perform behavior recognition on the behavior to be identified.

7. A behavior recognition device, characterized in that, The device includes a first acquisition unit, a second acquisition unit, a processing unit, and an identification unit, wherein, The first acquisition unit is used to acquire the initial motion vector difference between the first video frame and the second video frame based on the video stream data corresponding to the behavior to be identified; the first video frame is the video frame preceding the second video frame. The second acquisition unit is used to acquire first image data corresponding to the behavior to be identified that satisfies the preset conditions when the behavior to be identified satisfies the preset conditions. The processing unit is used to fuse the initial motion vector difference and the first image data to obtain target information; The identification unit is used to perform behavior identification on the target information and obtain the identification result corresponding to the behavior to be identified; The identification unit is further configured to: Determine the target video stream from the video stream corresponding to the behavior to be identified; Based on the second target information, the state value sequence parameters corresponding to the target video stream are determined; specifically, based on the second target information, a multi-layer feature map generated by a 2D convolution module is obtained, and the multi-layer feature map generated by the 2D convolution module is shared with a sequence autoconvolution module to generate the state value sequence parameters corresponding to the multi-layer feature map; the state value sequence parameters are a set of sequence autoconvolution generated values in the target video stream. The target video stream is divided into at least two windows according to the state value sequence parameters, and the at least two windows are slid-processed to obtain the sequence value corresponding to each window. The identification result corresponding to the behavior to be identified is obtained based on the sequence value.

8. A storage medium, characterized in that, The storage medium stores a computer program; when the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.

9. A behavior recognition device, characterized in that, The behavior recognition device includes: a processor and a memory for storing a computer program capable of running on the processor, wherein the processor, when running the computer program, performs the steps of the method according to any one of claims 1 to 6.