Action recognition method, apparatus, computer program product, and terminal device
By dividing and recombining image blocks in video sequences, and combining deep learning and support vector machines, an action recognition classifier is constructed, which solves the problem of poor robustness of existing methods and achieves higher action recognition accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
- Filing Date
- 2024-12-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing action recognition methods are easily affected by background and noise, resulting in poor robustness.
By dividing and recombining image blocks in the video sequence to be recognized, and combining deep learning and support vector machines, an action recognition classifier is constructed, integrating spatial and temporal information to improve recognition accuracy and robustness.
It improves the robustness and accuracy of action recognition, enabling better identification of actions in videos.
Smart Images

Figure CN122244938A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of action recognition technology, and in particular relates to an action recognition method, device, computer program product and terminal equipment. Background Technology
[0002] Action recognition typically refers to identifying various actions and behaviors in videos, such as fighting, abnormal running, whispering, and littering. This is crucial for understanding video content and is therefore widely used in various fields such as public security policing, urban governance, and general artificial intelligence video understanding. However, existing action recognition methods, such as dense trajectory algorithms, are easily affected by background and noise, resulting in poor robustness. Therefore, a more robust action recognition method is urgently needed. Summary of the Invention
[0003] In view of this, embodiments of this application provide an action recognition method, apparatus, computer program product, and terminal device to solve the problem of low robustness of existing action recognition methods.
[0004] A first aspect of this application provides an action recognition method, which may include:
[0005] Obtain the video sequence to be identified; wherein, the video sequence to be identified includes each image frame to be identified;
[0006] The image frame to be identified is divided into image blocks according to each division scale to obtain each image block; wherein, each image block of the image frame to be identified corresponds to one of the division scales;
[0007] Reconstruct the image blocks corresponding to each of the image frames to be identified to obtain the reconstructed image frames;
[0008] Action recognition is performed based on the reconstructed image frames to obtain the action recognition result of the video sequence to be recognized.
[0009] A second aspect of the embodiments of this application provides an action recognition device, which may include:
[0010] An acquisition module is used to acquire a video sequence to be identified; wherein, the video sequence to be identified includes each image frame to be identified;
[0011] The segmentation module is used to divide the image frame to be identified into image blocks according to various segmentation scales to obtain each image block; wherein, the image block of the image frame to be identified corresponds to one of the segmentation scales;
[0012] The reconstructing module is used to reconstruct the image blocks corresponding to each of the image frames to be identified, thereby obtaining reconstructed image frames;
[0013] The recognition module is used to perform action recognition based on the reconstructed image frames to obtain the action recognition result of the video sequence to be recognized.
[0014] A third aspect of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of any of the above-described action recognition methods.
[0015] A fourth aspect of this application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the terminal device implements the steps of any of the above-described action recognition methods.
[0016] A fifth aspect of this application provides a computer program product, including a computer program, which, when run, causes any of the above-described action recognition methods to be executed.
[0017] The beneficial effects of this application embodiment compared with the prior art are as follows: This application embodiment obtains a video sequence to be identified; wherein, the video sequence to be identified includes various image frames to be identified; according to various division scales, the image frames to be identified are divided into image blocks to obtain various image blocks; wherein, the image blocks of the image frames to be identified correspond to a division scale; the image blocks corresponding to each of the image frames to be identified are recombined to obtain recombined image frames; based on the recombined image frames, action recognition is performed to obtain the action recognition result of the video sequence to be identified. In this application embodiment, for each image frame to be identified, multiple image blocks of different scales can be obtained; then, the image blocks of different scales in each image frame to be identified can be recombined to obtain a recombined image frame that integrates multiple scale image blocks; accordingly, the rich information in the spatial domain (i.e., multiple scales) and temporal domain (i.e., each image frame to be identified in the video sequence to be identified) of the video sequence to be identified can be organically integrated into one image frame, providing richer and more detailed information for action recognition, thereby helping to improve the robustness and accuracy of action recognition. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1This is a schematic diagram of the process of obtaining sample video sequences;
[0020] Figure 2 This is a schematic diagram of the image block segmentation process;
[0021] Figure 3 This is a first schematic diagram of the image block region reconstruction process;
[0022] Figure 4 This is a first schematic diagram of the image patch reconstruction process;
[0023] Figure 5 This is a second schematic diagram of the image patch reconstruction process;
[0024] Figure 6 This is the third schematic diagram of the image patch reconstruction process;
[0025] Figure 7 This is a second schematic diagram of the image patch region reconstruction process;
[0026] Figure 8 A flowchart of an embodiment of an action recognition method in this application;
[0027] Figure 9 This is a structural diagram of one embodiment of an action recognition device according to the present application.
[0028] Figure 10 This is a schematic block diagram of a terminal device in an embodiment of this application. Detailed Implementation
[0029] To make the inventive objectives, features, and advantages of this application more apparent and understandable, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the embodiments described below are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0030] It should be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0031] It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the scope of the application. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.
[0032] It should also be further understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0033] As used in this specification and the appended claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be interpreted, depending on the context, as "once determined," "in response to determination," "once [the described condition or event] is detected," or "in response to detection of [the described condition or event]."
[0034] Furthermore, in the description of this application, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0035] Action recognition typically refers to recognizing various actions and behaviors in videos, such as fighting, running abnormally, whispering, and littering. This is of great significance for understanding video content and is therefore widely used in various fields such as public security policing, urban governance, and general artificial intelligence video understanding.
[0036] For example, in the field of public security and policing, motion recognition can analyze surveillance video in real time, quickly identify potential criminal behavior or emergencies, thereby improving response speed and processing efficiency.
[0037] For example, in the field of urban governance, motion recognition can monitor abnormal activities in public places, respond to emergencies in a timely manner, and ensure public safety.
[0038] For example, in the field of general artificial intelligence video understanding, action recognition is the foundation for realizing advanced video analysis functions; action recognition enables machines to understand video content, laying the technical foundation for applications such as content recommendation and interactive entertainment.
[0039] In action recognition, dense trajectory algorithms (DT) are commonly used. Specifically, this algorithm first uses a grid-based approach to densely sample feature points across multiple scales in an image frame. Then, it calculates the optical flow value within the neighborhood of each feature point to determine its motion direction, thus tracking its trajectory and enabling the identification of movements with abnormal amplitude. However, this method is susceptible to background noise, resulting in poor robustness in action recognition. Therefore, a more robust action recognition method is urgently needed.
[0040] It should be noted that the subject of execution of the method in this application is a terminal device, which can be a common computing device such as a desktop video surveillance system, camera equipment, computer, laptop, handheld computer, workstation, or mobile phone, or other computing devices.
[0041] In this embodiment of the application, deep learning methods can be used to identify the action to be identified; wherein, the action to be identified can be any common action, including but not limited to fighting, abnormal running, falling, whispering, littering, climbing, etc., and this embodiment of the application does not limit it.
[0042] Specifically, the image frame after feature extraction can be input into the action recognition classifier of this application embodiment; wherein, the action recognition classifier of this application embodiment is a classifier used to identify the action to be identified.
[0043] Before using the action recognition classifier for action recognition, the initial classifier needs to be trained to obtain the action recognition classifier in this embodiment of the application. The training process of the action recognition classifier in this embodiment of the application will be described in detail below.
[0044] Since classifier training typically requires a significant investment of manpower for data collection and algorithm optimization, in order to efficiently complete classifier training, this application embodiment can use as few training samples as possible during the training process. Therefore, the initial action classifier in this application embodiment can be a classifier that supports training with few samples. Preferably, the initial action classifier can be a Support Vector Machine (SVM).
[0045] Before training the initial classifier, a training sample set for training can be constructed first.
[0046] Specifically, a preset number of training samples can be obtained, and a training sample set can be constructed based on each training sample and its corresponding real label. The specific number of training samples in the training sample set can be set according to actual needs, and this application does not limit it.
[0047] Since the classifier training process in this embodiment uses as few training samples as possible, the amount of information contained in each training sample can be increased to make each training sample contain richer information, thereby ensuring that the trained classifier can perform accurate and robust inference.
[0048] In this embodiment, the video sequence can be divided into multi-scale image blocks to obtain spatial information of the video sequence. Then, the image blocks corresponding to each image frame can be reassembled into a single integrated image frame to obtain temporal information of the video sequence. This effectively integrates the spatial and temporal information of the video sequence to obtain an integrated image frame with higher information density. Each integrated image frame can be used as a sample to obtain individual samples. Since each sample contains rich information, a classifier with good classification performance can be trained using fewer samples.
[0049] The following will use any training sample as an example to describe in detail the process of obtaining training samples in the embodiments of this application.
[0050] In this embodiment of the application, a video sequence can be used as an analysis segment. When constructing training samples, the obtained video sequence can be recorded as a sample video sequence, which may include various sample image frames. The specific number of sample image frames included in the sample video sequence can be set according to actual needs, and this embodiment of the application does not limit this.
[0051] For example, a sample video sequence may include N sample image frames, where N can be a value from 4 to 32; here, the value of N can be 16, that is, the sample video sequence may include 16 sample image frames.
[0052] In this embodiment, N sample image frames in the sample video sequence can be extracted from the original video sequence according to the extraction interval. The extraction interval can be set according to actual needs, and this embodiment does not limit it.
[0053] For example, the extraction interval can be 0, that is, N consecutive image frames are extracted from the original video sequence without interval, and the extracted N consecutive image frames can be determined as N sample image frames in the sample video sequence.
[0054] For example, the extraction interval can be 3, that is, one image frame is extracted every 3 image frames in the original video sequence until N image frames are extracted, and the N extracted image frames can be determined as N sample image frames in the sample video sequence.
[0055] As an example, please refer to Figure 1The original video sequence can include 16 image frames. If N is 4 and the extraction interval is 3, then the 4th, 8th, 12th and 16th image frames in the original video sequence can be extracted to obtain 4 image frames. The 4 extracted image frames can be identified as 4 sample image frames in the sample video sequence.
[0056] In the embodiments of this application, sample image frames can be scaled to a preset resolution, and image blocks can be divided and reassembled based on the scaled sample image frames. The preset resolution can be set according to actual needs, and this application does not limit it; for example, the preset resolution can be 256*256. Unless otherwise specified, the sample image frames mentioned below are all scaled sample image frames.
[0057] In this embodiment, the sample image frames in the sample video sequence can be divided into image blocks according to preset division scales to obtain various sample image blocks. Each sample image block in a sample image frame corresponds to a division scale.
[0058] The aforementioned division scales can be customized and contextualized based on actual circumstances, and this application does not limit this. For example, the division scales can be 4*4, 8*8, and 16*16, respectively.
[0059] Taking any sample image frame as an example, it can be divided into sample image blocks corresponding to each division scale, centered on a preset division origin; for example, please refer to Figure 2 With the division scales being 4*4, 8*8, and 16*16 respectively, the sample image blocks of different scales can be divided with the preset division origin (i.e., point 1 in the figure) as the center. The sample image blocks of scale 4*4 (i.e., image block 1 in the figure), scale 8*8 (i.e., image block 2 in the figure), and scale 16*16 (i.e., image block 3 in the figure) are obtained respectively.
[0060] Here, the origin can be a point in the region of the sample image frame where motion changes may occur. The origin can be preset based on practical experience, for example, it can be the midpoint of the sample image frame; or it can be determined when annotating the sample image frame.
[0061] In one specific implementation of this application, if the range of an image block exceeds the size of the image when dividing the image block, the portion of the image block that exceeds the image range can be filled with a value of 0.
[0062] After dividing the image into individual sample blocks, the individual sample image blocks corresponding to each sample image frame can be reassembled to obtain the reassembled sample image frame.
[0063] Sample image blocks at different scales of a sample image frame can contain image information of varying richness. If the scale of a sample image block is large, it indicates that the sample image block contains image information of greater richness; if the scale of a sample image block is small, it indicates that the sample image block contains image information of lower richness.
[0064] In this embodiment, the sample image patches to be reconstructed can be determined based on their scale (i.e., the richness of image information they contain). Specifically, for larger-scale image patches, the richness of image information they contain is higher. Therefore, it can be considered that larger-scale image patches contribute more to the action recognition results, and only a smaller number of image patches at that scale are needed to obtain sufficient information to capture action changes. For smaller-scale image patches, the richness of image information they contain is lower. Therefore, it can be considered that smaller-scale image patches contribute less to the action recognition results, and a larger number of image patches at that scale are needed to obtain sufficient information to capture action changes.
[0065] Based on this, embodiments of this application can consider the degree of contribution of image patches at different scales to the action recognition results and determine the reorganization strategy of image patches at different scales; thereby, image patches at various scales can be utilized more rationally to improve the accuracy of action recognition.
[0066] In this embodiment, any division scale can be denoted as the target division scale. Specifically, several specific sample image blocks under the target division scale can be recombined to obtain the sample image block region corresponding to the target division scale. Accordingly, the sample image block regions corresponding to each division scale can be obtained. Then, the sample image block regions corresponding to each division scale can be recombined to obtain the recombined sample image frame.
[0067] For example, given division scales 1, 2, and 3, several sample image blocks at division scale 1 can be recombined to obtain sample image block region 1 corresponding to division scale 1; similarly, several sample image blocks at division scale 2 can be recombined to obtain sample image block region 2 corresponding to division scale 2; and so on. Furthermore, several sample image blocks at division scale 3 can be recombined to obtain sample image block region 3 corresponding to division scale 3. Then, sample image block region 1, sample image block region 2, and sample image block region 3 can be recombined to obtain a recombined sample image frame, such as... Figure 3 As shown.
[0068] When reconstructing the sample image block regions corresponding to the target segmentation scale, the target image blocks used for reconstruction can be determined first; among them, the target sample image block is the sample image block used for reconstruction that corresponds to the target segmentation scale in each sample image block.
[0069] Specifically, the target sampling interval corresponding to the target partition scale can be determined; here, the target sampling interval is positively correlated with the target partition scale. If the target partition scale is larger, the target sampling interval is larger; if the target partition scale is smaller, the target sampling interval is smaller.
[0070] For example, when the target grading scale is grading scale 1, the target sampling interval can be determined as sampling interval 1; when the target grading scale is grading scale 2, the target sampling interval can be determined as sampling interval 2; if grading scale 1 is greater than grading scale 2, then sampling interval 1 is also greater than sampling interval 2.
[0071] Based on the target sampling interval, each sample image frame can be sampled to obtain each target sample image frame.
[0072] For example, if the target sampling interval is 1, then one target sample image frame needs to be extracted every 1 sample image frame. If each sample image frame includes image frame 1, image frame 2, image frame 3 and image frame 4, then with a target sampling interval of 1, image frame 2 can be extracted as a target sample image frame after a 1-frame interval. After that, image frame 4 can be extracted as another target sample image frame after a 1-frame interval.
[0073] For each target sample image frame, the target sample image block corresponding to the target division scale can be determined from each sample image block corresponding to that target sample image frame.
[0074] For example, the target segmentation scale is 4*4 and the target sampling interval is 0, that is, each sample image frame is sampled as a target sample image frame; then, the sample image blocks with a scale of 4*4 in each target sampled image frame can be determined as target sample image blocks.
[0075] For example, if the target segmentation scale is 8*8 and the target sampling interval is 2, that is, one sample image frame is extracted every 2 sample image frames as the target sample image frame; then, the sample image blocks with a scale of 8*8 in each target sample image frame can be determined as the target sample image blocks.
[0076] After identifying each target sample image block, the target sample image blocks can be reassembled to obtain the sample image block region corresponding to the target division scale.
[0077] As an example, there are 16 sample image frames, namely image frame 1, image frame 2, ..., image frame 16. If the target segmentation scale is 4*4, then the sampling interval corresponding to this segmentation scale can be determined as 0. That is, image frame 1, image frame 2, ..., image frame 16 can be determined as the target sample image frames. Then, the 4*4 sample image blocks of each target sample image frame can be determined as target sample image blocks. Here, the 4*4 sample image block of image frame 1 can be denoted as image block 1-4, the 4*4 sample image block of image frame 2 can be denoted as image block 2-4, ..., and the 4*4 sample image block of image frame 16 can be denoted as image block 16-4. By recombining image block 1-4, image block 2-4, ..., image block 16-4, the 4*4 corresponding sample image block region 1 can be obtained, such as... Figure 4 As shown.
[0078] In the example above, if the target partition scale is 8*8, then the sampling interval corresponding to this partition scale can be determined to be 1. That is, image frames 2, 4, 6, 8, 10, 12, 14, and 16 can be determined as target sample image frames. Then, the 8*8 sample image blocks of each target sample image frame can be determined as target sample image blocks. Here, the 8*8 sample image block of image frame 2 can be denoted as image block 2-8, the 8*8 sample image block of image frame 4 as image block 4-8, ..., and the 8*8 sample image block of image frame 16 as image block 16-8. Recombining image blocks 2-8, 4-8, ..., and 16-8 yields the 4*4 corresponding sample image block region 2, as shown below. Figure 5 As shown.
[0079] In the example above, if the target segmentation scale is 16*16, then the sampling interval corresponding to this segmentation scale can be determined to be 4. That is, image frame 5, image frame 10, and image frame 15 can be determined as target sample image frames. Then, the sample image blocks of each target sample image frame with a scale of 16*16 can be determined as target sample image blocks. Here, the 16*16 sample image block of image frame 5 can be recorded as image block 5-16, the 16*16 sample image block of image frame 10 can be recorded as image block 10-16, and the 16*16 sample image block of image frame 15 can be recorded as image block 15-16. By recombining image block 5-16, image block 10-16, and image block 15-16, the 16*16 corresponding sample image block region 3 can be obtained, such as... Figure 6 As shown.
[0080] After obtaining each sample image block region, the sample image block regions can be reconstructed to obtain a reconstructed sample image frame.
[0081] Please continue referring to the example above. You can reconstruct sample image block region 1, sample image block region 2, and sample image block region 3 to obtain a reconstructed sample image frame, such as... Figure 7 As shown.
[0082] After obtaining the reconstructed sample image frame, image features can be extracted from the reconstructed sample image frame to obtain the sample image features.
[0083] In the embodiments of this application, any common neural network model for image feature extraction can be used to extract image features from the reconstructed sample image frames to obtain sample image features. For example, convolutional neural networks (such as ResNet) or Transformer networks (such as Vision Transformer (ViT)) can be used to extract image features from the reconstructed sample image frames to obtain sample image features.
[0084] In one specific implementation of this application, the ViT image encoder can be used to extract image features from the reconstructed sample image frames. This implementation will be described below.
[0085] Specifically, the reconstructed sample image frame can be divided into segmented image blocks (called patches) of the same size. Each patch can embed its corresponding position information. The position information corresponding to each patch can be the position of the patch in the reconstructed sample image frame. After embedding the position information of each patch, the reconstructed image frame after embedding the position information can be obtained (including each patch after embedding the position information (called patch embedding)).
[0086] Then, the individual patch embeddings can be expanded into a sequence and used as input to the Transformer encoder.
[0087] The Transformer encoder can perform attention processing on each patch embedding. Specifically, each patch embedding can be transformed by three different linear transformations (using a preset weight matrix) to generate the corresponding query (Q), key (K), and value (V). Then, the output Attention(K,Q,V) of the attention processing for each patch embedding can be calculated according to the following formula:
[0088]
[0089] Among them, K T It is the transpose of the key. It is the dimension of the key.
[0090] Then, the output of the attention mechanism processing of each patch embedding can be used to perform a weighted summation of each patch embedding to obtain the sample image features.
[0091] By combining location information and attention mechanisms, it is possible to better capture changes in actions over time, which helps classifiers to recognize actions more accurately.
[0092] In this embodiment, to improve the accuracy and robustness of action recognition, semantic guidance can be provided for the classification process. Specifically, descriptive text corresponding to the action to be recognized can be obtained. This descriptive text can be uploaded by the user or generated by other artificial intelligence models (such as chatgpt) based on the action to be recognized. After obtaining the descriptive text corresponding to the action to be recognized, semantic features can be extracted from the descriptive text to obtain text semantic features. Here, any neural network used for text feature extraction can be used to extract semantic features from the descriptive text, and this application does not impose any specific limitations on this. For example, the BERT (Bidirectional Encoder Representations from Transformers) model can be used to extract semantic features from the descriptive text to obtain text semantic features.
[0093] Then, a training sample can be obtained by combining the sample image features and text semantic features.
[0094] In this embodiment, the training samples can also be labeled to obtain the ground truth labels corresponding to the training samples. Specifically, the ground truth label corresponding to the training sample can indicate whether there is an action to be identified in the sample video sequence corresponding to the training sample. For example, a specific number can be used as the ground truth label to indicate that there is an action to be identified in the sample video sequence corresponding to the training sample; another specific number can be used as the ground truth label to indicate that there is no action to be identified in the sample video sequence corresponding to the training sample.
[0095] Each sample video sequence can be used to construct a training sample by following the above process. By obtaining a preset number of sample video sequences, a preset number of training samples can be determined. Based on each preset number of training samples and their corresponding ground truth labels, a training sample set can be constructed.
[0096] Then, the initial classifier can be trained using the training sample set. Specifically, the initial classifier can be trained using each training sample in the training sample set as input and the corresponding real label as the expected output, to obtain the action recognition classifier of this embodiment.
[0097] Specifically, the initial classifier can receive each training sample as input and use a preset kernel function to map the input training samples into a high-dimensional feature space, thereby constructing the optimal separating hyperplane in the high-dimensional feature space. Here, the kernel function can be set according to actual needs. As an example, the kernel function can be shown in the following equation:
[0098]
[0099] Where x and y are the features of the sample images in the training samples and the features of the example images (images containing the action to be identified), respectively. The width parameter of the Gaussian kernel function. The radial range of the kernel function is controlled, and its value can be set to 3 or other values as needed; k(x,y) represents the similarity between the features of the sample image and the features of the example image after being mapped to a high-dimensional space.
[0100] Next, we can solve the optimization problem of constructing the initial classifier, with the goal of maximizing the width of the classification boundary; by solving this optimization problem, we can obtain the support vectors; based on the obtained support vectors, we can determine the optimal separating hyperplane.
[0101] After determining the optimal classification hyperplane, the classifier training process can be terminated, resulting in the action recognition model in this embodiment. The action recognition model can then be applied to actual action recognition.
[0102] The following section will describe the process of action recognition using an action recognition model. For details, please refer to [link to relevant documentation]. Figure 8 An embodiment of an action recognition method in this application may include steps S801 to S804:
[0103] Step S801: Obtain the video sequence to be identified.
[0104] The video sequence to be identified may include each (N) image frame to be identified; as an example, N can be a value from 4 to 32, for example, N can be a value of 16.
[0105] In this embodiment, the video sequence for which action recognition is required can be denoted as the video sequence to be recognized; for example, the video sequence to be recognized can be a video captured by video surveillance. Each image frame to be recognized in this embodiment can be an image frame extracted from the original video sequence. The specific extraction process can be referred to the relevant description above, and will not be repeated here.
[0106] After obtaining the video sequence to be identified, each image frame to be identified in the video sequence can be scaled to a preset resolution, and subsequent image block division and recombination can be performed based on the scaled image frames to be identified; unless otherwise specified, the image frames to be identified in the following text are all scaled image frames to be identified.
[0107] Step S802: Divide the image frame to be recognized into image blocks according to each division scale to obtain each image block.
[0108] In the embodiments of this application, for each image frame to be identified, the image frame to be identified can be divided into corresponding image blocks according to various division scales; wherein, each image block of the image frame to be identified can correspond to a division scale.
[0109] Specifically, image blocks corresponding to each division scale can be divided with a preset division origin as the center. The division origin here can be preset based on practical experience. For example, it can be determined based on the region where the action to be recognized is easily identified during the classifier training or inference process.
[0110] For example, the various division scales can be 4*4, 8*8, and 16*16, respectively. Based on the division principle, image blocks with a scale of 4*4, an image block with a scale of 8*8, and an image block with a scale of 16*16 can be divided from each image frame to be identified.
[0111] Step S803: Reconstruct the image blocks corresponding to each image frame to be identified to obtain the reconstructed image frame.
[0112] In this embodiment of the application, image blocks at a certain division scale can be reconstructed to obtain the image block region corresponding to that division scale; then, the various image block regions can be reconstructed to obtain a reconstructed image frame.
[0113] For any given target segmentation scale, the target image blocks can be determined first. These target image blocks are the image blocks used for reconstruction that correspond to the target segmentation scale. Here, each image frame to be identified can be sampled based on a target sampling interval to obtain each target image frame; the target sampling interval is positively correlated with the target segmentation scale. Then, the image blocks in each image block corresponding to the target image frame that correspond to the target segmentation scale can be determined as the target image blocks.
[0114] The above process can be referred to in the relevant description above, and will not be repeated here.
[0115] Therefore, the rich information in the spatial and temporal domains of the video sequence to be identified can be integrated to obtain a reconstructed image frame with high information density; using this reconstructed image frame as input, robust and efficient action recognition can be performed.
[0116] Step S804: Perform action recognition based on the reconstructed image frames to obtain the action recognition results of the video sequence to be recognized.
[0117] In this embodiment, image features can be extracted from the reconstructed image frame to obtain the reconstructed image features. The process of image feature extraction can be referred to the relevant description above, and will not be repeated here.
[0118] Then, the reconstructed image features and the text semantic features obtained in the aforementioned process can be used as input to the action recognition classifier to perform action recognition and obtain the action recognition results of the video sequence to be recognized.
[0119] In one specific implementation of this application, in order to enable the action recognition classifier to adapt to the constantly changing environment, the action recognition classifier can also be updated periodically to ensure that the action recognition classifier can maintain robust reasoning.
[0120] In summary, this application embodiment obtains a video sequence to be identified; wherein the video sequence to be identified includes various image frames to be identified; the image frames to be identified are divided into image blocks according to various division scales to obtain various image blocks; wherein each image block of an image frame to be identified corresponds to one of the division scales; the image blocks corresponding to each image frame to be identified are recombined to obtain a recombined image frame; action recognition is performed based on the recombined image frame to obtain the action recognition result of the video sequence to be identified. In this application embodiment, for each image frame to be identified, multiple image blocks of different scales can be obtained; then, the image blocks of different scales in each image frame to be identified can be recombined to obtain a recombined image frame that integrates multiple scale image blocks; accordingly, rich information in the spatial domain (i.e., multiple scales) and temporal domain (i.e., each image frame to be identified in the video sequence to be identified) can be organically integrated into one image frame, providing richer and more detailed information for action recognition, thereby helping to improve the robustness and accuracy of action recognition.
[0121] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0122] Corresponding to the action recognition method described in the above embodiments, Figure 9 This diagram illustrates a structural diagram of an embodiment of an action recognition device provided in this application.
[0123] In this application embodiment, an action recognition device may include:
[0124] The acquisition module 901 is used to acquire a video sequence to be identified; wherein, the video sequence to be identified includes each image frame to be identified;
[0125] The segmentation module 902 is used to divide the image frame to be identified into image blocks according to various segmentation scales to obtain each image block; wherein, the image block of the image frame to be identified corresponds to one of the segmentation scales;
[0126] The recombination module 903 is used to reconstruct each of the image blocks corresponding to each of the image frames to be identified, to obtain a reconstructed image frame;
[0127] The recognition module 904 is used to perform action recognition based on the reconstructed image frame to obtain the action recognition result of the video sequence to be recognized.
[0128] In one specific implementation of this application embodiment, the reorganization module includes:
[0129] The image patch determination submodule is used to determine each target image patch; wherein, the target image patch is an image patch for reconstruction corresponding to a target partitioning scale, and the target partitioning scale is any one of the partitioning scales;
[0130] The first submodule is used to reassemble each of the target image blocks to obtain an image block region corresponding to the target segmentation scale.
[0131] The second submodule is used to reconstruct the first submodule and to reconstruct each of the image block regions to obtain the reconstructed image frame.
[0132] In one specific implementation of this application embodiment, the image block determination submodule includes:
[0133] An interval determination unit is used to determine a target sampling interval corresponding to the target partitioning scale; wherein the target sampling interval is positively correlated with the target partitioning scale;
[0134] A sampling unit is used to sample each of the image frames to be identified based on the target sampling interval to obtain each target image frame;
[0135] The image block determination unit is used to determine the target image block in each of the image blocks corresponding to each of the target image frames.
[0136] In one specific implementation of this application embodiment, the identification module includes:
[0137] The text acquisition submodule is used to acquire the descriptive text corresponding to the action to be recognized;
[0138] The semantic feature extraction submodule is used to extract semantic features from the descriptive text to obtain text semantic features;
[0139] The image feature extraction submodule is used to extract image features from the reconstructed image frame to obtain reconstructed image features;
[0140] The recognition submodule is used to identify the action recognition result of the video sequence to be recognized by using an action recognition classifier based on the semantic features of the text and the reconstructed image features; wherein, the action recognition classifier is a pre-trained classifier for recognizing the action to be recognized.
[0141] In one specific implementation of this application embodiment, the image feature extraction submodule includes:
[0142] An information embedding unit is used to embed location information into the reconstructed image frame to obtain the reconstructed image frame after location information embedding.
[0143] An attention mechanism processing unit is used to perform attention mechanism processing on the reconstructed image frame after the location information is embedded to obtain the reconstructed image features.
[0144] In one specific implementation of this application embodiment, the identification module further includes:
[0145] The sample acquisition submodule is used to acquire a preset number of training samples; wherein, the training sample includes a sample image feature and the text semantic feature;
[0146] The sample set construction submodule is used to construct a training sample set based on each of the training samples and their corresponding real labels;
[0147] The classifier training submodule is used to train the initial classifier by taking each training sample in the training sample set as the input of the initial classifier and the corresponding real label as the expected output, so as to obtain the action recognition classifier.
[0148] In one specific implementation of this application embodiment, the sample acquisition submodule includes:
[0149] A video sequence acquisition unit is used to acquire a preset number of sample video sequences; wherein, one sample video sequence includes sample image frames;
[0150] An image block segmentation unit is used to segment the sample image frames in the sample video sequence into image blocks according to each of the segmentation scales, thereby obtaining each sample image block; wherein, each sample image block of a sample image frame corresponds to one of the segmentation scales;
[0151] A recombination unit is used to reassemble the sample image blocks corresponding to each of the sample image frames to obtain recombined sample image frames;
[0152] An image feature extraction unit is used to extract image features from the recombined sample image frame to obtain the sample image features;
[0153] The sample determination unit is used to obtain the training samples based on the sample image features and the text semantic features.
[0154] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the devices, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0155] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0156] Figure 10 A schematic block diagram of a terminal device provided in an embodiment of this application is shown. For ease of explanation, only the parts related to the embodiment of this application are shown.
[0157] like Figure 10 As shown, the terminal device 10 of this embodiment includes: a processor 100, a memory 101, and a computer program 102 stored in the memory 101 and executable on the processor 100. When the processor 100 executes the computer program 102, it implements the steps in the various action recognition method embodiments described above, for example... Figure 8 Steps S801 to S804 are shown. Alternatively, when the processor 100 executes the computer program 102, it implements the functions of each module / unit in the above-described device embodiments, for example... Figure 9 The functions of modules 901 to 904 are shown.
[0158] For example, the computer program 102 may be divided into one or more modules / units, which are stored in the memory 101 and executed by the processor 100 to complete this application. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program 102 in the terminal device 10.
[0159] The terminal device 10 can be a desktop computer, laptop, handheld computer, smartphone, or smart TV, etc. Those skilled in the art will understand that... Figure 10 This is merely an example of terminal device 10 and does not constitute a limitation on terminal device 10. It may include more or fewer components than shown, or combine certain components, or different components. For example, terminal device 10 may also include input / output devices, network access devices, buses, etc.
[0160] The processor 100 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor. The processor 100 can be the nerve center and command center of the terminal device 10. The processor 100 can generate operation control signals based on instruction opcodes and timing signals to control instruction fetching and execution.
[0161] The memory 101 can be an internal storage unit of the terminal device 10, such as a hard disk or memory of the terminal device 10. The memory 101 can also be an external storage device of the terminal device 10, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the terminal device 10. Furthermore, the memory 101 can include both internal and external storage units of the terminal device 10. The memory 101 is used to store the computer program and other programs and data required by the terminal device 10. The memory 101 can also be used to temporarily store data that has been output or will be output.
[0162] The terminal device 10 may further include a communication module, which can provide communication solutions for network devices, including Wireless Local Area Networks (WLAN) (such as Wi-Fi), Bluetooth, Zigbee, mobile communication networks, Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), and Infrared (IR) technologies. The communication module may be one or more devices integrating at least one communication processing module. The communication module may include an antenna, which may have a single element or be an antenna array with multiple elements. The communication module can receive electromagnetic waves through the antenna, frequency modulate and filter the electromagnetic wave signals, and send the processed signals to the processor. The communication module can also receive signals to be transmitted from the processor, frequency modulate and amplify them, and then convert them into electromagnetic waves for radiation via the antenna.
[0163] The terminal device 10 may further include a power management module, which can receive input from an external power source, battery, and / or charger to power the processor, the memory, and the communication module, etc.
[0164] The terminal device 10 may further include a display module, which can be used to display information input by the user or information provided to the user. The display module may include a display panel, optionally configured as a Liquid Crystal Display (LCD), Organic Light-Emitting Diode (OLED), or similar display panel. Furthermore, a touch panel may cover the display panel. When the touch panel detects a touch operation on or near it, it transmits the information to the processor to determine the type of touch event. Subsequently, the processor provides corresponding visual output on the display panel based on the type of touch event.
[0165] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0166] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0167] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0168] In the embodiments provided in this application, it should be understood that the disclosed devices / terminal equipment and methods can be implemented in other ways. For example, the device / terminal equipment embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0169] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0170] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0171] This application provides a computer program product that, when run on a terminal device, enables the terminal device to implement the steps described in the various method embodiments above.
[0172] If the integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable storage medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc. It should be noted that the content included in the computer-readable storage medium can be appropriately added or removed according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, the computer-readable storage medium does not include electrical carrier signals and telecommunication signals.
[0173] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. An action recognition method, characterized in that, include: Obtain the video sequence to be identified; wherein, the video sequence to be identified includes each image frame to be identified; The image frame to be identified is divided into image blocks according to each division scale to obtain each image block; wherein, each image block of the image frame to be identified corresponds to one of the division scales; Reconstruct the image blocks corresponding to each of the image frames to be identified to obtain the reconstructed image frames; Action recognition is performed based on the reconstructed image frames to obtain the action recognition result of the video sequence to be recognized.
2. The action recognition method according to claim 1, characterized in that, The reconstructing of the image blocks corresponding to each of the image frames to be identified to obtain the reconstructed image frames includes: Each target image block is determined; wherein, the target image block is an image block for reconstruction corresponding to a target partitioning scale, and the target partitioning scale is any one of the partitioning scales; Reassemble the target image blocks to obtain image block regions corresponding to the target segmentation scale; The image block regions are recombined to obtain the recombined image frame.
3. The action recognition method according to claim 2, characterized in that, The determination of each target image patch includes: Determine the target sampling interval corresponding to the target partitioning scale; wherein the target sampling interval is positively correlated with the target partitioning scale; Based on the target sampling interval, each of the image frames to be identified is sampled to obtain each target image frame; The target image block is determined in each of the image blocks corresponding to each of the target image frames.
4. The action recognition method according to claim 1, characterized in that, The step of performing action recognition based on the reconstructed image frames to obtain the action recognition result of the video sequence to be recognized includes: Obtain the descriptive text corresponding to the action to be identified; Semantic features are extracted from the descriptive text to obtain text semantic features; Image features are extracted from the reconstructed image frames to obtain the reconstructed image features; Using an action recognition classifier, action recognition is performed based on the text semantic features and the reconstructed image features to obtain the action recognition result of the video sequence to be recognized; wherein, the action recognition classifier is a pre-trained classifier used to recognize the action to be recognized.
5. The action recognition method according to claim 4, characterized in that, The step of extracting image features from the reconstructed image frame to obtain reconstructed image features includes: Position information is embedded into the reconstructed image frame to obtain the reconstructed image frame with embedded position information; The reconstructed image frame, after embedding location information, is processed using an attention mechanism to obtain the reconstructed image features.
6. The action recognition method according to claim 4, characterized in that, The training process of the action recognition classifier includes: Obtain a preset number of training samples; wherein, the training sample includes a sample image feature and the text semantic feature; Based on each of the training samples and their corresponding real labels, a training sample set is constructed; The initial classifier is trained using each training sample in the training sample set as input and the corresponding real label as the expected output to obtain the action recognition classifier.
7. The action recognition method according to claim 6, characterized in that, The process of obtaining a preset number of training samples includes: Obtain a preset number of sample video sequences; wherein, one sample video sequence includes sample image frames; According to each of the aforementioned division scales, the sample image frames in the sample video sequence are divided into image blocks to obtain each sample image block; wherein, each sample image block of a sample image frame corresponds to one of the aforementioned division scales; Reconstruct the sample image blocks corresponding to each of the sample image frames to obtain the reconstructed sample image frames; Image feature extraction is performed on the reconstructed sample image frames to obtain the sample image features; The training samples are obtained based on the sample image features and the text semantic features.
8. A motion recognition device, characterized in that, include: An acquisition module is used to acquire a video sequence to be identified; wherein, the video sequence to be identified includes each image frame to be identified; The segmentation module is used to divide the image frame to be identified into image blocks according to various segmentation scales to obtain each image block; wherein, the image block of the image frame to be identified corresponds to one of the segmentation scales; The reconstructing module is used to reconstruct the image blocks corresponding to each of the image frames to be identified, thereby obtaining reconstructed image frames; The recognition module is used to perform action recognition based on the reconstructed image frames to obtain the action recognition result of the video sequence to be recognized.
9. A computer program product, characterized in that, Includes a computer program, which, when run, causes the action recognition method as described in any one of claims 1 to 7 to be executed.
10. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it causes the terminal device to implement the steps of the action recognition method as described in any one of claims 1 to 7.