Action recognition method, apparatus and device

By performing skeletal detection and RGB projection parameter optimization on multi-subject video images, the feature confusion problem in action recognition in multi-person scenes is solved, achieving higher accuracy in action recognition, especially in occluded scenes.

CN120954091BActive Publication Date: 2026-06-26TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2025-07-31
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In action recognition in multi-person scenarios, existing technologies struggle to effectively distinguish the action features of different subjects, leading to decreased recognition accuracy. In particular, the primary and secondary relationships of actions are difficult to clarify in complex multi-person interaction scenarios.

Method used

Human skeleton detection is performed on multi-subject video images to generate skeletal line maps. Learnable RGB projection parameters are then used to project the skeletal line maps onto the RGB channels of the video images. Combined with depth information and guidance regions, an action recognition network is used for action recognition. Dedicated projection parameters are used to separate the skeletal line maps of different subjects in the RGB space, avoiding feature confusion.

Benefits of technology

It improves the accuracy of action recognition, especially in occluded scenes, and can effectively distinguish the action features of different subjects, thus improving the accuracy and computational efficiency of action recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120954091B_ABST
    Figure CN120954091B_ABST
Patent Text Reader

Abstract

A motion recognition method, device and equipment are disclosed. The method comprises: performing human body skeleton detection on a video image containing multiple subjects to obtain skeleton key point information of each subject, the skeleton key point information comprising two-dimensional coordinate information of the key points, and generating a skeleton line graph of each subject according to the key point information; then acquiring learnable RGB projection parameters exclusive to the skeleton line graphs of different subjects, and projecting the skeleton line graphs into corresponding RGB channels of the video image according to the learnable RGB parameters; and using a motion recognition network to recognize the motion of each subject according to the video image with the skeleton projection. By learning independent and exclusive projection parameters for each subject, the skeleton line graphs of different subjects are naturally separated in the RGB space, even in the occlusion scene, the motion recognition network can distinguish different subjects, extract the motion features of the subject of interest for motion recognition, avoid the crosstalk of motion features of different subjects, and improve the accuracy of motion recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This specification relates to the field of computer vision technology, and more particularly to an action recognition method, apparatus, and device. Background Technology

[0002] Action recognition is a common task in computer vision, aiming to automatically identify the category of ongoing actions from a given video clip. With the development of artificial intelligence technology, action recognition has been widely applied in many fields, such as security monitoring, healthcare, and sports analytics. The development of action recognition technology is of great significance for improving user experience and supporting decision-making. Currently, commonly used action recognition methods are based on video and skeleton analysis, including using convolutional fusion to combine information from both video and skeleton features, or cross-modal fusion utilizing cross-attention mechanisms. However, when a video contains multiple individuals, especially in sports events, dance performances, or group activities, relying solely on video frame information or a single skeleton projection method for action recognition may lead to confusion, particularly in complex multi-person interaction scenarios where the primary and secondary relationships of actions are difficult to clearly distinguish, potentially resulting in decreased recognition accuracy. Summary of the Invention

[0003] In view of this, one or more embodiments of this specification provide an action recognition method, apparatus, and device to improve recognition accuracy in multi-person scenarios.

[0004] To achieve the above objectives, one or more embodiments of this specification provide the following technical solutions:

[0005] According to a first aspect of one or more embodiments of this specification, an action recognition method is proposed, comprising:

[0006] Human skeleton detection is performed on video images containing multiple subjects to obtain skeletal key point information of each subject. The skeletal key point information includes the two-dimensional coordinate information of the key points.

[0007] Generate skeletal line diagrams for each subject based on the key point information;

[0008] Obtain learnable RGB projection parameters specific to the skeletal line diagrams of different subjects, and project the skeletal line diagrams onto the corresponding RGB channels of the video image according to the learnable RGB parameters;

[0009] Using an action recognition network, the actions of each subject are identified based on video images with skeletal projections;

[0010] In this process, the learnable RGB projection parameters are jointly optimized with the action recognition network parameters in an end-to-end manner through the backpropagation algorithm during the training phase of the action recognition network.

[0011] In some embodiments, projecting the skeletal line graph onto the corresponding RGB channels of the video image according to the learnable RGB parameters includes:

[0012] Generate a binary mask image of the skeleton lines for each subject;

[0013] The mask image of each subject is multiplied by the corresponding learnable RGB projection parameters to obtain the components in the RGB channels;

[0014] The RGB channel projection components of each subject are superimposed onto the corresponding RGB channels of the video image.

[0015] In some embodiments, the exclusive learnable RGB projection parameters for each subject are set to assign each subject to an exclusive target channel, and different subjects are assigned to different target channels, wherein the target channel is one of the RGB channels.

[0016] In some embodiments, the method further includes:

[0017] Obtain a depth map of the video image, the depth map including depth information for each pixel;

[0018] The two-dimensional coordinate information of the skeletal key points is mapped to the corresponding positions in the depth map to obtain the three-dimensional coordinate information of the skeletal key points.

[0019] In some embodiments, the use of an action recognition network to recognize the actions of various subjects based on video images with skeletal projection includes:

[0020] The first feature of a video image with skeletal projection is extracted using a video encoder.

[0021] The second feature of the three-dimensional coordinate information of the skeletal key points corresponding to each subject is extracted using a skeletal encoder.

[0022] The actions of each subject are identified based on the fusion of the first feature and the second feature.

[0023] In some embodiments, the method further includes:

[0024] Based on the type of multi-subject cooperative motion, a guiding region is determined from the skeletal diagram, and the guiding region is projected onto a channel in the video image that does not have a skeletal projection.

[0025] Action recognition of each subject is performed based on video images with skeletal projection and guide areas, as well as the 3D coordinate information of the skeletal key points corresponding to each subject.

[0026] In some embodiments, the action recognition of each subject based on the video image having a skeletal projection and a guide region, and the three-dimensional coordinate information of the skeletal key points corresponding to each subject, includes:

[0027] Extract the features of the guiding region at the first resolution;

[0028] Features of the region outside the guide region in the channel containing the guide region are extracted at a second resolution, where the first resolution is higher than the second resolution.

[0029] In some embodiments, determining the actions of the opposing parties based on the first feature and the second feature includes:

[0030] The first feature and the second feature are fused to obtain an intermediate fused feature;

[0031] By routing different features in the intermediate fusion features to different expert networks and fusing the output features of each expert network in different ways, multiple target fusion features are obtained.

[0032] The multi-task classification result is obtained based on the fusion features of the multiple targets.

[0033] According to a second aspect of one or more embodiments of this specification, an action recognition device is provided, comprising:

[0034] The detection unit is used to perform human skeleton detection on video images containing multiple subjects, and obtain the skeletal key point information of each subject, wherein the skeletal key point information includes the two-dimensional coordinate information of the key points;

[0035] The generation unit is used to generate skeletal line drawings of each subject based on the key point information;

[0036] The projection unit is used to acquire learnable RGB projection parameters specific to the skeletal line diagrams of different subjects, and to project the skeletal line diagrams onto the corresponding RGB channels of the video image according to the learnable RGB parameters.

[0037] The recognition unit is used to identify the actions of various subjects based on video images with skeletal projection using an action recognition network.

[0038] In this process, the learnable RGB projection parameters are jointly optimized with the action recognition network parameters in an end-to-end manner through the backpropagation algorithm during the training phase of the action recognition network.

[0039] In some embodiments, the projection unit is specifically used for:

[0040] Generate a binary mask image of the skeleton lines for each subject;

[0041] The mask image of each subject is multiplied by the corresponding learnable RGB projection parameters to obtain the components in the RGB channels;

[0042] The RGB channel projection components of each subject are superimposed onto the corresponding RGB channels of the video image.

[0043] In some embodiments, the exclusive learnable RGB projection parameters for each subject are set to assign each subject to an exclusive target channel, and different subjects are assigned to different target channels, wherein the target channel is one of the RGB channels.

[0044] In some embodiments, the apparatus further includes a depth acquisition unit, configured to:

[0045] Obtain a depth map of the video image, the depth map including depth information for each pixel;

[0046] The two-dimensional coordinate information of the skeletal key points is mapped to the corresponding positions in the depth map to obtain the three-dimensional coordinate information of the skeletal key points.

[0047] In some embodiments, the identification unit is specifically used for:

[0048] The first feature of a video image with skeletal projection is extracted using a video encoder.

[0049] The second feature of the three-dimensional coordinate information of the skeletal key points corresponding to each subject is extracted using a skeletal encoder.

[0050] The actions of each subject are identified based on the fusion of the first feature and the second feature.

[0051] In some embodiments, the apparatus further includes a guiding unit for:

[0052] Based on the type of multi-subject cooperative motion, a guiding region is determined from the skeletal diagram, and the guiding region is projected onto a channel in the video image that does not have a skeletal projection.

[0053] Action recognition of each subject is performed based on video images with skeletal projection and guide areas, as well as the 3D coordinate information of the skeletal key points corresponding to each subject.

[0054] In some embodiments, the identification unit is specifically used for:

[0055] Extract the features of the guiding region at the first resolution;

[0056] Features of the region outside the guide region in the channel containing the guide region are extracted at a second resolution, where the first resolution is higher than the second resolution.

[0057] In some embodiments, when the identification unit is used to determine the actions of the opponents based on the first feature and the second feature, it is specifically used to:

[0058] The first feature and the second feature are fused to obtain an intermediate fused feature;

[0059] By routing different features in the intermediate fusion features to different expert networks and fusing the output features of each expert network in different ways, multiple target fusion features are obtained.

[0060] The multi-task classification result is obtained based on the fusion features of the multiple targets.

[0061] According to a third aspect of one or more embodiments of this specification, an electronic device is provided, comprising:

[0062] processor;

[0063] Memory used to store processor-executable instructions;

[0064] The processor implements the steps of the method proposed in the above embodiments by running the executable instructions.

[0065] According to a fourth aspect of one or more embodiments of this specification, a computer-readable storage medium is provided that stores computer instructions thereon, which, when executed by a processor, implement the steps of the method proposed in the above embodiments.

[0066] According to a fifth aspect of one or more embodiments of this specification, a computer program product is provided, comprising a computer program / instructions that, when executed by a processor, implement the steps of the method proposed in the above embodiments.

[0067] The action recognition method proposed in this specification performs human skeleton detection on video images containing multiple subjects to obtain skeletal keypoint information for each subject, and generates skeletal line maps for each subject based on the keypoint information. Then, it acquires learnable RGB projection parameters specific to the skeletal line maps of different subjects, and projects the skeletal line maps onto the corresponding RGB channels of the video image based on the learnable RGB parameters. Using an action recognition network, it identifies the actions of each subject based on the video image with skeletal projection. By learning independent and specific projection parameters for each subject, the skeletal line maps of different subjects are naturally separated in the RGB space. Even in occluded scenes, the action recognition network can distinguish different subjects and extract the action features of the subject of interest for action recognition, avoiding crosstalk between action features of different subjects and improving the accuracy of action recognition. Attached Figure Description

[0068] Figure 1This is a flowchart of an exemplary embodiment of an action recognition method.

[0069] Figure 2 This is a schematic diagram of an action recognition method provided in an exemplary embodiment.

[0070] Figure 3 This is a schematic diagram of another action recognition method provided in an exemplary embodiment.

[0071] Figure 4 This is a block diagram of an action recognition device provided in an exemplary embodiment.

[0072] Figure 5 This is a schematic diagram of the structure of a device provided in an exemplary embodiment. Detailed Implementation

[0073] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of one or more embodiments of this specification as detailed in the appended claims.

[0074] It should be noted that the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification in other embodiments. In some other embodiments, the methods may include more or fewer steps than described in this specification. Furthermore, a single step described in this specification may be broken down into multiple steps in other embodiments; and multiple steps described in this specification may be combined into a single step in other embodiments.

[0075] Figure 1 A flowchart of an action recognition method provided in an embodiment of this application is shown. The method includes:

[0076] Step 101: Perform human skeleton detection on the video image containing multiple subjects to obtain the skeletal key point information of each subject. The skeletal key point information includes the two-dimensional coordinate information of the key points.

[0077] This method can be applied to sports scenarios involving multiple actors. A multi-agent sports scenario refers to a dynamic scenario in which at least two independent individuals or teams have an interactive relationship in the same space and time, including but not limited to competitive sports such as basketball, football, and boxing matches, as well as cooperative sports such as dance performances.

[0078] The video image is as follows Figure 2As shown in 20, it can be a continuous sequence of images in the video captured by the camera, or it can be a real-time video stream.

[0079] By performing human skeleton detection on the video images, key skeletal information of each subject can be obtained.

[0080] In some embodiments, the two-dimensional coordinates of the skeletal key points of each subject can first be detected using a pose estimation algorithm. For example, HigherHRNet can be used to detect key points through heatmaps, and the key points can be grouped into different human instances using a grouping strategy to obtain the two-dimensional coordinates of the human nodes of each subject.

[0081] In step 102, skeletal line diagrams of each subject are generated based on the key point information.

[0082] By connecting the key points of each skeleton in the human body topology, the skeletal line drawing of each subject can be obtained. In some cases, the skeletal line drawing of each subject can be generated based on the two-dimensional coordinates of the key points.

[0083] In step 103, learnable RGB projection parameters specific to the skeletal line diagrams of different subjects are obtained, and the skeletal line diagrams are projected onto the corresponding RGB channels of the video image according to the learnable RGB parameters.

[0084] In this step, different subjects can have different meanings depending on the sports scenario involving multiple subjects. For example, in two-person competitive sports, different subjects can refer to the two athletes. Taking a boxing match as an example, the skeletal outline of one athlete can be projected into one channel, while the skeletal outline of the other athlete can be projected into another channel. For competitive sports involving two or more teams, different subjects can refer to different teams. Taking a football match as an example, the skeletal outlines of all the players in one team can be projected into one channel, and the skeletal outlines of all the players in the other team can be projected into another channel. For cooperative sports, different subjects can also refer to different individual athletes, that is, the skeletal outlines of different individual athletes can be projected into different RGB channels.

[0085] For a video of a motion scene containing multiple subjects, assign a unique RGB projection parameter to each subject i in the video. Specifically, each subject i contains an R projection parameter r. i G projection parameters g i B projection parameter b i It can be written as a parameter vector P i =[r i g i b i ].

[0086] The learnable RGB projection parameters are jointly optimized with the action recognition network parameters in an end-to-end manner through the backpropagation algorithm during the training phase of the action recognition network.

[0087] The following describes the optimization framework for the RGB projection parameters and action recognition network parameters proposed in the embodiments of this disclosure.

[0088] In the forward propagation process, video images of a motion scene containing N subjects are input into the system. First, the video images are processed by a pre-defined human skeleton detection network, outputting skeletal keypoints for the N subjects. A binary skeleton map is then generated based on these keypoints, where 1 represents a bone pixel and 0 represents the background. The input video images are sample images, labeled with the real-world labels for each subject's actions.

[0089] Obtain the unique projection parameters P for each subject. i =[r i g i b i The skeletal lines are then superimposed onto the video image by channel according to the projection parameters of each subject, resulting in the enhanced image.

[0090] The augmented image is input into the action recognition network, which outputs action prediction results. The loss L, such as cross-entropy loss, is calculated based on the action prediction results and the true action labels, before proceeding to the backpropagation process.

[0091] In the backpropagation process, the gradient of the action recognition network weights θ with respect to the loss L is calculated, as well as the learnable projection parameters P are calculated. i With respect to the gradient of the loss L, update the network weights θ and the learnable projection parameters P with the goal of minimizing the loss L. i After multiple iterations and optimizations, it converges to the target value of the learnable projection parameters.

[0092] Given the skeletal outlines of each subject as input, the action recognition network can assign a target value for a unique learnable projection parameter to each subject and project the skeletal outline onto the corresponding RGB channel of the video image using that target value.

[0093] In some embodiments, the skeletal graph is first colored according to learnable projection parameters specific to each subject. Specifically, a binary mask image of the skeletal lines for each subject is first generated. Then, the mask image of each subject is multiplied by its corresponding learnable RGB projection parameters to obtain the components in the RGB channels, thus obtaining the colored skeletal graph. Finally, the RGB channel projection components of each subject are superimposed onto the corresponding RGB channels of the video image.

[0094] After projection, the skeletal region is represented by an overlay of the original image and colored skeletal lines, while the background region remains unchanged. This process achieves parametric visual embedding of skeletal information, providing enhanced features for subsequent action recognition.

[0095] In some embodiments, the exclusive learnable RGB projection parameters for each subject are set to assign each subject to an exclusive target channel, and different subjects are assigned to different target channels, wherein the target channel is one of the RGB channels.

[0096] That is, during training, the projection parameters of each subject are forced to converge to a single channel through end-to-end optimization. For example, the projection parameters of subject 1 converge to the R channel, while the projection parameters of subject 2 converge to the channel.

[0097] This method achieves complete RGB separation of the skeletal lines of different subjects in multi-person overlapping scenarios, eliminating feature confusion in multi-person interaction and reducing computational complexity.

[0098] In step 104, an action recognition network is used to identify the actions of each subject based on video images with skeletal projection.

[0099] After obtaining video images with RGB channel skeletal projection, an action recognition network can be used to parse the motion semantics of different subjects, such as attack, defense, and passing, thereby determining the subject's actions.

[0100] The action recognition method proposed in this specification performs human skeleton detection on video images containing multiple subjects to obtain skeletal keypoint information for each subject, and generates skeletal line maps for each subject based on the keypoint information. Then, it acquires learnable RGB projection parameters specific to the skeletal line maps of different subjects, and projects the skeletal line maps onto the corresponding RGB channels of the video image based on the learnable RGB parameters. Using an action recognition network, it identifies the actions of each subject based on the video image with skeletal projection. By learning independent and specific projection parameters for each subject, the skeletal line maps of different subjects are naturally separated in the RGB space. Even in occluded scenes, the action recognition network can distinguish different subjects and extract the action features of the subject of interest for action recognition, avoiding crosstalk between action features of different subjects and improving the accuracy of action recognition.

[0101] In some embodiments, depth information of each skeletal key point can also be obtained.

[0102] Specifically, firstly, a depth map of the video image is acquired, which includes depth information for each pixel. This can be done using a camera device with depth sensing capabilities, such as an RGB-D camera, or by acquiring a monocular RGB image and using a deep learning model to predict the depth map.

[0103] Then, the two-dimensional coordinates are mapped to the corresponding positions in the depth map to obtain the depth information of the skeletal key points, thus obtaining the three-dimensional coordinates of the skeletal key points. Specifically, by combining 2D points on the image plane with depth information and back-projecting them into three-dimensional space through the camera geometry model, the three-dimensional coordinates of these points can be obtained.

[0104] Having obtained the three-dimensional coordinate information of the skeletal key points, the actions of each subject can be identified in the following way, see [link to relevant documentation]. Figure 2 The flowchart shown is for the action recognition method.

[0105] By performing human skeleton detection on video image 20, the coordinates of the skeletal key points of each subject in the image are obtained, thus generating a skeletal map. Simultaneously, by obtaining the depth information of each skeletal key point, the three-dimensional coordinate information of the skeletal key points can be obtained. The figure shows the skeletal map 21 of the red player in video image 20, while the skeletal map of the blue player is not shown. After obtaining the skeletal maps of each subject, the action recognition network assigns a unique learnable projection parameter to each subject. In this example, the projection parameter corresponding to the skeletal map of the red player is the R channel parameter, while the projection parameter corresponding to the skeletal map of the blue player is the B channel parameter. Therefore, the skeletal map of the red player is projected into the R channel, and the skeletal map of the blue player is projected into the B channel, while the G channel component remains unchanged.

[0106] The video encoder 22 is used to extract the first feature 24 of the video image 20 with projected skeletal information; at the same time, the skeletal encoder 23 is used to extract the second feature 25 of the three-dimensional coordinate information of the skeletal key points; then, the actions of each subject are determined based on the first feature 24 and the second feature 25.

[0107] The first feature 24 extracted by the video encoder 22 retains the original visual information, such as environmental background, object interaction, and global motion patterns, while the second feature 25 extracted by the skeletal encoder 23 focuses on the fine-grained motion trajectory of human posture, such as joint angles and limb displacement. The fusion of the first feature 24 and the second feature 25 achieves complementarity between visual appearance and structured motion information, which can improve the comprehensiveness and accuracy of action recognition.

[0108] Furthermore, by combining the 3D coordinate information of skeletal key points, projection ambiguity in 2D skeleton detection can be eliminated, joint point localization accuracy can be improved, occluded joint points can be completed using depth data, the accuracy of action recognition in occluded scenes can be improved, and the ability to distinguish action force and contact distance can be enhanced.

[0109] In some embodiments, the video encoder 22 extracts the first feature of the video image by: acquiring the corresponding subject's motion features based on the skeletal lines in each channel using the video encoder. For example, joint motion trajectory features can be generated by graph convolution of the skeletal lines; simultaneously, features of the channel components of channels without projected skeletal line graphs are extracted, i.e., raw visual information is extracted. For example, an I3D network can be used to extract spatiotemporal features, capturing environmental background, motion of objects other than the subject, etc., to assist in the recognition of the subject's motion. Finally, the first feature of the video image is obtained by weighted summation of each channel using a channel attention mechanism. The weights of each channel can be set and adjusted according to the motion type and different stages in the motion process.

[0110] By dynamically assigning weights to the features of each channel through a channel attention mechanism, key information can be intelligently enhanced in motion scenarios involving multiple subjects, thereby improving the accuracy of action recognition.

[0111] In some embodiments, a multi-stage information fusion mechanism can be used to achieve action recognition. (See also...) Figure 2 By fusing the first feature 24 and the second feature 25, an intermediate fused feature 26 is obtained. By routing different features in the intermediate fused feature 26 to different expert networks 271, 272, ..., 27N, and fusing the output features of each expert network in different ways, multiple target fused features are obtained; a multi-task classification result is obtained based on the multiple target fused features.

[0112] In actual sports, the subject's movements are usually not of a single type. Therefore, action recognition can be designed as a multi-task classification, determining the final action by simultaneously predicting multiple sub-task labels related to the action. In this embodiment, multiple expert networks 271, 272, ..., 27N are set up, and different expert networks focus on different aspects, learn the discrimination patterns of different action subspaces, and perform feature transformations for specific sub-tasks. For example, in a boxing match, the multiple expert networks can be set to focus on basic action classification, striking intensity, striking effect, tactical intent, etc., respectively.

[0113] Router 28 analyzes the content of the intermediate fused features 26 and, based on the semantic information of different dimensions within the features, assigns different features from the intermediate fused features 26 to the most relevant expert networks. Each expert network can be a small neural network, such as a two-layer MLP (Multilayer Perceptron), used to perform task-oriented feature transformation on the input features. For example, one expert network might aim to extract feature patterns strongly correlated with action type (e.g., straight punch / hook punch); another expert network might aim to extract feature patterns strongly correlated with striking intensity. The output of each expert network is a task-adapted high-level feature.

[0114] The output features of different expert networks are selected and fused according to the sub-task, such as through concatenation, weighted averaging, or attention mechanisms, to obtain the target fused feature. This target fused feature is then input into the corresponding sub-task classification head 29 to obtain the corresponding prediction result.

[0115] Using the methods described above, video and skeletal information can be effectively fused in the original feature stage, intermediate feature stage, and decision-making stage, thereby significantly improving the accuracy of action recognition. For actions requiring special attention to local details, such as recognizing the striking point and effect in boxing, a region guidance mechanism can be used to enable the video encoder to focus more on information in local areas.

[0116] In some embodiments, a guide region can be determined from the skeletal diagram based on the type of multi-subject cooperative motion, and the guide region can be projected onto a channel in the video image that does not have a skeletal projection; then, the motion recognition of each subject can be performed based on the video image with a skeletal projection and a guide region, as well as the three-dimensional coordinate information of the skeletal key points corresponding to each subject.

[0117] The guiding area is determined from the skeletal diagram based on the type of multi-subject collaborative movement. For example, in a boxing match scene, the guiding area can be defined as the hand region in the skeletal diagram; in a soccer scene, the guiding area can be defined as the foot region in the skeletal diagram.

[0118] For the guide region determined from the skeletal line drawing, a binary mask of the guide region (1 inside the region and 0 outside) can be extracted and superimposed on the channel of the unprojected bone in a semi-transparent manner to preserve the original background information.

[0119] After projecting the skeletal outlines and guide regions of different subjects into different channels of the video image, the first feature of the video image is extracted using an image encoder and fused with the second feature of the skeletal outlines extracted by the skeletal encoder to determine the action of each subject.

[0120] When using a skeletal encoder for feature extraction, features of the guide region can be extracted at a first resolution, and features of the region outside the guide region in the channel containing the guide region can be extracted at a second resolution, where the first resolution is higher than the second resolution. By processing the guide region at a higher resolution, the guide region can be focused, such as a boxer's hand or a soccer foot, allowing for the capture of fine-grained features and improving classification accuracy.

[0121] Figure 3 This is a schematic diagram of another action recognition method provided in an exemplary embodiment. This embodiment is an action recognition process for a boxing match scenario.

[0122] like Figure 3 As shown, by performing human skeleton detection on video image 30, the skeletal outlines of the two contestants and the 3D coordinate information of the key points of the skeletons are obtained. The figure shows the skeletal outline 31 of the red contestant and 32 of the blue contestant in video image 30. After obtaining the skeletal outlines of the two contestants, the skeletal outline 31 of the red contestant is projected into the R channel of the video image, and the skeletal outline 32 of the blue contestant is projected into the B channel. As shown in the projected video image 34, red skeletal lines are displayed on the image of the red contestant, and blue skeletal lines are displayed on the image of the blue contestant.

[0123] In a boxing match scenario, the guide area can be defined as the hand region in the skeletal diagram. By setting the interior of this guide area to 1 and the exterior to 0, a binary mask 33 for the guide area is obtained. This binary mask 33 is then overlaid on the channel of the unprojected bone in a semi-transparent manner, preserving the original background information.

[0124] The video image after projecting the skeletal outline and guiding area is shown in Figure 34. Based on video image 34, main action recognition is performed. On one hand, by projecting skeletons in different channels for different characters, visual cues regarding the relative positions of opponents can be provided, improving the ability to discriminate interactive actions and strengthening the perception of interactive relationships. On the other hand, depth data is used to complete occluded joints, improving the accuracy of action recognition in occluded scenes and enhancing the ability to discriminate action force and contact distance.

[0125] Figure 4 This is a block diagram of an exemplary embodiment of a motion recognition device. Figure 4 As shown, the device includes:

[0126] The detection unit 401 is used to perform human skeleton detection on a video image containing multiple subjects, and obtain the skeleton key point information of each subject. The skeleton key point information includes the two-dimensional coordinate information of the key points.

[0127] The generation unit 402 is used to generate skeletal line drawings of each subject based on the key point information;

[0128] Projection unit 403 is used to acquire learnable RGB projection parameters specific to the skeletal line diagrams of different subjects, and project the skeletal line diagrams onto the corresponding RGB channels of the video image according to the learnable RGB parameters.

[0129] Recognition unit 404 is used to recognize the actions of various subjects based on video images with skeletal projection using an action recognition network.

[0130] In this process, the learnable RGB projection parameters are jointly optimized with the action recognition network parameters in an end-to-end manner through the backpropagation algorithm during the training phase of the action recognition network.

[0131] In some embodiments, the projection unit is specifically used for:

[0132] Generate a binary mask image of the skeleton lines for each subject;

[0133] The mask image of each subject is multiplied by the corresponding learnable RGB projection parameters to obtain the components in the RGB channels;

[0134] The RGB channel projection components of each subject are superimposed onto the corresponding RGB channels of the video image.

[0135] In some embodiments, the exclusive learnable RGB projection parameters for each subject are set to assign each subject to an exclusive target channel, and different subjects are assigned to different target channels, wherein the target channel is one of the RGB channels.

[0136] In some embodiments, the apparatus further includes a depth acquisition unit, configured to:

[0137] Obtain a depth map of the video image, the depth map including depth information for each pixel;

[0138] The two-dimensional coordinate information of the skeletal key points is mapped to the corresponding positions in the depth map to obtain the three-dimensional coordinate information of the skeletal key points.

[0139] In some embodiments, the identification unit is specifically used for:

[0140] The first feature of a video image with skeletal projection is extracted using a video encoder.

[0141] The second feature of the three-dimensional coordinate information of the skeletal key points corresponding to each subject is extracted using a skeletal encoder.

[0142] The actions of each subject are identified based on the fusion of the first feature and the second feature.

[0143] In some embodiments, the apparatus further includes a guiding unit for:

[0144] Based on the type of multi-subject cooperative motion, a guiding region is determined from the skeletal diagram, and the guiding region is projected onto a channel in the video image that does not have a skeletal projection.

[0145] Action recognition of each subject is performed based on video images with skeletal projection and guide areas, as well as the 3D coordinate information of the skeletal key points corresponding to each subject.

[0146] In some embodiments, the identification unit is specifically used for:

[0147] Extract the features of the guiding region at the first resolution;

[0148] Features of the region outside the guide region in the channel containing the guide region are extracted at a second resolution, where the first resolution is higher than the second resolution.

[0149] In some embodiments, when the identification unit is used to determine the actions of the opponents based on the first feature and the second feature, it is specifically used to:

[0150] The first feature and the second feature are fused to obtain an intermediate fused feature;

[0151] By routing different features in the intermediate fusion features to different expert networks and fusing the output features of each expert network in different ways, multiple target fusion features are obtained.

[0152] The multi-task classification result is obtained based on the fusion features of the multiple targets.

[0153] Figure 5 This is a schematic structural diagram of a device provided in an exemplary embodiment. Please refer to... Figure 5 At the hardware level, the device includes a processor 502, an internal bus 504, a network interface 506, memory 508, and non-volatile memory 510, and may also include other hardware required for business operations. One or more embodiments of this specification can be implemented in software, such as the processor 502 reading the corresponding computer program from the non-volatile memory 510 into memory 508 and then running it. Of course, in addition to software implementation, one or more embodiments of this specification do not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

[0154] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer, which can take the form of a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email sending and receiving device, game console, tablet computer, wearable device, or any combination of these devices.

[0155] In a typical configuration, a computer includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0156] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0157] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0158] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0159] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0160] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to limit the scope of one or more embodiments of this specification. The singular forms “a,” “described,” and “the” used in one or more embodiments of this specification and in the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more associated listed items.

[0161] It should be understood that although the terms first, second, third, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first information may also be referred to as second information without departing from the scope of one or more embodiments of this specification, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "in response to a determination," or "when," or "in the event of a determination."

[0162] The above description is merely a preferred embodiment of one or more embodiments of this specification and is not intended to limit the scope of one or more embodiments of this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this specification should be included within the protection scope of one or more embodiments of this specification.

Claims

1. An action recognition method, characterized in that, include: Human skeleton detection is performed on video images containing multiple subjects to obtain skeletal key point information of each subject. The skeletal key point information includes the two-dimensional coordinate information of the key points. Generate skeletal line diagrams for each subject based on the key point information; Learnable RGB projection parameters specific to the skeletal line diagrams of different subjects are obtained, and the skeletal line diagrams are projected onto the corresponding RGB channels of the video image according to the learnable RGB projection parameters. The specific learnable RGB projection parameters of each subject are set to assign each subject to a specific target channel, and different subjects are assigned to different target channels. The target channel is one of the RGB channels. Using an action recognition network, the actions of each subject are identified based on video images with skeletal projections; In this process, the learnable RGB projection parameters are jointly optimized with the action recognition network parameters in an end-to-end manner through the backpropagation algorithm during the training phase of the action recognition network.

2. The method according to claim 1, characterized in that, The step of projecting the skeletal line graph onto the corresponding RGB channels of the video image according to the learnable RGB projection parameters includes: Generate a binary mask image of the skeleton lines for each subject; The mask image of each subject is multiplied by the corresponding learnable RGB projection parameters to obtain the components in the RGB channels; The RGB channel projection components of each subject are superimposed onto the corresponding RGB channels of the video image.

3. The method according to claim 1, characterized in that, The method further includes: Obtain a depth map of the video image, the depth map including depth information for each pixel; The two-dimensional coordinate information of the skeletal key points is mapped to the corresponding positions in the depth map to obtain the three-dimensional coordinate information of the skeletal key points.

4. The method according to claim 3, characterized in that, The action recognition network identifies the actions of various subjects based on video images with skeletal projections, including: The first feature of a video image with skeletal projection is extracted using a video encoder. The second feature of the three-dimensional coordinate information of the skeletal key points corresponding to each subject is extracted using a skeletal encoder. The actions of each subject are identified based on the fusion of the first feature and the second feature.

5. The method according to claim 1, characterized in that, The method further includes: Based on the type of multi-subject cooperative motion, a guiding region is determined from the skeletal diagram, and the guiding region is projected onto a channel in the video image that does not have a skeletal projection. Action recognition of each subject is performed based on video images with skeletal projection and guide areas, as well as the 3D coordinate information of the skeletal key points corresponding to each subject.

6. The method according to claim 5, characterized in that, The step of recognizing the actions of each subject based on video images with skeletal projection and guide areas, as well as the three-dimensional coordinate information of the skeletal key points corresponding to each subject, includes: Extract the features of the guiding region at the first resolution; Features of the region outside the guide region in the channel containing the guide region are extracted at a second resolution, where the first resolution is higher than the second resolution.

7. The method according to claim 4, characterized in that, Determining the actions of both opponents based on the first feature and the second feature includes: The first feature and the second feature are fused to obtain an intermediate fused feature; By routing different features in the intermediate fusion features to different expert networks and fusing the output features of each expert network in different ways, multiple target fusion features are obtained. The multi-task classification result is obtained based on the fusion features of the multiple targets.

8. A motion recognition device, characterized in that, include: The detection unit is used to perform human skeleton detection on video images containing multiple subjects, and obtain the skeletal key point information of each subject, wherein the skeletal key point information includes the two-dimensional coordinate information of the key points; The generation unit is used to generate skeletal line drawings of each subject based on the key point information; The projection unit is used to acquire learnable RGB projection parameters specific to the skeletal line diagrams of different subjects, and to project the skeletal line diagrams onto the corresponding RGB channels of the video image according to the learnable RGB projection parameters. The specific learnable RGB projection parameters of each subject are set to assign each subject to a specific target channel, and different subjects are assigned to different target channels. The target channel is one of the RGB channels. The recognition unit is used to identify the actions of various subjects based on video images with skeletal projection using an action recognition network. In this process, the learnable RGB projection parameters are jointly optimized with the action recognition network parameters in an end-to-end manner through the backpropagation algorithm during the training phase of the action recognition network.

9. An electronic device, comprising: processor; Memory used to store processor-executable instructions; The processor implements the method as described in any one of claims 1-7 by executing the executable instructions.