Action recognition method and device, electronic equipment and computer readable storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By obtaining a 3D skeleton sequence from the video to be recognized and converting it into a 2D skeleton sequence, the problem of low recognition rate caused by perspective differences in action recognition technology is solved, and accurate action recognition under different perspectives is achieved.

CN114792441BActive Publication Date: 2026-06-23SHENZHEN LUMIUNITED TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHENZHEN LUMIUNITED TECH CO LTD
Filing Date: 2021-01-25
Publication Date: 2026-06-23

Application Information

Patent Timeline

25 Jan 2021

Application

23 Jun 2026

Publication

CN114792441B

IPC: G06V40/20; G06V20/64; G06V20/40; G06V10/764; G06V10/82; G06N3/0464; G06N3/0442; G06N3/084

AI Tagging

Application Domain

Biological models Three-dimensional object recognition

Technology Topics

Computer graphics (images)Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Calendar photo frame (smart calendar photo frame advertising machine)
CN310026928SComputer graphics (images)Engineering
Graphical user interface [computer screen layout]
JP1829334SGraphical user interface Computer graphics (images)
Image capture control device
JP2026101110AOptical signalling Optical viewingCommunication unitMedicine
Splicing screen and splicing display apparatus
US12660113B2
GUI
JP1829806SGraphical user interface Computer graphics (images)

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN114792441B_ABST

Patent Text Reader

Abstract

Embodiments of the present application disclose a motion recognition method and device, electronic equipment and computer readable storage medium, and relate to the technical field of computer. The method comprises: obtaining a 3D skeleton sequence of a target object in a video to be recognized, wherein the target object is an object to which motion recognition is directed, the 3D skeleton sequence comprises at least one 3D skeleton, and when the 3D skeleton sequence comprises a plurality of 3D skeletons, the plurality of 3D skeletons are arranged in the order of time corresponding to each 3D skeleton in the video to be recognized; obtaining a 2D skeleton sequence corresponding to an angle recognizable by a motion recognition network according to the 3D skeleton sequence, wherein the 2D skeleton sequence comprises at least one 2D skeleton; and performing motion recognition according to the 2D skeleton sequence by using the motion recognition network to obtain a motion recognition result. Thus, the problem of low recognition rate caused by the angle of view can be solved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and more specifically, to an action recognition method, apparatus, electronic device, and computer-readable storage medium. Background Technology

[0002] The demand for action recognition technology is growing rapidly and has expanded into many fields, such as visual surveillance, human-computer interaction, video indexing / retrieval, video summarization, and video understanding. Currently, the common action recognition approach is as follows: first, an action recognition training set is collected; then, a skeleton-based action recognition network is trained based on the collected training set; and finally, the skeleton-based action recognition network is used for action recognition. However, the collected action recognition training set cannot cover every viewpoint of an action. Therefore, the skeleton-based action recognition network cannot correctly recognize actions from unfamiliar viewpoints. That is, when the acquisition angle of the actual application scenario data does not match the acquisition angle of the actions in the training set, the action recognition results obtained by the skeleton-based action recognition network may be inaccurate. Thus, current action recognition suffers from a low recognition rate. Summary of the Invention

[0003] This application provides an action recognition method, device, electronic device, and computer-readable storage medium, which can solve the problem of low recognition rate caused by perspective issues.

[0004] The embodiments of this application can be implemented as follows:

[0005] In a first aspect, embodiments of this application provide an action recognition method applied to an electronic device, the method comprising:

[0006] Obtain a 3D skeleton sequence of a target object in a video to be identified, wherein the target object is the object targeted by action recognition, and the 3D skeleton sequence includes at least one 3D skeleton. When the 3D skeleton sequence includes multiple 3D skeletons, the multiple 3D skeletons are arranged in chronological order according to the time corresponding to each 3D skeleton in the video to be identified.

[0007] Based on the 3D skeleton sequence, a 2D skeleton sequence corresponding to the angle that the action recognition network can recognize is obtained, wherein the 2D skeleton sequence includes at least one 2D skeleton.

[0008] The action recognition network performs action recognition based on the 2D skeleton sequence to obtain action recognition results.

[0009] Secondly, embodiments of this application provide an action recognition device applied to an electronic device, the device comprising:

[0010] A 3D information acquisition module is used to acquire a 3D skeleton sequence of a target object in a video to be identified, wherein the target object is the object targeted by action recognition, the 3D skeleton sequence includes at least one 3D skeleton, and when the 3D skeleton sequence includes multiple 3D skeletons, the multiple 3D skeletons are arranged in chronological order according to the time corresponding to each 3D skeleton in the video to be identified.

[0011] The conversion module is used to obtain a 2D skeleton sequence corresponding to an angle that can be recognized by the action recognition network based on the 3D skeleton sequence, wherein the 2D skeleton sequence includes at least one 2D skeleton.

[0012] The recognition module is used to perform action recognition based on the 2D skeleton sequence through the action recognition network to obtain action recognition results.

[0013] Thirdly, embodiments of this application provide an electronic device, including a processor and a memory, wherein the memory stores machine-executable instructions that can be executed by the processor, and the processor can execute the machine-executable instructions to implement the action recognition method described in any of the foregoing embodiments.

[0014] Fourthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the action recognition method as described in any of the foregoing embodiments.

[0015] The action recognition method, apparatus, electronic device, and computer-readable storage medium provided in this application, for a video to be recognized, first obtain a 3D skeleton sequence corresponding to the action to be recognized of the target object in the video. Then, based on the 3D skeleton sequence, obtain a 2D skeleton sequence corresponding to the action to be recognized that can be correctly recognized by the action recognition network. Finally, obtain the correct action recognition result through the action recognition network and the 2D skeleton sequence. The action recognition network is a network that performs action recognition based on the 2D skeleton. Thus, even if the acquisition angle of the video to be recognized does not match the acquisition angle of the actions in the training set of the action recognition network, the action to be recognized of the target object in the video can still be correctly recognized, thereby solving the problem of low recognition rate caused by perspective issues. Attached Figure Description

[0016] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a block diagram of the network system provided in the embodiments of this application;

[0018] Figure 2 This is a flowchart illustrating the action recognition method provided in an embodiment of this application;

[0019] Figure 3 This is a schematic diagram illustrating the process of obtaining action recognition results provided in an embodiment of this application;

[0020] Figure 4 yes Figure 2 A flowchart illustrating the sub-steps included in step S120;

[0021] Figure 5 This is a block diagram of the action recognition device provided in the embodiments of this application;

[0022] Figure 6 This is a block diagram of the electronic device provided in the embodiments of this application.

[0023] Icons: 100 - Electronic device; 110 - Memory; 120 - Processor; 200 - Motion recognition device; 210 - 3D information acquisition module; 220 - Conversion module; 230 - Recognition module; 300 - Control device. Detailed Implementation

[0024] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.

[0025] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0026] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, the terms "first," "second," etc., are used only for distinguishing descriptions and should not be construed as indicating or implying relative importance. It should be noted that features in the embodiments of this application can be combined with each other unless otherwise specified.

[0027] Depending on the input, action recognition can be divided into: action recognition based on color video, and action recognition based on 3D (3D) or 2D (2D) skeletons from color video. Action recognition based on color video refers to directly recognizing actions from color video; action recognition based on 3D or 2D skeletons from color video involves extracting a 3D or 2D skeleton from the color video and then performing action recognition based on the extracted skeleton. Due to its high level of representation and robustness to changes in human appearance and surrounding interference, 3D or 2D skeleton-based action recognition has attracted considerable attention and research in recent years.

[0028] One of the main challenges of action recognition based on 3D or 2D skeletons lies in the complex viewpoint variations when capturing action data. In a real-world scenario, the camera's capture viewpoint differs across sequences, leading to significant variations in skeleton representation. That is, for the same action, different videos captured from different viewpoints will yield different skeleton sequences. Furthermore, individuals can perform actions in different directions, and may dynamically change their orientation over time. Therefore, observing changes in viewpoint makes action recognition a highly challenging problem.

[0029] Because the training set for action recognition networks based on 2D skeletons is relatively abundant and easy to annotate, it is currently commonly used for recognition. Specifically, a 2D skeleton sequence is extracted from the image sequence, and the action recognition result is obtained based on the 2D skeleton-based action recognition network and the obtained 2D skeleton sequence.

[0030] However, as mentioned earlier, viewpoints lead to different skeleton sequences. For example, observing the same person performing a certain action from different perspectives will result in different 2D skeleton sequences. Due to practical limitations (such as manpower and resources), the training set used to train 2D skeleton-based action recognition networks cannot cover every viewpoint of every action. Therefore, the trained action recognition network may be unable to correctly recognize actions from unseen viewpoints. In other words, when the acquisition angle of the actual application scenario data does not match the acquisition viewpoint of the actions in the training set, it is difficult to obtain correct action recognition results using a 2D skeleton-based action recognition network. Consequently, the current action recognition accuracy is low.

[0031] Therefore, the inventors proposed an action recognition method in this application, which obtains a 2D skeleton sequence corresponding to the action to be recognized of the target object in the video to be recognized, which can be correctly recognized by the action recognition network, based on the 3D skeleton sequence corresponding to the action to be recognized. Then, the correct action recognition result is obtained through the action recognition network and the 2D skeleton sequence. This can improve the recognition rate of action recognition.

[0032] The embodiments of this application will now be described in detail with reference to the accompanying drawings.

[0033] Please refer to Figure 1 , Figure 1 This is a block diagram of a network system provided in an embodiment of this application. The network system may include an electronic device 100 and a control device 300 connected in communication. The electronic device 100 may be, but is not limited to, a server, a computer, etc. The control device 300 may be, but is not limited to, a gateway, a smart box, etc., and can be determined according to the actual application scenario.

[0034] The electronic device 100 is used to perform motion recognition on the video to be recognized, obtain motion recognition results, and send the motion recognition results to the control device 300. The control device 300 can analyze the obtained motion recognition results or generate prompt information, etc.

[0035] Optionally, in one embodiment of this invention, the network system described above can be applied to elderly care scenarios. When the electronic device 100 identifies abnormal behavior (e.g., falls, climbing, etc.) as the action recognition result, the control device 300 can generate corresponding abnormal prompt information and send it to the terminal device of the user caring for the elderly, so as to inform the user of the information in a timely manner. The control device 300 can also control the alarm device to directly issue a voice or light alarm to alert the elderly or the user caring for them.

[0036] Optionally, in another embodiment of this invention, the network system described above can be applied to typical home application scenarios. The electronic device 100 can perform motion recognition on video clips corresponding to typical home life, identifying daily human behaviors (such as getting up, watching TV, eating, etc.), obtaining motion recognition results, and sending the obtained motion recognition results to the control device 300. The control device 300 can perform statistical analysis based on the received motion recognition results and display the statistical results to the user.

[0037] Optionally, in another embodiment of this invention, the network system described above can also be applied to the control scenario of smart home devices. After obtaining the action recognition result, the control device 300 can control the corresponding smart home device according to a pre-set control strategy. The control strategy may include a specific action and the corresponding control of the specific home device. For example, if the control strategy includes a specific action, such as turning on the light, then when the action recognition result corresponds to that action, the control device 300 will issue a command to turn on the light.

[0038] It is understandable that the above scenarios are merely illustrative examples, and the network system described above can be applied to any scenario that requires action recognition.

[0039] Please refer to Figure 2 , Figure 2 This is a flowchart illustrating the action recognition method provided in this application embodiment. The action recognition method can be applied to an electronic device 100. The electronic device 100 may pre-store a trained action recognition network, which is a network that performs action recognition based on a 2D skeleton. This action recognition network can be pre-trained by the electronic device 100 or trained by other devices. The action recognition network can be, but is not limited to, CNN (Convolutional Neural Networks), LSTM (Long Short-Term Memory) networks, or ST-GCN (Spetial Temporal Graph Convolutional Networks for Skeleton Based Action Recognition) networks, etc. The specific flow of the action recognition method is explained in detail below.

[0040] Step S110: Obtain the 3D skeleton sequence of the target object in the video to be identified.

[0041] In this embodiment, the video to be identified can be determined by user specification, specification by other devices, based on specific selection rules, or other methods. The video to be identified may include at least one object, which can be a person or an animal (e.g., a cat). The target object is a specific object for which action recognition is required. It is understood that if the video to be identified includes multiple people, only one person can be selected as the target object to identify their actions. Optionally, for ease of recognition, the video to be identified may be a video corresponding to a single action to be identified.

[0042] After determining the video to be identified, it can be analyzed to obtain a 3D skeleton sequence corresponding to the target object. The video to be identified may include only one image frame or multiple image frames. The 3D skeleton sequence includes at least one 3D skeleton. The number of 3D skeletons included in the 3D skeleton sequence can be the same as the number of image frames included in the video to be identified. The 3D skeleton represents the 3D information of the target object; optionally, the 3D skeleton may include the three-dimensional coordinates of the target object's joints.

[0043] When the video to be identified consists of only one image frame, the 3D skeleton sequence includes only one 3D skeleton. When the video to be identified consists of multiple image frames, the 3D skeleton sequence includes multiple 3D skeletons, which are arranged in chronological order according to their corresponding times in the video. For example, if the 3D skeleton sequence of the target object in the video includes 3D skeletons 1, 2, and 3, and 3D skeleton 1 corresponds to time t1, 3D skeleton 2 to time t2, and 3D skeleton 3 to time t3, and t1 is earlier than t2 and t2 is earlier than t3, then the arrangement order of the 3D skeletons in the 3D skeleton sequence could be: 3D skeleton 1, 3D skeleton 2, 3D skeleton 3. Action is a continuous process; arranging the 3D skeletons in chronological order facilitates subsequent action recognition.

[0044] Step S120: Based on the 3D skeleton sequence, obtain the 2D skeleton sequence corresponding to the angle that the action recognition network can recognize.

[0045] In this embodiment, by performing 2D projection on each 3D skeleton in the 3D skeleton sequence, a corresponding 2D skeleton can be obtained, thus obtaining a 2D skeleton sequence corresponding to the 3D skeleton sequence. The 2D skeleton sequence includes at least one 2D skeleton, and the number and arrangement of the 2D skeletons in the 2D skeleton sequence are the same as the number and arrangement of the 3D skeletons in the 3D skeleton sequence. The 2D skeleton represents the 2D information of the target object; optionally, the 2D skeleton may include the two-dimensional coordinates of the joints of the target object.

[0046] During 2D projection, as long as the motion recognition network can correctly identify the motion based on the final 2D skeleton sequence, the specific projection processing method can be determined according to the actual situation. When the motion recognition network can correctly identify the motion corresponding to the 2D skeleton sequence at a certain viewpoint, the 2D skeleton sequence at that viewpoint can be considered as a 2D skeleton sequence at an angle that the motion recognition network can recognize.

[0047] Step S130: Through the action recognition network, action recognition is performed based on the 2D skeleton sequence to obtain the action recognition result.

[0048] When a 2D skeleton sequence corresponding to the 3D skeleton sequence and corresponding to an angle recognizable by the action recognition network is obtained, the action recognition network can be used to perform action recognition based on the 2D skeleton sequence, thereby obtaining the action recognition result. Since the 2D skeleton sequence used for action recognition corresponds to an angle recognizable by the action recognition network, the action recognition result is accurate.

[0049] When the training set cannot fully cover all viewpoints, this embodiment of the application obtains a 2D skeleton sequence from the obtained 3D skeleton sequence, which is then used by the action recognition network to correctly identify the action. Based on this action recognition network and the 2D skeleton sequence, a correct action recognition result is obtained. This solves the problem of low accuracy caused by viewpoint issues, thereby improving the accuracy of action recognition.

[0050] Please refer to Figure 3 , Figure 3 This is a schematic diagram illustrating the process of obtaining action recognition results provided in an embodiment of this application. Optionally, in one possible implementation of this embodiment, the electronic device 100 may store a pre-trained 3D pose estimation network. The 3D pose estimation network can be used to analyze each frame of the video to be recognized, thereby quickly obtaining the 3D skeleton of the target object in each frame. When multiple 3D skeletons are obtained from the video to be recognized, they can be arranged according to the chronological order of their corresponding times in the video to be recognized, thereby obtaining a 3D skeleton sequence of the target object. Each 3D skeleton corresponds to one video frame in the video to be recognized, and the time corresponding to the video frame in the video to be recognized can be used as the time corresponding to the 3D skeleton of that video frame in the video to be recognized.

[0051] The 3D pose estimation network can obtain the 3D skeleton of the target object from an RGB image or an RGBD image with depth information. Here, R represents red; G represents green; B represents blue; and D represents depth. The 3D pose estimation network can be a 3D human pose estimation network, such as SemGCN (Semantic Graph Convolutional Networks for 3D Human Pose Regression). This 3D pose estimation network can be pre-trained by other devices based on sample videos and their corresponding 3D skeletons, or it can be pre-trained by electronic device 100 based on sample videos and their corresponding 3D skeletons, or it can be obtained through other methods.

[0052] It is understood that the above method is only an example, and the 3D skeleton sequence of the target object can also be obtained from the video to be identified in other ways.

[0053] In one embodiment of this invention, the electronic device 100 may also store a pre-trained conversion network. This conversion network can quickly obtain a 2D skeleton sequence corresponding to an angle recognizable by the action recognition network. The conversion network and the action recognition network can be trained from multiple pre-obtained first training samples using an end-to-end training method or other training methods. Each first training sample includes a sample 3D skeleton sequence and a corresponding first sample action recognition result. The conversion network can also be trained solely from the sample 3D skeleton sequence and a corresponding 2D skeleton sequence that can be accurately recognized by the action recognition network. It is understood that other methods can also be used to train the conversion network, as long as it ensures that a 2D skeleton sequence that can be accurately recognized by the action recognition network can be obtained based on a 3D skeleton sequence. The conversion network can be pre-trained by the electronic device 100 or by other devices.

[0054] Optionally, the transformation network may include a viewpoint adaptive network. A 3D skeleton sequence obtained from the video to be recognized can be input into the viewpoint adaptive network to obtain a 2D skeleton sequence corresponding to the angle that the action recognition network can recognize. The viewpoint adaptive network and the action recognition network can be trained using sample 3D skeleton sequences and the first sample action recognition results corresponding to the sample 3D skeleton sequences.

[0055] Please refer to Figure 4 , Figure 4 yes Figure 1 A flowchart illustrating the sub-steps included in step S120. Step S120 may include sub-steps S121 and S122.

[0056] Sub-step S121: Determine the 2D projection parameters corresponding to each 3D skeleton in the 3D skeleton sequence through the viewpoint adaptive network.

[0057] Sub-step S122: Based on the 2D projection parameters corresponding to each 3D skeleton, project the 3D skeleton corresponding to each 2D projection parameter to obtain the 2D skeleton corresponding to the angle that the action recognition network can recognize.

[0058] In this embodiment, each 3D skeleton can be input into the perspective adaptive network to obtain the perspective transformation parameters required for 3D→2D skeleton conversion. These perspective transformation parameters are the 2D projection parameters. Thus, a neural network is used for rapid estimation of the perspective transformation parameters. Optionally, the perspective adaptive network can employ a weak perspective projection model, where the perspective transformation parameters are a set of camera intrinsic parameters, a 6-dimensional vector that is actually transformed into a 2×3 matrix.

[0059] Having obtained the 2D projection parameters through the perspective adaptive network, each 3D skeleton can be 2D projected according to its corresponding 2D projection parameters to obtain a 2D skeleton from the perspective viewpoint. Thus, a 2D skeleton sequence corresponding to the 3D skeleton sequence and capable of correct action recognition by the action recognition network can be obtained. The 2D projection parameters corresponding to each 3D skeleton in the 3D skeleton sequence can be the same or different, depending on the actual situation, and are not specifically limited here.

[0060] Optionally, please refer to again Figure 2 The transformation network may further include a perspective projection network. After obtaining the 3D skeleton sequence, the 2D projection parameters corresponding to each 3D skeleton can be obtained through a perspective adaptive network. Then, based on the 3D skeleton sequence and the corresponding 2D projection parameters, the perspective projection network performs 2D projection on each 3D skeleton in the 3D skeleton sequence to obtain the 2D skeleton sequence. The process of reprojecting the 3D skeleton into a 2D skeleton using the perspective projection network can be represented as: W′ = KX, where X represents the 3D skeleton input to the perspective projection network; K represents the corresponding 2D projection parameters of the input 3D skeleton, and K is a 6-dimensional vector; W′ represents the reprojected 2D skeleton. This perspective projection network is a single layer that only performs reprojection and has no trainable parameters.

[0061] After obtaining the 2D skeleton sequence, the 2D skeleton sequence can be input into the action recognition network, and the output of the action recognition network can be used as the action recognition result.

[0062] Optionally, in one embodiment of this invention, an action recognition result corresponding to the video to be recognized can be obtained by using a recognition model when the input is a video to be recognized. The structure of the recognition model can be as follows: Figure 2 As shown, the recognition model can include a 3D pose estimation network, a viewpoint adaptation network, a perspective projection network, and an action recognition network. For each input image frame, the image passes through all the other networks consecutively before finally being processed by the action recognition network to obtain the action recognition result.

[0063] Optionally, as a possible implementation, the 3D pose estimation network can be a pre-trained network, which can be trained end-to-end based on the first training samples to obtain the viewpoint adaptation network and the action recognition network, as briefly described below.

[0064] Multiple first training samples can be obtained in advance, including sample 3D skeleton sequences and first sample action recognition results. The sample 3D skeleton sequences are input into a first initial viewpoint adaptive network, and the 2D skeleton sequences obtained through the first initial viewpoint adaptive network are input into a first initial action recognition network to obtain first initial action recognition results. Here, the first initial viewpoint adaptive network represents an untrained viewpoint adaptive network, and the first initial action recognition network represents an untrained action recognition network; the first initial action recognition results represent the action recognition results obtained during training through the untrained viewpoint adaptive network and action recognition network.

[0065] A first loss value is obtained based on the first initial action recognition result and the sample action recognition result. The parameters in the first initial viewpoint adaptive network and / or the first initial action recognition network are then adjusted using gradient backpropagation based on this first loss value. Specifically, when calculating the first loss value, it can be calculated based on a preset loss function, the first initial action recognition result corresponding to the first training sample, and the first sample action recognition result.

[0066] Next, it can be determined whether the first training stop condition is met. This first training stop condition can be set according to actual needs. For example, the first training stop condition may include a first preset loss value or a first preset number of training iterations. The first training stop condition is determined to be met when the first loss value is less than the first preset loss value, or the number of training iterations is greater than the first preset number of training iterations.

[0067] If the first training stop condition is met, training stops, and the current first initial viewpoint adaptation network and first initial action recognition network are used as the trained viewpoint adaptation network and action recognition network. If the first training stop condition is not met, the process can return to the step of inputting the sample 3D skeleton sequence into the first initial viewpoint adaptation network and the 2D skeleton sequence obtained through the first initial viewpoint adaptation network into the first initial action recognition network to obtain the first initial action recognition result, that is, continuing to train using the first training samples.

[0068] Optionally, as another possible implementation, multiple second training samples can be obtained in advance, and the 3D pose estimation network, viewpoint adaptation network, and action recognition network can be trained using an end-to-end training method. The second training samples include sample videos and corresponding second sample action recognition results. Here, end-to-end refers to the process from inputting a series of RGB images (i.e., videos) to outputting action recognition results. The dataset composed of multiple second training samples can be a commonly used action recognition dataset, such as NTU-RGB+D or Kinetics-Skeleton.

[0069] During training, the sample video can be input into an initial 3D pose estimation network to obtain a second initial action recognition result through the connected initial 3D pose estimation network, second initial viewpoint adaptation network, and second action recognition network. Here, the initial 3D pose estimation network, second initial viewpoint adaptation network, and second action recognition network represent the untrained 3D pose estimation network, viewpoint adaptation network, and action recognition network, respectively; the second initial action recognition result represents the action recognition result obtained during training through the untrained 3D pose estimation network, viewpoint adaptation network, and action recognition network.

[0070] A second loss value is obtained based on the second initial action recognition result and the second sample action recognition result. Then, based on this second loss value, the parameters of at least one of the following networks—the initial 3D pose estimation network, the second initial viewpoint adaptation network, and the second initial action recognition network—are adjusted using gradient backpropagation. Specifically, when calculating the second loss value, it can be calculated based on a preset loss function, the second initial action recognition result corresponding to the second training sample, and the second sample action recognition result. Optionally, the second loss value can be calculated using a multi-class cross-entropy function, and the parameters can be updated layer by layer using the chain rule of backpropagation to obtain a trained network.

[0071] Next, it can be determined whether the second training stopping condition is met. If it is met, training stops, and the current initial 3D pose estimation network, second initial viewpoint adaptation network, and second initial action recognition network are used as the trained 3D pose estimation network, viewpoint adaptation network, and action recognition network. If it is not met, the process returns to the step of inputting the sample video into the initial 3D pose estimation network to obtain the second initial action recognition result through the connected initial 3D pose estimation network, second initial viewpoint adaptation network, and second action recognition network, that is, continuing to train using the second training samples. The second training stopping condition can be specifically set according to actual needs. For example, the second training stopping condition includes a second preset loss value; when the second loss value is less than the second preset loss value, it is determined that the second training stopping condition is met.

[0072] To facilitate rapid convergence of the entire network, the network was trained in an end-to-end manner, as shown in the figure. Figure 2 When using the recognition model shown, the parameters of the 3D pose estimation network and the action recognition network can be fixed, and only the viewpoint adaptation network can be trained.

[0073] Alternatively, the training process for the 3D pose estimation network, viewpoint adaptation network, and action recognition network can be divided into two stages. The first stage involves training the 3D pose estimation network based on samples and their corresponding 3D skeletons; that is, the first stage outputs a 3D skeleton when the input is an image. The second stage involves training the viewpoint adaptation network and action recognition network based on the sample 3D skeleton sequence and the sample action recognition results; that is, the second stage uses the 3D skeleton as input and outputs action recognition results. To accelerate training, only the second stage of training can be performed.

[0074] It is understandable that the above 3D pose estimation network, viewpoint adaptation network, and action recognition network can also be trained in other ways.

[0075] During training, the cross-entropy loss function can optionally be used to calculate the loss value, with the optimization objective being to maximize the final action recognition performance. The gradient flows back within each network, i.e., gradient backpropagation can train the entire network (where the entire network here includes the 3D pose estimation network, the viewpoint adaptation network, and the action recognition network). The process is briefly explained below.

[0076] Assuming the input image is x, the 3D pose estimation network uses f_3D to represent its processing, and the action recognition network uses f_AR to represent its processing. Following the forward flow of the entire network, the formula for the final output `classification_result` (action recognition result) from the input x is: `classification_result = f_AR(K(f_3D(x)))`. Then, based on this output and the actual sample action recognition results, supervised training is performed using the cross-entropy loss function to train the 3D pose estimation network, the viewpoint adaptation network, and the action recognition network. After obtaining the output `classification_result`, substituting it as the predicted value into the cross-entropy loss function yields the final loss function. Backward gradient propagation is then performed based on this loss. p = [p0,...,p] C-1 ] is a probability distribution, where each element p i Let y represent the probability that a sample belongs to the i-th class; y = [y0,...,y1] C-1 ] is the one-hot representation of the sample label (i.e., the sample action recognition result). When the sample belongs to the i-th class, y is... i =1, otherwise y i =0; c is the sample label.

[0077] To perform the corresponding steps in the above embodiments and various possible methods, an implementation of the action recognition device 200 is given below. Please refer to... Figure 5 , Figure 5 This is a block diagram of the action recognition device 200 provided in this embodiment. It should be noted that the basic principle and technical effects of the action recognition device 200 provided in this embodiment are the same as those in the above embodiments. For the sake of brevity, any parts not mentioned in this embodiment can be referred to the corresponding content in the above embodiments. The action recognition device 200 can be applied to an electronic device 100, which can store a pre-trained action recognition network based on a 2D skeleton for action recognition. The action recognition device 200 may include a 3D information acquisition module 210, a conversion module 220, and a recognition module 230.

[0078] The 3D information acquisition module 210 is used to acquire a 3D skeleton sequence of a target object in a video to be identified. The target object is the object targeted by action recognition. The 3D skeleton sequence includes at least one 3D skeleton. When the 3D skeleton sequence includes multiple 3D skeletons, the multiple 3D skeletons are arranged in chronological order according to the time corresponding to each 3D skeleton in the video to be identified.

[0079] The conversion module 220 is used to obtain a 2D skeleton sequence corresponding to an angle recognizable by the action recognition network based on the 3D skeleton sequence, wherein the 2D skeleton sequence includes at least one 2D skeleton.

[0080] The recognition module 230 is used to perform action recognition based on the 2D skeleton sequence through the action recognition network to obtain action recognition results.

[0081] Optionally, in this embodiment, the conversion module 220 is specifically used to: input the 3D skeleton sequence into a viewpoint adaptive network to obtain a 2D skeleton sequence corresponding to an angle recognizable by the action recognition network. The viewpoint adaptive network and the action recognition network are trained from the sample 3D skeleton sequence and the first sample action recognition result corresponding to the sample 3D skeleton sequence.

[0082] Optionally, in this embodiment, the conversion module 220 is specifically used to: determine the 2D projection parameters corresponding to each 3D skeleton in the 3D skeleton sequence through the viewpoint adaptive network; and project the 3D skeleton corresponding to each 2D projection parameter according to the 2D projection parameters to obtain the 2D skeleton corresponding to the angle that the action recognition network can recognize.

[0083] Optionally, in this embodiment, the viewpoint adaptive network and the action recognition network are trained in the following manner: obtaining multiple first training samples, wherein the first training samples include sample 3D skeleton sequences and first sample action recognition results; inputting the sample 3D skeleton sequences into a first initial viewpoint adaptive network, and inputting the 2D skeleton sequences obtained through the first initial viewpoint adaptive network into a first initial action recognition network to obtain first initial action recognition results; obtaining a first loss value based on the obtained first initial action recognition results and sample action recognition results, and adjusting the parameters in the first initial viewpoint adaptive network and / or the first initial action recognition network according to the first loss value through backpropagation; determining whether a first training stop condition is met; if met, stopping training, and using the current first initial viewpoint adaptive network and first initial action recognition network as the trained viewpoint adaptive network and action recognition network; if not met, returning to the step of inputting the sample 3D skeleton sequences into the first initial viewpoint adaptive network, and inputting the 2D skeleton sequences obtained through the first initial viewpoint adaptive network into the first initial action recognition network to obtain first initial action recognition results.

[0084] Optionally, in this embodiment, the 3D information acquisition module 210 is specifically used to: analyze each image frame in the video to be identified through a 3D pose estimation network to obtain the 3D skeleton of the target object in each image frame.

[0085] Optionally, in this embodiment, the conversion module 220 is specifically used to: input the 3D skeleton sequence into the viewpoint adaptive network to obtain the 2D skeleton sequence corresponding to the angle that the action recognition network can recognize. The 3D pose estimation network, viewpoint adaptation network, and action recognition network are trained as follows: Multiple second training samples are obtained, including sample videos and second sample action recognition results; the sample videos are input into the initial 3D pose estimation network to obtain second initial action recognition results through the connected initial 3D pose estimation network, second initial viewpoint adaptation network, and second action recognition network; a second loss value is obtained based on the obtained second initial action recognition results and second sample action recognition results, and the parameters of at least one of the initial 3D pose estimation network, second initial viewpoint adaptation network, and second initial action recognition network are adjusted through backpropagation based on the second loss value; it is determined whether a second training stopping condition is met; if met, training is stopped, and the current initial 3D pose estimation network, second initial viewpoint adaptation network, and second initial action recognition network are used as the trained 3D pose estimation network, viewpoint adaptation network, and action recognition network; if not met, the process returns to the step of inputting the sample videos into the initial 3D pose estimation network to obtain second initial action recognition results through the connected initial 3D pose estimation network, second initial viewpoint adaptation network, and second action recognition network.

[0086] Please refer to Figure 6 , Figure 6 This is a block diagram of the electronic device 100 provided in an embodiment of this application. This application also provides an electronic device 100. For example... Figure 6 As shown, the electronic device 100 may include one or more of the following components: a memory 110, a processor 120, and one or more application programs, wherein the one or more application programs may be stored in the memory 110 and configured to be executed by one or more processors 120, and the one or more programs are configured to perform the action recognition method as described in the foregoing method embodiments.

[0087] Processor 120 may include one or more processing cores. Processor 120 connects to various parts within the electronic device 100 using various interfaces and lines, and performs various functions and processes data of the electronic device 100 by running or executing instructions, programs, code sets, or instruction sets stored in memory 110, and by calling data stored in memory 110. Optionally, processor 120 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). Processor 120 may integrate one or a combination of several of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the displayed content; and the modem handles wireless communication. It is understood that the modem may also not be integrated into processor 120 and may be implemented separately using a communication chip.

[0088] The memory 110 may include random access memory (RAM) or read-only memory (ROM). The memory 110 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 110 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), and instructions for implementing the various method embodiments described above. Those skilled in the art will understand that... Figure 6 The structure shown is for illustrative purposes only and does not limit the structure of the electronic device 100 described above. For example, the electronic device 100 may also include components that are more... Figure 6 The more or fewer components shown, or having the same Figure 6 The different configurations shown.

[0089] This application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the action recognition method described above.

[0090] In summary, this application provides an action recognition method, apparatus, electronic device, and computer-readable storage medium. For a video to be recognized, a 3D skeleton sequence corresponding to the action to be recognized of the target object in the video is first obtained. Then, based on the 3D skeleton sequence, a 2D skeleton sequence corresponding to the action to be recognized, which can be correctly recognized by the action recognition network, is obtained. Finally, the correct action recognition result is obtained through the action recognition network and the 2D skeleton sequence. The aforementioned action recognition network is a network that performs action recognition based on the 2D skeleton. Thus, even if the acquisition angle of the video to be recognized does not match the acquisition angle of the actions in the training set of the action recognition network, the action to be recognized of the target object in the video can still be correctly recognized, thereby solving the problem of low recognition rate caused by perspective issues.

[0091] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can also be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0092] In addition, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

[0093] If the aforementioned functions are implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0094] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. An action recognition method, characterized in that, Applied to electronic devices, the method includes: Obtain a 3D skeleton sequence of a target object in a video to be identified, wherein the target object is the object targeted by action recognition, and the 3D skeleton sequence includes at least one 3D skeleton. When the 3D skeleton sequence includes multiple 3D skeletons, the multiple 3D skeletons are arranged in chronological order according to the time corresponding to each 3D skeleton in the video to be identified. The 3D skeleton sequence is input into the perspective adaptive network to determine the 2D projection parameters corresponding to each 3D skeleton in the 3D skeleton sequence; the perspective projection network performs 2D projection on each 3D skeleton in the 3D skeleton sequence based on the 3D skeleton sequence and the 2D projection parameters corresponding to each 3D skeleton, to obtain the 2D skeleton sequence corresponding to the angle that the action recognition network can recognize, wherein the 2D skeleton sequence includes at least one 2D skeleton. The action recognition network performs action recognition based on the 2D skeleton sequence to obtain action recognition results.

2. The method according to claim 1, characterized in that, The viewpoint adaptive network and the action recognition network are trained from the sample 3D skeleton sequence and the action recognition result of the first sample corresponding to the sample 3D skeleton sequence.

3. The method according to claim 2, characterized in that, The perspective projection network is a single layer that only performs reprojection and has no trainable parameters.

4. The method according to claim 2 or 3, characterized in that, The viewpoint adaptive network and the action recognition network are trained in the following way: Multiple first training samples are obtained, wherein the first training samples include sample 3D skeleton sequences and first sample action recognition results; The sample 3D skeleton sequence is input into the first initial viewpoint adaptive network, and the 2D skeleton sequence obtained through the first initial viewpoint adaptive network is input into the first initial action recognition network to obtain the first initial action recognition result; A first loss value is obtained based on the first initial action recognition result and the sample action recognition result, and the parameters in the first initial viewpoint adaptive network and / or the first initial action recognition network are adjusted based on the first loss value through backpropagation. Determine whether the first training stop condition is met; If the conditions are met, training is stopped, and the current first initial viewpoint adaptation network and the first initial action recognition network are used as the trained viewpoint adaptation network and the action recognition network. If the conditions are not met, the process returns to the step of inputting the sample 3D skeleton sequence into the first initial viewpoint adaptive network, and inputting the 2D skeleton sequence obtained through the first initial viewpoint adaptive network into the first initial action recognition network to obtain the first initial action recognition result.

5. The method according to claim 1, characterized in that, The step of obtaining the 3D skeleton sequence of the target object in the video to be identified includes: By using a 3D pose estimation network, each image frame in the video to be identified is analyzed to obtain the 3D skeleton of the target object in each image frame.

6. The method according to claim 5, characterized in that, The step of obtaining a 2D skeleton sequence corresponding to an angle recognizable by the action recognition network based on the 3D skeleton sequence includes: The 3D skeleton sequence is input into the viewpoint adaptive network to obtain the 2D skeleton sequence corresponding to the angle that the action recognition network can recognize. The 3D pose estimation network, viewpoint adaptation network, and action recognition network are trained in the following way: Multiple second training samples are obtained, wherein the second training samples include sample videos and second sample action recognition results; The sample video is input into the initial 3D pose estimation network to obtain the second initial action recognition result through the connected initial 3D pose estimation network, second initial viewpoint adaptive network and second action recognition network; A second loss value is obtained based on the second initial action recognition result and the second sample action recognition result. Based on the second loss value, the parameters of at least one of the initial 3D pose estimation network, the second initial viewpoint adaptation network and the second initial action recognition network are adjusted through backpropagation. Determine whether the second training stop condition is met; If the conditions are met, training is stopped, and the current initial 3D pose estimation network, second initial viewpoint adaptation network, and second initial action recognition network are used as the trained 3D pose estimation network, viewpoint adaptation network, and action recognition network. If the conditions are not met, the process returns to the step of inputting the sample video into the initial 3D pose estimation network to obtain the second initial action recognition result through the connected initial 3D pose estimation network, second initial viewpoint adaptation network, and second action recognition network.

7. A motion recognition device, characterized in that, Applied to electronic devices, the device includes: A 3D information acquisition module is used to acquire a 3D skeleton sequence of a target object in a video to be identified, wherein the target object is the object targeted by action recognition, the 3D skeleton sequence includes at least one 3D skeleton, and when the 3D skeleton sequence includes multiple 3D skeletons, the multiple 3D skeletons are arranged in chronological order according to the time corresponding to each 3D skeleton in the video to be identified. A conversion module is used to input the 3D skeleton sequence into a perspective adaptive network to determine the 2D projection parameters corresponding to each 3D skeleton in the 3D skeleton sequence; and to perform 2D projection on each 3D skeleton in the 3D skeleton sequence based on the 3D skeleton sequence and the 2D projection parameters corresponding to each 3D skeleton through a perspective projection network to obtain a 2D skeleton sequence corresponding to an angle that can be recognized by the action recognition network, wherein the 2D skeleton sequence includes at least one 2D skeleton. The recognition module is used to perform action recognition based on the 2D skeleton sequence through the action recognition network to obtain action recognition results.

8. The apparatus according to claim 7, characterized in that, The viewpoint adaptive network and the action recognition network are trained from the sample 3D skeleton sequence and the action recognition result of the first sample corresponding to the sample 3D skeleton sequence.

9. An electronic device, characterized in that, It includes a processor and a memory, the memory storing machine-executable instructions that can be executed by the processor, the processor executing the machine-executable instructions to implement the action recognition method according to any one of claims 1-6.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the action recognition method as described in any one of claims 1-6.