Method for recognizing human action and apparatus for performing same

The diffusion transformer-based learning method aligns skeletal and textual data to improve motion recognition models' generalization, allowing accurate recognition of unseen actions.

WO2026127510A1PCT designated stage Publication Date: 2026-06-18KOREA ADVANCED INST OF SCI & TECH

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
KOREA ADVANCED INST OF SCI & TECH
Filing Date
2025-12-04
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing motion recognition models face performance degradation due to a modality gap between skeletal data and textual data, which limits their ability to generalize for unseen actions.

Method used

A diffusion transformer-based learning method that integrates skeletal data and text features using a diffusion process-based learning method to align and fuse the skeletal data and textual data, which includes a diffusion transformer.

🎯Benefits of technology

The method enhances the generalization performance of motion recognition models by reducing the modality gap between skeletal and textual data, enabling accurate zero-shot recognition of unseen actions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025020742_18062026_PF_FP_ABST
    Figure KR2025020742_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Disclosed are a method for recognizing a human action, and an apparatus for performing same. According to one embodiment, the method may comprise the operations of: acquiring training data including skeleton data representing a human action, a ground truth text label of the skeleton data, and a wrong text label of the skeleton data; acquiring a noisy skeleton feature on the basis of adding original noise to a skeleton feature of the skeleton data; acquiring first predicted noise on the basis of the noisy skeleton feature and a ground truth feature of the ground truth text label; acquiring second predicted noise on the basis of the noisy skeleton feature and a wrong feature of the wrong text label; and training an action recognition model on the basis of the distance between the original noise and the first predicted noise and the distance between the original noise and the second predicted noise.
Need to check novelty before this filing date? Find Prior Art

Description

Method for recognizing human motion and device for performing the same

[0001] The present disclosure relates to a method for recognizing human motion and an apparatus for performing the same.

[0002] Motion recognition technology can be a technology that analyzes human movements. Since a trained neural network can only recognize specific movements included in the training data, a very large amount of training data may be required to improve the performance of the neural network.

[0003] Zero-shot learning can be a technique for training a neural network to recognize actions it has not observed during the training process.

[0004] The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure.

[0005] The present invention was developed with support from the Ministry of Science and ICT (Project No.: RS-2022-00144444, Project Name: Information and Communication / Broadcasting Technology Development Project, Research Project Name: Research on Learning and Rendering Spatial Image Representation of Static and Dynamic Scenes Based on Deep Learning, Lead Institution: Korea Advanced Institute of Science and Technology, Research Management Agency: Korea Institute of Information and Communication Technology Planning and Evaluation).

[0006] According to one embodiment, a method for improving the performance degradation of a motion recognition model caused by a modality gap between skeletal data and text features describing the skeletal data may be provided.

[0007] The technical problems to be solved in this disclosure are not limited to the above technical problems.

[0008] A method for training a motion recognition model performed by an electronic device may include the operation of acquiring training data comprising skeletal data representing human motion, correct text labels of said skeletal data, and incorrect text labels of said skeletal data. The method may include the operation of acquiring noisy skeleton features based on adding original noise to the skeletal features of said skeletal data. The method may include the operation of acquiring a first prediction noise based on the noisy skeleton features and the correct features of said correct text labels. The method may include the operation of acquiring a second prediction noise based on the noisy skeleton features and the incorrect features of said incorrect text labels. The method may include the operation of training the motion recognition model based on the distance between said original noise and the first prediction noise and the distance between said original noise and the second prediction noise.

[0009] The above motion recognition model may include a diffusion transformer.

[0010] The operation of training the motion recognition model may include training the motion recognition model using a first prediction error calculated based on the distance between the original noise and the first prediction noise, and a second prediction error calculated based on the distance between the original noise and the second prediction noise.

[0011] The operation of training the motion recognition model using the first prediction error and the second prediction error may include training the motion recognition model based on a first loss function configured such that the second prediction error becomes larger than or equal to a predetermined value than the first prediction error.

[0012] The operation of training the motion recognition model based on the first loss function may include training the motion recognition model using a third loss function defined based on the first loss function and a second loss function configured to minimize the distance between the original noise and the first predicted noise.

[0013] The above method may further include the operation of extracting the skeletal features from the skeletal data using a first encoder.

[0014] The above method may further include an operation of extracting the correct answer feature from the correct answer text prompt corresponding to the correct answer text label using the second encoder. The above method may further include an operation of extracting the incorrect answer feature from the incorrect answer text prompt corresponding to the incorrect answer text label using the second encoder.

[0015] Each of the above correct answer features and above incorrect answer features may include global features and local features.

[0016] The operation of acquiring the first prediction noise may include an operation of modulating a skeletal feature representation corresponding to the noise-laden skeletal feature using at least one of the first parameters acquired from the global feature of the correct answer feature. The operation of acquiring the first prediction noise may include an operation of modulating a first local feature representation corresponding to the local feature of the correct answer feature using at least one of the parameters. The operation of acquiring the first prediction noise may include an operation of acquiring the first prediction noise based on the modulated skeletal feature representation and the modulated local feature representation.

[0017] An electronic device for training a motion recognition model may include at least one processor and a memory for storing instructions.

[0018] When the above instructions are executed individually or collectively by the at least one processor, the device may be able to perform a plurality of operations. The plurality of operations may include an operation to acquire training data including skeletal data representing human movements, correct text labels of the skeletal data, and incorrect text labels of the skeletal data. The plurality of operations may include an operation to acquire noisy skeleton features based on adding original noise to the skeletal features of the skeletal data. The plurality of operations may include an operation to acquire a first prediction noise based on the noisy skeleton features and the correct features of the correct text labels. The plurality of operations may include an operation to acquire a second prediction noise based on the noisy skeleton features and the incorrect features of the incorrect text labels. The operation may include training the motion recognition model based on the distance between the original noise and the first prediction noise and the distance between the original noise and the second prediction noise.

[0019] The above motion recognition model may include a diffusion transformer.

[0020] The operation of training the motion recognition model may include training the motion recognition model using a first prediction error calculated based on the distance between the original noise and the first prediction noise, and a second prediction error calculated based on the distance between the original noise and the second prediction noise.

[0021] The operation of training the motion recognition model using the first prediction error and the second prediction error may include training the motion recognition model based on a first loss function configured such that the second prediction error becomes larger than or equal to a predetermined value than the first prediction error.

[0022] The operation of training the motion recognition model based on the first loss function may include training the motion recognition model using a third loss function defined based on the first loss function and a second loss function configured to minimize the distance between the original noise and the first predicted noise.

[0023] The above plurality of operations may further include an operation of extracting the skeletal features from the skeletal data using the first encoder.

[0024] The plurality of operations may further include an operation of extracting the correct answer feature from the correct answer text prompt corresponding to the correct answer text label using the second encoder. The plurality of operations may further include an operation of extracting the incorrect answer feature from the incorrect answer text prompt corresponding to the incorrect answer text label using the second encoder.

[0025] Each of the above correct answer features and above incorrect answer features may include global features and local features.

[0026] The operation of acquiring the first prediction noise may include an operation of modulating a skeletal feature representation corresponding to the noise-laden skeletal feature using at least one of the first parameters acquired from the global feature of the correct answer feature. The operation of acquiring the first prediction noise may include an operation of modulating a first local feature representation corresponding to the local feature of the correct answer feature using at least one of the parameters. The operation of acquiring the first prediction noise may include an operation of acquiring the first prediction noise based on the modulated skeletal feature representation and the modulated local feature representation.

[0027] A device for recognizing human motion may include at least one processor and a memory for storing instructions. When the instructions are executed individually or collectively by the at least one processor, the device may be able to perform a plurality of actions. The plurality of actions may include an action of acquiring skeletal data. The plurality of actions may include an action of estimating an action corresponding to the skeletal data using a motion recognition model learned by the method.

[0028] According to one embodiment, a computer-readable recording medium storing one or more computer programs may include instructions for causing at least one processor to perform the method.

[0029] Figure 1 is a diagram illustrating a direct alignment-based learning method for motion recognition.

[0030] FIG. 2 is a diagram illustrating a diffusion process-based learning method for motion recognition according to various embodiments.

[0031] FIG. 3 is a diagram illustrating a learning framework of a learning method according to various embodiments.

[0032] FIG. 4 is a drawing illustrating a block for fusing different types of features inside a diffusion transformer according to various embodiments.

[0033] FIG. 5 is a diagram illustrating an inference framework of a motion recognition model learned by a learning method according to various embodiments.

[0034] FIG. 6 is an exemplary flowchart for explaining a method for inferring the behavior of skeletal data according to various embodiments.

[0035] FIG. 7 is an exemplary block diagram of a learning device according to various embodiments.

[0036] FIG. 8 is an exemplary block diagram of a motion recognition device according to various embodiments.

[0037] Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be modified and implemented in various forms. Accordingly, actual implementations are not limited to the specific embodiments disclosed, and the scope of this specification includes modifications, equivalents, or substitutions included in the technical concept described by the embodiments.

[0038] Terms such as "first" or "second" may be used to describe various components, but these terms should be interpreted solely for the purpose of distinguishing one component from another. For example, the first component may be named the second component, and similarly, the second component may be named the first component.

[0039] When it is stated that a component is "connected" to another component, it should be understood that it may be directly connected to or coupled with that other component, or that there may be other components in between.

[0040] Singular expressions include plural expressions unless the context clearly indicates otherwise. In this document, phrases such as “A or B,” “at least one of A and B,” “at least one of A or B,” “A, B or C,” “at least one of A, B and C,” and “at least one of A, B, or C” may each include any one of the items listed together with the corresponding phrase, or all possible combinations thereof. In this specification, terms such as “comprising” or “having” are intended to designate the existence of the described feature, number, step, action, component, part, or combination thereof, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

[0041] Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by those skilled in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this specification.

[0042] As used herein, the term "module" may include a unit implemented in hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit. A module may be a component formed integrally, or a minimum unit of said component or a part thereof that performs one or more functions. For example, according to one embodiment, a module may be implemented in the form of an application-specific integrated circuit (ASIC).

[0043] As used in this document, the term "part" refers to a software or hardware component, such as an FPGA or ASIC, that performs certain roles. However, "part" is not limited to software or hardware. "Part" may be configured to reside in an addressable storage medium or configured to operate one or more processors. For example, "part" may include components such as software components, object-oriented software components, class components, and task components, as well as processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. The functions provided within the components and "parts" may be combined into a smaller number of components and "parts" or further separated into additional components and "parts." Furthermore, components and "parts" may be implemented to operate one or more CPUs within a device or secure multimedia card. Additionally, '~part' may include one or more processors.

[0044] Hereinafter, various embodiments will be described in detail with reference to the accompanying drawings. In describing the present disclosure with reference to the accompanying drawings, identical components are given the same reference numerals regardless of the drawing symbols, and redundant descriptions thereof may not be described repeatedly.

[0045]

[0046] Figure 1 is a diagram illustrating a direct alignment-based learning method for motion recognition.

[0047] Referring to Fig. 1, a direct alignment-based learning method can be used to directly connect skeleton data and text data for zero-shot motion recognition.

[0048] Skeletal data and text used for training can each be processed through separate paths.

[0049] Skeleton data can be input into a skeleton encoder (10). The skeleton encoder (10) includes skeleton features z containing spatiotemporal movement information of the skeleton (e.g., movement information of a throwing motion) from the skeleton data. s Can output . Skeletal feature z s It can exist in a latent space called 'Skeleton Latent'.

[0050] Text data can be input into a text encoder (15). The text encoder (15) includes text features z containing semantic information of an action label from an action label (e.g., text 'throw') corresponding to the skeletal data. p Can output. Text features z p It can exist within a latent space called 'Text Latent'.

[0051] In direct alignment-based learning methods, through direct alignment, skeletal feature z s and text features z p A direct connection between can be attempted. Skeletal feature z s and text features z pIf directly connected, a problem may arise where the performance of the motion recognition model degrades. For example, a motion recognition model trained by a direct alignment-based learning method may have limited generalization performance regarding unseen actions during the training process. This problem may be caused by the difference (or modality gap) between the modality of text data, which is visual information, and the modality of text data, which is linguistic information.

[0052] In the present disclosure, skeletal features z within a single integrated latent space can improve the performance of a motion recognition model. s and text features z p It is possible to provide a learning method based on the convergence of

[0053]

[0054] FIG. 2 is a diagram illustrating a diffusion process-based learning method for motion recognition according to various embodiments.

[0055] Referring to FIG. 2, according to various embodiments, a diffusion process-based learning method comprises skeletal features z generated by a skeletal encoder (20) through a diffusion process. s text features z generated by the text encoder (25) p or text features z p It can be fused (or merged) with the corresponding text prompt. A diffusion process-based learning method may be a learning technique that adds noise to the data (e.g., concatenation) in a forward process to make the data disordered, and restores the original data from the disordered data in a reverse process.

[0056] Diffusion process-based learning methods are skeletal feature z s text features z p or text features zp To fuse with text prompts corresponding to [the text prompt], a reverse diffusion training scheme can be used to achieve merged discriminative features. The diffusion process-based learning method learns to remove noise from noisy skeletal features using text prompts as conditions, thereby embedding text prompts into a single integrated latent space (skeleton-text latent space). Through this process, the diffusion process-based learning method can reduce the modality gap between skeletal data and text data and form an implicit alignment between skeletal data and text data.

[0057] To train a motion recognition model, a training dataset such as Equation 1 can be used during the training phase.

[0058] [Mathematical Formula 1]

[0059]

[0060] In mathematical formula 1, represents a skeleton sequence, and can be the correct text label corresponding to the skeleton sequence. Skeleton sequence is the sequence length (e.g., number of frames), number of joints , number of people , the dimension of the coordinates representing the position of each joint It can be composed of. is the set of seen class labels It can belong to.

[0061] In the inference step, a test data set such as Equation 2 can be used.

[0062] [Mathematical Formula 2]

[0063]

[0064] In mathematical formula 2, represents a skeleton sequence (or unseen skeleton data) of unseen classes, and can be a ground truth text label corresponding to a skeletal sequence of an unobserved class. A set of labels for observed classes. and the set of labels for the unobserved classes They can be disjoint as in mathematical equation 3.

[0065] [Mathematical Formula 3]

[0066]

[0067] The motion recognition model, training data It is learned through, Unobserved classes can be generalized through this. The motion recognition model learns robust discriminative fusion between skeletal features and text prompts (or text descriptions), thereby [recognizing] unobserved skeletal sequences Accurate label for It can predict. The motion recognition model can be referred to as TDSM (triplet diffusion for skeleton-text matching).

[0068]

[0069] FIG. 3 is a diagram illustrating a learning framework of a learning method according to various embodiments.

[0070] Referring to FIG. 3, according to one embodiment, the learning framework may include an input encoding process, a forward process, and a reverse process. In the learning phase, the encoder (e.g., text encoder (35) and / or skeleton encoder (30)) may be frozen.

[0071] During the input encoding process, skeletal data and text prompts can each be embedded into a feature space.

[0072] The skeleton encoder (30) embeds the skeleton input (X) into a skeleton feature space (or skeleton latent space) during the input encoding process to create skeleton features z x It can generate. The architecture of the skeleton encoder (30) may be based on a graph convolutional network (GCN), a spatial temporal graph convolutional network (ST-GCN), or a shift-GCN, but the scope of the present disclosure is not limited thereto. The skeleton encoder (30) may be pre-trained based on a cross-entropy loss such as Equation 4.

[0073] [Mathematical Formula 4]

[0074]

[0075] In mathematical formula 4, is the predicted class label for the skeleton input (X), and is the number of observation classes, and may be a one-hot vector of the ground truth text label. Once training is complete, the parameters of the skeleton encoder (30) are frozen, and the skeleton features (or skeleton latent space representation) It can be used to generate . Skeleton feature z for the attention layer x It can be reshaped, and ( is the number of skeleton tokens, and It can be expressed in the form of a dimension of features.

[0076] The text encoder (35) can capture semantic information about action labels by utilizing a text prompt. Correct text label y p In the answer text prompt d p is assigned, and incorrect answer text label y n In the incorrect answer text prompt d n A text prompt may be assigned. The correct text label may be text (e.g., a word) that correctly describes the action of the given skeletal data, and the incorrect text label may be text (e.g., a word) that incorrectly describes the action of the given skeletal data. For example, if the skeletal data represents a 'throwing action', the correct text label may be 'throwing' and the incorrect text label may be 'picking up'. The text prompt may be text (e.g., a sentence) that describes the text label. The text encoder (35) may be a pre-trained model for encoding the prompt. For example, the text encoder (35) may include contrastive language-image pre-training (CLIP).

[0077] The text encoder (35) is, as in mathematical formula 5, prompt d p and prompt d n global text feature z from each g and regional text features z l It can generate.

[0078] [Mathematical Formula 5]

[0079]

[0080] In mathematical formula 5, represents token-wise concatenation, and represents global text features, Is It can represent local text features having text tokens.

[0081] The text encoder (35) is the correct text label y p Global correct text feature z corresponding to g,p and regional correct answer text features z l,p and incorrect answer text labels y n Global incorrect text feature z corresponding to g,n and regional incorrect answer text features z l,n Can be printed respectively. Answer text label y p Global correct text feature z corresponding to g,p and regional correct answer text features z l,p and incorrect answer text labels y n Global incorrect text feature z corresponding to g,n and regional incorrect answer text features z l,n By conditioning the denoising of noise from noisy skeletal features, the diffusion process can be guided.

[0082] The learning framework may be based on a diffusion process that includes a forward process and a reverse process. The learning framework may utilize a conditional denoising diffusion process to learn a discriminative skeletal latent space by fusing skeletal features and text prompts through a backdiffusion process, rather than generating data.

[0083] In the forward process, random noise (e.g., random Gaussian noise) is a random timestep within a total of T steps Skeletal features z xBy being added to, the noisy skeleton feature z x,t This can be generated. Noisy skeletal feature z x,t It can be expressed as in mathematical formula 6.

[0084] [Mathematical Formula 6]

[0085]

[0086] In mathematical formula 6, is Gaussian noise, and It can control the noise level at step t.

[0087] In the reverse process, the diffusion transformer (37) has noisy skeletal features z x,t , text features corresponding to the correct text label (e.g., global correct text feature z g,p and regional correct answer text features z l,p ), and text features corresponding to incorrect text labels (e.g., global incorrect text feature z g,n and regional incorrect answer text features z l,n Based on ), the original noise can be predicted. The original noise is the skeletal feature z in the forward process. x It can mean noise added to. The noise predicted by the diffusion transformer (37) at time step t can be expressed as Equation 7.

[0088] [Mathematical Formula 7]

[0089]

[0090] In mathematical formula 7 represents the predicted noise, can represent the operation of the diffusion transformer (37). The diffusion transformer (37) is a global correct text feature z which is a positive feature. g,p and regional correct answer text features z l,pPredicted noise corresponding to and the global incorrect text feature z, which is a negative feature g,n and regional incorrect answer text features z l,n Predicted noise corresponding to Each can be generated. Prediction noise Diffusion transformer (37) for generating and prediction noise The diffusion transformer (37) for generating can share weights. For example, prediction noise Diffusion transformer (37) for generating and prediction noise The diffusion transformer (37) for generating can be the same.

[0091] The diffusion transformer (37) has noise-laden skeletal features z x,t , global text features z g , local text features z l It can take , and timestep t as input. Noisy skeletal feature z x,t , global text features z g , local text features z l Each can be embedded as a feature representation as in Equation 8.

[0092] [Mathematical Formula 8]

[0093]

[0094] In mathematical formula 8, and is a positional embedding applied to a feature map that can capture spatial location, and can be a timestep embedding that maps a scalar timestep t to a higher-dimensional space.

[0095] The diffusion transformer (37) has embedded features By sequentially processing through B (B is a natural number) CrossDiT blocks (37-1), processing the output of the CrossDiT blocks (37-1) through layer normalization (37-2), and processing the output of the layer normalization through a linear layer (37-3), the predicted noise It can output. The CrossDiT block (37-1) is described in more detail in FIG. 4.

[0096] In the learning phase, the total loss function used for learning the diffusion transformer (37) (e.g., updating the parameters of the diffusion transformer) can be defined based on diffusion loss and triplet diffusion. For example, the total loss function can be defined as in Equation 9.

[0097] [Mathematical Formula 9]

[0098]

[0099] In mathematical formula 9, represents the total loss function, and represents diffusion loss, represents triple diffusion loss, It can adjust the contribution ratio between diffusion loss and triple diffusion loss as a weighting factor.

[0100] Diffusion loss can focus only on correct skeleton data-text data (or text prompt) pairs (e.g., a pair of skeleton data representing a 'throw action' and the text 'throw'). Diffusion loss is the noise predicted by the model when a text prompt corresponding to the skeleton data is provided. a, skeletal feature z in the forward process x Original noise added to It can contribute to making it as close as possible to. For example, diffusion loss is predicted noise and original noise It can contribute to minimizing the difference between them. Diffusion loss can be defined as in Equation 10.

[0101] [Mathematical Formula 10]

[0102]

[0103] In mathematical formula 10, represents the original noise, may be predicted noise for the correct text label.

[0104] Triple diffusion loss can focus on both correct skeleton data-text data pairs and incorrect skeleton data-text data pairs (e.g., a pair of skeleton data representing a 'throwing action' and the text 'throw'). Triple diffusion loss can contribute to enhancing the ability of action recognition models to distinguish between correct text labels and incorrect text labels. For example, triple diffusion loss can enable discriminative fusion of two modalities (e.g., visual information from skeleton data and linguistic information from text data) within an integrated latent space (e.g., learned skeleton-text latent space) by contributing to a smaller prediction error for correct skeleton data-text data pairs and a larger prediction error for incorrect skeleton data-text data pairs. Triple diffusion loss can be defined as shown in Equation 11.

[0105] [Mathematical Formula 11]

[0106]

[0107] In mathematical formula 11, is the predicted noise for the incorrect text label, and can be a margin parameter. For example, It can be a positive real number.

[0108] Among the interrelated movements of a person, a common skeleton movement may be shared, and the learning framework can associate unobserved skeleton data (or unobserved skeleton sequences) with corresponding text prompts by utilizing the common skeleton movement within the seen skeleton data used to train the motion recognition model during the learning phase.

[0109]

[0110] FIG. 4 is a drawing illustrating a block for fusing different types of features inside a diffusion transformer according to various embodiments. The structure of the CrossDiT block (37-1) shown in FIG. 3 is exemplary and the scope of the present disclosure is not limited thereto.

[0111] Referring to FIG. 4, according to one embodiment, the CrossDiT block (37-1) can facilitate the interaction between skeletal features and text features to improve fusion performance through feature modulation and multi-head self-attention.

[0112] The CrossDiT block (37-1) can be based on a DiTs (diffusion transformer) architecture. The CrossDiT block (37-1) can efficiently capture dependencies between different modalities by using modulation techniques and a self-attention mechanism.

[0113] Embedded skeletal features and embedded local text features Each is an embedded global text feature, as in Equation 12 It can be modulated through scale shift operations and scale operations based on various modulation parameters calculated from.

[0114] [Mathematical Formula 12]

[0115]

[0116] In mathematical formula 12, can represent embedded skeletal features or embedded local text features. Parameters and is an embedded global text feature (or global text features of Fig. 3) It can be generated based on ) and time step t.

[0117] The CrossDiT block (37-1) is an embedded skeletal feature as in Equation 13 and embedded local text features You can calculate the query, key, and value matrices for each.

[0118] [Mathematical Formula 13]

[0119]

[0120] The query, key, and value matrices are token-wise concatenated and input into a multihead self-attention module, after which they can be partitioned to maintain token-specific information as shown in Equation 14.

[0121] [Mathematical Formula 14]

[0122]

[0123] The CrossDiT block (37-1) can achieve skeleton-text fusion that improves the efficiency of interaction between modalities and improves generalization performance for discriminative feature learning and unseen actions by utilizing attention derived from skeleton features, time steps, and text features.

[0124]

[0125] FIG. 5 is a diagram illustrating an inference framework of a motion recognition model learned by a learning method according to various embodiments.

[0126] Referring to FIG. 5, according to one embodiment, in the inference step, an unobserved skeletal sequence and unobserved skeletal sequence Candidate text prompts corresponding to can be input into a motion recognition model (e.g., a motion recognition model trained based on the learning framework of FIG. 3). The motion recognition model takes the noise generated for the candidate text prompts as fixed ground truth noise. It can be compared to.

[0127] Unobserved skeletal sequence It can be encoded into a skeletal latent space as in Equation 15 through a skeletal encoder (50). The skeletal encoder (50) is an unobserved skeletal sequence Skeletal features from Can extract.

[0128] [Mathematical Formula 15]

[0129]

[0130] Each candidate action label Each candidate text prompt Associated with, the text encoder (55) can extract global text features and local text features from candidate text prompts as in Equation 16.

[0131] [Mathematical Formula 16]

[0132]

[0133] In the motion recognition model, fixed Gaussian noise in the forward process and fixed timestep Using, as in Equation 17, noisy skeletal features It can generate.

[0134] [Mathematical Formula 17]

[0135]

[0136] The diffusion transformer (57) of the motion recognition model can predict noise as in Equation 18.

[0137] [Mathematical Formula 18]

[0138]

[0139] Candidate Action Labels The prediction error for is and Calculated using the l2 norm, and the predicted labels can be determined as the item that minimizes the prediction error, as shown in Equation 19. Prediction label silver, unobserved skeletal sequence Prediction results generated by a motion recognition model for (e.g., unobserved skeletal sequences) It can be text representing ).

[0140] [Mathematical Formula 19]

[0141]

[0142] The motion recognition model can select a motion label corresponding to a text prompt that best aligns with a given skeleton sequence. The motion recognition model can perform accurate zero-shot motion recognition even for unobserved skeleton sequences and / or unobserved motion labels.

[0143]

[0144] FIG. 6 is an exemplary flowchart for explaining a method for inferring the behavior of skeletal data according to various embodiments.

[0145] Referring to FIG. 6, according to one embodiment, a method for inferring the behavior of skeletal data of a behavior class not seen during the training phase using a pre-trained diffusion model may be provided.

[0146] In operation 610, an electronic device (e.g., a skeleton recognition device) can acquire input skeleton data of an unseen motion during the training phase. The input skeleton data may belong to a new motion class that was not used during the training phase.

[0147] In operation 620, the electronic device can generate skeletal features based on encoding input skeletal data into a latent space using a pre-trained skeletal encoder.

[0148] In operation 630, the electronic device may generate noisy test skeleton features by adding reference noise to the skeleton features (e.g., concatenation). The reference noise may be fixed noise sampled from a Gaussian distribution. As the reference noise is fixed, a fair comparison can be made for all candidate text prompts. The reference noise (or the amount of reference noise) may correspond to a predetermined time step. The time step may be set to a point where the discriminative power of the noise prediction can be maximized while minimizing the loss of skeleton information.

[0149] In operation 640, the electronic device may acquire multiple candidate text prompts corresponding to multiple candidate action classes (e.g., sit, stand, and / or walk) that are not used for training the learned diffusion model. For example, the electronic device may acquire multiple candidate text prompts using a generative model (e.g., a large language model). Each candidate text prompt may be text in the form of a sentence describing the corresponding candidate action class. The electronic device may use a text encoder to extract global text features and local text features for each candidate text prompt. Global text features and local text features may be input as conditions for the diffusion model.

[0150] In operation 650, the electronic device can use a learned diffusion model to generate multiple candidate noises corresponding respectively to multiple candidate text prompts from a noisy skeletal feature. Each candidate text prompt can be used as a conditional input (or hint) for generating noise from a noisy test skeletal feature.

[0151] In operation 660, the electronic device can calculate the reconstruction error of each of the multiple candidate text prompts based on the distance between the reference noise and each of the multiple candidate noises. The distance between the reference noise and each of the multiple candidate noises can be calculated using an L2 norm.

[0152] In operation 670, the electronic device can determine the candidate operation class corresponding to the candidate text prompt having the smallest reconstruction error among a plurality of candidate operation classes as the operation of the input skeleton data.

[0153]

[0154] FIG. 7 is an exemplary block diagram of a learning device according to various embodiments.

[0155] Referring to FIG. 7, according to one embodiment, a learning device (700) (e.g., a server) may include at least one processor (720) and memory (740).

[0156] The memory (740) may store instructions (or programs) executable by at least one processor (720). For example, the instructions may include instructions for executing the operation of at least one processor (720) and / or the operation of each configuration of at least one processor (720).

[0157] The memory (740) may include one or more computer-readable storage media. The memory (740) may include non-volatile storage devices (e.g., a solid-state drive (SSD), a magnetic hard disc, an optical disc, a floppy disc, a flash memory, an electrically programmable memory (EPROM), or an electrically erasable and programmable (EEPROM)).

[0158] The memory (740) may be a non-transitory medium. The term “non-transitory” may indicate that the storage medium is not implemented by a carrier wave or a propagated signal. However, the term “non-transitory” should not be interpreted as meaning that the memory (740) is immobile.

[0159] At least one processor (720) can process data stored in memory (740). At least one processor (720) can execute computer-readable code (e.g., software) stored in memory (740) and instructions triggered by at least one processor (720).

[0160] At least one processor (720) may be a data processing device implemented in hardware having a circuit having a physical structure for executing desired operations. For example, the desired operations may include code or instructions included in a program.

[0161] For example, a data processing device implemented in hardware may include a microprocessor, a central processing unit, a processor core, a multi-core processor, a multiprocessor, an Application-Specific Integrated Circuit (ASIC), or a Field Programmable Gate Array (FPGA).

[0162] At least one processor (720) may include various processing circuits and / or multiple processors. For example, as used in this specification and claims, the term “processor” may include various processing circuits including at least one processor, and one or more of the at least one processor may be configured to perform the various functions described in this specification, either alone or together in a distributed manner. Where “processor,” “at least one processor,” or “one or more processors” are described in this specification as being configured to perform various functions, these terms may include, for example without limitation, cases where one processor performs some of the described functions and one or more other processors perform the remaining functions, as well as cases where a single processor performs all of the described functions. At least one processor may be a combination of multiple processors that perform the various functions described or disclosed in a distributed manner, etc. At least one processor may execute program instructions to achieve or perform the various functions.

[0163] For example, at least one processor (720) may include a main processor (e.g., a central processing unit or an application processor) and an auxiliary processor (e.g., a communication processor, a neural processing unit (NPU), and / or a graphic processing unit (GPU)).

[0164] At least one processor (720) can enable the learning device (700) to perform operations performed in the learning or inference step described in the present disclosure by individually or collectively executing code, instructions, and / or applications stored in memory (740).

[0165]

[0166] FIG. 8 is an exemplary block diagram of a motion recognition device according to various embodiments.

[0167] Referring to FIG. 8, according to one embodiment, a motion recognition device (800) (e.g., an electronic device such as a smartphone, tablet, PC, or laptop) may include at least one processor (820) and memory (840).

[0168] The memory (840) may store instructions (or programs) executable by at least one processor (820). For example, the instructions may include instructions for executing the operation of at least one processor (820) and / or the operation of each configuration of at least one processor (820).

[0169] The memory (840) may include one or more computer-readable storage media. The memory (840) may include non-volatile storage devices (e.g., a solid-state drive (SSD), a magnetic hard disc, an optical disc, a floppy disc, a flash memory, an electrically programmable memory (EPROM), or an electrically erasable and programmable (EEPROM)).

[0170] The memory (840) may be a non-transitory medium. The term “non-transitory” may indicate that the storage medium is not implemented by a carrier wave or a propagated signal. However, the term “non-transitory” should not be interpreted as meaning that the memory (840) is immobile.

[0171] At least one processor (820) can process data stored in memory (840). At least one processor (820) can execute computer-readable code (e.g., software) stored in memory (840) and instructions triggered by at least one processor (820).

[0172] At least one processor (820) may be a data processing device implemented in hardware having a circuit having a physical structure for executing desired operations. For example, the desired operations may include code or instructions included in a program.

[0173] For example, a data processing device implemented in hardware may include a microprocessor, a central processing unit, a processor core, a multi-core processor, a multiprocessor, an Application-Specific Integrated Circuit (ASIC), or a Field Programmable Gate Array (FPGA).

[0174] At least one processor (820) may include various processing circuits and / or multiple processors. For example, as used in this specification and claims, the term “processor” may include various processing circuits including at least one processor, and one or more of the at least one processor may be configured to perform the various functions described in this specification, either alone or together in a distributed manner. Where “processor,” “at least one processor,” or “one or more processors” are described in this specification as being configured to perform various functions, these terms may include, for example without limitation, cases where one processor performs some of the described functions and one or more other processors perform the remaining functions, as well as cases where a single processor performs all of the described functions. At least one processor may be a combination of multiple processors that perform the various functions described or disclosed in a distributed manner, etc. At least one processor may execute program instructions to achieve or perform the various functions.

[0175] For example, at least one processor (820) may include a main processor (e.g., a central processing unit or an application processor) and an auxiliary processor (e.g., a communication processor, a neural processing unit (NPU), and / or a graphic processing unit (GPU)).

[0176] At least one processor (820) can enable the motion recognition device (800) to perform the actions performed in the learning or inference step described in the present disclosure by individually or collectively executing code, instructions, and / or applications stored in memory (840).

[0177]

[0178] The embodiments described above may be implemented as hardware components, software components, and / or combinations of hardware and software components. For example, the devices, methods, and components described in the embodiments may be implemented using a general-purpose computer or a special-purpose computer, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing unit may execute an operating system (OS) and software applications executed on said operating system. Additionally, the processing unit may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing unit may be described as being used as a single unit, but those skilled in the art will understand that the processing unit may include multiple processing elements and / or multiple types of processing elements. For example, the processing unit may include multiple processors or one processor and one controller. In addition, other processing configurations, such as parallel processors, are also possible.

[0179] Software may include computer programs, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired or instruct the processing unit independently or collectively. Software and / or data may be stored on any type of machine, component, physical device, virtual equipment, computer storage medium, or device so as to be interpreted by the processing unit or to provide instructions or data to the processing unit. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on computer-readable recording media.

[0180] The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may store program instructions, data files, data structures, etc., either individually or in combination, and the program instructions recorded on the medium may be those specifically designed and configured for the embodiment or those known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, and flash memory. Examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

[0181] The hardware device described above may be configured to operate as one or more software modules to perform the operation of the embodiment, and vice versa.

[0182] Although the embodiments have been described above with reference to the limited drawings, those skilled in the art can apply various technical modifications and variations based thereon. For example, suitable results may be achieved even if the described techniques are performed in a different order than described, and / or if the components of the described system, structure, device, circuit, etc. are combined or assembled in a form different from described, or replaced or substituted by other components or equivalents.

[0183] Although the present disclosure has been illustrated and described with reference to various embodiments, it will be understood by those skilled in the art that the various embodiments are intended to be illustrative and not limiting. It will be understood by those skilled in the art that various modifications to the form and details may be made without departing from the true spirit and full scope of the present disclosure, including the appended claims and their equivalents. It will also be understood by those skilled in the art that any of the embodiments described herein may be used in conjunction with other embodiments described herein. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the claims set forth below.

[0184] The effects obtainable from the present disclosure are not limited to those mentioned above, and other unmentioned effects will be clearly understood from this document by those skilled in the art to which the present disclosure belongs.

[0185] Although the present disclosure has been described and illustrated with reference to various embodiments, it should be understood that these various embodiments are illustrative and not limiting. Furthermore, a person skilled in the art will understand that various modifications, alternatives, and / or variations of the various embodiments disclosed herein may be made without departing from the true technical spirit of the present disclosure and the overall technical scope defined by the appended claims and their equivalents. Additionally, it should be understood that one or more embodiments described herein may be used in combination with one or more other embodiments described herein.

Claims

1. A motion recognition method using a learned diffusion model performed by an electronic device, An action of acquiring input skeletal data that is the target of motion recognition; An operation to generate skeletal features by encoding the above-mentioned input skeletal data into a latent space; An operation to generate noisy test skeleton features by adding reference noise corresponding to a predetermined test time step to the above skeleton features; An operation to obtain each candidate text prompt (respective candidate text prompt) corresponding to a plurality of candidate action classes not used in the training of the above-mentioned learned diffusion model; An operation to calculate each candidate noise corresponding to each candidate text prompt from the noisy test skeleton feature using the above-mentioned learned diffusion model - each candidate text prompt is used as a condition input for calculating noise from the noisy test skeleton feature -; An operation to calculate a reconstruction error for each of the candidate text prompts based on the distance between the reference noise and each of the candidate noises; and An action that determines the candidate action class corresponding to the candidate text prompt having the smallest reconstruction error as the action of the above input skeleton data. A motion recognition method including 2. In Paragraph 1, The operation of obtaining each of the above candidate text prompts is, An operation to generate text of a sentence describing each of the above plurality of candidate operation classes; and The operation of extracting global text features and local text features from each candidate text prompt using a pre-trained text encoder Includes, The above-mentioned learned diffusion model is, A motion recognition method configured to remove noise from the noisy skeletal features using the above global text features and the above local text features.

3. In Paragraph 2, The above-mentioned learned diffusion model is, It includes a diffusion transformer, The above diffusion transformer is, An adaptive normalization layer that controls the distribution of the skeletal features based on the global text features; and A cross-attention module that learns the interaction between the local text features and the skeletal features A motion recognition method including.

4. In Paragraph 1, The above reference noise is, A motion recognition method that is sampled from a Gaussian distribution.

5. In Paragraph 1, The above-determined test time step is, A motion recognition method comprising one or more type steps within the entire learning time step interval of the above-mentioned learned diffusion model.

6. In Paragraph 1, The operation of calculating the above reconstruction error is, Operation of calculating the reconstruction error based on the L2 norm between the reference noise and each of the candidate noises. A motion recognition method including 7. In a device for performing motion recognition using a learned diffusion model, At least one processor; and Memory that stores instructions Includes, When the above instructions are executed individually or collectively by the at least one processor, the device is made to perform a plurality of operations, and The above plurality of operations are, An action of acquiring input skeletal data that is the target of motion recognition; An operation to generate skeletal features by encoding the above-mentioned input skeletal data into a latent space; An operation to generate a noisy test skeleton feature by adding reference noise corresponding to a predetermined test time step to the above skeleton feature; An operation to obtain each candidate text prompt corresponding to a plurality of candidate action classes not used in the training of the above-mentioned learned diffusion model; An operation to calculate each candidate noise corresponding to each candidate text prompt from the noisy test skeleton feature using the above-mentioned learned diffusion model - each candidate text prompt is used as a conditional input for calculating noise from the noisy test skeleton feature -; An operation to calculate a reconstruction error for each of the candidate text prompts based on the distance between the reference noise and each of the candidate noises; and An action that determines the candidate action class corresponding to the candidate text prompt having the smallest reconstruction error as the action of the input skeleton data. A device including 8. A method for training a motion recognition model performed by an electronic device, An action of acquiring training data including skeletal data representing human movement, correct text labels for said skeletal data, and incorrect text labels for said skeletal data; An operation to acquire noisy skeleton features based on adding original noise to the skeleton features of the above-mentioned skeleton data; An operation to acquire a first prediction noise based on the above noise-laden skeletal features and the above correct answer features of the above correct answer text label; An operation to obtain a second prediction noise based on the above noise-laden skeletal features and the above incorrect answer features of the above incorrect answer text labels; and A motion recognition model trained based on the distance between the original noise and the first predicted noise and the distance between the original noise and the second predicted noise. A method including 9. In Paragraph 8, The above motion recognition model is, diffusion transformer A method including 10. In Paragraph 8, The action of training the above motion recognition model is, A motion recognition model is trained using a first prediction error calculated based on the distance between the original noise and the first prediction noise, and a second prediction error calculated based on the distance between the original noise and the second prediction noise. A method including 11. In Paragraph 10, The operation of training the motion recognition model using the first prediction error and the second prediction error is, An operation of training the motion recognition model based on a first loss function configured such that the second prediction error becomes larger than the first prediction error by a predetermined value; A method including 12. In Paragraph 11, The operation of training the motion recognition model based on the first loss function above is, A motion recognition model is trained using a third loss function defined based on the first loss function and a second loss function configured to minimize the distance between the original noise and the first predicted noise. A method including 13. In Paragraph 8, The operation of extracting the skeletal features from the skeletal data using the first encoder A method that further includes.

14. In Paragraph 13, The operation of extracting the answer feature from the answer text prompt corresponding to the answer text label using a second encoder; and The operation of extracting the incorrect answer feature from the incorrect answer text prompt corresponding to the incorrect answer text label using the second encoder above. A method that further includes.

15. In Paragraph 8, Each of the above correct answer characteristics and the above incorrect answer characteristics is, Global features and regional features A method including 16. In Paragraph 15, The operation of acquiring the first prediction noise above is, An operation of modulating a skeletal feature representation corresponding to the noisy skeletal feature using at least one of the first parameters obtained from the global feature of the above-mentioned correct answer feature; An operation of modulating a first local feature representation corresponding to a local feature of the correct answer feature using at least one of the above parameters; and Operation of acquiring the first prediction noise based on the modulated skeletal feature representation and the modulated local feature representation A method including 17. In a device for motion recognition, At least one processor; and Memory that stores instructions Includes, When the above instructions are executed individually or collectively by the at least one processor, the device is made to perform a plurality of operations, and The above plurality of operations are, Operation of acquiring skeletal data; and An action that estimates an action corresponding to the skeletal data using a motion recognition model learned by the method of claim 8. A device including