Training device, training method, action recognition device, and action recognition method

WO2026133432A1PCT designated stage Publication Date: 2026-06-25KONICA MINOLTA INC

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: KONICA MINOLTA INC
Filing Date: 2024-12-17
Publication Date: 2026-06-25

Application Information

Patent Timeline

17 Dec 2024

Application

25 Jun 2026

Publication

WO2026133432A1

IPC: G06N20/00

AI Tagging

Application Domain

Machine learning

Technology Topics

Learning unit Identification device

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure JP2024044666_25062026_PF_FP_ABST

Patent Text Reader

Abstract

Disclosed is a technique for efficiently generating an action recognizer that recognizes the interaction between a human and an object. One aspect of the present disclosure relates to a training device comprising: a training data acquisition unit that acquires a training dataset consisting of a training video, training description text representing the interaction between a human and an object or between a human and another human, and an action class label of said human with respect to the object in the training video; a key point detection unit that detects a key point between said human and the object from the training video; and a training unit that trains, on the basis of the key point, the training description text, and the action class label, an action recognizer that recognizes the action of a human who is the actor for an object or a human.

Need to check novelty before this filing date? Find Prior Art

Claims

1. A learning device comprising: a training data acquisition unit that acquires a training dataset consisting of training videos, training descriptions that represent the relationship between a person and an object or between a person and another person, and behavior class labels of the person's actions toward the object in the training videos; a key point detection unit that detects key points between the person and the object from the training videos; and a learning unit that learns an action recognition device that recognizes the actions of a person acting toward an object or another person based on the key points, the training descriptions, and the behavior class labels.

2. The learning device according to claim 1, wherein the training description focuses on changes in a person's posture or state during the actions of the object to be recognized.

3. The learning device according to claim 1, wherein the training description explains the relationship between a person and an object for each object in the behavior of the object to be recognized.

4. The learning device according to claim 1, wherein the training description explains the relationship between people in the behavior of the object to be recognized.

5. The learning device according to claim 1, wherein the training description explains the relationship between a person and an object and the relationship between people in the actions of the object being recognized, as well as changes in a person's posture or state.

6. The learning device according to claim 1, wherein the training description is generated by a large-scale language model.

7. The learning device according to claim 1, wherein the key point detection unit detects human joint points as the key points.

8. The learning device according to claim 1, wherein the key point detection unit detects the position of a person's fingers as the key point.

9. The learning device according to claim 1, wherein the key point detection unit detects the endpoints of an object as the key points.

10. The learning device according to claim 1, wherein the key point detection unit detects the key point using an articulation point detector and an object detector.

11. The learning device according to claim 1, wherein the keypoint detection unit generates time-series information of the position coordinates of the keypoint and the type of object.

12. The learning device according to claim 1, wherein the learning unit trains the behavior recognition device so that the similarity between the feature vector of the training video output from the behavior recognition device and the feature vector of the training description is increased.

13. A learning method in which a computer performs the following steps:

1. Obtain a training dataset consisting of training videos, training descriptions describing the relationship between a person and an object or between a person and another person, and class labels of the person's actions toward the object in the training videos; 2. Detect key points between the person and the object from the training videos; and 3. Train an action recognizer that recognizes the actions of a person acting toward an object or another person based on the key points, the training descriptions, and the class labels.

14. An action recognition device comprising: an acquisition unit that acquires a video to be recognized and a descriptive text that represents the relationship between a person and an object; a key point detection unit that detects key points between the person and the object from the video; and an action recognition unit that recognizes the actions of a person acting on an object or a person from the key points and the descriptive text, using an action recognizer trained with a training dataset consisting of a training video and a training descriptive text that represents the relationship between a person and an object or between two people, and the action class label of the person's actions toward the object in the training video.

15. The behavior recognition device according to claim 14, wherein the acquisition unit acquires a descriptive text indicating a behavior that cannot be classified into the behavior class label, and the behavior recognition unit uses the behavior recognition device to recognize the person's behavior toward the object from the key point and the unclassifiable descriptive text.

16. An action recognition method in which a computer performs the following steps: obtaining a video to be recognized and a descriptive text that represents the relationship between a person and an object or between a person; detecting key points between the person and the object from the video; and recognizing the actions of the person acting on the object or person from the key points and the descriptive text using an action recognizer trained with a training dataset consisting of a training video, a training descriptive text that represents the relationship between a person and an object or between a person, and the action class label of the person's actions toward the object in the training video.