Training device, training method, action recognition device, and action recognition method
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KONICA MINOLTA INC
- Filing Date
- 2024-12-17
- Publication Date
- 2026-06-25
Smart Images

Figure JP2024044666_25062026_PF_FP_ABST
Abstract
Claims
1. A learning device comprising: a training data acquisition unit that acquires a training dataset consisting of training videos, training descriptions that represent the relationship between a person and an object or between a person and another person, and behavior class labels of the person's actions toward the object in the training videos; a key point detection unit that detects key points between the person and the object from the training videos; and a learning unit that learns an action recognition device that recognizes the actions of a person acting toward an object or another person based on the key points, the training descriptions, and the behavior class labels.
2. The learning device according to claim 1, wherein the training description focuses on changes in a person's posture or state during the actions of the object to be recognized.
3. The learning device according to claim 1, wherein the training description explains the relationship between a person and an object for each object in the behavior of the object to be recognized.
4. The learning device according to claim 1, wherein the training description explains the relationship between people in the behavior of the object to be recognized.
5. The learning device according to claim 1, wherein the training description explains the relationship between a person and an object and the relationship between people in the actions of the object being recognized, as well as changes in a person's posture or state.
6. The learning device according to claim 1, wherein the training description is generated by a large-scale language model.
7. The learning device according to claim 1, wherein the key point detection unit detects human joint points as the key points.
8. The learning device according to claim 1, wherein the key point detection unit detects the position of a person's fingers as the key point.
9. The learning device according to claim 1, wherein the key point detection unit detects the endpoints of an object as the key points.
10. The learning device according to claim 1, wherein the key point detection unit detects the key point using an articulation point detector and an object detector.
11. The learning device according to claim 1, wherein the keypoint detection unit generates time-series information of the position coordinates of the keypoint and the type of object.
12. The learning device according to claim 1, wherein the learning unit trains the behavior recognition device so that the similarity between the feature vector of the training video output from the behavior recognition device and the feature vector of the training description is increased.
13. A learning method in which a computer performs the following steps:
1. Obtain a training dataset consisting of training videos, training descriptions describing the relationship between a person and an object or between a person and another person, and class labels of the person's actions toward the object in the training videos; 2. Detect key points between the person and the object from the training videos; and 3. Train an action recognizer that recognizes the actions of a person acting toward an object or another person based on the key points, the training descriptions, and the class labels.
14. An action recognition device comprising: an acquisition unit that acquires a video to be recognized and a descriptive text that represents the relationship between a person and an object; a key point detection unit that detects key points between the person and the object from the video; and an action recognition unit that recognizes the actions of a person acting on an object or a person from the key points and the descriptive text, using an action recognizer trained with a training dataset consisting of a training video and a training descriptive text that represents the relationship between a person and an object or between two people, and the action class label of the person's actions toward the object in the training video.
15. The behavior recognition device according to claim 14, wherein the acquisition unit acquires a descriptive text indicating a behavior that cannot be classified into the behavior class label, and the behavior recognition unit uses the behavior recognition device to recognize the person's behavior toward the object from the key point and the unclassifiable descriptive text.
16. An action recognition method in which a computer performs the following steps: obtaining a video to be recognized and a descriptive text that represents the relationship between a person and an object or between a person; detecting key points between the person and the object from the video; and recognizing the actions of the person acting on the object or person from the key points and the descriptive text using an action recognizer trained with a training dataset consisting of a training video, a training descriptive text that represents the relationship between a person and an object or between a person, and the action class label of the person's actions toward the object in the training video.