Video language model training method and human interaction behavior recognition method
By integrating object location annotation information and a multimodal refinement learning module into the video language model, the problem of insufficient accuracy in behavior recognition from the first-person perspective is solved, and high-precision recognition of complex human-object interaction behaviors is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2025-03-24
- Publication Date
- 2026-06-16
AI Technical Summary
In first-person perspective behavior recognition, video stability, field of view limitations, and the challenges of fine-grained feature extraction prevent existing technologies from accurately recognizing human-object interaction behaviors, especially in scenarios involving hand operations and object interactions where the accuracy of behavior recognition is insufficient.
By designing a video language model that combines a video feature extraction network, an object position feature extraction network, an L-layer multi-head self-attention block, and an L-layer multimodal refinement learning module, object position annotation information is integrated to achieve fine-grained alignment of visual features and text features, thereby improving action understanding capabilities.
It improves the accuracy of human interaction behavior recognition, enabling more accurate identification and understanding of complex action differences and object interaction relationships from a first-person perspective.
Smart Images

Figure CN120472359B_ABST