Video language model training method and human interaction behavior recognition method

By integrating object location annotation information and a multimodal refinement learning module into the video language model, the problem of insufficient accuracy in behavior recognition from the first-person perspective is solved, and high-precision recognition of complex human-object interaction behaviors is achieved.

CN120472359BActive Publication Date: 2026-06-16NORTHEASTERN UNIV CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHEASTERN UNIV CHINA
Filing Date
2025-03-24
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In first-person perspective behavior recognition, video stability, field of view limitations, and the challenges of fine-grained feature extraction prevent existing technologies from accurately recognizing human-object interaction behaviors, especially in scenarios involving hand operations and object interactions where the accuracy of behavior recognition is insufficient.

Method used

By designing a video language model that combines a video feature extraction network, an object position feature extraction network, an L-layer multi-head self-attention block, and an L-layer multimodal refinement learning module, object position annotation information is integrated to achieve fine-grained alignment of visual features and text features, thereby improving action understanding capabilities.

🎯Benefits of technology

It improves the accuracy of human interaction behavior recognition, enabling more accurate identification and understanding of complex action differences and object interaction relationships from a first-person perspective.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120472359B_ABST
    Figure CN120472359B_ABST
Patent Text Reader

Abstract

The application provides a video language model training method and a human interaction behavior recognition method, and relates to the technical field of computer vision recognition, including: obtaining a video sample and action description text data for human interaction behavior in the video sample; determining first video features and first object position features corresponding to the video sample; determining visual joint features output by each layer of multi-head self-attention blocks in an L-layer multi-head self-attention block based on the first video features and the first object position features; determining visual representation, text representation and multi-modal representation output by the last layer of multi-modal refinement learning modules in an L-layer multi-modal refinement learning module based on the action description text data and the visual joint features; updating model parameters of the video language model based on the visual representation, the text representation and the multi-modal representation until a target video language model trained is obtained. The application can improve the accuracy of human interaction behavior recognition.
Need to check novelty before this filing date? Find Prior Art