Video language model training method and human interaction behavior recognition method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating object location annotation information and a multimodal refinement learning module into the video language model, the problem of insufficient accuracy in behavior recognition from the first-person perspective is solved, and high-precision recognition of complex human-object interaction behaviors is achieved.

CN120472359BActive Publication Date: 2026-06-16NORTHEASTERN UNIV CHINA

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: NORTHEASTERN UNIV CHINA
Filing Date: 2025-03-24
Publication Date: 2026-06-16

Application Information

Patent Timeline

24 Mar 2025

Application

16 Jun 2026

Publication

CN120472359B

IPC: G06V20/40; G06V40/20; G06V10/80; G06V10/44; G06V10/82; G06V10/776; G06N3/045

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In first-person perspective behavior recognition, video stability, field of view limitations, and the challenges of fine-grained feature extraction prevent existing technologies from accurately recognizing human-object interaction behaviors, especially in scenarios involving hand operations and object interactions where the accuracy of behavior recognition is insufficient.

⚗Method used

By designing a video language model that combines a video feature extraction network, an object position feature extraction network, an L-layer multi-head self-attention block, and an L-layer multimodal refinement learning module, object position annotation information is integrated to achieve fine-grained alignment of visual features and text features, thereby improving action understanding capabilities.

🎯Benefits of technology

It improves the accuracy of human interaction behavior recognition, enabling more accurate identification and understanding of complex action differences and object interaction relationships from a first-person perspective.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120472359B_ABST

Patent Text Reader

Abstract

The application provides a video language model training method and a human interaction behavior recognition method, and relates to the technical field of computer vision recognition, including: obtaining a video sample and action description text data for human interaction behavior in the video sample; determining first video features and first object position features corresponding to the video sample; determining visual joint features output by each layer of multi-head self-attention blocks in an L-layer multi-head self-attention block based on the first video features and the first object position features; determining visual representation, text representation and multi-modal representation output by the last layer of multi-modal refinement learning modules in an L-layer multi-modal refinement learning module based on the action description text data and the visual joint features; updating model parameters of the video language model based on the visual representation, the text representation and the multi-modal representation until a target video language model trained is obtained. The application can improve the accuracy of human interaction behavior recognition.

Need to check novelty before this filing date? Find Prior Art