A few-sample action recognition method and system, terminal

By constructing an action knowledge base and a temporal awareness adapter, the spatiotemporal interaction trajectory of semantic entities is explicitly modeled, which solves the object-centered bias and semantic gap problems of visual language models in few-sample action recognition, and improves the accuracy and generalization ability of action recognition.

CN122244941APending Publication Date: 2026-06-19SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2026-03-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing visual language models suffer from object-centered bias, semantic gap, lack of interaction modeling, and weak temporal reasoning ability in few-sample action recognition, resulting in poor performance in complex action recognition.

Method used

By constructing an action knowledge base containing semantic entities and hierarchical semantic descriptions, using a large language model to parse action tags, and combining a temporal-aware adapter to extract visual features, the spatiotemporal interaction trajectory of semantic entities is explicitly modeled to generate interactive perception features. Segments are divided according to the time dimension and aligned with the hierarchical semantic descriptions across modalities to construct action prototypes.

🎯Benefits of technology

It significantly improves the model's feature representation ability in complex action recognition, enhances the accuracy and generalization ability of action recognition in low-sample scenarios, solves the problems of object-centered bias and semantic missing, and strengthens the ability to perceive time sequence.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244941A_ABST
    Figure CN122244941A_ABST
Patent Text Reader

Abstract

This application provides a few-shot action recognition method, system, and terminal. The method includes: acquiring action category labels, a few-shot support set video, and a query set video to be recognized; extracting dynamic visual features from both the support set video and the query set video; constructing hybrid semantic tags based on semantic entities, performing interactive modeling with the dynamic visual features, capturing the spatiotemporal interaction trajectory of the semantic entities, and generating interactive perception features; based on the interactive perception features of the support set video, dividing it into segments according to the time dimension and aligning it cross-modally with the hierarchical semantic description to construct action prototypes; calculating the matching distance between the interactive perception features of the query set video and the action prototypes, and determining the action category to which the query set video belongs. This application addresses the object-centered bias and missing action semantics problems existing in the VLM model, significantly improving the action recognition accuracy and generalization ability in few-shot scenarios on multiple mainstream benchmark datasets.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of computer vision and natural language processing technology, specifically to a method, system, and terminal for few-sample action recognition. Background Technology

[0002] With the rapid development of video surveillance, human-computer interaction, autonomous driving, and smart healthcare, video action recognition technology has become one of the core research directions in computer vision. However, deep learning-driven action recognition models typically rely on large-scale labeled data (such as datasets like Kinetics) for training to obtain robust feature representations. In practical applications, acquiring massive amounts of time-series labeled video data faces extremely high costs and data scarcity challenges. To address this issue, Few-shot Action Recognition (FSAR) has emerged. FSAR is defined as enabling models to quickly generalize and recognize previously unseen action categories using only a very small number (usually 1 or 5) labeled samples.

[0003] Traditional few-shot action recognition methods primarily follow the metric-based meta-learning paradigm. These methods train a general embedding space on a large amount of base class data, ensuring that video features of the same class are as close as possible in the space, while features of different classes are as far apart as possible. Early research focused on solving the temporal alignment problem of videos. For example, the OTAM method uses dynamic time warping to align support and query set videos; the TRX method captures multi-scale temporal relationships by sampling different numbers of frame tuples. Subsequently, researchers introduced more complex attention mechanisms, such as TARN and MPRE, which enhance cross-class generalization through temporal attention; HyRSM uses task-specific matching algorithms to find optimal support-query pairs. Although these visual unimodal methods have thoroughly explored the temporal representation and alignment of action features and made progress on specific datasets, they generally face a core bottleneck: limited by finite labeled data, the models are prone to overfitting on the training set, making it difficult to capture highly discriminative action patterns. To alleviate this problem, researchers have attempted to enhance the visual representation of motion videos by introducing pre-trained Vision-Language Models (VLMs). However, existing VLM-based methods still suffer from the following significant drawbacks when handling complex actions: A severe "object-centered bias": CLIP models are primarily pre-trained on static images, which makes them excellent at recognizing static objects (such as "people" and "bench"), but they lack the ability to understand dynamic interactions between objects. When recognizing the action of "sitting down," traditional CLIP tends to focus on the static background (people, benches) that is always present in the video, resulting in extremely high similarity of representations between different frames, and failing to effectively perceive the temporal evolution of the action.

[0004] Semantic gap and lack of interaction modeling: A significant semantic gap exists between abstract action labels (such as "sit down") and specific visual execution processes. Although existing methods attempt to decompose labels, they mostly rely on global representations and cannot explicitly link specific semantic entities (such as "hand" or "cup") to dynamic visual interactions. This leads to models easily overlooking key spatial trajectories and contact point interactions when faced with actions involving complex human-computer interactions (such as "picking up an object" and "putting down an object"), resulting in recognition ambiguity.

[0005] Weak temporal reasoning capabilities: CLIP's native architecture is primarily designed to handle independent patches and lacks explicit mechanisms for capturing cross-frame dependencies. Although some studies have attempted to introduce lightweight adapters, they often fail to deeply couple visual motion perception with textual semantic priors, resulting in insufficient generalization of the generated action prototypes.

[0006] In summary, how to overcome the static object bias in existing visual language models and transform object-centered visual embeddings into motion-aware action representations is a key problem that urgently needs to be solved in the field of few-shot action recognition.

[0007] A search revealed Chinese patent application number 202410937750.5, which discloses a few-sample action recognition method based on semantic-guided multimodal fusion. This method utilizes a large language model to generate rich and comprehensive textual knowledge covering various action categories, ensuring the comprehensiveness of semantic information extracted for the few-sample action recognition task. Preliminary classification in the text branch is achieved by matching and measuring the extracted discriminative semantic information with the visual information of samples of unknown categories. Furthermore, a semantically guided visual interaction module is designed in the visual branch, promoting the effective integration of semantic and visual information, improving the quality of feature representation in samples, and enabling more timely understanding of new categories with only a few samples. However, it lacks spatiotemporal interactive perception and fails to capture cross-frame correlations. Summary of the Invention

[0008] In view of the deficiencies in the prior art, the purpose of this application is to provide a method, system, and terminal for few-sample action recognition.

[0009] A first aspect of this application provides a method for few-sample action recognition, comprising: Obtain action category labels and a small sample support set of videos, as well as a query set of videos to be identified; The action category labels are parsed using a large language model to construct an action knowledge base containing semantic entities and hierarchical semantic descriptions; Using a visual encoder that includes a time-aware adapter, dynamic visual features are extracted from the support set video and the query set video to be identified, respectively. Based on the semantic entities, a hybrid semantic tag is constructed, and interactive modeling is performed with the dynamic visual features to capture the spatiotemporal interaction trajectory of the semantic entities and generate interactive perception features. Based on the interactive perception features of the supported video set, segments are divided according to the time dimension and aligned with the hierarchical semantic description across modalities to construct action prototypes; Calculate the matching distance between the interactive perception features of the query set video and the action prototype, and determine the action category to which the query set video belongs.

[0010] Optionally, the step of using a large language model to parse the action category labels and constructing an action knowledge base containing semantic entities and hierarchical semantic descriptions includes: Build preset prompts and instructions; The pre-trained large language model is used to parse the action category labels according to the prompt instructions, and generate structured semantic information including perspective analysis, temporal analysis, atomic action description optimization and action instance extraction. An action knowledge base is constructed based on the structured semantic information. The action knowledge base includes a set of semantic entities and hierarchical action semantic descriptions.

[0011] Optionally, the step of extracting visual features from the support set video and the query set video to be identified using a visual encoder including a time-aware adapter includes: The input video's T-frame sequence is divided into N patch markers per frame, and each patch marker is encoded into a preset feature dimension D to obtain the feature markers of the input video; The feature labels are input into the backbone network of the visual encoder, which is composed of stacked Transformer blocks, each of which embeds a time-aware adapter with a bottleneck structure. In each Transformer block, the following processing is performed: The original visual labels are obtained by processing the input feature labels through the original mechanism of the frozen Transformer block; The time-aware adapter processes the input feature tags to obtain time-series dynamic information; The temporal awareness adapter reduces the feature dimension from the preset feature dimension D to the bottleneck dimension d through a linear downsampling layer, exchanges motion cues in the temporal dimension using a temporal multi-head self-attention mechanism, and restores the dimension to the preset feature dimension D through a linear upsampling layer. The temporal dynamic information is incorporated into the original visual tags using residual connections, and the visual features of the current Transformer block are output until the visual features of the video are obtained after passing through all Transformer blocks.

[0012] Optionally, the hybrid semantic tagging includes: Knowledge-driven prior semantic tags, constructed based on semantic entities in the action knowledge base, are used to locate semantic regions in the video corresponding to the semantic entities. Data-driven, learnable tags, randomly initialized and updated during training, are used to capture potential cues or motion patterns not covered by the hierarchical semantic description.

[0013] Optionally, the step of constructing a hybrid semantic tag based on the semantic entity, performing interactive modeling with the dynamic visual features, capturing the spatiotemporal interaction trajectory of the semantic entity, and generating interactive perception features includes: Calculate the dot product similarity between the hybrid semantic tags and dynamic visual features, and generate attention weights using Softmax normalization; The dynamic visual features are processed using a dual-path hybrid interaction mechanism to obtain motion pattern features; Position encoding is performed on the motion pattern features to obtain semantically enhanced features; The semantic enhancement features are weighted and aggregated using the attention weights to obtain region features; The generated regional features are concatenated with global image features to preserve the model's ability to perceive global contextual information and obtain interactive perceptual features.

[0014] Optionally, the process of using a dual-path hybrid interaction mechanism to process the dynamic visual features to obtain motion pattern features includes: Path 1: Utilize intra-frame multi-head self-attention layers to model spatial relationships between different semantic entities within the same frame; Path 2: Utilize a cross-frame multi-head self-attention layer to track the dynamic evolution of the same semantic entity over time; The outputs of the two paths are fused and motion pattern features are extracted via a feedforward network.

[0015] Optionally, the step of dividing the video into segments according to the time dimension and aligning them with the hierarchical semantic description across modalities based on the interactive perception features of the supporting video set, and constructing action prototypes, includes: Divide the interaction-aware features into equal categories There are corresponding time sequence segment features, among which Equal to the number of stages in the hierarchical action semantic description; Obtain the text embedding corresponding to the hierarchical action semantic description, and for the supported video set, fuse the text embedding of each stage with the corresponding temporal segment features; The temporal Transformer module is used to perform cross-modal alignment inference on the fused features, so that the visual features are aligned with the corresponding semantics. Based on the aligned features, the action prototypes corresponding to the support set videos are generated.

[0016] Optionally, calculating the matching distance between the interactive perception features of the query set video and the action prototype, and determining the action category to which the query set video belongs, includes: Calculate the visual matching score between the interactive perception features of the query set video and the action prototype, and the semantic matching score between the interactive perception features of the query set video and the text embedding of the action category; The visual matching score and the semantic matching score are weighted and fused to obtain the total matching score; The action category to which the query set video belongs is determined based on the total matching score.

[0017] A second aspect of this application provides a few-sample action recognition system, comprising: The data acquisition module acquires action category labels and a small sample support set of videos, as well as a query set of videos to be identified. The knowledge base construction module uses a large language model to parse the action category labels and construct an action knowledge base containing semantic entities and hierarchical semantic descriptions. The real-time perception adapter module uses a visual encoder containing a time-series perception adapter to extract dynamic visual features of the support set video and the query set video to be identified, respectively. The hybrid semantic interaction module constructs hybrid semantic tags based on the semantic entities, performs interaction modeling with the dynamic visual features, captures the spatiotemporal interaction trajectory of the semantic entities, and generates interactive perception features. The hierarchical semantic anchoring module, based on the interactive perception features of the supported video set, divides the video into segments according to the time dimension and aligns them with the hierarchical semantic description across modalities to construct action prototypes; The multi-module action prototype matching module calculates the matching distance between the interactive perception features of the query set video and the action prototype, and determines the action category to which the query set video belongs.

[0018] A third aspect of this application provides a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it can be used to execute any of the few-sample action recognition methods described above, or to run the few-sample action recognition system described above.

[0019] The few-shot action recognition method provided in this application extracts dynamic visual features using a visual encoder with a temporal-aware adapter and constructs hybrid semantic tags based on semantic entities for interaction modeling. This captures the spatiotemporal interaction trajectories of semantic entities, generating interaction-aware features, thereby transforming static feature representations into feature representations sensitive to dynamic actions. This method effectively utilizes semantic prior knowledge from an action knowledge base constructed using a large language model, which includes semantic entities and hierarchical semantic descriptions. Furthermore, by dividing the data into segments according to the time dimension and aligning them across modalities with the hierarchical semantic descriptions to construct action prototypes, it explicitly models the spatiotemporal interactions and temporal transformations between entities, significantly improving the model's feature representation capabilities when handling complex action behaviors.

[0020] Other technical effects resulting from the additional features will be further illustrated in the corresponding embodiments. Attached Figure Description

[0021] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating a few-sample action recognition method according to an exemplary embodiment; Figure 2 This is a flowchart illustrating the workflow of action tag parsing and knowledge base construction according to an exemplary embodiment; Figure 3 This is a flowchart illustrating a few-sample action recognition test according to an exemplary embodiment; Figure 4 This is a structural diagram illustrating a few-sample action recognition system according to an exemplary embodiment; Figure 5 This is a performance comparison chart illustrating the method of this application with other existing methods according to an exemplary embodiment; Figure 6 The results are visualizations of attention on different action instances (such as Pushup and Pick) according to an exemplary embodiment (Figure a is Pupush, Figure b is Pikc). Detailed Implementation

[0022] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all fall within the protection scope of the present application. Parts not described in detail in the following embodiments can be implemented using existing technology.

[0023] Terminology Explanation: Feature labels: Feature labels are the feature vectors obtained after the image is input to the encoder; Patch marking refers to dividing a video frame image into several image patches and encoding them into feature vectors with preset feature dimensions through a linear projection layer; Atomic action: A component of a certain action. Taking "sitting down" as an example, its atomic actions are "the person moves closer to the chair", "the person bends their knees and moves their buttocks closer to the chair", and "the person sits completely on the chair" (in this application, in order to maintain the universality and robustness of the method, the three-stage atomic action decomposition method is adopted by default).

[0024] Existing VLM methods suffer from severe static object bias when dealing with complex actions, such as object centralization bias, semantic gap and lack of interaction modeling, and weak temporal reasoning ability.

[0025] Based on the above problems, this application provides a few-sample action recognition method to solve the aforementioned problems.

[0026] Reference Figure 1 As shown, a few-sample action recognition method includes the following steps: S1, obtain action category labels and few-sample support set videos, and query set videos to be identified; S2 utilizes a large language model to parse action category labels and constructs an action knowledge base containing semantic entities and hierarchical semantic descriptions; S3, using a visual encoder containing a time-aware adapter, extracts dynamic visual features of the support set video and the query set video to be identified, respectively; S4: Construct hybrid semantic tags based on semantic entities, perform interactive modeling with dynamic visual features, capture the spatiotemporal interaction trajectory of semantic entities, and generate interactive perception features; S5, based on the interactive perception features that support video sets, divides segments according to the time dimension and aligns them with hierarchical semantic descriptions across modalities to construct action prototypes; S6, calculate the matching distance between the interactive perception features of the query set video and the action prototype, and determine the action category to which the query set video belongs.

[0027] In the above embodiment, S3 efficiently injects temporal reasoning capabilities without compromising the original representational capabilities of the large-scale pre-trained model. S4 explicitly models the spatiotemporal interaction trajectories between various semantic entities (such as people and objects, limb parts, etc.) in the video to extract discriminative motion cues obscured by global features. S5 utilizes higher-level semantic priors (such as different evolutionary stages of actions) to accurately calibrate visual features and construct robust and discriminative action prototypes. Therefore, the entire method can overcome the object-centric bias and semantic gaps in the VLM model, significantly improving the accuracy and generalization ability of action recognition in low-sample scenarios on multiple mainstream benchmark datasets.

[0028] In some specific embodiments of this application, S1, obtaining action category labels and few-sample support set videos, and query set videos to be identified, can be achieved by the following steps: S11, Obtain a small sample support set video.

[0029] Specifically, a few-shot support set video refers to a video set that provides only a small number of samples for each action category. Typically, only 1 or 5 videos are provided for each action category, i.e., in a 1-shot or 5-shot setting, to simulate the situation where labeled data is scarce in real-world scenarios.

[0030] S12, retrieve the video set to be queried.

[0031] Specifically, the query set of videos to be identified consists of input videos for which the action category needs to be determined.

[0032] S13, Get action tags.

[0033] Specifically, the action category label is the name of the known action category corresponding to the supported video set, which is used for subsequent construction of the action knowledge base and prototype matching.

[0034] The above embodiments provide a data foundation for subsequent processing steps.

[0035] To address the problem of insufficient video data in low-sample scenarios, which makes it difficult for models to learn the essential patterns of actions, this application uses a pre-trained large language model to parse action labels. In some specific embodiments of this application, S2, the large language model is used to parse action category labels and construct an action knowledge base containing hierarchical semantic descriptions. Specifically, this can be achieved through the following steps: Figure 2 As shown: S21, instruction design.

[0036] Through effective prompt design, the large language model is guided to generate structured knowledge for specific actions, focusing on identifying key semantic entities required to perform the action, as well as the hierarchical semantic description of the action's evolution over time.

[0037] Specifically, a large language model can be a general pre-trained language model, such as GPT-3.5, GPT-4, Llama2, Llama3, Qwen, or ChatGLM, which are models with instruction-following capabilities.

[0038] The instruction design specifically includes the following four parts: Perspective Analysis: Analyze the optimal viewing angle for the corresponding action video; Temporal analysis: Analyzing the visual descriptions of action tags at different times; Atomic action description optimization: Based on the representation requirements of the visual language model, optimize the generation of hierarchical semantic descriptions; Action instance extraction: Extract semantic instances related to actions.

[0039] The semantic entities include the subject performing the action (such as a person), the limbs performing the action (such as hands and legs), and the related target objects involved in the action (such as chairs and cups); the hierarchical semantic description refers to multiple fine-grained action descriptions that evolve over time, dividing the action process into multiple semantic stages such as start, execution, and end according to the temporal sequence, providing fine-grained supervision for video temporal alignment.

[0040] For example, the hierarchical semantic description of the action "Sit" includes: "a person approaches the chair", "a person bends over and knees in front of the chair", and "a person sits completely on the chair".

[0041] S22, Action knowledge base generation.

[0042] The parsing results of S21 are stored as an action knowledge base, which includes a set of semantic entities and hierarchical semantic descriptions.

[0043] Semantic entity set: The corresponding subject performing the action, the body part performing the action, and the target object related to the action.

[0044] The example action “Sit” corresponds to “person”, “chair”, “hips”, etc.

[0045] Hierarchical semantic description: fine-grained action descriptions corresponding to different stages in the execution of an action.

[0046] The exemplary "Phase 1: Approach the chair," "Phase 2: Bend your knees," and "Phase 3: Sit down completely" can provide structured textual guidance for subsequent fine-grained visual alignment.

[0047] The text description obtained in the S2 example above is used to enhance the subsequent action video feature representation. For semantic entity descriptions, by indexing the corresponding semantic features on the image features, the intra-frame and inter-frame motion patterns of these semantic entities are modeled in the action feature representation, resulting in more efficient action representation. For hierarchical semantic descriptions, corresponding segmentation is performed in the temporal sequence of the action video, and visual and text features are aligned and fused to finally construct a robust action feature representation, achieving accurate few-shot action classification.

[0048] Without the text parsing in the S2 stage, the action feature representations in S3-S6 will only have visual feature inputs, making the model's feature representations susceptible to individual video samples in the support set and lacking robustness.

[0049] Native visual language models (such as CLIP) suffer from object centralization bias and weak temporal perception due to pre-training on static images. This application addresses these issues by inserting an adapter with temporal inference capabilities. This allows for efficient parameter tuning and training while keeping the original visual encoder parameters frozen, enabling temporal information perception of the original visual language model and preserving effective encoding of visual features. In some specific embodiments of this application, S3, using a visual encoder containing a temporal perception adapter, dynamic visual features are extracted from the support set video and the query set video to be identified, respectively. This can be achieved through the following steps: S31, Project the T-frame sequence of the input video into feature labels with a preset feature dimension D, where each frame contains N patch labels.

[0050] For example, T=8, D=512, N=196.

[0051] In addition, an extra global feature token ([CLS] Token) is introduced for each frame, so the total number of tokens actually processed per frame is N+1.

[0052] S32, the feature labels are input into the backbone network of the visual encoder. The backbone network consists of stacked Transformer blocks, each of which embeds a temporal-aware adapter with a bottleneck structure.

[0053] Specifically, the bottleneck structure refers to a network structure design used within the Time-Aware Adapter (TA-Adapter) that employs a "first reduce dimensionality, then process, then increase dimensionality" approach. Its data dimensions exhibit a "wide, narrow, wide" variation, resembling the shape of a bottleneck. Specifically, the dimensional changes are as follows: Input: The feature dimension is the preset feature dimension D (consistent with the backbone network).

[0054] Intermediate layer: The dimension is compressed to the bottleneck dimension d through a linear downsampling layer (where d is less than D and both are positive integers).

[0055] Output: The dimension is restored to the preset feature dimension D through a linear upsampling layer.

[0056] Such a design allows key temporal computations (such as the temporal multi-head self-attention mechanism T-MSA) to be performed in a low-dimensional space d.

[0057] For example, based on the default CLIP model, the backbone network has 12 Transformer Blocks, each with a time-aware adapter.

[0058] S33, within each Transformer block, the following processing is performed in parallel: S331, the input features are processed through the original mechanism of the frozen Transformer block to obtain the original visual labels; S332 processes the input features through a time-aware adapter to obtain time-series dynamic information; The temporal-aware adapter reduces the feature dimension from the preset feature dimension D to the bottleneck dimension d through a linear downsampling layer, and uses a temporal multi-head self-attention mechanism to exchange motion cues in the time dimension (transferring feature labels from one dimension to the next). Rearranged as ,in For frame number, (Number of patch tokens), N+1 indicates that an additional feature token is introduced to represent the global feature, and the dimension is restored to the preset feature dimension D through a linear upsampling layer.

[0059] Due to the bottleneck structure design, the temporal multi-head self-attention mechanism is computed in a low-dimensional space, significantly reducing the number of trainable parameters (the number of parameters is proportional to the square of the dimension). The temporal attention mechanism (T-MSA) has high computational complexity. Executing it in a low-dimensional space d significantly reduces floating-point operations (FLOPs). The adapter's input and output dimensions are both D, completely consistent with the visual encoder backbone network. In other words, this ensures that temporal motion information is successfully injected into the originally static visual model without disrupting the pre-trained model structure or adding excessive computational burden.

[0060] S34 uses residual connections to integrate temporal dynamic information into the original visual tags, outputs the visual features of the current Transformer block, and continues until the visual features of the video are obtained after passing through all Transformer blocks.

[0061] Specifically, S33 and S34 are expressed as follows: Backbone network processing: Timing-aware adapter fusion: in: This represents the input features of the l-th Transformer block (i.e., the output features of the (l-1)-th Transformer block). This represents the intermediate output features of the l-th Transformer block after processing by the backbone network; This represents the final output feature of the l-th Transformer block after being fused by the time-aware adapter, which will be used as the input of the (l+1)-th block; LN stands for Layer Normalization. MSA stands for Multi-head Self-Attention. FFN stands for Feed-Forward Network. TA-Adapter stands for Timing Aware Adapter Module, which is used to extract and fuse timing dynamic information.

[0062] The above embodiments, by introducing a temporal-aware adapter module (TAA), inject lightweight temporal reasoning capabilities into the frozen visual language model, achieving an effective transformation from static object recognition to dynamic motion perception, and solving the problem of existing technologies over-relying on static appearance features while ignoring the motion evolution process.

[0063] To address the semantic gap and lack of interaction modeling, this application indexes and models relevant semantic representations in visual features based on hybrid semantic tagging, used to explicitly capture fine motion trajectories obscured by global features. In some specific implementations, S4, interaction modeling is performed using hybrid semantic tagging and dynamic visual features to capture the spatiotemporal interaction trajectories of semantic entities in the video, generating interaction-aware features for both the support set and the query set videos to be identified. This can be achieved through the following steps: S41, Based on the combination of action-related action entity information in the constructed action knowledge base and learnable semantic tags, a hybrid semantic tagging based on semantic prior is achieved. (X=3, Y=3) Initialization.

[0064] Specifically, hybrid semantic markup include: Knowledge-driven prior semantic tagging Initialization is performed by extracting text embeddings of predefined semantic entities (including people, objects, and body parts) from the action knowledge base, which is used to locate specific semantic regions in the video; Data-driven learnable tags : By being randomly initialized and updated as the network is trained, it is used to capture potential background cues or complex motion patterns that are not covered by text descriptions.

[0065] S42 employs a dual-path hybrid interaction mechanism to process dynamic visual features and obtain motion pattern features. ; Specifically, Path 1: Utilize intra-frame multi-head self-attention layers to model the spatial relationships between different semantic entities within the same frame; Path 2: Utilize a cross-frame multi-head self-attention layer to track the dynamic evolution of the same semantic entity over time; The outputs of the two paths are fused and then extracted with motion pattern features via a feedforward network.

[0066] S43, characteristics of motion patterns Perform position encoding This yields semantically enhanced features; S44, semantic enhancement features are aggregated through attention weights to obtain region features. ; S45, the generated region features Concatenation with global image features This preserves the model's ability to perceive global contextual information and obtains interactive perception features.

[0067] The specific process is defined as follows: The above embodiments, by combining knowledge-driven prior semantic tags and data-driven learnable tags, can accurately identify and track the spatial displacement trajectories and interactions of key semantic entities in videos, capture fine-grained motion cues that are obscured by global representations, and greatly enhance the model's ability to discriminate complex human-computer interaction actions.

[0068] By leveraging the hierarchical action semantic descriptions in the action semantic knowledge base constructed in S2, and addressing the sparsity of action samples in few-sample scenarios, interactive perception features are anchored to robust atomic action description features, further enhancing the robustness of prototype feature representation in few-sample scenarios. In some specific embodiments of this application, S5, based on the interactive perception features of the support set video, divides them into multiple segments along the time dimension and aligns them cross-modally with the hierarchical semantic descriptions in the action knowledge base to construct action prototypes. This can be achieved through the following steps: S51, semantic anchoring.

[0069] Based on the hierarchical semantic description in the action knowledge base, the interactive perception features are anchored to the semantic prior to solve the representation ambiguity caused by sample sparsity in low-sample scenarios. S52, timing segmentation.

[0070] The interactive awareness features are divided into multiple segments according to the time dimension, and the number of segments is consistent with the number of stages in the hierarchical semantic description.

[0071] Specifically, under the default settings, for a few-sample support set video with 8 frames sampled, frames 1-4, 2-6, and 5-8 are used as the corresponding temporal feature segments; S53, Cross-Modal Alignment and Prototyping.

[0072] The visual features of each paragraph are aligned across modalities with the feature representations of the corresponding hierarchical semantic description. An attention mechanism is used to make the visual features align with the corresponding semantic priors at each stage to generate action prototypes.

[0073] Specifically, S51-S53 above uses structured language priors to incrementally calibrate visual features, generating the action prototype formula as follows: These are the interactive sensing features obtained from S4. It is a global image feature. These are the text features described by hierarchical meaning after parsing. It refers to the features of each segment of a T-frame video after it has been evenly divided into three segments. and The number of stages is corresponding to ensure alignment and fusion. MSA refers to the Multi-head Self-Attention layer in the Transformer Block.

[0074] Ultimately, a robust action prototype was constructed that combines local interaction cues with temporal evolution logic.

[0075] The above embodiments align the hierarchical action semantic descriptions generated by the large language model with the visual features at different stages of video evolution across modalities, ensuring that the visual features maintain semantic consistency even in scenarios with very few samples. By introducing fine-grained textual prior guidance, action prototypes with strong robustness and high generalization are constructed.

[0076] In some specific embodiments of this application, S6, the matching distance between the interactive perception features of the query set video and the action prototype is calculated, and the action category to which the query set video belongs is determined based on the matching result, such as... Figure 3 As shown, it includes: S61, calculate the visual feature matching score between the interactive perception features (also known as the query visual representation) of the query set video and the action prototype of the support set, and the visual semantic matching score between the interactive perception features of the query set video and the text embedding (action text description) of the action category. S62, weighted fusion of visual matching score and semantic matching score to obtain total matching score; S63, determine the action category to which the query set video belongs based on the total matching score.

[0077] The above embodiments achieve multimodal complementarity through a weighted fusion of visual matching and semantic matching. By utilizing the visual representation capabilities of action prototypes and combining the semantic prior knowledge of text embedding, the limitations of single-modal matching are effectively alleviated, significantly improving the accuracy and robustness of action classification in scenarios with few samples.

[0078] Based on the same technical concept, other embodiments of this application, such as Figure 4 As shown, a few-sample action recognition system 100 is provided, comprising: Data acquisition module 110 acquires action category labels and a small sample support set of videos, as well as a query set of videos to be identified; The knowledge base construction module 120 obtains action category labels and a few-sample support set of videos, uses a large language model to parse the action category labels, and constructs an action knowledge base containing semantic entities and hierarchical semantic descriptions. The real-time perception adapter module 130 uses a visual encoder containing a time-series perception adapter to extract dynamic visual features of the support set video and the query set video to be identified, respectively. The hybrid semantic interaction module 140 constructs hybrid semantic tags based on semantic entities, performs interactive modeling with dynamic visual features, captures the spatiotemporal interaction trajectory of semantic entities, and generates interactive perception features. The hierarchical semantic anchoring module 150, based on the interactive perception features of the supported video, divides the video into segments according to the time dimension and aligns them with the hierarchical semantic description across modalities to construct action prototypes; The multi-module action prototype matching module 160 calculates the matching distance between the interactive perception features of the query set video and the action prototype, and determines the action category to which the query set video belongs.

[0079] The specific implementation techniques of each module / unit in the above examples of this application can be referred to the steps of the few-sample action recognition method in the above embodiments, and will not be repeated here.

[0080] The preferred features in the above embodiments can be used individually in any embodiment, or in any combination thereof, provided they do not conflict with each other. Furthermore, parts not described in detail in the embodiments can be implemented using existing technologies.

[0081] The following examples and comparative examples will be used to further illustrate this application in order to better understand the above-mentioned technical solutions. It should be understood that the following are only some examples and are not intended to limit this application.

[0082] Comparative Example 1 The performance of the method in this application and other existing methods were compared under the settings of multiple datasets, and the specific data is shown in Figure 5.

[0083] from Figure 5 The data results show that the few-shot action recognition method provided in this application achieves state-of-the-art (SOTA) performance on multiple public benchmark datasets. Specifically: As shown in Table 1, the recognition accuracy of the proposed method (S2M) is superior to existing mainstream methods (such as CLIP-FSAR and LGA) on multiple datasets, including HMDB51, Kinetics, UCF101, and SSv2. In particular, under the 1-shot setting of HMDB51, the accuracy reaches 84.5%, which is an improvement of about 3.7% compared to the second-best method.

[0084] Even in extremely low-sample (1-shot) scenarios, the proposed method still maintains a significant performance advantage, demonstrating that introducing semantic priors through an action knowledge base can effectively alleviate the overfitting problem caused by sample sparsity.

[0085] On the time-sensitive SSv2 dataset, the proposed method achieved 70.3% (1-shot) and 80.3% (5-shot) accuracy, significantly outperforming the comparison methods and validating that the time-aware adapter and spatiotemporal interaction modeling module can effectively capture the dynamic evolution features of actions.

[0086] Application Example 1 Figure 6 The attention visualization results of the proposed method on different action instances (such as Pushup and Pick) are presented. Figure 6As shown, the model can accurately focus on key semantic regions (such as hands and objects), indicating that this application can effectively capture the spatiotemporal interaction patterns of fine-grained semantic entities in different actions, and verifying the effectiveness of the hybrid semantic interaction module.

[0087] Application Example 2 The few-shot action recognition method provided in this application can be widely applied to daily security monitoring scenarios, such as nursing homes, hospital corridors, public areas of campuses and shopping malls, and warehouse passageways. In these scenarios, monitoring systems often need to promptly identify high-risk behaviors such as falls, theft, and vandalism under long-term operation and complex lighting and occlusion conditions, while minimizing false alarms and missed detections. Traditional monitoring solutions typically rely on sufficient labeled data for supervised training or on engineering pipelines based on pose, trajectory, and rule thresholds. When facing category expansion and scene migration, this often requires re-collecting data, labeling, or repeatedly adjusting rules, which is costly and difficult to cover the constantly evolving long-tail behaviors in the real world.

[0088] In contrast, the method in this application transforms action recognition into a problem of rapidly generalizing to new categories from a small number of examples and explicitly introduces text parsing of action labels. By decomposing category names such as "falling" and "stealing" into interpretable semantic elements and stage cues, such as subject, key object interaction, motion patterns, and temporal structure, these textual semantics are used to guide the alignment and matching of video representations.

[0089] Specifically, the small sample identification process for falls is as follows: First, for the behavior of falling, the large language model is used to parse it into key semantic elements such as "rapid descent of the body's center of gravity" and "contact with the ground", as well as hierarchical temporal stages of "standing, losing balance and falling".

[0090] Then, during feature extraction, the temporal-aware adapter captures the rapid downward movement trajectory of the human body, while the hybrid semantic interaction module guides the model to focus on changes in the spatial relationship between the human body and the ground.

[0091] Subsequently, a robust prototype of the "falling" action was built based on a small set of supporting videos and the aforementioned text description.

[0092] During the recognition phase, the features of the video to be recognized are matched with the prototype. Even when faced with people of different body types, different lighting conditions, or occlusion conditions, the model can accurately determine the identity based on semantic consistency.

[0093] The method presented in this application not only utilizes prior textual semantic constraints to reduce the model's accidental dependence on background or people, making it applicable even with limited support for incomplete video coverage, but also relies on semantic consistency for discrimination when camera angle, lighting, or crowd distribution changes, achieving cross-scene generalization. Furthermore, when new monitoring behaviors are needed, only a few examples and updated label text descriptions are required for the model to quickly adapt, significantly reducing data collection and annotation costs and enabling a more user-friendly approach to constantly changing long-tail risk events.

[0094] Based on the same technical concept, other embodiments of this application provide a terminal, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it can be used to perform the above-described method or run the above-described system.

[0095] Optionally, the memory is used to store programs; the memory may include volatile memory, such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include non-volatile memory, such as flash memory. The memory is used to store computer programs (such as application programs and functional modules that implement the above methods), computer instructions, etc., and the aforementioned computer programs and computer instructions can be partitioned and stored in one or more memories. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by the processor.

[0096] The aforementioned computer programs, computer instructions, etc., can be stored in partitions within one or more memory locations. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by a processor.

[0097] A processor is used to execute a computer program stored in memory to implement the various steps of the methods involved in the above embodiments. For details, please refer to the relevant descriptions in the preceding method embodiments.

[0098] The processor and memory can be separate structures or integrated structures. When the processor and memory are separate structures, they can be coupled together via a bus.

[0099] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0100] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0101] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0102] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0103] The foregoing has described some specific embodiments of this application. It should be understood that this application is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the substantive content of this application. The above-described preferred features can be used in any combination without conflict.

Claims

1. A method for recognizing actions with few samples, characterized in that, include: Obtain action category labels, few-sample support set videos, and query set videos to be identified; The action category labels are parsed using a large language model to construct an action knowledge base containing semantic entities and hierarchical semantic descriptions; Using a visual encoder that includes a time-aware adapter, dynamic visual features are extracted from the support set video and the query set video to be identified, respectively. Based on the semantic entities, a hybrid semantic tag is constructed, and interactive modeling is performed with the dynamic visual features to capture the spatiotemporal interaction trajectory of the semantic entities and generate interactive perception features. Based on the interactive perception features of the supported video set, segments are divided according to the time dimension and aligned with the hierarchical semantic description across modalities to construct action prototypes; Calculate the matching distance between the interactive perception features of the query set video and the action prototype, and determine the action category to which the query set video belongs.

2. The few-sample action recognition method according to claim 1, characterized in that, The process of using a large language model to parse the action category labels and constructing an action knowledge base containing semantic entities and hierarchical semantic descriptions includes: Build preset prompts and instructions; The pre-trained large language model is used to parse the action category labels according to the prompt instructions, and generate structured semantic information including perspective analysis, temporal analysis, atomic action description optimization and action instance extraction. An action knowledge base is constructed based on the structured semantic information. The action knowledge base includes a set of semantic entities and hierarchical action semantic descriptions.

3. The few-sample action recognition method according to claim 1, characterized in that, The step of extracting visual features from the support set video and the query set video using a visual encoder containing a time-aware adapter includes: The input video's T-frame sequence is divided into N patch markers per frame, and each patch marker is encoded into a preset feature dimension D to obtain the feature markers of the input video; The feature labels are input into the backbone network of the visual encoder, which is composed of stacked Transformer blocks, each of which embeds a time-aware adapter with a bottleneck structure. In each Transformer block, the following processing is performed: The original visual labels are obtained by processing the input feature labels through the original mechanism of the frozen Transformer block; The time-aware adapter processes the input feature tags to obtain time-series dynamic information; The temporal awareness adapter reduces the feature dimension from the preset feature dimension D to the bottleneck dimension d through a linear downsampling layer, exchanges motion cues in the time dimension using a temporal multi-head self-attention mechanism, and restores the dimension to the preset feature dimension D through a linear upsampling layer. D and d are both positive integers, and d is less than D. The temporal dynamic information is incorporated into the original visual tags using residual connections, and the visual features of the current Transformer block are output until the visual features of the video are obtained after passing through all Transformer blocks.

4. The few-sample action recognition method according to claim 1, characterized in that, The hybrid semantic tagging includes: Knowledge-driven prior semantic tags, constructed based on semantic entities in the action knowledge base, are used to locate semantic regions in the video corresponding to the semantic entities. Data-driven, learnable tags, randomly initialized and updated during training, are used to capture potential cues or motion patterns not covered by the hierarchical semantic description.

5. The few-sample action recognition method according to claim 4, characterized in that, The process of constructing hybrid semantic tags based on the semantic entities, performing interactive modeling with the dynamic visual features, capturing the spatiotemporal interaction trajectory of the semantic entities, and generating interactive perception features includes: Calculate the dot product similarity between the hybrid semantic tags and dynamic visual features, and generate attention weights using Softmax normalization; The dynamic visual features are processed using a dual-path hybrid interaction mechanism to obtain motion pattern features; Position encoding is performed on the motion pattern features to obtain semantically enhanced features; The semantic enhancement features are weighted and aggregated using the attention weights to obtain region features; The generated regional features are concatenated with global image features to preserve the model's ability to perceive global contextual information and obtain interactive perceptual features.

6. The few-sample action recognition method according to claim 5, characterized in that, The dynamic visual features are processed using a dual-path hybrid interaction mechanism to obtain motion pattern features, including: Path 1: Utilize intra-frame multi-head self-attention layers to model spatial relationships between different semantic entities within the same frame; Path 2: Utilize a cross-frame multi-head self-attention layer to track the dynamic evolution of the same semantic entity over time; The outputs of the two paths are fused and motion pattern features are extracted via a feedforward network.

7. The few-sample action recognition method according to claim 1, characterized in that, The step of constructing action prototypes based on the interactive perception features of the supported video set, dividing the video into segments along the time dimension and aligning them with the hierarchical semantic description across modalities, includes: Divide the interaction-aware features into equal categories There are corresponding time sequence segment features, among which It equals the number of stages in the hierarchical action semantic description, where L is an integer greater than or equal to 2; Obtain the text embedding corresponding to the hierarchical action semantic description, and for the supported video set, fuse the text embedding of each stage with the corresponding temporal segment features; The temporal Transformer module is used to perform cross-modal alignment inference on the fused features, so that the visual features are aligned with the corresponding semantics. Based on the aligned features, the action prototypes corresponding to the support set videos are generated.

8. The few-sample action recognition method according to claim 1, characterized in that, The step of calculating the matching distance between the interactive perception features of the query set video and the action prototype, and determining the action category to which the query set video belongs, includes: Calculate the visual matching score between the interactive perception features of the query set video and the action prototype, and the semantic matching score between the interactive perception features of the query set video and the text embedding of the action category; The visual matching score and the semantic matching score are weighted and fused to obtain the total matching score; The action category to which the query set video belongs is determined based on the total matching score.

9. A few-sample action recognition system, characterized in that, include: The data acquisition module acquires action category labels and a small sample support set of videos, as well as a query set of videos to be identified. The knowledge base construction module uses a large language model to parse the action category labels and construct an action knowledge base containing semantic entities and hierarchical semantic descriptions. The real-time perception adapter module uses a visual encoder containing a time-series perception adapter to extract dynamic visual features of the support set video and the query set video to be identified, respectively. The hybrid semantic interaction module constructs hybrid semantic tags based on the semantic entities, performs interaction modeling with the dynamic visual features, captures the spatiotemporal interaction trajectory of the semantic entities, and generates interactive perception features. The hierarchical semantic anchoring module, based on the interactive perception features of the supported video set, divides the video into segments according to the time dimension and aligns them with the hierarchical semantic description across modalities to construct action prototypes; The multi-module action prototype matching module calculates the matching distance between the interactive perception features of the query set video and the action prototype, and determines the action category to which the query set video belongs.

10. A terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it can be used to perform the method of any one of claims 1-8, or to run the system of claim 9.

Citation Information

Patent Citations

  • Small sample action recognition method based on semantic guidance multi-modal fusion

    CN118747916A