Fine-grained multi-modal video behavior recognition method and system guided by motion saliency

By employing a motion saliency-guided fine-grained multimodal video behavior recognition method, which combines temporal segments and image patch selection from RGB and depth videos, efficient fusion of cross-modal features is achieved. This solves the problem of insufficient information fusion in existing RGB-D behavior recognition methods, thereby improving recognition accuracy and robustness.

CN121999302BActive Publication Date: 2026-06-26SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2026-04-08
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing RGB-D behavior recognition methods struggle to effectively integrate multimodal information, resulting in limited recognition accuracy. This is especially true in video-level behavior recognition tasks, where existing methods fail to fully exploit complementary information and fine-grained interaction relationships between modalities.

Method used

This paper proposes a fine-grained multimodal video behavior recognition method and system guided by motion saliency. By uniformly dividing RGB video and depth video into time segments, RGB-DRDI image pairs are constructed. A motion saliency-guided image patch selection mechanism is adopted and input into the CLIP model for fine-grained multimodal interactive learning. Behavior recognition is then performed by combining text features.

Benefits of technology

It realizes the linkage and interaction of RGB static features and deep dynamic features at the image patch level, which enhances the discriminative ability and robustness of video representation, reduces computational redundancy, and improves the accuracy and robustness of behavior recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121999302B_ABST
    Figure CN121999302B_ABST
Patent Text Reader

Abstract

The application discloses a fine-grained multi-modal video behavior recognition method and system guided by motion saliency, and the method comprises the following steps: selecting the first K image blocks with the most motion saliency from all image blocks corresponding to the DRDI image of each time segment as dynamic image blocks; selecting the corresponding K image blocks from the RGB image as static image blocks; splicing the dynamic and static image blocks under each time segment, and inputting the spliced image blocks into a visual encoder for fine-grained multi-modal interactive learning to obtain the feature representation corresponding to the current time segment; processing the feature representations of all time segments to obtain a video global feature representation; extracting a text feature representation from a text description of a behavior category; calculating the similarity between the video global feature representation and the text feature representation, and determining a video behavior recognition result according to the similarity. The application can improve the accuracy of video behavior recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of video human behavior recognition technology, and in particular to a fine-grained multimodal video behavior recognition method and system guided by motion saliency. Background Technology

[0002] Video human behavior recognition is a fundamental task in the field of computer vision, and it has attracted much attention due to its wide range of applications in human-computer interaction, intelligent monitoring, and service robots.

[0003] For RGB-D behavior recognition, researchers have proposed various information fusion strategies. Currently, the mainstream methods mainly employ decision-level fusion or feature-level fusion schemes.

[0004] For decision-level fusion methods, each modality's data is processed independently, and the overall prediction is obtained by averaging or weighting the recognition results of individual channels. However, such methods typically model each modality separately during training and only fuse the outputs at the final stage, failing to fully exploit the complementary information between modalities, thus limiting the overall discriminative ability.

[0005] In contrast, feature-level fusion methods attempt to jointly model features from different modalities. Some methods simply concatenate or weightedly fuse single-modal features in the last layer, while others achieve cross-modal feature interaction through a backbone network structure during feature learning. However, most of these methods rely on relatively coarse feature aggregation mechanisms, which cannot effectively characterize fine-grained interaction relationships between different modalities and are difficult to capture local cross-modal dependencies, resulting in limited fused feature representation capabilities and limited performance improvements. Summary of the Invention

[0006] To address the shortcomings of existing technologies, this invention provides a fine-grained multimodal video behavior recognition method and system guided by motion saliency.

[0007] On the one hand, a fine-grained multimodal video behavior recognition method guided by motion saliency is provided, including:

[0008] Obtain the RGB video and its corresponding depth video to be used for behavior recognition, divide the RGB video and its corresponding depth video into T time segments, and construct RGB-DRDI image pairs for all time segments.

[0009] The RGB image and DRDI image in each time segment's RGB-DRDI image pair are evenly divided into several non-overlapping image blocks;

[0010] From all the image patches corresponding to the DRDI image in each time segment, select the top K image patches with the most significant motion as dynamic image patches; from the RGB image in the same time segment as the DRDI image, also select K image patches with corresponding spatial locations as static image patches;

[0011] The dynamic and static image blocks in each time segment are stitched together, and the stitched image blocks are input into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment; the feature representations of all time segments are subjected to average pooling to obtain the global feature representation of the video.

[0012] The text description of the behavior category is input into the text encoder of the CLIP model for text feature extraction to obtain the text feature representation; the similarity between the global feature representation of the video and the text feature representation is calculated, and the video behavior recognition result is determined based on the text description corresponding to the maximum similarity.

[0013] On the other hand, a fine-grained multimodal video behavior recognition system guided by motion saliency is provided, including:

[0014] The acquisition module is configured to: acquire the RGB video to be recognized and its corresponding depth video, divide the RGB video and its corresponding depth video into T time segments, and construct RGB-DRDI image pairs for all time segments;

[0015] The partitioning module is configured to uniformly divide the RGB image and DRDI image in each time segment's RGB-DRDI image pair into several non-overlapping image blocks.

[0016] The selection module is configured to: select the top K most motion-significant image blocks from all image blocks corresponding to the DRDI image in each time segment as dynamic image blocks; and also select K image blocks corresponding to the spatial location from the RGB image in the same time segment as the DRDI image as static image blocks.

[0017] The extraction module is configured to: stitch together dynamic and static image blocks for each time segment, input the stitched image blocks into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment; and apply average pooling to the feature representations of all time segments to obtain the global feature representation of the video.

[0018] The recognition module is configured to: input the text description of the behavior category into the text encoder of the CLIP model for text feature extraction to obtain the text feature representation; calculate the similarity between the global feature representation of the video and the text feature representation; and determine the video behavior recognition result based on the text description corresponding to the maximum similarity.

[0019] The above technical solution has the following advantages or beneficial effects:

[0020] This invention achieves patch-level linkage and interaction between RGB static features and deep video dynamic features through motion saliency-guided fine-grained multimodal interactive learning. This method enables image encoders to efficiently model cross-modal associations, enhancing the discriminative power and robustness of video representations while reducing computational redundancy. This invention constructs a Deep Residual Dynamic Image (DRDI) for capturing local spatiotemporal motion information within short temporal segments. Attached Figure Description

[0021] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0022] Figure 1 This is a flowchart of the method in Example 1. Detailed Implementation

[0023] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0024] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0025] All data acquisition in this embodiment is carried out in accordance with laws and regulations and with user consent, and the data is used legally.

[0026] Video-based human behavior recognition is a fundamental task in computer vision, attracting significant attention due to its wide applications in human-computer interaction, intelligent monitoring, and service robots. Because RGB data is readily available and easy to use, much existing research focuses on behavior recognition based on RGB video, achieving significant progress in recent years. However, RGB data primarily reflects appearance information, and its features are easily affected by factors such as lighting changes, background clutter, and viewpoint variations, making it difficult to robustly model motion dynamics and spatial structure. With the emergence of high-precision, low-cost depth sensors such as Microsoft Kinect, RGB-D behavior recognition has gradually become a research hotspot. Depth data can provide complementary geometric and structural information, is unaffected by lighting changes, and can enhance the understanding of motion, pose, and scene layout. The complementarity of RGB and depth modalities shows great potential in improving the accuracy and robustness of behavior recognition. However, how to efficiently fuse multimodal information remains a core challenge in RGB-D behavior recognition. Especially in video-level behavior recognition tasks, it is necessary to jointly model static appearance information and dynamic temporal features, which are typically distributed across different modalities and time dimensions. How to capture and fuse these heterogeneous features in a fine-grained and efficient manner remains an unsolved problem.

[0027] In existing RGB-D action recognition tasks, the recognition accuracy is limited due to the difficulty in effectively fusing complementary RGB and depth information. Most existing methods typically encode the RGB and depth data separately and then fuse them at the feature level, which limits fine-grained information interaction across modalities.

[0028] In recent years, the CLIP (Contrastive Language–Image Pre-training) model has been proposed as a powerful visual-language foundational model, demonstrating superior generalization and transfer capabilities across a wide range of visual understanding tasks. Its rich semantic representations and pre-trained visual features make it an ideal framework for multimodal learning. However, CLIP was initially designed for aligning static images with text and lacks inherent temporal modeling capabilities. Existing research has attempted to extend CLIP to video action recognition tasks by introducing external temporal modeling modules to capture motion changes across frames; other studies have introduced learnable visual or textual prompts to better adapt CLIP's static representations to action semantic modeling, achieving alignment between static features and dynamic action semantics. Despite the increasingly widespread application of CLIP in action recognition, systematic research on RGB-D multimodal learning remains lacking. In particular, how to achieve fine-grained cross-modal interaction within the CLIP framework has not yet been explored, and this mechanism is crucial for fully utilizing the complementary characteristics of RGB and deep modalities in spatiotemporal action understanding.

[0029] Addressing the challenges of RGB-D multimodal fusion and the current lack of CLIP-based RGB-D behavior recognition research, this invention proposes a novel approach: using CLIP as a unified feature learning backbone network to achieve time-aware, fine-grained multimodal interactive learning for RGB-D behavior recognition. To this end, this invention proposes a Motion Saliency Guided Fine-Grained Multimodal Interactive Learning (MS-FMIL) framework. This framework, through a motion saliency-driven patch-level multimodal interactive learning mechanism, achieves joint modeling of static appearance and dynamic motion cues, effectively fusing RGB and depth information to construct a spatiotemporally aware multimodal behavior representation. RGB frames provide static appearance information, while the Depth Residual Dynamic Image (DRDI) represents local short-term spatiotemporal motion features by calculating the residuals between adjacent frames within short time segments and encoding them using Rank Pooling. After dividing the video sequence into several time segments, the corresponding RGB frames and DRDI image pairs for each segment are input into the MS-FMIL framework. To improve the efficiency of multimodal interactive learning, this invention designs a motion saliency-driven image patch selection mechanism. This mechanism guides the selection of more information-rich and motion-salient regions in both modalities by calculating patch-level motion saliency scores for DRDI. The selected RGB static image patches and Depth dynamic image patches are then jointly input into the CLIP image encoder to achieve efficient fine-grained interactive learning between multimodalities. This design injects local short-term context into the spatial encoding process, enabling the model to perceive motion dynamics. Subsequently, temporal pooling is used to aggregate the feature sequences of the entire video to obtain a global multimodal video representation vector with time-aware capabilities. In terms of semantic alignment, this invention introduces behavior category text descriptions enhanced by GPT-3.5 and achieves matching of visual and semantic features through contrastive learning, thereby further improving visual-semantic consistency and behavior recognition accuracy.

[0030] The main innovations and contributions of this invention are as follows:

[0031] (1) This invention extends CLIP to the field of RGB-D behavior recognition for the first time, and proposes a fine-grained multimodal interactive learning framework (MS-FMIL) based on motion saliency guidance, which realizes the linkage interaction between RGB static features and Depth dynamic features at the image patch level. This design enables the CLIP image encoder to efficiently model cross-modal associations, enhance the discriminative power and robustness of video representation, and reduce computational redundancy;

[0032] (2) This invention constructs a Deep Residual Dynamic Image (DRDI) to capture local spatiotemporal motion information within short time segments. By establishing a deep interaction between static appearance features and dynamic motion features, spatiotemporal context information can be directly injected into the CLIP structure without the need for an explicit temporal modeling module;

[0033] (3) This method learns static appearance information from RGB frames and dynamic motion information from DRDI, realizing a static-dynamic joint modeling mechanism with weak temporal alignment, which can obtain a video behavior representation with time awareness without the need for precise frame-level synchronization.

[0034] (4) Extensive experimental results on four publicly available RGB-D behavior recognition datasets validate the effectiveness of the method of this invention. This method significantly improves recognition performance and multimodal fusion effect while maintaining the lightweight and flexible framework.

[0035] Example 1

[0036] This embodiment provides a fine-grained multimodal video behavior recognition method guided by motion saliency;

[0037] like Figure 1 As shown, the motion saliency-guided fine-grained multimodal video action recognition method includes:

[0038] S101: Obtain the RGB video and its corresponding depth video to be used for behavior recognition, divide the RGB video and its corresponding depth video into T time segments, and construct RGB-DRDI image pairs for all time segments.

[0039] S102: Divide the RGB image and DRDI image in each time segment's RGB-DRDI image pair into several non-overlapping image blocks;

[0040] S103: Select the top K most motion-significant image blocks from all image blocks corresponding to the DRDI image in each time segment as dynamic image blocks; also select K image blocks corresponding to the spatial location from the RGB image in the same time segment as the DRDI image as static image blocks;

[0041] S104: The dynamic and static image blocks in each time segment are stitched together, and the stitched image blocks are input into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment; the feature representations of all time segments are subjected to average pooling to obtain the global feature representation of the video.

[0042] S105: Input the text description of the behavior category into the text encoder of the CLIP model to extract text features and obtain the text feature representation; calculate the similarity between the global feature representation of the video and the text feature representation, and determine the video behavior recognition result based on the text description corresponding to the maximum similarity.

[0043] In this model, the input of the visual encoder is connected to the patch embedding layer, and the input of the patch embedding layer is also connected to the output of the motion saliency score calculation module, which is used to input DRDI images. The output of the visual encoder is connected to the input of the similarity calculation module. The input of the text encoder is used to input text descriptions of behavior categories, and the output of the text encoder is connected to the input of the similarity calculation module. The output of the similarity calculation module outputs the similarity calculation results. The visual encoder, text encoder, patch embedding layer, similarity calculation module, and motion saliency score calculation module of the CLIP model together constitute a video human behavior recognition model, which is also defined as MS-FMIL in this application.

[0044] Further, in S101: the RGB video and its corresponding depth video are uniformly divided into T time segments, and RGB-DRDI image pairs for all time segments are constructed, including:

[0045] S101-1: Divide the RGB video evenly into T RGB time segments, and randomly sample one RGB image frame in each RGB time segment; T is a positive integer;

[0046] S101-2: Divide the depth video into T depth time segments using the same partitioning method as the RGB video. Calculate the residual frames between adjacent depth frames within each depth time segment to obtain the residual image sequence for each depth time segment. Perform sorted pooling on the residual image sequence for each depth time segment to obtain the depth residual dynamic image (DRDI) corresponding to each depth time segment.

[0047] S101-3: Take the RGB image frames and depth residual dynamic images (DRDI) in the same time segment as a set of image pairs, thereby establishing RGB-DRDI image pairs for all time segments corresponding to the RGB video and its corresponding depth video.

[0048] Further, S101-1: Dividing the RGB video evenly into T RGB time segments, and randomly sampling one RGB image frame in each RGB time segment, including:

[0049] For an RGB video containing N frames, divide it evenly into T RGB time segments, denoted as . ; Represents a set of RGB time segments; Indicates the first One RGB time segment;

[0050] One RGB frame is randomly sampled from each RGB time segment to characterize its static appearance information;

[0051] The resulting RGB sequence , recorded as This sequence is used to capture the spatial appearance features of human behavior that change over time. Indicates the first Each time segment corresponds to an RGB frame.

[0052] Furthermore, in S101-2: the depth video is also divided into the same partitioning method as the RGB video. A depth time segment, including:

[0053] For the corresponding depth video, the same temporal segmentation method is used to obtain... A depth time segment, denoted as ;in, Represents a set of depth time segments. Indicates the first A depth time segment.

[0054] Further, S101-2: Calculating residual frames between adjacent depth frames within each depth time segment to obtain a residual image sequence for each depth time segment, specifically includes:

[0055] Within each segment, the residual between adjacent depth frames is calculated to capture motion changes;

[0056] ;

[0057] in, It is the first The first in the depth time segment Frame depth image, This represents the total number of frames within the segment.

[0058] Indicates the first A depth time segment;

[0059] No. The first segment One residual frame Represented as:

[0060] ;

[0061] in, Indicates the first A sequence of residual images of segments, It is the first The first in the depth time segment Frame depth image.

[0062] Further, S101-2: performing sorted pooling on the residual image sequence of each depth time segment to obtain the depth residual dynamic image (DRDI) corresponding to each depth time segment, specifically including:

[0063] To represent the temporal evolution of behavior within each segment, the residual image sequence is... Sorted pooling is applied to generate a single dynamic image of the depth residual.

[0064] Sort pooling is a temporal coding method that preserves the temporal order of a frame sequence by learning a linear sorting function that changes over time.

[0065] First, the residual image sequence... Perform time-varying mean vector operations to obtain the smoothed feature sequence. :

[0066] ;

[0067] in, Indicates the first In the first time segment Feature vectors after smoothing of residual frames;

[0068] The final animated image is obtained by optimizing a linear sorting function:

[0069] If time step Then it satisfies ;

[0070] Wherein, sorting vector Used to represent timing information Indicates the first In the first time segment Feature vectors after smoothing of residual frames;

[0071] Solving for sorted vectors Objective function:

[0072] ;

[0073] in, As slack variables, This is the regularization parameter.

[0074] Use the optimized vector Perform dimensional transformation (the dimensional transformation refers to...) Transform the vector into The image is used to obtain a depth residual dynamic image (DRDI).

[0075] DRDI provides a compact and information-rich representation method that can capture motion patterns and enhance temporal modeling capabilities for behavior recognition tasks.

[0076] The beneficial effects of the above technical solution are as follows: For the input RGB video and corresponding depth video, this invention first uniformly divides the video sequence into several time segments to simultaneously extract static appearance information and dynamic motion information for multimodal behavior recognition. Within each time segment, static frames are extracted from the RGB sequence, and a corresponding DRDI image is constructed to capture local short-term motion dynamics. DRDI can effectively capture short-term motion changes around each RGB frame, providing rich spatiotemporal context information, thus providing sufficient dynamic cues to support subsequent multimodal feature interaction learning.

[0077] Further, S102: Divide the RGB image and DRDI image in each time segment's RGB-DRDI image pair into several non-overlapping image blocks; specifically including:

[0078] The RGB and DRDI images in each time segment's RGB-DRDI image pair are evenly divided into two parts. A number of non-overlapping image blocks.

[0079] Given an RGB-DRDI pair of behavioral videos, first, the data from the first... Image pairs of fragments It is treated as a set of multimodal inputs. The RGB image is then processed. With the corresponding depth residual dynamic image Each image is uniformly divided into 14×14 non-overlapping patches, with each patch having a spatial size of [missing information]. Pixel.

[0080] The segmented RGB and DRDI image patches are jointly input into the CLIP model's image encoder, enabling fine-grained multimodal interactive learning at the image patch level. To improve the efficiency and discriminative power of multimodal learning, this invention designs a motion saliency-driven image patch selection mechanism. This mechanism can identify and focus on the most discriminative key spatial regions in RGB static frames and DRDI. Unlike traditional methods that treat all image patches equally, the motion saliency-guided selection strategy of this invention can dynamically filter out irrelevant region content based on the motion information of DRDI, thereby improving learning efficiency and feature discrimination ability.

[0081] Furthermore, from all image patches corresponding to the DRDI image in each time segment, the top K image patches with the most significant motion are selected as dynamic image patches; from the RGB image in the same time segment as the DRDI image, K image patches corresponding to the spatial location are also selected as static image patches, specifically including:

[0082] Calculate the motion saliency score for each image patch in the DRDI image;

[0083] Based on the motion saliency score, select the top K most motion-saliency dynamic image patches from the DRDI images;

[0084] Based on the principle of consistent spatial location, K static image blocks are selected from the RGB image, wherein the spatial location of the static image block in the RGB image corresponds to the spatial location of the dynamic image block in the DRDI image.

[0085] Further, the calculation of the motion saliency score corresponding to each image patch of the DRDI image includes:

[0086] No. DRDI images of a time segment The input is used for feature learning in a 2D convolutional neural network, which employs ResNet. 18. Implementation; After feature learning, the size of the obtained spatial feature map is... The image patches of the spatial feature map correspond one-to-one with the 14×14 non-overlapping image patches of the RGB image and DRDI;

[0087] After performing average pooling on the spatial feature map along the channel dimension, the saliency score map is obtained by applying the sigmoid activation function. The process is represented as:

[0088] ;

[0089] Each value in the table represents an importance estimate for the corresponding image patch. Indicates average pooling. This represents the activation function. This represents a two-dimensional convolutional neural network.

[0090] Furthermore, the step of selecting the top K most motion-salient dynamic image patches from the DRDI image based on motion saliency scores includes:

[0091] Based on the obtained motion saliency score map Top The K-sampling strategy selects the top K most informative dynamic image patches from DRDI; based on the principle of consistent spatial location correspondence, K static image patches are selected from the RGB image, wherein the spatial location of the static image patch in the RGB image corresponds to the spatial location of the dynamic image patch in the DRDI image.

[0092] Thus, K static image blocks are obtained. and K dynamic image blocks .

[0093] By leveraging motion cues from DRDI, this invention enables the model to focus on key spatial regions containing discriminative temporal cues, thereby better modeling spatiotemporal features. Furthermore, the shared motion saliency score map ensures cross-modal alignment, maintaining spatial consistency between RGB and depth inputs. This invention's motion saliency image patch selection mechanism not only significantly reduces the model's computational cost but also preserves the most relevant visual information, thus achieving efficient multimodal interactive learning and highly discriminative feature representation.

[0094] After selecting motion saliency image patches, each temporal segment of the behavioral video can be represented by a pair of information-rich multimodal inputs: an RGB static image patch and a DRDI dynamic image patch. The selected image patches contain the most discriminative spatial features and motion cues, and are jointly input into the ViT visual encoder in the CLIP framework to achieve fine-grained multimodal interactive learning.

[0095] Further, S104: The dynamic and static image patches for each time segment are concatenated. The concatenated image patch is then input into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment. The feature representations of all time segments are then subjected to average pooling to obtain the global feature representation of the video. Specifically, this includes:

[0096] The static and dynamic image blocks of each time segment are first mapped to static image block embedding sequences and dynamic image block embedding sequences respectively through a linear projection layer;

[0097] Then, the static image patch embedding sequence and the dynamic image patch embedding sequence are concatenated, and spatial location coding and modality coding are added to the concatenation result to obtain the input sequence of the CLIP visual encoder;

[0098] Next, the input sequence is fed into the visual encoder of the CLIP model to obtain the output sequence of image patch markings;

[0099] An averaging operation is applied to the output sequence of image patch markings to obtain the feature representation corresponding to each time segment;

[0100] Linear projection is performed on the feature representation corresponding to each time segment to obtain the projection result of each time segment;

[0101] Temporal average pooling is performed on the projection results of all time segments to obtain the global feature representation of the video.

[0102] Further, the step of mapping the static image blocks and dynamic image blocks of each time segment into static image block embedding sequences and dynamic image block embedding sequences respectively through a linear projection layer includes:

[0103] To ensure compatibility with the image encoder, DRDI is converted to 3-channel RGB format.

[0104] For the A time segment, and These represent the selected static and dynamic image blocks, respectively.

[0105] Image patches first pass through a shared linear projection layer Mapped to an image patch embedding sequence.

[0106] Furthermore, the step of concatenating the static image patch embedding sequence and the dynamic image patch embedding sequence, and adding spatial location coding and modality coding to the concatenation result to obtain the input sequence of the CLIP visual encoder, specifically includes:

[0107] Spatial location encoding was then added to each image patch token. and modal coding ;No. The encoder input for each segment can be represented as:

[0108] ;

[0109] in, This indicates an intermediate result.

[0110] Further, the input sequence is input into the visual encoder of the CLIP model to obtain an output sequence of image patch labels; wherein, the visual encoder of the CLIP model consists of L sequentially stacked Transformer layers, each Transformer layer containing a multi-head self-attention mechanism (MSA) and a feedforward neural network (FFN) for learning spatial dependencies and cross-modal associations. The output of a Transformer layer can be represented as:

[0111] ;

[0112] in, Representation layer normalization.

[0113] Further, the average operation is applied to the output sequence of the image patch markings to obtain the feature representation corresponding to each time segment; a linear projection is performed on the feature representation corresponding to each time segment to obtain the projection result of each time segment; and a temporal average pooling operation is performed on the projection results of all time segments to obtain the global feature representation of the video, including:

[0114] go through After layer encoding, the output sequences of all image patch labels are averaged to obtain fragment-level multimodal feature representations, which are then processed by a linear projection matrix. Mapped to a common latent space for subsequent visual and text feature alignment, the calculation formula is as follows:

[0115] ;

[0116] in, Indicates the first The layer's output value; Indicates average operation. Indicates the first Feature representation for each segment; Feature representation for each T segment Perform average pooling over the time dimension to obtain the final video-level feature representation. .

[0117] The present invention proposes a CLIP-based MS The FMIL framework achieves efficient and effective spatiotemporal feature representation learning by jointly modeling RGB static image patches and DRDI dynamic image patches within a unified ViT coding structure. This method captures fine-grained spatial and motion correlations through patch-level cross-modal interaction and modality-aware coding mechanisms without introducing additional complex temporal modeling modules. Temporal information aggregation is accomplished using only a simple temporal average pooling operation, thus maintaining a lightweight model structure, high computational efficiency, and stable training.

[0118] Furthermore, the text description of the behavior category is input into the text encoder of the CLIP model for text feature extraction to obtain the text feature representation, which is then enhanced using a large language model.

[0119] To enhance semantic supervision in behavior recognition tasks, this invention introduces GPT-enhanced behavior category text descriptions to replace traditional simple category names. This enhanced text can fully characterize the spatiotemporal features of human behavior, including movement trends and interactions with related objects. By understanding these more detailed behavior text descriptions, the model can more effectively align visual features with high-level semantic information during training. This invention provides enhanced text prompts for three human behavior categories (brushing teeth, tearing paper, and walking away from each other). Each prompt not only describes "what the behavior is," but also depicts its "dynamic process over time" and "how it interacts with the environment or objects." All enhanced text prompts for behavior categories are encoded using CLIP's frozen text encoder to obtain a set of text embedding representations. ,in This represents the number of behavior categories. Throughout the training process, the text encoder remains frozen to preserve its pre-trained semantic structure.

[0120] Furthermore, the loss function applied during the training of the video behavior recognition model is:

[0121] ;

[0122] in, This is a learnable temperature parameter.

[0123] For the input video, use CLIP-based MS. The FMIL framework can obtain its global multimodal visual representation. Subsequently, the visual features of the video were calculated. Text embedding with all categories Cosine similarity between During the training phase, the CLIP visual encoder is fine-tuned by maximizing the similarity between video features and their corresponding category text representations and comparing them with the cross-entropy loss function.

[0124] This invention provides a detailed description of the Motion Saliency-Guided Fine-Grained Multimodal Interactive Learning (MS-FMIL) framework. This framework effectively leverages the complementary advantages of RGB appearance information and depth motion information to achieve high-precision behavior recognition. The MS-FMIL framework of this invention achieves deeper and more localized multimodal joint learning through a motion saliency-guided patch-level fine-grained fusion mechanism. First, a Depth Residual Dynamic Image (DRDI) is constructed to represent temporal motion cues in a depth video sequence. Then, a motion saliency-driven image patch selection module is designed, focusing on the most informative spatial regions in the RGB-DRDI image pairs. Subsequently, based on the selected RGB and depth image patches, they are input into the CLIP image encoder for fine-grained multimodal interactive learning. Finally, the CLIP text encoder encodes the semantically enhanced behavior category descriptions to generate text representations, and behavior recognition is achieved by maximizing the cosine similarity between video features and text features.

[0125] This invention conducted a series of comprehensive experiments to verify the effectiveness of the proposed MS-FMIL framework. The experiments were carried out on four widely used and challenging multimodal action recognition datasets, including NTURGB+D60, NTURGB+D120, SBUKinectInteraction, and UTD. The MHAD dataset is used. The experimental section first details the dataset and implementation settings, then compares its performance with current state-of-the-art methods, and analyzes the contribution of each key component through ablation experiments. Finally, visualization results are provided to further illustrate the advantages of the proposed method.

[0126] The NTU RGB+D 60 dataset is a large-scale multimodal human behavior recognition benchmark, containing approximately 56,880 video samples covering 60 behavior categories. The behaviors were performed by 40 participants and captured from three different camera perspectives. Each video sample includes synchronized RGB, depth, infrared, and skeleton multimodal data. This dataset provides cross-participant (Cross-participant) data. Subject, CS) and Cross Perspective Two evaluation protocols, View and CV, were used to compare performance.

[0127] The NTU RGB+D 120 dataset adds 60 action categories to NTU60, containing 114,480 video samples covering 120 behavior categories. This dataset includes more complex and fine-grained human activities, making it a more challenging benchmark dataset in the field of behavior recognition. It maintains the same multimodal settings as NTU60 and provides two official evaluation protocols: Cross-Subject (CSubject) and Cross-Set (CSet).

[0128] The SBU Kinect Interaction and UTD-MHAD datasets were used to evaluate the generalization ability of the proposed method. The SBU Kinect Interaction dataset focuses on two-person interactions, containing 282 video sequences covering 8 action categories, and providing RGB and depth data streams. The UTD-MHAD dataset is a multimodal human behavior dataset containing 861 video action samples performed by 8 subjects across 27 action categories, each sample containing synchronized RGB and depth data.

[0129] This invention implements its method based on the ViT-B / 16 architecture. For each input video, a uniform sampling strategy consistent with TSN is adopted to uniformly divide the RGB sequence and its corresponding depth sequence into T=16 time segments. One frame of RGB image is sampled from each time segment, and the corresponding DRDI image is constructed, thereby generating 16 frames of RGB and 16 frames of DRDI sequences for each video. To achieve motion saliency image patch selection, this invention uses a pre-trained 2DResNet. An 18-network serves as a lightweight feature extractor, calculating motion importance scores for each image patch. At each time step, the RGB-DRDI image is processed by the CLIP image encoder, which retains the top K salient patches. In experiments, the text encoder remains frozen, and only the CLIP visual encoder is fine-tuned to achieve fine-grained alignment of RGB and DRDI features. Data augmentation employs multi-scale cropping and random horizontal flipping operations. Model training uses the Adam optimizer, coupled with a cosine learning rate scheduling strategy for optimization.

[0130] For the NTURGB+D60 and NTURGB+D120 datasets, the learning rate is set to 8×10. -6 The network was trained for 30 and 50 epochs respectively. To further test the cross-dataset generalization performance of the method, the CLIP backbone network trained on the NTU dataset was frozen as a whole and trained on SBU and UTD datasets. Introducing vision on the MHAD dataset Language prompt learning (Vision) The Language Prompt Learning (LPL) mechanism is used to achieve few-shot transfer learning. This is implemented in SBU and UTD. In the MHAD experiment, the learning rate was set to 5 × 10⁻⁶. -4 The training rounds were 20 and 30 respectively.

[0131] During the testing phase, multi-perspective reasoning was employed. The ViewInference strategy uses four spatial clippings and three temporal segments to obtain video behavior recognition results by averaging the classification scores of all viewpoints.

[0132] To analyze the contribution of each component in the framework proposed in this invention, a series of ablation experiments were conducted on the NTU RGB+D 60 dataset under both CS and CV evaluation protocols. These experiments aimed to verify the impact of different multimodal fusion strategies, key model settings, and training schemes on the final recognition performance. All experiments used 16-frame RGB and DRDI input sequences.

[0133] The effect of the number K of motion-salient image patches : In this experiment, the present invention evaluated the impact of the number K of selected motion-salient image patches on recognition accuracy and computational efficiency. As shown in Table 1, when K increases from 100 to 160, the recognition performance continues to improve, with the accuracy under the CS protocol increasing from 94.01% to 96.08%, and under the CV protocol from 95.89% to 98.75%. Performance peaks at K=160, where the computational cost is 637.6 GFLOPs, which is at a moderate level. When K further increases to above 160, the accuracy begins to decline slightly or tends to saturate. For example, using all image patches (K=196) does not bring performance improvement compared to the optimal setting; the CS and CV accuracies decrease by 0.88% and 0.87%, respectively, while the computational cost increases significantly (up to 713.1 GFLOPs). When K increases from 100 to 160, both CS and CV accuracies show a consistent upward trend; the computational cost increases monotonically with increasing K. The above results demonstrate that when the number of selected image patches is too small, the model cannot fully capture the effective information required for behavior recognition; while too many image patches introduce redundancy and noise, which in turn affects efficiency and generalization ability. Since K=160 achieves an ideal balance between performance and efficiency, unless otherwise stated, this parameter setting is used in the experiments of this invention to ensure that the most discriminative regional features are fully preserved during the behavior recognition process, while avoiding the additional computational overhead caused by redundant image patches.

[0134] Table 1 Comparison of recognition accuracy for different K values ​​on the NTU RGB+D 60 dataset

[0135]

[0136] Impact of multimodal visual feature fusion strategies : In the MS proposed in this invention In the FMIL framework, at each time step, the top K=160 motion-saliency-guided effective image patches are selected from the RGB and DRDI input pairs and jointly input into the CLIP visual encoder to achieve fine-grained multimodal interactive learning. This experiment aims to evaluate how to achieve segment-level... This invention effectively fuses the multimodal interaction features output by the CLIP visual encoder (level). It designs and compares five different feature fusion strategies: CLS... Only the output of the CLS token is used as the final feature representation; RGB Avg averages the output embeddings from RGB image patches; Depth Avg averages the output embedding from the DRDI image patch; RGB+Depth Avg takes the average of all output feature embeddings from both RGB and DRDI modalities; Concat Avg concatenates the RGB and DRDI outputs at the corresponding image patch locations, followed by spatial average pooling. To maintain consistent feature dimensions, the original feature dimensions are restored using a linear projection layer after concatenation. For all five feature fusion methods mentioned above, temporal average pooling is uniformly employed in subsequent stages to obtain the final video-level multimodal feature representation.

[0137] The comparison results on the NTURGB+D60 dataset are shown in Table 2. The experimental results show that among all fusion strategies, RGB+Depth... Avg achieved the best performance, with a recognition accuracy of 96.08% under the CS protocol and 98.75% under the CV protocol. Therefore, this method was selected as the default fusion strategy for the proposed framework. (Compared to CLS) Compared to the only strategy, RGB+Depth Avg improved CS accuracy by 0.31% and CV accuracy by 1.12%. This indicates that relying solely on a single CLS tag may result in the loss of fine-grained, modality-specific critical information. When using only a single modality tag, Depth... Avg vs RGB Avg outperformed CS by 0.70% and CV by 0.65%, indicating that the spatiotemporal motion information provided by DRDI input is more discriminative and valuable for behavior recognition tasks. Concat The Avg strategy performed the worst, compared to RGB+Depth. Compared to Avg, CS accuracy decreased by 2.69%, and CV accuracy decreased by 3.69%. In summary, the experimental results show that directly applying average pooling to all output feature labels of both modalities yields the most stable and discriminative multimodal feature representation, further demonstrating the effectiveness and complementary advantages of fusing RGB static appearance information and depth dynamic information.

[0138] Table 2 Comparison of different multimodal visual feature fusion strategies on the NTU RGB+D 60 dataset

[0139]

[0140] Impact of Modal Embedding: This experiment explored the effect of introducing specific modal embeddings into the ViT input to enhance the discriminative power of multimodal features. Table 3 shows the comparison results with and without modal embeddings on the NTU RGB+D 60 dataset. It can be seen that introducing modal embeddings brings significant performance improvements, increasing by 1.18% under the CS protocol and by 1.33% under the CV protocol. This indicates that explicit modal encoding enables the Transformer to better separate RGB appearance information from DRDI motion information during feature interaction learning. Without modal embeddings, the model is prone to partial confusion with cross-modal cues, thus weakening its ability to extract complementary patterns. Therefore, modal embedding vectors play a crucial role in enhancing RGB-D cross-modal alignment and improving action recognition accuracy.

[0141] Table 3 Comparison of modal embeddings with and without modality on the NTU RGB+D 60 dataset.

[0142]

[0143] Impact of text enhancement behavior prompts :Table 4 evaluates the impact of text-enhanced action cues (GPT-generated descriptions) and ordinary class labels, as well as different fine-tuning strategies, on the NTU RGB+D 60 dataset. As shown in Table 4, using GPT-enhanced text descriptions to enrich class names consistently improves recognition accuracy under both CS and CV protocols, outperforming ordinary action labels. Specifically, when the text encoder is frozen, GPT-enhanced action descriptions bring a stable improvement of 0.31% (CS) and 0.37% (CV), indicating that richer, more descriptive text cues provide stronger semantic supervision for the alignment of visual features with action categories. A similar trend was observed when fine-tuning both encoders jointly; GPT-enhanced cues improved CS by 0.18% and CV by 0.49% compared to ordinary labels.

[0144] Furthermore, this invention observes that fine-tuning only the CLIP image encoder slightly outperforms jointly fine-tuning the image and text encoders. This is attributed to the CLIP text encoder, which, after pre-training on large-scale language-image pairs, captures strong linguistic prior knowledge and a highly generalizable semantic structure. Given the standardized format and rich semantics of GPT-enhanced cues, further fine-tuning the text encoder might introduce unnecessary parameter updates or lead to overfitting. Meanwhile, visual modalities play a crucial role in behavior recognition, and fine-tuning the image encoder allows the model to better adapt to spatiotemporal motion cues. Overall, GPT-enhanced behavior cues provide richer semantic details, reduce label ambiguity, and enhance visual-language alignment; while freezing the text encoder preserves robust pre-trained linguistic representations, avoiding overfitting on dataset-specific text representations.

[0145] Table 4. Experimental results of text enhancement behavior cues and different fine-tuning strategies on the NTU RGB+D60 dataset.

[0146]

[0147] Effectiveness of Multimodal Fusion: To verify the effectiveness of RGB-D multimodal fusion, this invention compared three single-modal methods with the multimodal framework (RGB+DRDI) of this invention on the NTU RGB+D60 dataset under both CS and CV protocols. "RGB" indicates that only RGB static frame sequences are used as input. "DRDI" indicates that only DRDI sequences are used as input. "DRDI-Top 160" indicates that only the top 160 image patches with the highest motion saliency in each frame are used as input. These methods differ only in the input modality of the CLIP visual encoder. All experimental results were obtained by fine-tuning the CLIP image encoder. As shown in Table 5, the "RGB" method has the weakest performance, while "DRDI" significantly improves recognition performance, reaching 91.78% under CS and 94.10% under CV. This indicates that the spatiotemporal motion information provided by DRDI is more valuable for behavior recognition. Selecting the top 160 motion-saliency image patches from DRDI brings further improvement, indicating that focusing on information-rich regions can reduce noise and facilitate recognition. The proposed multimodal method integrating RGB and DRDI achieves the best performance, exceeding "RGB" by 9.72% on CS and 6.09% on CV; it also exceeds "DRDI" by 4.30% on CS and 4.65% on CV. This confirms the complementarity of RGB and DRDI, with RGB providing static appearance cues and DRDI providing robust motion information.

[0148] Table 5 Comparison of unimodal and multimodal inputs on the NTU RGB+D 60 dataset

[0149]

[0150] To evaluate the effectiveness of the proposed MS-FMIL framework, it was compared with a range of state-of-the-art action recognition methods on the NTU RGB+D 60 and 120 datasets, covering different input modalities, including unimodal and multimodal methods. To achieve the optimal balance between efficiency and accuracy, this invention employs a Top K=160 motion saliency patch selection strategy for model training. Experiments were conducted to analyze the performance of both unimodal and multimodal methods, validating the significant advantages of fine-grained multimodal interactive learning combining RGB and DRDI inputs and cross-modal vision-language alignment.

[0151] Table 6-1 Comparison with the current method on NTU RGB+D 60 and 120 datasets

[0152] R: RGB modality, D: Depth modality, S: Skeleton modality, P: 2D pose, T: Text modality

[0153]

[0154] Table 6-2 Comparison with the current method on NTU RGB+D 60 and 120 datasets

[0155] R: RGB modality, D: Depth modality, S: Skeleton modality, P: 2D pose, T: Text modality

[0156]

[0157] 1) Comparison with single-mode methods: This invention first compares the proposed method with existing state-of-the-art unimodal methods using RGB, depth, or skeleton inputs on the NTU RGB+D 60 and 120 datasets. As shown in Table 6-2, the proposed method achieves 96.08% CS accuracy and 98.75% CV accuracy on the NTU RGB+D 60 dataset, and 91.84% CSub accuracy and 93.89% CSet accuracy on the NTU RGB+D 120 dataset, significantly outperforming most unimodal methods. In the skeleton-based methods on the NTU RGB+D 60 dataset, under the CS protocol, the proposed MS-FMIL outperforms existing high-performance methods WDCE-Net and LG-GCN, improving accuracy by 3.08% and 2.98%, respectively; under the CV protocol, the proposed method brings accuracy improvements of 1.55% and 2.05%, respectively. Compared to GeometryMotion-Net in the deep modality, the method of this invention achieves a 3.38% gain on CS and is only 0.15% lower on CV. This invention hypothesizes that GeometryMotion-Net's superior performance on CV may be attributed to its geometry-aware modeling specifically designed for deep input. On the NTU RGB+D 120 dataset, the method of this invention consistently outperforms various unimodal methods: compared to LG-GCN, it achieves a 2.44% improvement on CSub and a 2.89% improvement on CSet; compared to GeometryMotion-Net, it achieves a 1.74% improvement on CSub and a 0.29% improvement on CSet. These performance gains indicate that the method of this invention benefits from complementary RGB and depth visual cues, combined with semantically driven text representation, effectively improving human behavior recognition capabilities.

[0158] 2) Comparison with Multimodal Methods: Since the proposed framework utilizes both visual modalities (RGB and depth) and linguistic modalities (text), this invention further compares it with recent multimodal behavior recognition methods. As shown in Tables 6-1 and 6-2, the MS-FMIL proposed in this invention outperforms existing state-of-the-art multimodal methods under all evaluation protocols on the NTU RGB+D 60 and 120 datasets. On the NTU RGB+D 60 dataset, the proposed method significantly outperforms TCEM, achieving a 1.78% improvement under the CS protocol and a 0.75% improvement under the CV protocol. When tested on the NTU RGB+D 120 dataset, the proposed method significantly outperforms recent multimodal baseline methods such as STAR-Transformer and CSCMFT. Specifically, compared to STAR-Transformer, it improves performance by 1.54% on CSub and 1.19% on CSet; compared to CSCMFT, it is 2.04% higher on CSub and 3.39% higher on CSet.

[0159] The above results validate the effectiveness of this framework, which fine-tunes the CLIP image encoder based on motion-salient patches extracted from RGB and DRDI inputs to capture discriminative spatiotemporal dynamic features of behavior. Compared to existing methods that rely on complex multi-stream fusion networks or manually designed modal interactions, the method of this invention introduces an efficient, fine-grained multimodal learning strategy that jointly models spatial appearance and temporal dynamics, thereby achieving superior behavior recognition performance.

[0160] This invention proposes MS-FMIL, a fine-grained multimodal interactive learning framework guided by motion saliency, aiming to extend the CLIP model to RGB-D action recognition tasks. This framework achieves joint modeling of RGB appearance representation and depth motion information by inputting each frame of RGB image and its corresponding DRDI into the CLIP image encoder, thereby supporting patch-level cross-modal interaction. This invention constructs DRDI on short time segments to capture local spatiotemporal motion information, allowing the native CLIP architecture to obtain temporally aware feature representations without relying on explicit temporal modeling modules. To improve computational efficiency and reduce redundancy, a motion-saliency image patch selection strategy is introduced. This strategy calculates the motion saliency score of each image patch based on DRDI and selects image patches with greater information content for RGB-D cross-modal interactive learning. Furthermore, the framework supports weakly aligned static-dynamic joint modeling: RGB provides static appearance cues, while DRDI provides motion dynamics from neighboring segments. This not only reduces the dependence on strict frame-level synchronization between RGB and depth streams but also generates temporally aware video representations. Extensive experiments on four RGB-D action recognition datasets validated the effectiveness and robustness of MS-FMIL. This framework achieves superior performance with a lightweight and flexible architecture, while also providing a new perspective on using CLIP for multimodal action recognition.

[0161] This invention innovatively extends the CLIP framework, proposing a fine-grained multimodal interactive learning framework (MS-FMIL) guided by motion saliency to achieve efficient RGB-D behavior recognition. Specifically, at each time step, this invention jointly inputs the RGB image and its corresponding depth residual dynamic image into the CLIP image encoder, enabling fine-grained interaction between static appearance features and dynamic motion features at the patch level, thereby improving multimodal fusion performance. Furthermore, this invention establishes a motion saliency image patch selection mechanism. By calculating the motion saliency score of each image patch using the depth residual dynamic image, key cross-modal regions are selected to improve interaction efficiency and discriminative ability while reducing redundant computation. Without requiring an explicit temporal modeling module, this invention directly injects local spatiotemporal context information from the depth residual dynamic image into the CLIP framework, achieving temporally aware feature representation learning. This method also supports a weak temporal alignment mechanism, enabling the depth dynamic image to complement the static RGB image under non-strict frame synchronization conditions, thus achieving more robust cross-modal information fusion. Experimental results on a large number of benchmark datasets show that the method described in this invention can significantly improve the performance of RGB-D behavior recognition, providing a new research approach for CLIP-based multimodal behavior recognition.

[0162] Example 2

[0163] This embodiment provides a motion saliency-guided fine-grained multimodal video behavior recognition system, including:

[0164] The acquisition module is configured to: acquire the RGB video to be recognized and its corresponding depth video, divide the RGB video and its corresponding depth video into T time segments, and construct RGB-DRDI image pairs for all time segments;

[0165] The partitioning module is configured to uniformly divide the RGB image and DRDI image in each time segment's RGB-DRDI image pair into several non-overlapping image blocks.

[0166] The selection module is configured to: select the top K most motion-significant image blocks from all image blocks corresponding to the DRDI image in each time segment as dynamic image blocks; and also select K image blocks corresponding to the spatial location from the RGB image in the same time segment as the DRDI image as static image blocks.

[0167] The extraction module is configured to: stitch together dynamic and static image blocks for each time segment, input the stitched image blocks into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment; and apply average pooling to the feature representations of all time segments to obtain the global feature representation of the video.

[0168] The recognition module is configured to: input the text description of the behavior category into the text encoder of the CLIP model for text feature extraction to obtain the text feature representation; calculate the similarity between the global feature representation of the video and the text feature representation; and determine the video behavior recognition result based on the text description corresponding to the maximum similarity.

[0169] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A fine-grained multimodal video behavior recognition method guided by motion saliency, characterized by: include: Obtain the RGB video and its corresponding depth video to be used for behavior recognition. Divide the RGB video and its corresponding depth video into T time segments evenly, and construct RGB-DRDI image pairs for all time segments, including: The RGB video is evenly divided into T RGB time segments, and one RGB image frame is randomly sampled in each RGB time segment; T is a positive integer; The depth video is divided into T depth time segments using the same partitioning method as RGB video. Within each depth time segment, residual frames between adjacent depth frames are calculated to obtain a residual image sequence for each depth time segment. The residual image sequence for each depth time segment is then sorted and pooled to obtain the Depth Residual Dynamic Image (DRDI) corresponding to each depth time segment. Specifically, this includes: To represent the temporal evolution of behavior within each segment, the residual image sequence is... Sorting pooling is applied to generate a single depth residual dynamic image; sorting pooling is a temporal coding method that preserves the temporal order of a sequence of frames by learning a linear sorting function that changes over time. First, the residual image sequence... Perform time-varying mean vector operations to obtain the smoothed feature sequence. : ; in, Indicates the first In the first time segment Feature vectors after smoothing of residual frames; Indicates the first The first segment One residual frame; The final animated image is obtained by optimizing a linear sorting function: If time step Then it satisfies ; Wherein, sorting vector Used to represent timing information Indicates the first In the first time segment Feature vectors after smoothing of residual frames; Solving for sorted vectors Objective function: ; in, As slack variables, For regularization parameters; Use the optimized vector Perform dimensional transformation to obtain a dynamic image of the depth residual; By taking RGB image frames and depth residual dynamic images (DRDI) in the same time segment as a set of image pairs, RGB-DRDI image pairs for all time segments corresponding to RGB video and its corresponding depth video are established. The RGB image and DRDI image in each time segment's RGB-DRDI image pair are evenly divided into several non-overlapping image blocks; From all the image patches corresponding to the DRDI image in each time segment, select the top K image patches with the most significant motion as dynamic image patches; from the RGB image in the same time segment as the DRDI image, also select K image patches with corresponding spatial locations as static image patches; The dynamic and static image blocks in each time segment are stitched together, and the stitched image blocks are input into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment; the feature representations of all time segments are subjected to average pooling to obtain the global feature representation of the video. The text description of the behavior category is input into the text encoder of the CLIP model for text feature extraction to obtain the text feature representation; the similarity between the global feature representation of the video and the text feature representation is calculated, and the video behavior recognition result is determined based on the text description corresponding to the maximum similarity.

2. The motion saliency-guided fine-grained multimodal video behavior recognition method as described in claim 1, characterized in that, Within each depth time segment, residual frames between adjacent depth frames are calculated to obtain a residual image sequence for each depth time segment, specifically including: Within each segment, the residual between adjacent depth frames is calculated to capture motion changes; ; in, It is the first The first in the depth time segment Frame depth image, This represents the total number of frames within the segment. Indicates the first A depth time segment; No. The first segment One residual frame Represented as: ; in, Indicates the first A sequence of residual images of segments, It is the first The first in the depth time segment Frame depth image.

3. The motion saliency-guided fine-grained multimodal video behavior recognition method as described in claim 1, characterized in that, Select the top K most motion-sustaining image patches from all image patches corresponding to the DRDI image of each time segment as dynamic image patches; From the RGB images that are in the same time segment as the DRDI images, K image blocks with corresponding spatial locations are also selected as static image blocks, specifically including: Calculate the motion saliency score for each image patch in the DRDI image; Based on the motion saliency score, select the top K most motion-saliency dynamic image patches from the DRDI images; Based on the principle of consistent spatial location, K static image blocks are selected from the RGB image, wherein the spatial location of the static image block in the RGB image corresponds to the spatial location of the dynamic image block in the DRDI image.

4. The motion saliency-guided fine-grained multimodal video behavior recognition method as described in claim 3, characterized in that, The calculation of the motion saliency score corresponding to each image patch in the DRDI image includes: No. DRDI images of a time segment The input is used for feature learning in a 2D convolutional neural network, which employs ResNet.

18. Implementation; After feature learning, the size of the obtained spatial feature map is... The image patches of the spatial feature map correspond one-to-one with the 14×14 non-overlapping image patches of the RGB image and DRDI; After performing average pooling on the spatial feature map along the channel dimension, an activation function is applied to obtain the motion saliency score map. : ; Each value in the table represents an importance estimate for the corresponding image patch. Indicates average pooling. This represents the activation function. This represents a two-dimensional convolutional neural network.

5. The motion saliency-guided fine-grained multimodal video behavior recognition method as described in claim 1, characterized in that, The dynamic and static image blocks in each time segment are stitched together, and the stitched image blocks are input into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment. The feature representations of all time segments are then subjected to average pooling to obtain the global feature representation of the video. Specifically, it includes: The static and dynamic image blocks of each time segment are first mapped to static image block embedding sequences and dynamic image block embedding sequences respectively through a linear projection layer; Then, the static image patch embedding sequence and the dynamic image patch embedding sequence are concatenated, and spatial location coding and modality coding are added to the concatenation result to obtain the input sequence of the CLIP visual encoder; Next, the input sequence is fed into the visual encoder of the CLIP model to obtain the output sequence of image patch markings; An averaging operation is applied to the output sequence of image patch markings to obtain the feature representation corresponding to each time segment; Linear projection is performed on the feature representation corresponding to each time segment to obtain the projection result of each time segment; Temporal average pooling is performed on the projection results of all time segments to obtain the global feature representation of the video.

6. The motion saliency-guided fine-grained multimodal video behavior recognition method as described in claim 5, characterized in that, The output sequence of the image patch markings is averaged to obtain the feature representation corresponding to each time segment; the feature representation corresponding to each time segment is linearly projected to obtain the projection result of each time segment; and the projection results of all time segments are subjected to temporal average pooling to obtain the global feature representation of the video, including: go through After layer encoding, the output sequences of all image patch labels are averaged to obtain fragment-level multimodal feature representations, which are then processed by a linear projection matrix. Mapped to a common latent space for subsequent visual and text feature alignment, the calculation formula is as follows: ; in, Indicates the first The layer's output value; Indicates average operation. Indicates the first Feature representations corresponding to each segment; Feature representation of T segments Perform average pooling over the time dimension to obtain the final video-level feature representation. .

7. The motion saliency-guided fine-grained multimodal video behavior recognition method as described in claim 1, characterized in that, The text description of the behavior category is input into the text encoder of the CLIP model for text feature extraction to obtain the text feature representation, which is then enhanced using a large language model.

8. A fine-grained multimodal video behavior recognition system guided by motion saliency, characterized in that, include: The acquisition module is configured to: acquire the RGB video and its corresponding depth video to be used for behavior recognition; uniformly divide the RGB video and its corresponding depth video into T time segments; and construct RGB-DRDI image pairs for all time segments, including: The RGB video is evenly divided into T RGB time segments, and one RGB image frame is randomly sampled in each RGB time segment; T is a positive integer; The depth video is divided into T depth time segments using the same partitioning method as RGB video. Within each depth time segment, residual frames between adjacent depth frames are calculated to obtain a residual image sequence for each depth time segment. The residual image sequence for each depth time segment is then sorted and pooled to obtain the Depth Residual Dynamic Image (DRDI) corresponding to each depth time segment. Specifically, this includes: To represent the temporal evolution of behavior within each segment, the residual image sequence is... Sorting pooling is applied to generate a single depth residual dynamic image; sorting pooling is a temporal coding method that preserves the temporal order of a sequence of frames by learning a linear sorting function that changes over time. First, the residual image sequence... Perform time-varying mean vector operations to obtain the smoothed feature sequence. : ; in, Indicates the first In the first time segment Feature vectors after smoothing of residual frames; Indicates the first The first segment One residual frame; The final animated image is obtained by optimizing a linear sorting function: If time step Then it satisfies ; Wherein, sorting vector Used to represent timing information Indicates the first In the first time segment Feature vectors after smoothing of residual frames; Solving for sorted vectors Objective function: ; in, As slack variables, For regularization parameters; Use the optimized vector Perform dimensional transformation to obtain a dynamic image of the depth residual; By taking RGB image frames and depth residual dynamic images (DRDI) in the same time segment as a set of image pairs, RGB-DRDI image pairs for all time segments corresponding to RGB video and its corresponding depth video are established. The partitioning module is configured to uniformly divide the RGB image and DRDI image in each time segment's RGB-DRDI image pair into several non-overlapping image blocks. The selection module is configured to: select the top K most motion-significant image blocks from all image blocks corresponding to the DRDI image in each time segment as dynamic image blocks; and also select K image blocks corresponding to the spatial location from the RGB image in the same time segment as the DRDI image as static image blocks. The extraction module is configured to: stitch together dynamic and static image blocks for each time segment, input the stitched image blocks into the visual encoder of the CLIP model for fine-grained multimodal interactive learning to obtain the feature representation corresponding to the current time segment; and apply average pooling to the feature representations of all time segments to obtain the global feature representation of the video. The recognition module is configured to: input the text description of the behavior category into the text encoder of the CLIP model for text feature extraction to obtain the text feature representation; calculate the similarity between the global feature representation of the video and the text feature representation; and determine the video behavior recognition result based on the text description corresponding to the maximum similarity.