Video classification method based on mamba network adaptive adjustment of image base model

CN120388315BActive Publication Date: 2026-06-19SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date
2025-03-18
Publication Date
2026-06-19

Smart Images

  • Figure CN120388315B_ABST
    Figure CN120388315B_ABST
Patent Text Reader

Abstract

This invention relates to a video classification method based on Mamba Networks for adaptive adjustment of the image base model. By combining the state-space model of the Mamba Network with the image base model, efficient extraction and adaptive adjustment of spatiotemporal features in video data are achieved, thereby improving the accuracy and computational efficiency of the video recognition system. The main steps include: preprocessing the input video into a long sequence of video features; grouping the long sequence of video features using window partitioning and calculating autocorrelation features within each group; processing the long sequence of video features using the Mamba Network and adjusting them through a modulation function; feeding the modulated features into subsequent layers of the image base model for forward propagation; and classifying the video using a classifier. This invention provides a novel video recognition framework that can improve video recognition performance without changing the structure of the base model, through the adjustment mechanism of the Mamba Network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and video understanding technology, and in particular to a video classification method based on Mamba networks to adaptively adjust the underlying image model. Background Technology

[0002] Video understanding is a key and challenging task in computer vision. A crucial aspect of addressing this challenge lies in learning effective spatiotemporal representations from video data. However, due to the complex spatiotemporal dynamics exhibited by video data, training a video understanding model from scratch is inefficient and requires enormous amounts of data. In contrast, image understanding has made significant progress using image-based models. This has prompted exploration into applying image-based models to video understanding, as they provide robust pre-trained representations, reducing the reliance on training task-specific models from the ground up.

[0003] For example, Chinese patent application publication number CN115272941A discloses a weakly supervised video temporal action detection and classification method and system. However, this method performs video understanding by processing spatial and temporal information separately, which simplifies the complete modeling of the sequence. Therefore, image patches at different spatial and temporal locations can only interact indirectly and implicitly. This may be insufficient to capture the complex dynamics inherent in video data. To effectively capture spatiotemporal dynamics, traditional methods process the entire spatiotemporal sequence through global attention, but this leads to memory and speed efficiency issues due to excessive complexity.

[0004] Therefore, the spatiotemporal coupling characteristics of video data lead to the following problems when directly transferring image models:

[0005] (1) Spatiotemporal modeling is fragmented: Existing methods separate spatial feature extraction from temporal correlation analysis, resulting in limited spatiotemporal interaction;

[0006] (2) Explosive computational complexity: Although the global spatiotemporal attention mechanism can capture long-range dependencies, its computational complexity increases quadratically with the number of frames, making it difficult to handle long video sequences.

[0007] (3) Information loss and semantic gap: In order to reduce computational costs, most methods downsample or compress the video, resulting in the loss of high-frequency motion details and weakening the ability to capture fast movements or micro-changes.

[0008] In summary, there is currently a lack of video classification methods that can solve or partially solve the aforementioned problems. Summary of the Invention

[0009] The purpose of this invention is to overcome the shortcomings of the existing technology by providing a video classification method based on Mamba networks to adaptively adjust the image base model, so as to solve or partially solve the problems of high computational complexity, low efficiency and difficulty in capturing complete spatiotemporal dynamics when processing video data.

[0010] The objective of this invention can be achieved through the following technical solutions:

[0011] One aspect of the present invention provides a video classification method based on an adaptive adjustment of an image base model using a Mamba network. The method classifies target videos using the adapted image base model, wherein the process of obtaining the adapted image base model includes the following steps:

[0012] Step S1: Acquire video, and use a pre-trained image base model to encode the features of each frame in the video to obtain long sequence video features including frame time information;

[0013] Step S2: Group the long sequence video features to obtain multiple sub-sequences, and calculate the autocorrelation features of each sub-sequence based on the self-attention mechanism;

[0014] Step S3: Using the merged autocorrelation features as input to the Mamba network, an output feature sequence is obtained. The output feature sequence and the long sequence video features are then subjected to sequence modulation to obtain modulated features.

[0015] Step S4: Based on the modulated features, forward propagation is performed at each layer of the image base model, and steps S2-S4 are repeated to obtain the final features of the video.

[0016] Step S5: Freeze the parameters of the image base model, and based on the final features, use a classifier to achieve the category probability distribution. Train the Mamba network by calculating the loss function to realize the adaptive adjustment of the image base model.

[0017] As a preferred technical solution, step S1, the process of obtaining long sequence video features, includes the following steps:

[0018] Step S101: Extract frames from the acquired video to obtain a video frame sequence;

[0019] Step S102: Use a pre-trained image base model to perform image-level encoding on each video frame, and add position encoding information based on the time series.

[0020] Step S103: Merge the encoding results of all video frames to obtain long sequence video features.

[0021] As a preferred technical solution, step S2, the process of calculating the autocorrelation characteristics of each subsequence, includes the following steps:

[0022] Step S201: For the long sequence video features, use a 2D window to segment the video into multiple sub-sequences;

[0023] Step S202: For each subsequence, calculate its autocorrelation features using the self-attention layer weights pre-trained by the image base model.

[0024] Step S203: Combine the calculation results of all subsequences to obtain the combined autocorrelation features of the entire long video sequence.

[0025] As a preferred technical solution, the process of obtaining the modulated features in step S3 includes the following steps:

[0026] Step S301: Configure the state space model in the Mamba network so that it can accept the original video sequence as input and output two different output feature sequences as modulation parameters;

[0027] Step S302: Use a sequence modulation function to perform sequence modulation on the output feature sequence and the long sequence video features to obtain modulated features.

[0028] As a preferred technical solution, the modulation function is:

[0029] SeqMod(x,y1,y2) = x⊙y1 + x + y2

[0030] Where SeqMod(x,y1,y2) is the modulated feature, x is the long sequence video feature, and y1 and y2 are two different output feature sequences output by the state space model.

[0031] As a preferred technical solution, in step S4, based on the modulated features, forward propagation is performed at each layer of the image base model, and steps S1-S4 are repeated to obtain the final features of the video. The process includes the following steps:

[0032] Step S401: For the i-th layer of the image base model, process the current sequence using the weights of its forward propagation part to complete the adaptation and adjustment of the current model layer.

[0033] Step S402: Repeat steps S2-S401 until the adaptation adjustment for all layers of the current model is completed and the final features of the output video are obtained.

[0034] As a preferred technical solution, step S4 further includes:

[0035] Step S403: Perform temporal and spatial averaging on the final features of the video.

[0036] As a preferred technical solution, step S5 includes the following steps:

[0037] Step S501: Based on the language names of all categories to be classified, process them using a pre-trained language model to obtain an untrainable category classifier;

[0038] Step S502: Calculate the cosine similarity between the final features of the video and each category representative of the classifier, and obtain the category prediction probability distribution based on the similarity.

[0039] Step S503: Based on the comparison between the predicted probability distribution of the categories and the true labels, calculate the cross-entropy as the classification loss function;

[0040] Step S504: Based on the intermediate calculation results of the forward inference in step S4, and compared with the intermediate calculation results of the unadjusted base model, the mean square error is calculated as the distillation loss function.

[0041] Step S505: Based on the classification loss function and the distillation loss function, calculate the final loss function by weighted summation;

[0042] Step S506: Based on the final loss function, train the Mamba network using gradient descent until it converges.

[0043] As a preferred technical solution, the image base model is CLIP or SigLIP.

[0044] Another aspect of the present invention provides a video classification system based on Mamba Networks to adaptively adjust the image base model, for implementing the aforementioned video classification method based on Mamba Networks to adaptively adjust the image base model, the video classification system comprising:

[0045] The image encoding module is used to acquire video and use a pre-trained image base model to encode the features of each frame in the video to obtain long sequence video features including frame time information.

[0046] The autocorrelation calculation module is used to group the long sequence video features to obtain multiple sub-sequences, and calculate the autocorrelation features of each sub-sequence based on the self-attention mechanism.

[0047] The feature modulation module is used to take the merged autocorrelation features as input to the Mamba network to obtain an output feature sequence, and to perform sequence modulation on the output feature sequence and the long sequence video features to obtain modulated features.

[0048] The forward propagation module is used to perform forward propagation at each layer of the image base model based on the modulated features, repeating this process multiple times to obtain the final features of the video.

[0049] The training module is used to freeze the parameters of the image base model, and based on the final features, uses a classifier to achieve the category probability distribution. The Mamba network is trained by calculating the loss function to realize the adaptive adjustment of the image base model.

[0050] Compared with the prior art, the present invention has at least one of the following beneficial effects:

[0051] (1) Improved video recognition training accuracy: The Mamba network-based image base model adaptive video recognition method provided in this invention trains only the Mamba network while freezing the parameters of the pre-trained image base model. This allows for effective adjustment of video features by introducing only a small number of Mamba network modules without altering the original image base model structure. This method maximizes the utilization of the image pre-trained model weights, better captures the spatiotemporal dynamics in the video, and thus significantly improves the accuracy of video recognition.

[0052] (2) Improved Feature Representation Capability: This invention obtains multiple sub-sequences (i.e., segmentation) by grouping the long sequence video features. Based on the modulated features, forward propagation (i.e., modulation) is performed at each layer of the image base model, thereby providing a video recognition framework that includes a "segmentation and modulation" stage. This framework first applies window-based spatial local attention to each layer of the video base model, and then injects complete spatiotemporal information through a modulation function. This design not only improves the feature representation capability but also enhances the network's generalization ability, making it more adaptable to variations in different video data.

[0053] (3) Low computational complexity: By introducing the Mamba network and the State Space Model (SSM), this invention can achieve efficient processing of long sequence video features, reduce computational complexity, and significantly improve processing speed and memory utilization.

[0054] (4) Good training effect: By using a weighted training method of distillation loss function and classification loss function, combined with the adjustment of the base model and Mamba network, this invention can simultaneously optimize the video feature extraction capability and model learning efficiency, further improving the performance of video recognition tasks. Attached Figure Description

[0055] Figure 1 This is a flowchart of a video classification method based on Mamba Networks to adaptively adjust the image base model in this embodiment.

[0056] Figure 2This is a schematic diagram illustrating the principle of adaptive adjustment of the image base model based on the Mamba network in the embodiment.

[0057] Figure 3 This is a schematic diagram of a video classification system that adapts and adjusts the image base model based on the Mamba network, as shown in the embodiment.

[0058] Figure 4 This is a schematic diagram of the electronic device in the embodiment. Detailed Implementation

[0059] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0060] Example 1

[0061] To address the aforementioned problems in existing technologies, this embodiment provides a video classification method based on Mamba networks to adaptively adjust the image base model. This aims to solve the problems of high computational complexity, low efficiency, and difficulty in capturing complete spatiotemporal dynamics in existing methods when processing video data. See also... Figure 1 and Figure 2 The method includes the following steps:

[0062] Step S1: Preprocess the video and encode it into long sequence video features using an image fundamental model.

[0063] Specifically, step S1 may include steps S101-S103:

[0064] Step S101: Extract frames from the video to obtain a video frame sequence.

[0065] After video data is input into the system, it needs to be preprocessed to extract keyframes and perform feature extraction on each video frame. Specific methods include frame decomposition of the video data, extracting each frame into independent image units, and standardizing these images to ensure consistency and efficiency of the model input.

[0066] Step S102: Use a pre-trained image foundation model to perform image-level encoding on each video frame, and add temporal position embedding information to each frame based on its time series.

[0067] Image-level feature encoding is performed on each frame of the image using a pre-trained image base model (such as CLIP, SigLIP, etc.). The image base model is a deep network model pre-trained on a large-scale dataset, possessing powerful feature representation capabilities. Each frame of the image, after being encoded by the image base model, generates a high-dimensional feature vector.

[0068] To accommodate the temporal dependencies in video data, temporal position embedding is incorporated into the features of each frame. This process, by adding temporal position encoding to each image frame, allows the model to effectively distinguish the temporal information of different frames, thus laying the foundation for subsequent spatiotemporal feature modeling. Temporal position encoding can employ a similar method to position encoding in the Transformer model, embedding the temporal information of the frame as part of the input into the features.

[0069] Step S103: Merge the encoding results of all frames to obtain the long sequence video features V∈R HWT×C .

[0070] The feature vectors of all video frames are merged into a long sequence feature matrix V∈R. HWT×C Where H, W, and T represent the spatial dimensions (height and width) and temporal dimensions (number of frames) of the video, respectively, and C is the dimension of the feature vector for each frame. The resulting long sequence of video features will then serve as input for subsequent processing.

[0071] Step S2: Group the long sequence video features and calculate autocorrelation features within each subsequence. Specifically, step S2 may include steps S201-S203:

[0072] Step S201: Extract the long video feature sequence V∈R HWT×C The sequence is divided using a 2D window of size w*w, resulting in multiple subsequences, the number of which is N = HWT / w. 2 .

[0073] To improve computational efficiency and reduce the computational burden on the model, the long video feature sequence V is segmented into multiple subsequences. Specifically, a 2D window of size w×w is used to divide the video feature matrix V into several smaller subsequences. The feature dimension of each subsequence remains the same as the original video features. The number of subsequences after segmentation is N = HWT / w. 2 This makes each subsequence a local feature block, thereby reducing the time complexity of each computation.

[0074] Step S202: For each subsequence, for the i-th layer of the image base model, calculate its autocorrelation features using the weights of its pre-trained self-attention layer.

[0075] For each subsequence, the autocorrelation feature of that subsequence is calculated using the self-attention weights of the i-th layer of the image base model. Since the subsequences are spatially local, using the self-attention mechanism to calculate the autocorrelation feature can preserve local features while avoiding the high computational cost of global computation.

[0076] Step S203: Merge the calculation results of all subsequences to obtain the processing result x∈R of the entire long video sequence. HWT×C .

[0077] The autocorrelation calculation results of all subsequences are combined to obtain the processed result x∈R of the entire long video sequence features. HWT×C This provides pre-processed spatiotemporal characteristics for subsequent adjustment steps.

[0078] Step S3: Process the entire long video sequence using a Mamba network and modulate the result with the original features using a modulation function.

[0079] To further enhance the spatiotemporal adaptability of video features, this method introduces the State-Space Model (SSM) of the Mamba Network to adjust the entire long sequence of video features obtained in step S2. The Mamba Network, through its State-Space Model, can effectively model long-term dependencies, exhibiting good computational efficiency and scalability.

[0080] Specifically, step S3 may include steps S301-S302:

[0081] Step S301: Introduce the state space model (SSM) in the Mamba network as a basic module and adjust the output dimension of its last layer so that it can accept the original video sequence as input and output two different feature sequences as modulation parameters: y1, y2 = SSM(x).

[0082] The state-space model of the Mamba Network is tuned so that the output dimension of its last layer is adapted to the input video sequence. The state-space model of the Mamba Network takes video features x as input and outputs two feature sequences y1 and y2, which are used as modulation parameters to adjust the original video features.

[0083] Step S302: Design a sequence modulation function (SeqMod) to modulate the original features. Its specific form is: SeqMod(x,y1,y2)=x⊙y1+x+y2, where ⊙ is the Hamiltonian multiplication symbol.

[0084] A sequence modulation function (SeqMod) is constructed to further enhance the expressive power of spatiotemporal information by weighted modulation of the input features. The specific modulation process is as follows:

[0085] SeqMod(x,y1,y2) = x⊙y1 + x + y2

[0086] Here, the symbol ⊙ represents Hamiltonian multiplication. This operation can effectively fuse video features and modulation parameters output by the Mamba network, injecting more spatiotemporal information and helping the model better understand the dynamic changes in video data.

[0087] Step S4: The modulation result is fed into the subsequent forward propagation process of the visual base model, and steps S1-S4 are repeated continuously to finally obtain the feature representation of the current video. Specifically, step S4 may include steps S401-S403:

[0088] Step S401: For the i-th layer of the image base model, process the current sequence using the weights of its feed-forward part to complete the adaptation and adjustment of the current model layer.

[0089] The modulated feature x is fed into each layer of the image base model for forward propagation. For each layer, the weights of the forward propagation portion of that layer are used to adaptively adjust the current feature, allowing it to be better integrated into the output of that layer. This process is performed sequentially at each layer, gradually adjusting the feature representation of each layer until the features of the entire video are obtained through the forward propagation of each layer to obtain the final output. The output feature representation of the l-th layer is denoted as y. l ∈R HWT .

[0090] Step S402: Repeat steps S2 to S401 until the adaptation adjustment for all layers of the current model is completed and the output result y is obtained. L ∈R HWT .

[0091] The model not only adapts to video features layer by layer, but also injects spatiotemporal information after each layer is completed. This step continues iteratively until the adjustment process of all model layers is completed, ultimately yielding the complete feature representation y of the video. L ∈R HWT×C .

[0092] Step S403: Analyze the final output y of the network. L The feature representation y of the video is obtained by averaging the temporal and spatial values ​​respectively. o ∈R C .

[0093] For the final output result y L The video is then averaged in both time and space to obtain its final feature representation y. o ∈R C This feature represents the global spatiotemporal features of the video, which can provide effective support for subsequent classification steps.

[0094] Step S5: Feed the feature representation into the classifier to obtain the predicted probability distribution of the categories, compare it with the true labels, calculate the loss function, and train the class. Specifically, step S5 may include steps S501-S506:

[0095] Step S501: Based on the language names of all categories to be classified, process them using a pre-trained language model to obtain an untrainable category classifier.

[0096] Video features y after adjustment and forward propagation processing o The data is fed into a classifier for category prediction. The classifier first uses a pre-trained language model, combined with the language name of the video category, to convert the category label into a non-trainable category classifier.

[0097] Step S502: Calculate the cosine similarity between the obtained features and each category representative of the classifier, and obtain the category prediction probability distribution based on the similarity.

[0098] Based on video features y o By calculating the cosine similarity with each classifier for each category, the predicted probability distribution for each category can be obtained.

[0099] Step S503: Compare the predicted probability distribution with the true labels and calculate the cross-entropy as the classification loss function.

[0100] The predicted probability distribution is compared with the true labels, and the cross-entropy loss is calculated as the basic loss function for model training.

[0101] Step S504: Simultaneously, collect the intermediate calculation results of the network forward inference and compare them with the intermediate calculation results of the unadjusted base model, and calculate the mean square error of the two as the distillation loss function.

[0102] Collect intermediate computation results {y} of each layer of the image base model forward inference in steps S2 to S4 of the network. 1 ,y 2 ,…,y LThe intermediate results are compared with those of the unadjusted base model, and the mean squared error between the two is calculated as the distillation loss function. This distillation loss helps the model better combine the characteristics of the base model and the Mamba network, improving its performance in video recognition tasks.

[0103] Step S505: Weight the classification loss function and the distillation loss function to obtain the final loss function.

[0104] The classification loss and distillation loss are weighted and summed to obtain the total loss function.

[0105] Step S506: Freeze all basic model parameters in the network, train only the parameters of the Mamba network, backpropagate the loss function, and train the network using gradient descent until convergence.

[0106] The entire network is trained using backpropagation and gradient descent until the model converges.

[0107] In summary, this method has the following characteristics:

[0108] (1) Improve video recognition training accuracy: This method freezes most of the parameters of the pre-trained image base model, and can effectively adjust video features by introducing only a small number of Mamba network modules without changing the original image base model structure. This method can maximize the use of the image pre-trained model weights, better capture the spatiotemporal dynamics in the video, and thus significantly improve the accuracy of video recognition.

[0109] (2) Novel Video Recognition Framework: This method provides a unique video recognition framework that includes a "segmentation and modulation" stage. This framework first applies window-based spatial local attention to each layer of the video base model, then injects complete spatiotemporal information through a modulation function. This design not only improves feature representation capabilities but also enhances the network's generalization ability, making it more adaptable to variations in different video data.

[0110] (3) Efficient computation method: By introducing Mamba network and state space model (SSM), it is possible to efficiently process long sequence video features, reduce computational complexity, and significantly improve processing speed and memory utilization.

[0111] (4) Effective training process: By using a weighted training method of distillation loss function and classification loss function, combined with the adjustment of the base model and Mamba network, the ability to extract video features and the learning efficiency of the model can be optimized at the same time, further improving the performance of video recognition task.

[0112] Example 2

[0113] Based on Example 1, see Figure 3This embodiment provides a video classification system that adapts an image base model based on a Mamba network, used to implement the video classification method based on a Mamba network for adapting an image base model in Embodiment 1. The video classification system includes:

[0114] (1) Image encoding module, used to acquire video, and use a pre-trained image base model to encode the features of each frame in the video to obtain long sequence video features including frame time information.

[0115] (2) Autocorrelation calculation module, used to group the long sequence video features to obtain multiple sub-sequences, and calculate the autocorrelation features of each sub-sequence based on the self-attention mechanism.

[0116] (3) Feature modulation module, used to take the merged autocorrelation features as input to the Mamba network to obtain the output feature sequence, and to perform sequence modulation on the output feature sequence and the long sequence video features to obtain the modulated features.

[0117] (4) Forward propagation module, used to perform forward propagation at each layer of the image base model based on the modulated features, repeating multiple times to obtain the final features of the video.

[0118] (5) Training module, used to freeze the parameters of the image base model, and based on the final features, use a classifier to achieve the category probability distribution, and train the Mamba network by calculating the loss function to realize the adaptive adjustment of the image base model.

[0119] This invention provides a novel video recognition framework that improves video recognition performance through the adjustment mechanism of Mamba networks without altering the basic model structure. Compared to traditional methods, this invention offers better spatiotemporal feature modeling capabilities and computational efficiency, making it suitable for large-scale video analysis tasks and possessing strong scalability and practical value.

[0120] Example 3

[0121] Based on the foregoing embodiments, this embodiment provides an electronic device, including: one or more processors and a memory, wherein the memory stores one or more programs, the one or more programs including instructions for executing the video classification method based on the Mamba network to adaptively adjust the image base model as described in Embodiment 1.

[0122] like Figure 4 At the hardware level, the electronic device includes a processor, internal bus, network interface, memory, and non-volatile memory, and may also include other hardware required for the business operations. The processor reads the corresponding computer program from the non-volatile memory into memory and then runs it to achieve the above-mentioned functions. Figure 1 The method described herein. Of course, in addition to software implementation, this invention does not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

[0123] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0124] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0125] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for video classification based on adaptive adjustment of image base models using a Mamba network, characterized in that, The target video is classified using an adaptively adjusted image base model. The process of obtaining the adaptively adjusted image base model includes the following steps: Step S1: Acquire video, and use a pre-trained image base model to encode the features of each frame in the video to obtain long sequence video features including frame time information; Step S2: Group the long sequence video features to obtain multiple sub-sequences, and calculate the autocorrelation features of each sub-sequence based on the self-attention mechanism; Step S3: Using the merged autocorrelation features as input to the Mamba network, an output feature sequence is obtained. The output feature sequence and the long sequence video features are then subjected to sequence modulation to obtain modulated features. Step S4: Based on the modulated features, forward propagation is performed at each layer of the image base model, and steps S2-S4 are repeated to obtain the final features of the video. Step S5: Freeze the parameters of the image base model. Based on the final features, use a classifier to achieve the class probability distribution. Train the Mamba network by calculating the loss function to achieve adaptive adjustment of the image base model. In step S3, the process of acquiring the modulated features includes the following steps: Step S301: Configure the state space model in the Mamba network so that it can accept the original video sequence as input and output two different output feature sequences as modulation parameters; Step S302: Use a sequence modulation function to perform sequence modulation on the output feature sequence and the long sequence video features to obtain modulated features.

2. The video classification method based on Mamba network for adaptive adjustment of image base model according to claim 1, characterized in that, In step S1, the process of obtaining long sequence video features includes the following steps: Step S101: Extract frames from the acquired video to obtain a video frame sequence; Step S102: Use a pre-trained image base model to perform image-level encoding on each video frame, and add position encoding information based on the time series. Step S103: Merge the encoding results of all video frames to obtain long sequence video features.

3. The video classification method based on Mamba network for adaptive adjustment of image base model according to claim 1, characterized in that, In step S2, the process of calculating the autocorrelation characteristics of each subsequence includes the following steps: Step S201: For the long sequence video features, use a 2D window to segment the video into multiple sub-sequences; Step S202: For each subsequence, calculate its autocorrelation features using the self-attention layer weights pre-trained by the image base model. Step S203: Combine the calculation results of all subsequences to obtain the combined autocorrelation features of the entire long video sequence.

4. The video classification method based on Mamba network for adaptive adjustment of image base model according to claim 1, characterized in that, The modulation function is: in, The modulated features, Features of long video sequences , These are two distinct output feature sequences from the state-space model.

5. The video classification method based on Mamba network for adaptive adjustment of image base model according to claim 1, characterized in that, In step S4, based on the modulated features, forward propagation is performed at each layer of the image base model, and steps S1-S4 are repeated to obtain the final features of the video. The process includes the following steps: Step S401, for the first image base model i The layer uses the weights of its forward propagation part to process the current sequence, thus completing the adaptation and adjustment of the current model layer; Step S402: Repeat steps S2-S401 until the adaptation adjustment for all layers of the current model is completed and the final features of the output video are obtained.

6. The video classification method based on Mamba network for adaptive adjustment of image base model according to claim 5, characterized in that, Step S4 further includes: Step S403: Perform temporal and spatial averaging on the final features of the video.

7. The video classification method based on Mamba network for adaptive adjustment of image base model according to claim 1, characterized in that, Step S5 includes the following steps: Step S501: Based on the language names of all categories to be classified, process them using a pre-trained language model to obtain an untrainable category classifier; Step S502: Calculate the cosine similarity between the final features of the video and each category representative of the classifier, and obtain the category prediction probability distribution based on the similarity. Step S503: Based on the comparison between the predicted probability distribution of the categories and the true labels, calculate the cross-entropy as the classification loss function; Step S504: Based on the intermediate calculation results of the forward inference in step S4, and compared with the intermediate calculation results of the unadjusted base model, the mean square error is calculated as the distillation loss function. Step S505: Based on the classification loss function and the distillation loss function, calculate the final loss function by weighted summation; Step S506: Based on the final loss function, train the Mamba network using gradient descent until it converges.

8. The video classification method based on Mamba network for adaptive adjustment of image base model according to claim 1, characterized in that, The image base model is CLIP or SigLIP.

9. A video classification system based on Mamba networks for adaptive adjustment of image base models, characterized in that, For implementing the video classification method based on Mamba network for adaptive adjustment of image base model as described in any one of claims 1-8, the video classification system includes: The image encoding module is used to acquire video and use a pre-trained image base model to encode the features of each frame in the video to obtain long sequence video features including frame time information. The autocorrelation calculation module is used to group the long sequence video features to obtain multiple sub-sequences, and calculate the autocorrelation features of each sub-sequence based on the self-attention mechanism. The feature modulation module is used to take the merged autocorrelation features as input to the Mamba network to obtain an output feature sequence, and to perform sequence modulation on the output feature sequence and the long sequence video features to obtain modulated features. The forward propagation module is used to perform forward propagation at each layer of the image base model based on the modulated features, repeating this process multiple times to obtain the final features of the video. The training module is used to freeze the parameters of the image base model, and based on the final features, uses a classifier to achieve the category probability distribution. The Mamba network is trained by calculating the loss function to realize the adaptive adjustment of the image base model.