Visual Representation Method and Device Based on Bidirectional State-Space Model

By designing the Vim architecture, two-dimensional images are converted into one-dimensional sequences and combined with bidirectional state space modeling and position embedding, solving the problem of visual representation of high-resolution images and long-distance dependencies. This achieves efficient visual representation learning and is suitable for tasks such as image classification and semantic segmentation.

CN117876845BActive Publication Date: 2026-06-30HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2024-01-15
Publication Date
2026-06-30

Smart Images

  • Figure CN117876845B_ABST
    Figure CN117876845B_ABST
Patent Text Reader

Abstract

This invention discloses a visual representation method based on a bidirectional state-space model—Vision Mamba (Vim). The Vim model first segments the input image into a series of image patches and linearly projects them into a vector sequence, which is then input into the Vim module for efficient sequence modeling. This method is the first to apply the Mamba state-space model to the field of computer vision and introduces a bidirectional state-space modeling approach to optimize the lack of global vision in processing visual data. Simultaneously, it utilizes positional embedding to provide spatial information and location awareness, making the model more robust in intensive prediction tasks such as semantic segmentation, object detection, and instance segmentation. Furthermore, thanks to the efficient design of the Mamba algorithm, Vim exhibits sub-quadratic time complexity and linear memory complexity, showing a significant efficiency advantage compared to visual models based on the Transformer structure. This invention also provides a corresponding visual representation device based on the bidirectional state-space model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of deep learning and computer vision technology, and more specifically, relates to a visual representation method and apparatus based on a bidirectional state space model. Background Technology

[0002] In recent years, with the continuous development of deep learning technology, many innovative models and methods have emerged in the field of visual representation learning, bringing new possibilities to image processing tasks. In this field, Convolutional Neural Networks (CNNs) have remained a fundamental model, achieving groundbreaking results in areas such as image classification, object detection, and semantic segmentation. Through local receptive fields and parameter sharing mechanisms, they have successfully captured spatial correlations. However, with the increasing complexity of tasks and the growing demand for global contextual understanding, CNNs have revealed certain limitations when handling large-scale or long-distance dependencies.

[0003] To overcome these limitations, the Transformer architecture emerged, initially achieving significant success in natural language processing. Subsequently, the Vision Transformer (ViT) introduced the Transformer to computer vision, treating images as one-dimensional sequential data for processing. This approach endows each pixel with adaptive global contextual information, effectively addressing the problem that fixed filters in CNNs struggle to capture features of positional changes. However, for high-resolution images, the self-attention computation in ViT faces challenges in terms of memory consumption and computational complexity, especially when dealing with dense prediction tasks where efficiency is limited.

[0004] Against this backdrop, the State Space Model (SSM) was introduced, a modeling framework derived from continuous systems theory. SSMs possess linear time complexity and hardware-friendly characteristics, demonstrating great potential in processing sequential data, especially extremely long sequences. Several SSM-based methods, such as S4 and Mamba, have been successfully applied to tasks involving audio, video, and other diverse sequential data, effectively solving the problem of modeling long-term dependencies. In certain scenarios, these SSM-based methods have even demonstrated superior scalability and computational efficiency compared to traditional Transformers.

[0005] Overall, deep learning technology has brought diverse models and methods to the field of visual representation learning. Convolutional Neural Networks (CNNs) play a crucial role in image processing, but they have some limitations in handling global context and long-range dependencies. The Transformer architecture and its applications in computer vision (such as the Vision Transformer) have solved some problems by introducing self-attention mechanisms, but they still face challenges in handling complex tasks and high-resolution images. Meanwhile, state-space model-based methods (such as S4 and Mamba) have demonstrated linear time complexity and hardware-friendly characteristics in processing sequential data, providing an effective solution for modeling long-term dependencies. In the future, the development of deep learning technology will continue to drive progress in visual representation learning, bringing more possibilities to the field of image processing. Summary of the Invention

[0006] To address the challenges currently faced in visual representation learning, particularly in handling large-scale high-resolution images and long-range dependencies, we propose a novel solution. This solution, based on a state-space model, aims to explore how to leverage the advantages of bidirectional state-space models to solve the problems of global context understanding and position sensitivity in visual data. We design and introduce a new architecture called Vision Mamba (Vim), combining the efficient design of the Mamba algorithm, employing bidirectional compression modeling to adapt to the characteristics of visual tasks, and enhancing the model's spatial awareness through position embedding, providing a general and efficient solution for visual representation learning. By introducing bidirectional state-space modeling, Vim achieves significant progress in global context understanding. Its bidirectional compression modeling enables it to better capture long-range dependencies, improving the model's adaptability. Simultaneously, the introduction of position embedding enhances Vim's spatial awareness, allowing it to better understand the features of different regions in an image. This innovative combination of designs makes Vim a comprehensive visual representation learning model that achieves outstanding performance across various tasks. Notably, Vim not only demonstrates excellent performance but also high efficiency in computational resource utilization. Its low GPU memory footprint, FLOPs, and inference time cost provide a feasible solution for practical applications. This makes Vim not only a theoretical innovation but also a technological achievement with practical application potential. In summary, by introducing the Vim architecture, we have successfully integrated the advantages of state-space models into the field of visual representation learning, providing a comprehensive and efficient solution. This innovation is expected to open up new directions for the future development of visual representation learning technology and bring more flexible and powerful tools to the field of image processing.

[0007] To achieve the above objectives, according to one aspect of the present invention, a visual representation method based on a bidirectional state-space model is provided, comprising the following steps:

[0008] (1) Convert the two-dimensional image into a one-dimensional long sequence. This includes the following sub-steps;

[0009] (1.1) For two-dimensional images Perform a uniform cropping operation to obtain a set of image patches. Where H is the height of the image, W is the width of the image, C is the number of channels of the image, P is the side length of the cropped square image patch, and N is the total number of image patches obtained after cropping.

[0010] (1.2) Convert the cropped image patch set into a one-dimensional sequence, as shown in the following formula:

[0011]

[0012] in Represents the i-th image patch. The linear projection parameter matrix, For category tokens, For location embedding, D is the dimension of the image patch after linear projection. Specifically, it includes the following sub-steps:

[0013] (1.2.1) Use linear projection to plot each image patch Switch to For each image patch, the projection matrix is ​​the same learnable parameter matrix. The sequence is formed after dimensional transformation

[0014] (1.2.2) Transfer the category token x cls The length of the sequence is increased by splicing the images together with the first part of the sequence, which consists of image patches after dimensionality transformation.

[0015] (1.2.3) Embed the position with sequence Add them together to obtain the sequence data in the model to be input.

[0016] (2) Use multiple Vim blocks stacked together to form a Vim encoder to model the input sequence. This includes the following sub-steps:

[0017] (2.1) Preprocessing operations before performing state-space modeling on the input sequence. This includes the following sub-steps:

[0018] (2.1.1) Output T of the previous Vim block l-1 Performing normalization operation yields Where B is the batch size, M is the sequence length, and D is the sequence hidden dimension.

[0019] (2.1.2) Using linear layers to pair T′ l-1 Perform a dimensional transformation to obtain and Where d is the hidden dimension after transformation.

[0020] (2.2) For Perform forward state-space modeling. This includes the following sub-steps:

[0021] (2.2.1) For the sequence Perform a one-dimensional convolution operation and activate it using the SiLU function to obtain x′. o .

[0022] (2.2.2) According to x′ o The state-space model parameter matrix is ​​calculated. C o .

[0023] (2.2.3) Perform state-space modeling operations to obtain... The formula is as follows:

[0024]

[0025] (2.2.4) y o Performing a matrix dot product with z after the SiLU activation function yields the result of the forward state-space modeling, as shown in the following formula:

[0026] y′ forward =y o ·SiLU(z)

[0027] (2.3) Perform inverse state-space modeling. This includes the following sub-steps:

[0028] (2.3.1) Perform a sequence flip operation on sequence x to swap the positions of the sequences.

[0029] (2.3.2) For the reversed sequence Perform a one-dimensional convolution operation and activate it using the SiLU function to obtain...

[0030] (2.3.3) According to The state-space model parameter matrix is ​​calculated. C o .

[0031] (2.3.4) Perform state-space modeling operations to obtain... The formula is as follows:

[0032]

[0033] (2.3.5) will Performing a matrix dot product with z after the activation function SiLU yields the inverse state-space modeling result, as shown in the following formula:

[0034]

[0035] (2.4) The forward modeling sequence y′ obtained after bidirectional state-space modeling forward and reverse modeling sequence y′ backward The output sequence is obtained by performing a residual connection with the input sequence. The formula is as follows:

[0036] T l =Linear(y′) forward +y′ backward )+T l-1

[0037] Where T k-1 Let be the input sequence of the Vim block, and Linear be the linear projection function.

[0038] (3) Extract the category labels of the corresponding positions of the modeled sequence, and use the multilayer perceptron (MLP) to obtain the category prediction results of the image sequence.

[0039] (4) Use the Vim model trained in the above steps as the backbone network and fine-tune it on different downstream tasks.

[0040] According to another aspect of the present invention, a visual representation device based on a bidirectional state-space model is also provided, including at least one processor and a memory, wherein the at least one processor and the memory are connected via a data bus, and the memory stores instructions that can be executed by the at least one processor, wherein the instructions, after being executed by the processor, are used to complete the visual representation method based on the bidirectional state-space model.

[0041] In summary, the technical solutions conceived by this invention have the following beneficial effects compared with the prior art:

[0042] (1) High efficiency: The invention utilizes the Mamba architecture to build Vim, which achieves sub-quadratic time complexity computation in modern vision tasks, and the memory usage increases linearly.

[0043] (2) Good results: The bidirectional state space model for global visual context modeling introduced in the Mamba structure improves Vim’s visual representation performance. Extensive experiments have demonstrated Vim’s excellent performance in image classification tasks and various downstream dense prediction tasks. Attached Figure Description

[0044] Figure 1 This is a flowchart of a visual representation method based on a bidirectional state-space model according to the present invention. Detailed Implementation

[0045] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0046] This invention relates to the fields of deep learning and computer vision, as well as image classification, semantic segmentation, object detection and instance segmentation, which are fundamental tasks in computer vision. It also includes basic visual models and pre-trained fine-tuning paradigms.

[0047] like Figure 1 As shown, the visual representation method based on a bidirectional state-space model of the present invention includes the following sub-steps:

[0048] (1) Convert the two-dimensional image into a one-dimensional long sequence. This includes the following sub-steps;

[0049] (1.1) For two-dimensional images Perform a uniform cropping operation to obtain a set of image patches. Where H is the height of the image, W is the width of the image, C is the number of channels of the image, P is the side length of the cropped square image patch, and N is the total number of image patches obtained after cropping.

[0050] (1.2) Convert the cropped image patch set into a one-dimensional sequence, as shown in the following formula:

[0051]

[0052] in Represents the i-th image patch. The linear projection parameter matrix, For category tokens, For location embedding, D is the dimension of the image patch after linear projection. Specifically, it includes the following sub-steps:

[0053] (1.2.1) Use linear projection to plot each image patch Switch to For each image patch, the projection matrix is ​​the same learnable parameter matrix. The sequence is formed after dimensional transformation

[0054] (1.2.2) Transfer the category token x cls The length of the sequence is increased by splicing the images together with the first part of the sequence, which consists of image patches after dimensionality transformation.

[0055] (1.2.3) Embed the position with sequence Add them together to obtain the sequence data in the model to be input.

[0056] (2) Use multiple Vim blocks stacked together to form a Vim encoder to model the input sequence. This includes the following sub-steps:

[0057] (2.1) Preprocessing operations before performing state-space modeling on the input sequence. This includes the following sub-steps:

[0058] (2.1.1) Output T of the previous Vim block l-1 Performing normalization operation yields Where B is the batch size, M is the sequence length, and D is the sequence hidden dimension.

[0059] (2.1.2) Using linear layers to pair T′ l-1 Perform a dimensional transformation to obtain and Where d is the hidden dimension after transformation.

[0060] (2.2) For Perform forward state-space modeling. This includes the following sub-steps:

[0061] (2.2.1) For the sequence Perform a one-dimensional convolution operation and activate it using the SiLU function to obtain x′. o .

[0062] (2.2.2) According to x′ o The state-space model parameter matrix is ​​calculated. C o .

[0063] (2.2.3) Perform state-space modeling operations to obtain... The formula is as follows:

[0064]

[0065] (2.2.4) y oPerforming a matrix dot product with z after the SiLU activation function yields the result of the forward state-space modeling, as shown in the following formula:

[0066] y′ forward =y o ·SiLU(z)

[0067] (2.3) Perform inverse state-space modeling. This includes the following sub-steps:

[0068] (2.3.1) Perform a sequence flip operation on sequence x to swap the positions of the sequences.

[0069] (2.3.2) For the reversed sequence Perform a one-dimensional convolution operation and activate it using the SiLU function to obtain...

[0070] (2.3.3) According to The state-space model parameter matrix is ​​calculated. C o .

[0071] (2.3.4) Perform state-space modeling operations to obtain... The formula is as follows:

[0072]

[0073] (2.3.5) will Performing a matrix dot product with z after the activation function SiLU yields the inverse state-space modeling result, as shown in the following formula:

[0074]

[0075] (2.4) The forward modeling sequence y′ obtained after bidirectional state-space modeling forward and reverse modeling sequence y′ backward The output sequence is obtained by performing a residual connection with the input sequence. The formula is as follows:

[0076] T l =Linear(y′) forward +y′ backward )+T l-1

[0077] Where T k-1 Let be the input sequence of the Vim block, and Linear be the linear projection function.

[0078] (3) Extract the category labels of the corresponding positions of the modeled sequence, and use the multilayer perceptron (MLP) to obtain the category prediction results of the image sequence.

[0079] (4) Use the Vim model trained in the above steps as the backbone network and fine-tune it on different downstream tasks.

[0080] Furthermore, the present invention also provides a visual representation device based on a bidirectional state space model, including at least one processor and a memory, wherein the at least one processor and the memory are connected via a data bus, and the memory stores instructions that can be executed by the at least one processor. After being executed by the processor, the instructions are used to complete the visual representation method based on the bidirectional state space model.

[0081] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A visual representation method based on a bidirectional state-space model, characterized in that, Includes the following steps: (1) Convert a two-dimensional image into a one-dimensional long sequence, including the following sub-steps; (1.1) For two-dimensional images Perform a uniform cropping operation to obtain a set of image patches. Where H is the height of the image, W is the width of the image, C is the number of channels of the image, P is the side length of the cropped square image patch, and N is the total number of image patches obtained after cropping. (1.2) Convert the cropped image patch set into a one-dimensional sequence, as shown in the following formula: in W represents the i-th image patch. The linear projection parameter matrix, For category tokens, For location embedding, D is the dimension of the image patch after linear projection; (2) Use multiple Vim blocks stacked together to form a Vim encoder for the input sequence. Modeling is performed, where the first l The input corresponding to each Vim block is denoted as The corresponding output is denoted as The modeling process includes the following sub-steps: (2.1) For the input sequence Preprocessing operations before state-space modeling are performed to obtain and Where B is the batch size, M is the sequence length, and d is the transformed hidden dimension; (2.2) For Perform forward state space modeling; specifically including the following sub-steps: (2.2.1) For the sequence Perform a one-dimensional convolution operation and activate it using the SiLU function to obtain... ; (2.2.2) According to The state-space model parameter matrix is ​​calculated. , , ; (2.2.3) Perform state-space modeling operation to obtain The formula is as follows: (2.2.4) will After being activated by SiLU Performing a matrix dot product operation yields the result of the forward state space modeling, as shown in the following formula: ; (2.3) For Perform inverse state-space modeling; specifically including the following sub-steps: (2.3.1) For the sequence Perform a sequence reversal operation to swap the positions of the sequences. (2.3.2) For the reversed sequence Perform a one-dimensional convolution operation and activate it using the SiLU function to obtain... ; (2.3.3) According to The state-space model parameter matrix is ​​calculated. , , ; (2.3.4) Perform state-space modeling operations to obtain The formula is as follows: (2.3.5) will After being activated by SiLU Performing a matrix dot product operation yields the result of the inverse state-space model, as shown in the following formula: ; (2.4) The forward modeling sequence obtained after modeling the bidirectional state space and reverse modeling sequence The output sequence is obtained by performing a residual connection with the input sequence. The formula is as follows: in For the first l The input sequence of Vim blocks For the first l The output sequence of a Vim block It is a linear projection function; (3) Extract the category label at the corresponding position of the output sequence after the last Vim block is modeled. The class prediction result of the output sequence is obtained by applying a multilayer perceptron (MLP) to it; (4) Use the Vim model trained in the above steps as the backbone network and fine-tune it on different downstream tasks.

2. The visual representation method based on a bidirectional state-space model as described in claim 1, characterized in that, Step (1.2) specifically includes the following sub-steps: (1.2.1) Use linear projection to plot each image patch Switch to ; For each image patch, the projection matrix is ​​the same learnable parameter matrix. ; The sequence is formed after dimensional transformation ; (1.2.2) Class token This is spliced ​​to the beginning of a sequence composed of image patches after dimensionality transformation, increasing the length of the sequence; (1.2.3) Embed the position with sequence Add them together to obtain the sequence data in the model to be input. .

3. The visual representation method based on a bidirectional state-space model as described in claim 1 or 2, characterized in that, Step (2.1) specifically includes the following sub-steps: (2.1.1) Output of the previous Vim block Performing normalization operation yields Where B is the batch size, M is the sequence length, and D is the hidden dimension of the sequence; (2.1.2) Using linear layer pairs Perform a dimensional transformation to obtain and , where d is the transformed hidden dimension.

4. A visual representation device based on a bidirectional state-space model, characterized in that, It includes at least one processor and a memory, which are connected via a data bus. The memory stores instructions that can be executed by the at least one processor. After being executed by the processor, the instructions are used to complete the visual representation method based on the bidirectional state space model as described in any one of claims 1-3.