Efficient high-resolution image visual marker generation method for multi-modal large model

By dynamically slicing and cross-attention processing high-resolution images, compressed visual tag sequences are generated, solving the problem of high computational and memory requirements of multimodal large language models in high-resolution image processing, and achieving efficient visual tag generation and fine-grained understanding.

CN118735932BActive Publication Date: 2026-06-30ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2024-07-01
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing multimodal large language models significantly increase the length of visual marker sequences when processing high-resolution images, resulting in high computational and memory requirements and difficulty in handling tasks requiring fine details, such as dense OCR and visual localization of small objects. At the same time, existing methods are prone to information loss or distortion when adjusting image resolution.

Method used

A dynamic slicing method based on minimum fill is used to segment high-resolution images, generating several image patches. Multi-layer visual features are extracted through a visual encoder, and compressed visual label sequences are generated using cross-attention and connection processing. It supports input images with arbitrary aspect ratios.

Benefits of technology

It effectively compresses the length of visual tag sequences, preserves rich detail information in high-resolution images, improves the efficiency and performance of multimodal large language models, and is suitable for tasks requiring fine-grained visual understanding.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118735932B_ABST
    Figure CN118735932B_ABST
Patent Text Reader

Abstract

This invention discloses an efficient high-resolution image visual tag generation method for multimodal large-scale models. The invention includes the following steps: First, the input original image is segmented using a minimum-fill dynamic slicing method to obtain several image patches, which are then added to a set of image patches to be processed. The original image is preprocessed and also added to the set of image patches to be processed. Next, a visual encoder is used to extract multi-layer visual features from each image patch in the set of image patches to be processed. Then, the penultimate layer of features is downsampled to low-resolution features and cross-attention is performed with the high-resolution features to obtain a visual tag subsequence for the current image patch. Finally, the final compressed visual tag sequence is obtained. This invention can efficiently generate visual tag sequences and is applicable to various visual reasoning tasks such as visual question answering, document question answering, and optical character recognition, providing an efficient and accurate visual context representation method for multimodal large-scale language models.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a visual tag generation method for multimodal large language models, specifically to an efficient method for generating visual tags for high-resolution images in multimodal large models. Background Technology

[0002] With the rapid development of large language models, multimodal large language models have made significant progress in visual-language understanding, reasoning, and interaction capabilities. By projecting the output features of the visual encoder onto the large language model, it is able to perceive the visual world, and the visual projector plays a crucial role in connecting the visual and language models. In current multimodal large language model frameworks, computational and memory requirements are primarily driven by the language model, which contains a large number of parameters. The visual encoder is typically much smaller than the language model; therefore, the length of the visual tag sequence generated by the visual projector significantly affects the overall efficiency of the multimodal large language model.

[0003] Most methods employ linear projectors or resamplers. For linear projectors, multilayer perceptron projection preserves all visual context through a one-to-one transformation, thus retaining detailed information including redundant labels. This significantly increases the length of the visual label sequence when processing high-resolution images or videos.

[0004] In addition, some methods employ resamplers or Q-Formers to explicitly control the length of the visual label sequence using a set of learnable queries, and use cross-attention layers to force the extraction of the most relevant visual cues from the visual features. Meanwhile, some methods utilize convolutional layers to encourage local interactions of visual features and generate compressed labels. However, these methods inevitably lose detailed information, sacrificing the visual reasoning capabilities of multimodal large language models. Furthermore, some methods directly transfer visual features from the length dimension to the channel dimension through simple pixel shuffling or neighbor concatenation operations to reduce sequence length. While preserving all information, this may destroy the structural features of the visual features themselves.

[0005] Most current multimodal large models use CLIP-ViT as a visual encoder to capture visual information. However, this is limited by low-resolution input, such as only supporting images with resolutions of 224*224 or 336*336. This makes it difficult for multimodal large models to handle tasks requiring fine details, such as dense OCR, object counting, and visual localization of small objects. Existing methods directly use SAM or ConvNeXt encoders that support high-resolution input, directly merging high-resolution features with the original low-resolution features.

[0006] There are also methods to resize the image to an accessible size and use sliding windows to divide the image into blocks of the same size (e.g., 224x224). While these methods change the original resolution to a fixed square size, this can lead to blurred or distorted visual content. Other methods use similar aspect ratios instead of a fixed square ratio for resizing. Nevertheless, unavoidable image resizing and padding operations still exist, which can lead to image distortion and wasted computational resources. Summary of the Invention

[0007] To address the problems existing in the background technology, this invention proposes an efficient method for generating visual tags from high-resolution images for multimodal large models. This invention first dynamically slices the high-resolution image, then injects multi-level high-resolution visual features into a low-resolution query to generate a compressed visual tag sequence. This method supports input images with arbitrary aspect ratios, effectively compresses the length of the visual tag sequence, and improves the overall performance and efficiency of multimodal large models.

[0008] The technical solution adopted in this invention includes the following steps:

[0009] I. An efficient method for generating high-resolution image visual tags for multimodal large models

[0010] S1: The input original image is segmented based on the original aspect ratio using a dynamic slicing method with minimum fill to obtain several image blocks; and the original image is preprocessed to make it the same size as the image blocks, thereby obtaining the image blocks corresponding to the original image and forming a set of image blocks to be processed with the segmented image blocks.

[0011] S2: Use a visual encoder to extract multi-layer visual features of each image patch in the set of image patches to be processed, and extract single-layer low-resolution features;

[0012] S3: Based on the multi-layer visual features and single-layer low-resolution features of the current image patch, cross-attention and connection processing are performed to obtain the visual label subsequence of the current image patch;

[0013] S4: Repeat S2 and S3. After generating visual labels for the remaining image blocks in the set of image blocks to be processed, obtain all visual label subsequences, and then generate the compressed visual label sequence of the original image.

[0014] In step S1, the original image is an image with any aspect ratio, and the resolution of the original image includes high resolution.

[0015] In step S1, the input original image is segmented based on its original aspect ratio using a dynamic slicing method based on minimum fill, resulting in several image blocks, specifically:

[0016] Based on the dimensions of the original input image and the preset dimensions of the image patches, the parameters of the segmentation grid, including the number of rows and columns of the segmentation grid, are calculated using the following formula:

[0017]

[0018] S o (H, W, r, n) H n W )=IoU((H,W),(n H ×r,n W ×r))

[0019]

[0020] Where, n H *,n W * represents the actual number of rows and columns of the divided grid, n. H and n W N is the number of rows and columns of the grid. g Indicates the maximum number of grid cells; r represents the preset size of the image patch, S p ( ) and S o ( ) represent the fill score and overlap score, respectively; H and W represent the height and width of the original image, respectively; and IoU() represents the intersection-union ratio of the image. Configure for grid segmentation. Let be a set of integers, β be the weighting coefficient, and α be the image adjustment ratio;

[0021] Then, based on the actual number of rows and columns of the segmentation grid, the input original image is segmented to obtain several image blocks. After filling the image blocks with a size smaller than the preset size, the final image block is obtained.

[0022] In step S1, the original image is resized and pixel-filled to make it the same size as the image block.

[0023] In step S2, the penultimate layer visual features output from the visual encoder are downsampled to obtain a single-layer low-resolution feature.

[0024] Specifically, S3 is:

[0025] The multi-layer visual features of the current image patch are processed by a multi-layer perceptron and then deformed and flattened to obtain high-resolution features at multiple scales, which are used as keys and values ​​in the cross-attention mechanism. The single-layer low-resolution features of the current image patch are flattened and used as queries in the cross-attention mechanism. Then, multi-layer injection processing based on cross-attention is performed to obtain the visual tag subsequence of the current image patch.

[0026] The multi-layer injection process based on cross-attention specifically includes:

[0027] Perform a cross-attention operation to obtain the output of the cross-attention mechanism module and use it as the input of the connector. The connector generates a visual tag subsequence for the current image patch.

[0028] In step S4, a newline character is added to the end of each visual marker subsequence, and a separator is added between two adjacent visual marker subsequences, so that all visual marker subsequences are spliced ​​together to generate a compressed visual marker sequence of the original image.

[0029] In S2, the visual encoder is specifically a CLIP-VIT-L / 14 visual encoder.

[0030] The connector is a multilayer sensor.

[0031] The separator includes commas.

[0032] II. An efficient, high-resolution image visual tagging generation system for multimodal large models

[0033] The dynamic slicing module is used to perform image segmentation and preprocessing on the original image based on the original aspect ratio to obtain a set of image blocks to be processed.

[0034] The visual feature extraction module is used to extract visual features of image patches using a visual encoder;

[0035] The visual marker sequence generation module is used to generate visual marker sub-sequences for each image block using the multi-layer injection module, and to generate compressed visual marker sequences based on the visual marker sub-sequences.

[0036] III. A computer device

[0037] The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the method.

[0038] IV. A computer-readable storage medium

[0039] The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the method.

[0040] The beneficial effects of this invention are:

[0041] This invention proposes an efficient method for generating visual tags for high-resolution images in multimodal large models, capable of generating concise visual tag sequences between the visual encoder and the language model. High-resolution visual features are downsampled into low-resolution visual features through interpolation, serving as the basis for the overall visual representation. Through a multi-layer injection module, a single layer of low-resolution features acts as the query, while multiple layers of high-resolution visual features serve as reference keys and values. This ensures that detailed information is fully absorbed within the corresponding local context regions, effectively updating the low-resolution query and transforming it into a rich representation.

[0042] This invention has significant advantages in achieving efficient high-resolution image understanding and visual reasoning of multimodal large language models, providing an efficient and fine-grained solution for future research on visual-language models. Attached Figure Description

[0043] Figure 1 This is an overall flowchart of an efficient method for generating visual markers for high-resolution images in multimodal large models, according to an embodiment of the present invention.

[0044] Figure 2 This is a flowchart of the image dynamic slicing module according to an embodiment of the present invention.

[0045] Figure 3 This is a schematic diagram of a visual projector network that efficiently supports the generation of visual markers for high-resolution images, according to an embodiment of the present invention. Detailed Implementation

[0046] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0047] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0048] like Figure 1 and Figure 2 As shown, on the one hand, the efficient method for generating visual labels for high-resolution images in multimodal large models proposed in this invention includes the following steps:

[0049] S1: The input original image is segmented based on the original aspect ratio using a dynamic slicing method with minimum fill, resulting in several image blocks; the original image is then preprocessed to make it the same size as the image blocks, thus obtaining the image blocks corresponding to the original image and forming a set of image blocks to be processed with the segmented image blocks.

[0050] The original image can be any aspect ratio, and its resolution includes both high and low resolutions. Low resolution refers to 336x336 or 224x224, which is the input image resolution of CLIP-ViT itself. In contrast, high resolution is supported; this invention can support resolutions of 1088x1088, 1344x1344, or even higher.

[0051] In S1, the input original image is segmented based on its original aspect ratio using a dynamic slicing method based on minimum fill, resulting in several image patches, specifically:

[0052] Based on the dimensions of the input original image and the preset size of the image patches (i.e., 336 as specified in the CLIP-ViT-L / 14 image encoder), the parameters of the segmentation grid, including the number of rows and columns of the segmentation grid, are solved using the following formula:

[0053]

[0054] S o (H, W, r, n) H n W )=IoU((H,W),(n H ×r,n W ×r))

[0055]

[0056] Where, n H *,n W * represents the actual number of rows and columns of the divided grid, n. G and n W N is the number of rows and columns of the grid. g Indicates the maximum allowed number of grid cells; r represents the preset size of the image patch, S p ( ) and S o ( ) represent the fill fraction and overlap fraction, respectively, H and W are the height and width of the original image, respectively, and IoU((H,W) , (n H ×r,n W ×r)) is the intersection-union ratio (IoU) of the original image and the large image formed by stitching together the segmented image patches. Configure different grid segments. α is an integer, β is the weighting coefficient, and α is the image adjustment ratio;

[0057] Then, based on the actual number of rows and columns of the segmentation grid, the input original image is segmented to obtain several image blocks. Image blocks smaller than a preset size are then filled to obtain the final image blocks. The above formula obtains a given image. The optimal segmentation grid configuration considers three key factors: 1) maintaining the original aspect ratio of the original image to avoid distortion; 2) minimizing the padding ratio so that most of the grid is occupied by the content of the original image; 3) among the segmentation grid configurations that satisfy the first two conditions, selecting the configuration with the resolution closest to the original image.

[0058] While keeping the original aspect ratio unchanged, the original image is resized (i.e. scaled) and zero-pixel padded according to the image adjustment ratio to make it the same size as the image block, serving as a macro overview of the original image to preserve the original information.

[0059] S2: Use a visual encoder to extract multi-layer visual features of each image patch in the set of image patches to be processed, and extract single-layer low-resolution features;

[0060] like Figure 3 As shown in S2, bilinear interpolation is used to downsample the penultimate layer visual features output from the visual encoder (i.e., the penultimate layer visual features output from the Transformer in the CLIP-VIT-L / 14 visual encoder) to obtain a single-layer low-resolution feature. Specifically, bilinear interpolation is used to downsample the penultimate layer visual features of the Transformer in the CLIP-VIT-L / 14 -336px visual encoder using a scaling factor s. Downsampling for low-resolution visual features in N is the length of the visual feature, C is the dimension of the visual feature, and the length M of the visual label sequence can be controlled by the downsampling ratio s. Low-resolution visual feature I′ v It can be viewed as a coarse representation of the original high-resolution visual features, where the low-resolution features I′ v Each pixel corresponds to a high-resolution feature I v A specific (s*s) subregion.

[0061] S3: Based on the multi-layer visual features and single-layer low-resolution features of the current image patch, cross-attention and connection processing are performed to obtain the visual label subsequence of the current image patch;

[0062] S3 specifically refers to:

[0063] This invention constructs low-resolution features Each pixel in the high-resolution feature The pairing relationship of each sub-region (where s represents the scaling factor, M represents the length of the low-resolution feature, s 2 (where ×M is the length of the high-resolution feature), thus injecting detailed information of the high-resolution sub-region into each low-resolution coarsely represented pixel. Therefore, the multi-layer visual features of the current image patch are processed by a multi-layer perceptron (MLP) and then deformed and flattened to obtain multi-scale high-resolution features, which are used as keys and values ​​in the cross-attention mechanism. The single-layer low-resolution features of the current image patch are flattened and used as queries in the cross-attention mechanism. Then, multi-layer injection processing based on cross-attention is performed to obtain the visual label subsequence of the current image patch.

[0064] The multi-layer injection process based on cross-attention is as follows:

[0065] A cross-attention operation is performed to inject multi-layered high-resolution detail information into low-resolution features. The output of the cross-attention mechanism module is obtained and used as the input of the connector, which generates a visual label subsequence for the current image patch. The multi-layer injection processing based on cross-attention is called the multi-layer injection module. Since different layers of the CLIP-VIT visual encoder exhibit different biases towards different modes, shallow features contain low-level information, while deep features perform better in semantic understanding. Therefore, single-layer low-resolution features are flattened and used as queries, while multi-layered high-resolution features are processed by an MLP, deformed, and flattened, and then used as candidate values ​​and keys for reference. Cross-attention operations allow low-resolution features to fully absorb fine-grained high-resolution feature information. Finally, a connector composed of multi-layer perceptrons converts the features into a visual label subsequence.

[0066] S4: Repeat S2 and S3. After generating visual tags for the remaining image blocks in the set of image blocks to be processed, obtain all visual tag sub-sequences, and then generate the compressed visual tag sequence (tokens) of the original image. Specifically, it is generated by splicing the visual tag sub-sequences corresponding to all image blocks.

[0067] In S4, a newline character ('\n') is added to the end of each visual marker subsequence, and a separator (such as a comma (',')) is added between adjacent visual marker subsequences to represent the two-dimensional structural information of the image and avoid ambiguity in large language models. This concatenates all visual marker subsequences to generate a compressed visual marker sequence of the original image. The compressed visual marker sequence generated by this invention has a length that is 1 / s of the original visual marker sequence. 2 Furthermore, it possesses high-quality visually labeled sequences, which can then be used for reasoning tasks in large language models.

[0068] On one hand, this invention proposes an efficient system for generating visual labels for high-resolution images in multimodal large models, comprising:

[0069] The dynamic slicing module is used to perform image segmentation and preprocessing on the original image based on the original aspect ratio to obtain a set of image blocks to be processed.

[0070] The visual feature extraction module is used to extract visual features of image patches using a visual encoder;

[0071] The visual marker sequence generation module is used to generate visual marker sub-sequences for each image block using the multi-layer injection module, and to generate compressed visual marker sequences based on the visual marker sub-sequences.

[0072] On one hand, the present invention proposes a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of a method.

[0073] On one hand, the present invention proposes a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of a method.

[0074] Large Language Model Simultaneously receiving visual marker T v and text tag T t It then generates coherent responses autoregressively. For a response sequence of length L, it generates a contextual target answer. The probability can be calculated using the following formula:

[0075]

[0076] Where p() is the probability of generating the target answer, and y i The i-th word in the target answer sequence.

[0077] This invention proposes an efficient visual tag generation method for multimodal large-scale models that supports high-resolution images. This method effectively compresses the length of visual tag sequences while preserving rich details in high-resolution images, thereby significantly improving the efficiency and performance of multimodal large-scale language models. By interpolating to generate low-resolution features and injecting multiple layers of high-resolution features, a concise visual tag sequence is generated, achieving high-quality feature transfer between the visual encoder and the large-scale language model. This method overcomes the shortcomings of traditional methods in high-resolution image processing and is particularly suitable for scenarios requiring fine-grained visual understanding, such as OCR recognition and complex visual reasoning. Therefore, it is beneficial for the practical application of multimodal large-scale models.

[0078] Finally, it should be noted that the above embodiments and descriptions are only used to illustrate the technical solutions of the present invention and not to limit it. Those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the disclosure of the technical solutions of the present invention, and all such modifications and substitutions should be covered within the protection scope of the claims of the present invention.

Claims

1. An efficient high-resolution image visual marker generation method for a multi-modal large model, characterized in that, Includes the following steps: S1: The input original image is segmented based on the original aspect ratio using a dynamic slicing method with minimum fill to obtain several image blocks; and the original image is preprocessed to make it the same size as the image blocks, thereby obtaining the image blocks corresponding to the original image and forming a set of image blocks to be processed with the segmented image blocks. S2: Use a visual encoder to extract multi-layer visual features of each image patch in the set of image patches to be processed, and extract single-layer low-resolution features; S3: Based on the multi-layer visual features and single-layer low-resolution features of the current image patch, cross-attention and connection processing are performed to obtain the visual label subsequence of the current image patch; Specifically, S3 is: The multi-layer visual features of the current image patch are processed by a multi-layer perceptron and then deformed and flattened to obtain high-resolution features at multiple scales, which are used as keys and values ​​in the cross-attention mechanism. The single-layer low-resolution features of the current image patch are flattened and used as queries in the cross-attention mechanism. Then, multi-layer injection processing based on cross-attention is performed to obtain the visual tag subsequence of the current image patch. The multi-layer injection process based on cross-attention specifically includes: Perform a cross-attention operation to obtain the output of the cross-attention mechanism module and use it as the input of the connector. The connector generates a visual tag subsequence for the current image patch. S4: Repeat S2 and S3. After generating visual labels for the remaining image blocks in the set of image blocks to be processed, obtain all visual label subsequences, and then generate the compressed visual label sequence of the original image.

2. The method according to claim 1, wherein, In step S1, the original image is an image with any aspect ratio, and the resolution of the original image includes high resolution.

3. The efficient high-resolution image visual tag generation method for multimodal large models according to claim 1, characterized in that, In step S1, the input original image is segmented based on its original aspect ratio using a dynamic slicing method based on minimum fill, resulting in several image blocks, specifically: Based on the dimensions of the original input image and the preset dimensions of the image patches, the parameters of the segmentation grid, including the number of rows and columns of the segmentation grid, are calculated using the following formula: , , in, These represent the actual number of rows and columns of the divided grid, respectively. and It is the number of rows and columns of the grid. Indicates the maximum number of grid cells; The preset size represents the image patch. and These are the fill fraction and the overlap fraction, respectively. These are the height and width of the original image, respectively. The crossover ratio (CUP) of the images. Configure for grid segmentation. For the set of integers, These are the weighting coefficients. Adjust the ratio of the image; Then, based on the actual number of rows and columns of the segmentation grid, the input original image is segmented to obtain several image blocks. After filling the image blocks with a size smaller than the preset size, the final image block is obtained.

4. The efficient high-resolution image visual tag generation method for multimodal large models according to claim 1, characterized in that, In step S1, the original image is resized and pixel-filled to make it the same size as the image block.

5. The efficient high-resolution image visual tag generation method for multimodal large models according to claim 1, characterized in that, In step S2, the penultimate layer visual features output from the visual encoder are downsampled to obtain a single-layer low-resolution feature.

6. The efficient high-resolution image visual tag generation method for multimodal large models according to claim 1, characterized in that, In step S4, a newline character is added to the end of each visual marker subsequence, and a separator is added between two adjacent visual marker subsequences, so that all visual marker subsequences are spliced ​​together to generate a compressed visual marker sequence of the original image.

7. A high-efficiency, high-resolution image visual tagging generation system for multimodal large models, characterized in that, include: The dynamic slicing module is used to perform image segmentation and preprocessing on the original image based on the original aspect ratio to obtain a set of image blocks to be processed. The visual feature extraction module is used to extract multi-layer visual features of image patches and extract single-layer low-resolution features using a visual encoder. A visual marker sequence generation module is used to generate visual marker sub-sequences for each image patch using a multi-layer injection module, and to generate a compressed visual marker sequence based on the visual marker sub-sequences of each image patch; including: The multi-layer visual features of each image patch are processed by a multi-layer perceptron and then deformed and flattened to obtain high-resolution features at multiple scales, which are used as keys and values ​​in the cross-attention mechanism. The single-layer low-resolution features of the current image patch are flattened and used as queries in the cross-attention mechanism. Then, multi-layer injection processing based on cross-attention is performed to obtain the visual tag subsequence of the current image patch. The multi-layer injection process based on cross-attention specifically includes: Perform a cross-attention operation to obtain the output of the cross-attention mechanism module and use it as the input of the connector. The connector generates a visual tag subsequence for the current image patch.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.