A Visual Large Language Model Construction Method for Interpreting Multi-Source Ship Remote Sensing Images
By constructing a visual large language model for interpreting multi-source ship remote sensing images, extracting image features using CLIP and DINOv2 networks, and combining LLaMA and SAM models for transfer learning, the unified problem of multi-source and multi-type remote sensing image detection is solved, and efficient processing of ship detection and segmentation tasks is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INST OF TECH
- Filing Date
- 2024-02-21
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies cannot use natural language to locate and extract remote sensing targets, nor can they unify the detection of multi-source and multi-type remote sensing images.
A visual large language model for interpreting multi-source ship remote sensing images is constructed. Image features are extracted through CLIP and DINOv2 visual backbone networks. The LLaMA large language model and SAM model are combined to align visual and linguistic features. Bias and scale matrices are added for transfer learning and multimodal input processing.
It achieves unified processing of multi-source ship remote sensing images, enabling ship horizontal bounding box detection, rotated bounding box detection, and pixel-level segmentation tasks, thus enhancing the ability to interpret remote sensing images.
Smart Images

Figure CN117994677B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of interdisciplinary technology of remote sensing and computer vision, and specifically relates to a method for constructing a visual large language model for interpreting multi-source ship remote sensing images. Background Technology
[0002] Ship detection based on remote sensing images aims to accurately identify and extract the position of ships in remote sensing images with complex background interference. Intelligent ship interpretation is a key task in marine environmental analysis and monitoring, and is widely used in fields such as maritime security, border defense, naval warfare, environmental protection, search and rescue operations, maritime traffic control, and fisheries management.
[0003] Existing deep learning-based ship detection algorithms can be mainly divided into two types: two-stage algorithms and one-stage algorithms. Two-stage methods are region-recommendation-based detection algorithms. They first generate candidate bounding boxes for target regions, then further classify the images within these boxes and correct their positions to achieve target detection. One-stage methods are regression-based target detection algorithms. They primarily detect targets by performing regular, dense sampling of bounding boxes at various locations, sizes, and aspect ratios on the image. The network directly processes the input image to generate object category probabilities and location coordinates. These algorithms have proven their effectiveness in various remote sensing tasks, achieving state-of-the-art performance. However, these models cannot use natural language to locate and extract remote sensing targets, and they cannot unify the detection of multi-source (e.g., optical or SAR) images, or the detection of multiple types (e.g., horizontal bounding box detection and rotated bounding box detection).
[0004] Therefore, a method for constructing a visual large language model for interpreting multi-source ship remote sensing images is needed to solve the above problems. Summary of the Invention
[0005] In view of this, the purpose of this invention is to provide a method for constructing a visual large language model for interpreting multi-source ship remote sensing images, in order to solve the problems in the prior art that it is impossible to use natural language to locate and extract remote sensing targets, and that it is impossible to unify multi-source image detection and multi-type detection.
[0006] To achieve the above objectives, the present invention provides the following technical solution:
[0007] This invention provides a method for constructing a visual large language model for interpreting multi-source ship remote sensing images, comprising the following steps:
[0008] S1: Visual features of each input image I from the ship remote sensing image are extracted using the CLIP visual backbone network and the DINOv2 visual backbone network, respectively, and the multi-scale visual features extracted by the CLIP encoder are concatenated along the same channel dimension. and multi-scale visual features extracted by the DINOv2 encoder Language features p represented by natural language instruction tokens are obtained through a learnable mapping layer. l Alignment to obtain the image features p after dimensional transformation. v By concatenating image features p v and language features p l To obtain multimodal input in,
[0009] S2: Based on the LLaMA large language model, LoRA technology is used to analyze image features p. v and language features p l Popeye, a visual large language model, is obtained through alignment training.
[0010] S3: Add bias matrix ΔW b and scale matrix ΔW s Two learnable parameters are used to construct a knowledge transfer framework for the remote sensing of ships.
[0011] S4: Combine the Popeye Visual Large Language Model with the SAM model to achieve pixel-level ship segmentation for language indication: The input ship remote sensing image is encoded into an image embedding using the MAE through the SAM image encoder, the detection bounding box generated by the LLaMA large language model is encoded into a cue embedding, and the image embedding and cue embedding are combined through the mask decoder to output the predicted segmentation mask.
[0012] Furthermore, in step S1, the visual features extracted by the CLIP encoder Represented as
[0013] in, Representing visual features, H×W×C is the input image resolution, n is the number of multi-scales, and R is the set of real numbers.
[0014] Furthermore, in step S1, the multi-scale visual features extracted by the DINOv2 encoder... Represented as
[0015] in, is a visual feature, and m is the number of multi-scale features.
[0016] Furthermore, in step S2, the alignment training specifically includes the following steps:
[0017] A1: The LLaMA large language model consists of multiple Transformer blocks and is pre-trained on the LAION-400M and COCO Caption datasets using LoRA technology.
[0018] A2: The LoRA technique is used to freeze the entire LLaMA weight matrix during pre-training, and a learnable rank decomposition matrix is injected into the top l-th layer of the Transformer architecture, where the single attention matrix Att of the l-th Transformer block is... l and MultiAtt l The calculations are as follows:
[0019]
[0020]
[0021] In the formula, H represents the number of attention heads. In the l-th Transformer block, each attention head is represented as h, and h∈H. The weight matrix is: R is a real number, D represents the dimension, and d k The dimension of the K matrix. represents the scaling factor, T represents the matrix transpose, and softmax is an activation function;
[0022] A3: In the topmost layer l of the Transformer architecture, a learnable low-rank weight matrix is introduced. The multi-head attention module (Adapted Attn) of the l-th layer Transformer attention module after LoRA adaptation is output using the following formula:
[0023]
[0024] In the formula, This represents a learnable low-rank weight matrix.
[0025] Furthermore, in step S3, constructing a knowledge transfer framework for the remote sensing ship domain includes the following steps:
[0026] B1: Insert a bias matrix ΔW into each linear layer in the Transformer. b and a scaling matrix ΔW s As two learnable parameters;
[0027] B2: Given a linear layer f(x) = Wx, transform the linear layer into f(x) = ΔW s (Wx+ΔW b ),
[0028] In the formula, the weight matrix W, ΔW b ΔW s ∈R D×D Bias matrix ΔW b =Init(0), scale matrix
[0029] Furthermore, in step S4, the segmentation process is performed using the following algorithm:
[0030] F visual =SAM v-enc (I)
[0031] B det =Popeye(I)
[0032] F prompt =SAM p-enc (B det )
[0033] Ω = SAM dec (F visual ,F prompt )
[0034] In the formula, I∈R H×W×3 F represents the input image. visual B represents the feature representation of the image extracted by the SAM image encoder. det This indicates a sparse hint including the ship detection bounding box, F prompt SAM represents the sparse cue tokens encoded by the cue encoder, Ω represents the set of prediction masks, and SAM represents the sparse cue tokens encoded by the cue encoder. v-enc It is an image encoder based on the SAM model. p-enc It is the cue encoder of the SAM model, SAM dec It is the mask decoder of the SAM model, and Popeye(I) is the LLaMA large language model.
[0035] Furthermore, in step S1, multimodal input Represented as:
[0036]
[0037] In the formula, N v N is the length of the visual feature token. l The length of the language feature token. These are tokens representing extracted visual features, derived from p v ; These are extracted natural language instruction tokens, from p l .
[0038] The beneficial effects of this invention are as follows:
[0039] This invention proposes a multimodal large language model for interpreting multi-source ship remote sensing images. This method can understand and uniformly process various ship interpretation tasks, such as ship horizontal bounding box detection, ship rotated bounding box detection, and ship pixel-level segmentation. Technically, we construct a novel visual-language aligned image interpretation method and design a knowledge adaptation and transfer paradigm to adapt to the field of ship remote sensing. Utilizing the powerful generalization ability of the large language model, we achieve a general paradigm for interpreting multi-source ship remote sensing images.
[0040] Other advantages, objectives, and features of the invention will be set forth in the following description and will be apparent to those skilled in the art in some respects, or may be learned by practice of the invention. The objectives and other advantages of the invention can be realized and obtained through the following description. Attached Figure Description
[0041] To make the objectives, technical solutions, and beneficial effects of this invention clearer, the following figures are provided for illustration:
[0042] Figure 1 This is a diagram of the LLaMA large language model framework of the present invention;
[0043] Figure 2 This is a diagram illustrating the ship inspection results of the present invention;
[0044] Figure 3 This is a diagram illustrating the segmentation effect of the present invention. Detailed Implementation
[0045] like Figures 1-3 As shown, this invention provides a method for constructing a visual large language model for interpreting multi-source ship remote sensing images, including the following steps:
[0046] S1: An image interpretation method based on a large language model to align vision and language. Remote sensing images of ships taken by satellite from a bird's-eye view inherently contain complex background interference. Due to this perspective, ship images contain overlapping visual elements and diverse textures of the marine environment and adjacent land. Therefore, a hybrid visual backbone network feature extractor is proposed, which extracts visual features separately and then fuses them to achieve a more powerful visual representation. Specifically, this includes:
[0047] A1: Multi-scale cross-modal feature fusion, extracting multi-scale visual features from each input image I through the CLIP (Contrastive Language-Image Pretraining, proposed by OpenAI) visual backbone network; wherein, the visual feature embedding generated by the CLIP encoder is represented as follows: in
[0048] In the formula, Representing visual features, H×W×C is the input image resolution, and n is the number of multi-scale values.
[0049] A2: To explore implicit visual signals, a DINOv2 visual backbone network trained under self-supervised conditions using VIT is employed to extract multi-scale visual features. The visual features generated by the DINOv2 encoder are represented as follows: in For visual features, m represents the multi-scale quantity;
[0050] A3: Feature tokens extracted by the CLIP encoder and DINOv2 encoder are concatenated along the same channel dimension, and aligned with the dimensions of natural language instruction tokens using a learnable mapping layer to obtain a powerful image feature representation, denoted as...
[0051] In the formula, p v The image features after dimensional transformation.
[0052] The relation is:
[0053] A4: Identify the natural language instruction token as... By connecting p v and p l Obtain multimodal input
[0054]
[0055] In the formula, N v N is the length of the visual feature token. l The length of the language feature token. These are tokens representing extracted visual features, derived from p v ; These are extracted natural language instruction tokens, from p l This step completes the fusion of visual and linguistic information.
[0056] S2: Visual language alignment training is achieved using the LoRA parameter-efficient fine-tuning technique. To endow the large language model with basic visual understanding capabilities, widely used domain data, such as the LAION-400M and COCO Caption datasets, are used for pre-training. To avoid the risks of traditional, expensive full-parameter fine-tuning and overfitting, Low-Rank Adaptation (LoRA) is used for training, which is a parameter-efficient fine-tuning method; specifically including:
[0057] Based on the LLaMA large language model, the model was pre-trained using the LoRA (Low-Rank Adaptation) technique on the LAION-400M and COCO Caption datasets.
[0058] The LLaMA large language model consists of multiple Transformer blocks. The LoRA technique freezes the entire LLaMA weight matrix during pre-training and injects a learnable rank factorization matrix into the top L layers of the Transformer architecture to reduce the number of trainable parameters for downstream tasks; the single attention matrix Att of the l-th Transformer block... l and MultiAttention l The calculations are as follows:
[0059]
[0060]
[0061] In the formula, H represents the number of attention heads. In the l-th Transformer block, the weight matrix for each attention head h∈H is: These attention weights are initially adapted for downstream tasks. In particular, at the top layer of the Transformer architecture, a learnable low-rank weight matrix is introduced. The multi-head attention module after LoRA adaptation is denoted as Adapted Attn. The output of the Transformer attention module after the l-th layer adaptation is defined as follows:
[0062]
[0063] In the formula, the weight matrix ∈ R D×D The rank of the trainable low-rank adapter matrix is much smaller than R.
[0064] In summary, the process starts with a frozen LLaMA and then fine-tunes it by optimizing four smaller matrices in LoRA, thereby enhancing the alignment and interactive understanding between vision and language.
[0065] After pre-training, the proposed algorithm successfully achieved appropriate visual and linguistic alignment. To further enhance the model's command-following ability and adapt it to the remote sensing ship domain, we continued to fine-tune the visual large language model on a newly constructed multi-source ship multimodal dataset. Considering the potential of stimulating the learning of multi-source and cross-modal large language models, we added more learnable parameters and incorporated a transfer learning stage in the ship domain, thus progressing through the following steps:
[0066] S3: Construct a knowledge transfer framework to adapt to the field of ship remote sensing, specifically including:
[0067] Insert a bias matrix ΔW into each linear layer in the Transformer. b and a scaling matrix ΔW s As two learnable parameters, given a linear layer f(x) = Wx, the linear layer is transformed into f(x) = ΔW. s (Wx+ΔW b ),
[0068] where
[0069] ΔW b =Init (0)
[0070] In the formula, the weight matrix W, ΔW b ΔW s ∈R D×D The bias matrix is initialized to zero (Init(0)), and the scaling matrix is initialized to a random Gaussian distribution. This helps maintain the stability and effectiveness of fine-tuning.
[0071] Different parameter optimization methods were adopted in the visual language alignment stage and the ship domain adaptation stage, which effectively solved the interference problem between image-text understanding and instruction following ability, thereby enhancing the emergency response capability of Popeye, a visual large language model, to perform interactive processing of multi-source ship horizontal bounding box and rotated bounding box detection tasks according to language instructions.
[0072] S4: Combining the visual large language model Popeye with the SAM (Segment Anything) model to achieve pixel-level ship segmentation for language indication:
[0073] Segment Anything (SAM) can perform cue-based segmentation (points, bounding boxes, masks, etc.). By designing appropriate cues for the model, a wide range of segmentation tasks can be accomplished. However, remote sensing images contain complex background interference and blurred target edges, which severely affects the segmentation performance of SAM on remote sensing images. The integration of the proposed model with SAM specifically enhances SAM's capabilities on remote sensing ship images;
[0074] The specific segmentation process includes: encoding the input remote sensing image into an image feature embedding representation using the SAM image encoder and the Masked Autoencoder (MAE); encoding the detection bounding boxes generated by the proposed model into a cue embedding representation; and effectively combining the image embedding and cue embedding using the mask decoder to output the predicted segmentation mask. The entire process is represented as follows:
[0075] F visual =SAM v-enc (I)
[0076] B det =Popeye(I)
[0077] F prompt =SAM p-enc (B det )
[0078] Ω = SAM dec (F visual ,F prompt )
[0079] In the formula, I∈R H×W×3 F represents the input image. visual B represents the feature representation of the image extracted by the SAM image encoder. det This indicates a sparse hint including the ship detection bounding box, F prompt Ω represents the sparse cue tokens encoded by the cue encoder, and Ω represents the set of prediction masks.
[0080] This invention proposes a multimodal large language model for interpreting multi-source ship remote sensing images. This method can understand and uniformly process various ship interpretation tasks, such as ship horizontal bounding box detection, ship rotated bounding box detection, and ship pixel-level segmentation. Technically, we construct a novel visual-language aligned image interpretation method and design a knowledge adaptation and transfer paradigm to adapt to the field of ship remote sensing. Utilizing the powerful generalization ability of the large language model, we achieve a general paradigm for interpreting multi-source ship remote sensing images.
[0081] Finally, it should be noted that the above preferred embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail through the above preferred embodiments, those skilled in the art should understand that various changes can be made to it in form and detail without departing from the scope defined by the claims of the present invention.
Claims
1. A method for constructing a visual large language model for interpreting multi-source ship remote sensing images, characterized in that, Includes the following steps: S1: Extract each input image of the ship remote sensing image using the CLIP visual backbone network and the DINOv2 visual backbone network respectively. The visual features are concatenated along the same channel dimension, representing multi-scale visual features extracted by the CLIP encoder. and multi-scale visual features extracted by the DINOv2 encoder It uses a learnable mapping layer to represent language features with natural language instruction tokens. Alignment to obtain image features after dimensional transformation By concatenating image features and language features To obtain multimodal input ,in, , ; S2: Based on the LLaMA large language model, LoRA technology is used to analyze image features. and language features A visual large language model is obtained through alignment training. ; S3: Add bias matrix and scale matrix Two learnable parameters are used to construct a knowledge transfer framework for the remote sensing of ships. S4: Will A pixel-level ship segmentation for language indication is achieved by combining a visual big language model with a SAM model: the input ship remote sensing image is encoded into an image embedding using a MAE through a SAM image encoder, the detection bounding boxes generated by the LLaMA big language model are encoded into a cue embedding, and the image embedding and cue embedding are combined through a mask decoder to output the predicted segmentation mask.
2. The method for constructing a visual large language model for multi-source ship remote sensing image interpretation according to claim 1, characterized in that: In step S1, the visual features extracted by the CLIP encoder Represented as , in, , Indicates visual characteristics, For the input image resolution, n Let R be a multi-scale quantity and R be the set of real numbers.
3. The method for constructing a visual large language model for multi-source ship remote sensing image interpretation according to claim 2, characterized in that: In step S1, the multi-scale visual features extracted by the DINOv2 encoder Represented as , in, , As a visual feature, m It is a multi-scale quantity.
4. The method for constructing a visual large language model for multi-source ship remote sensing image interpretation according to claim 1, characterized in that: In step S2, the alignment training specifically includes the following steps: A1: The LLaMA large language model consists of multiple Transformer blocks and is pre-trained using LoRA technology based on the LAION-400M and COCO Caption datasets; A2: Employs LoRA technique to freeze the entire LLaMA weight matrix during pre-training and injects a learnable rank factorization matrix into the top of the Transformer architecture. Layer, of which, the first Single attention matrix of a Transformer block and bullish attention The calculations are as follows: In the formula, The number of attention heads, in the first In each Transformer block, each attention head is represented as h. The weight matrix is: , , R is a real number, and D represents the dimension. The dimension of the K matrix is 1 / represents the scaling factor, T represents the matrix transpose, and softmax is an activation function; A3: At the very top of the Transformer architecture In the layer, a learnable low-rank weight matrix is introduced, and the 1st weight is output using the following formula. Multi-head attention module of Transformer after LoRA adaptation : In the formula, , , , This represents a learnable low-rank weight matrix.
5. The method for constructing a visual large language model for multi-source ship remote sensing image interpretation according to claim 4, characterized in that: Step S3 involves constructing a knowledge transfer framework for remote sensing of ships, which includes the following steps: B1: Insert a bias matrix into each linear layer in the Transformer. and a scale matrix As two learnable parameters; B2: Given a linear layer Transform the linear layer into , In the formula, the weight matrix Bias matrix initialization: Initialization of the scale matrix: .
6. The method for constructing a visual large language model for multi-source ship remote sensing image interpretation according to claim 4, characterized in that: In step S4, the segmentation process is performed using the following algorithm: In the formula, Indicates the input image. Indicates by The image feature representation extracted by the image encoder This indicates a sparse hint including the ship inspection bounding box. This represents a sparse cue tag encoded by a cue encoder. The set of prediction masks, It is an image encoder based on the SAM model. It is the prompt encoder of the SAM model. It is the mask decoder of the SAM model. It is a visual large language model.
7. The method for constructing a visual large language model for multi-source ship remote sensing image interpretation according to claim 1, characterized in that: In step S1, multimodal input Represented as: In the formula, The length of the visual feature token. The length of the language feature token. These are tokens extracted from visual features, sourced from... These are extracted natural language instruction tokens, from .