A visual-based marine work equipment identification method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining SAM and Grounding DINO models, automated identification and detection of offshore operational equipment are achieved, solving the problem of equipment status assessment under complex marine conditions and realizing high-precision and robust monitoring of offshore operational equipment.

CN119863700BActive Publication Date: 2026-06-19TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA SHENZHEN INTERNATIONAL GRADUATE SCHOOL
Filing Date: 2024-12-25
Publication Date: 2026-06-19

Application Information

Patent Timeline

25 Dec 2024

Application

19 Jun 2026

Publication

CN119863700B

IPC: G06V20/10; G06V10/25; G06V10/26; G06V20/70; G06V10/774; G06V10/764; G06V10/80; G06V10/82; G06V10/766; G06N3/0464; G06N3/0895

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

When operating at sea, the passage of ships increases the risk of collision. Existing technologies are difficult to effectively assess the operational status of equipment and identify potential risk factors under complex marine conditions. In particular, the effects of waves, water quality changes and glare make it difficult to distinguish targets from the background, leading to frequent misjudgments.

Method used

The SAM model is used for image segmentation to generate high-quality target masks, which are then combined with the Grounding DINO model for automatic detection. The model is fine-tuned through cross-modal feature fusion and self-supervised learning strategies to improve detection accuracy and generalization ability, forming an iterative optimization process.

Benefits of technology

It has enabled automated and real-time detection of offshore equipment, improved the accuracy and robustness of detection, and can accurately identify offshore equipment in complex environments, thereby enhancing the safety and efficiency of offshore operations.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN119863700B_ABST

Patent Text Reader

Abstract

This invention provides a vision-based method for monitoring offshore operational equipment, comprising: acquiring and preprocessing images of the offshore operation site; automatically generating high-quality target masks using the SAM model for image segmentation; combining user-input text prompts; automatically detecting offshore equipment in the images and generating bounding boxes; and refining the detection results by combining the bounding boxes with the segmentation masks from the SAM model. The Grounding DINO model is fine-tuned on the SAM automatically labeled dataset to improve its adaptability to the offshore environment, and its detection performance is continuously improved through an iterative optimization process. The method of this invention demonstrates superior performance on the ShipDataset dataset, with an Average Precision (AP) exceeding that of traditional models, especially in offshore equipment detection, exhibiting high robustness and accuracy, providing strong technical support for the safety and efficiency of offshore operations.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to computer vision and marine work equipment identification technology, and in particular to a vision-based, efficient method for identifying marine work equipment. Background Technology

[0002] When operating at sea, the risk of collision increases when other vessels are passing nearby, which can severely impact the success rate of operations. To mitigate this risk, it is necessary to be able to more comprehensively assess the operational status of equipment under complex marine conditions and identify potential risk factors in advance.

[0003] In recent years, the technology for automatic detection of ships and objects on the water surface has made significant progress, mainly divided into three categories: (1) background modeling and subtraction methods, (2) methods based on human visual attention mechanisms, and (3) techniques based on edge and texture features. Background modeling and subtraction techniques model the sea surface background through median modeling or using Gaussian models and Gaussian mixture models, and then identify ships by subtracting the background. Although such methods are effective, their performance is limited on non-fixed camera platforms. The second category of methods is based on human visual attention models, using local contrast or spectral residuals to construct saliency maps to highlight target areas. Although these methods can reduce ocean background noise to some extent, they are not effective when dealing with large areas of waves. The third category of methods focuses on using edge and texture features, such as gradient information, Hough transform, local binary patterns, and gradient histograms, for ship detection. These techniques improve detection accuracy by analyzing the texture information of the sea surface background, but their computational complexity poses a significant challenge to real-time video processing.

[0004] Equipment inspection on offshore platforms faces significant challenges (see...) Figure 1 The main reason for the difficulty in detection is the interference of reflected light (glare) from the water surface, which reduces the accuracy of the algorithm. Furthermore, dynamic marine environments such as waves, water quality, and weather changes further exacerbate the detection challenge. These factors affect image quality, making it difficult to distinguish between the target and the background. For example, the similar visual features of waves and small devices can easily lead to misjudgment, and changes in water transparency also affect image clarity.

[0005] It should be noted that the information disclosed in the background section above is only for understanding the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0006] The main objective of this invention is to overcome the deficiencies in the aforementioned background technology and provide a vision-based method for identifying marine work equipment.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] A vision-based method for identifying marine work equipment includes the following steps:

[0009] S1. Collect image data from the offshore operation site and preprocess it to prepare a dataset for subsequent model training and testing;

[0010] S2. Image segmentation is performed using the SAM model to automatically generate high-quality target object masks. The SAM model performs pixel-level analysis on the preprocessed image to identify and segment the marine operation equipment in the image, generate the corresponding segmentation mask, and construct an automatically labeled dataset containing rich images of marine operation equipment and corresponding masks.

[0011] S3. Automatic detection using the Grounding DINO model, where the Grounding DINO model automatically detects marine equipment in the image and generates detection boxes based on user-input text prompts. These boxes are then combined with the segmentation mask generated by the SAM model to refine the detection results.

[0012] S4. On the automatically labeled dataset constructed by the SAM model, the Grounding DINO model is fine-tuned through cross-modal feature fusion and self-supervised learning strategies to better adapt it to the offshore operating platform environment and improve the model's generalization ability and detection accuracy.

[0013] S5. Iterative optimization of the annotation and target detection system, in which the fine-tuned Grounding DINO model is replaced back in the automatic annotation system. The automatic annotation system uses the fine-tuned model to automatically generate more accurate detection boxes and segmentation masks to improve annotation accuracy, and uses the newly annotated data to continue to fine-tune the Grounding DINO model, forming an iterative optimization process to continuously improve detection performance.

[0014] Furthermore, the SAM model includes:

[0015] The input processing module is used to standardize the input image, including resizing, color space conversion, and enhancement techniques, to improve the robustness of the model.

[0016] The feature extraction network adopts a CNN-based backbone network, such as ResNet or EfficientNet, and gradually extracts high-level features of the image through multi-layer convolution operations. At the same time, a self-attention mechanism is introduced to improve the feature representation capability.

[0017] The segmentation mask generation module uses a combination of multilayer perceptron (MLP) and convolutional layers to generate the final segmentation mask. It also considers contextual information to improve segmentation accuracy and applies post-processing algorithms such as conditional random field (CRF) to optimize the segmentation results.

[0018] The post-processing module includes edge smoothing technology and pixel recalibration scheme, which uses graphics algorithms to optimize the segmented contours and enhance the quality of the final mask.

[0019] Furthermore, the SAM model is trained using a diverse dataset containing both labeled and unlabeled images to enhance its performance under self-supervised learning; a combination of cross-entropy loss and contrastive loss is employed to ensure the model's accuracy in pixel-level segmentation; and dynamic learning rate adjustment based on training progress is implemented, adjusting the learning rate in real time according to the model's training progress and actual performance to ensure that the model can converge quickly in the early stages of training and remain stable in the later fine-tuning stages.

[0020] Furthermore, the Grounding DINO model is based on the DINO architecture, integrating self-supervised learning and dynamic feature extraction; it processes input images through multi-scale feature representation to enhance the model's ability to capture details; it utilizes self-supervised learning to train the model on unlabeled data; and the model dynamically adjusts the feature extraction process to adapt to different maritime environments and equipment characteristics.

[0021] Furthermore, the Grounding DINO model possesses the following characteristics:

[0022] The normalization parameters are dynamically adjusted based on the distribution of input features to improve feature representation and alleviate the performance degradation problem of batch normalization.

[0023] During training, the model employs a self-supervised learning strategy, training on unlabeled data to discover potential structures in the data, reducing reliance on labeled data and improving generalization ability.

[0024] The model uses a multi-layer convolutional neural network (CNN) structure to extract features at different levels, capture the diversity of objects, and combine the object's appearance features with contextual information to accurately locate the target object.

[0025] In the SAM-Grounding DINO framework, users interactively select portions of an image as expert cues, and the model automatically matches entities and generates accurate bounding boxes and masks, achieving automatic annotation of the entire image.

[0026] Furthermore, the operation process of the automatic annotation system includes:

[0027] By leveraging the automatic segmentation capabilities of the SAM model, we can assist annotators in performing preliminary mask annotations on target objects in images, providing foundational data for subsequent automatic annotation and model training.

[0028] Manually labeled data was used to fine-tune the Grounding DINO model, enabling it to learn the characteristics of marine engineering equipment and improve its performance in inspection tasks.

[0029] The finely tuned Grounding DINO model is applied to the unlabeled image to automatically detect marine engineering equipment and generate detection boxes. The detection results are input into the SAM model along with the unlabeled image to generate an accurate segmentation mask, thus completing fully automatic annotation.

[0030] Through cyclical adaptive learning, the model learns the features of marine engineering equipment in specific scenarios during the manual annotation stage, and applies these features during the automatic annotation stage, thereby expanding the annotation scope and efficiency.

[0031] Furthermore, the Grounding DINO model includes:

[0032] A text encoder receives input text and converts it into text features, providing semantic information to the model.

[0033] An image encoder processes an input image, extracts image features, and provides a numerical description of the image content.

[0034] The text-image fusion module combines text and image features and enhances feature representation capabilities through cross-modal learning strategies to understand and associate text and image content.

[0035] The language-guided index extraction module utilizes fused features to extract image regions related to the input text through language guidance, thereby achieving target localization.

[0036] A cross-modal decoder combines text features (Key) and image features (Value) to generate accurate detection boxes, thus mapping text to image content.

[0037] Furthermore, the fine-tuning of the Grounding DINO model includes:

[0038] Visual information is extracted from images using the Swing transformer, text descriptions are extracted using BERT, and cross-modal feature fusion is performed using a feature enhancer module.

[0039] Feature fusion is enhanced by adding image-to-text cross-attention and text-to-image cross-attention, as well as a language-guided query selection module;

[0040] Create a cross-modal decoder that fuses text and image features to improve modality alignment and feature representation capabilities;

[0041] During fine-tuning, L1 loss and GIOU loss were used for bounding box regression, following GLIP, and contrastive loss between predicted objects and language labels was used for classification.

[0042] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the method.

[0043] A computer program product includes a computer program that, when executed by a processor, implements the method.

[0044] Compared with the prior art, the present invention has the following beneficial effects:

[0045] The main technical advantages of the vision-based marine equipment identification method of this invention include:

[0046] Automated processing and real-time detection: Utilizing the visual foundation models SAM and Grounding DINO, automated processing and real-time detection of marine platform data are achieved. Image data is processed automatically, and marine operational equipment is detected in real time. It can also quickly and accurately perceive and track changes in the physical world, reflecting these changes in the virtual scene in real time. This ensures the consistency between the virtual environment and the real marine operational environment, thus providing an efficient and automated solution for monitoring marine operational equipment.

[0047] High-precision image segmentation and annotation: The SAM model automatically identifies and segments marine operation equipment in images through pixel-level analysis, generates high-quality segmentation masks, provides accurate image segmentation information for automatic annotation systems, and constructs an automatic annotation dataset containing rich images of marine operation equipment and corresponding masks.

[0048] Enhanced detection accuracy: By combining user-input text prompts, the Grounding DINO model automatically detects marine equipment in images and generates detection boxes. The detection boxes are combined with the segmentation mask generated by the SAM model to refine the detection results and improve detection accuracy.

[0049] Adaptability and generalization ability: By fine-tuning the GroundingDINO model on the automatically labeled dataset built by the SAM model, the model is better adapted to the offshore operating platform environment, thereby improving the model's generalization ability and detection accuracy.

[0050] Iterative optimization: The fine-tuned Grounding DINO model is replaced back in the automatic annotation system. This model is then used to automatically generate more accurate detection boxes and segmentation masks, improving annotation accuracy. The newly annotated data is then used to continue fine-tuning the Grounding DINO model, forming an iterative optimization process that continuously improves detection performance.

[0051] Robustness and Efficiency: Experiments on the ShipDataset ship image dataset demonstrate that our proposed method exhibits superior performance compared to other methods, achieving the highest average accuracy (AP) and showcasing outstanding performance in ship detection. Furthermore, our method outperforms the YOLOv7 model on the marine engineering equipment dataset, proving its separation accuracy and robustness in the open domain.

[0052] Capability to cope with complex marine environments: The method of this invention can accurately detect target content based on user-input text in marine engineering scenarios, and can accurately label even in the case of multiple targets and complex backgrounds. It demonstrates the ability to identify wind turbines in wind farm environments, as well as the detection accuracy of large marine engineering vessels and their crane equipment in complex marine operation environments.

[0053] In summary, the marine equipment identification method of the present invention significantly improves the automation level and detection accuracy of marine equipment monitoring through steps such as automated processing, precise segmentation, text prompt detection, model fine-tuning, and iterative optimization, providing new solutions and technical support for virtual-real integration.

[0054] Other beneficial effects of the embodiments of the present invention will be further described below. Attached Figure Description

[0055] Figure 1 This demonstrates the challenges faced in marine object detection.

[0056] Figure 2 This is a flowchart illustrating the annotation and fine-tuning process based on a visual fundamental model, as described in an embodiment of the present invention.

[0057] Figure 3 This is a schematic diagram illustrating the perception capability of the SAM model in an embodiment of the present invention.

[0058] Figure 4 This is an architecture diagram of the SAM model according to an embodiment of the present invention.

[0059] Figure 5 This is an architectural diagram of the Grounding DINO model according to an embodiment of the present invention.

[0060] Figure 6 This is the output result of the SAM in the automatic labeling system for marine engineering equipment according to an embodiment of the present invention.

[0061] Figure 7 This is a flowchart of the automatic labeling system for marine engineering equipment according to an embodiment of the present invention.

[0062] Figure 8 This is a network flowchart of Grounding DINO according to an embodiment of the present invention.

[0063] Figure 9 The results of the embodiments of the present invention on the constructed marine engineering dataset are shown. Detailed Implementation

[0064] The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary and not intended to limit the scope and application of the present invention.

[0065] To comprehensively and accurately assess the operational status of offshore equipment under complex marine conditions and identify potential risk factors in advance, this invention proposes a vision-based method for monitoring offshore equipment. The method mainly includes: acquiring and preprocessing images of the offshore operation site; using the SAM model for image segmentation to automatically generate high-quality target masks; combining user-input text prompts; and using the Grounding DINO model to automatically detect offshore equipment in the images and generate bounding boxes, which are then combined with the segmentation masks from the SAM model to refine the detection results. The Grounding DINO model is fine-tuned on the SAM automatically labeled dataset to improve its adaptability to the marine environment, and its detection performance is continuously improved through an iterative optimization process. This method demonstrates superior performance on the ShipDataset dataset, with AP surpassing traditional models, especially in offshore equipment detection, exhibiting high robustness and accuracy, providing strong technical support for the safety and efficiency of offshore operations.

[0066] See Figures 2 to 9 A vision-based method for identifying marine work equipment includes the following steps:

[0067] Step S1. Data acquisition and preprocessing, wherein image data of the offshore operation site is acquired and preprocessed to prepare a dataset for subsequent model training and testing;

[0068] Step S2. Use the SAM (Segment Anything Model) model to perform image segmentation and automatically generate high-quality target object masks. The SAM model performs pixel-level analysis on the preprocessed image to identify and segment the marine operation equipment in the image, generate the corresponding segmentation mask, and construct an automatically labeled dataset containing rich images of marine operation equipment and corresponding masks.

[0069] Step S3. Automatic detection using the Grounding DINO model, wherein the Grounding DINO model is used to automatically detect marine equipment in the image and generate detection boxes based on the text prompts input by the user. The detection boxes are then combined with the segmentation mask generated by the SAM model to refine the detection results and improve the accuracy of the detection.

[0070] Step S4. Fine-tuning the Grounding DINO model, which involves fine-tuning the Grounding DINO model on the automatically labeled dataset constructed by the SAM model to better adapt it to the offshore operating platform environment. The model's generalization ability and detection accuracy are improved through cross-modal feature fusion and self-supervised learning strategies.

[0071] Step S5. Iteratively optimize the annotation and target detection system, in which the fine-tuned Grounding DINO model is replaced back in the automatic annotation system. The automatic annotation system uses the fine-tuned model to automatically generate more accurate detection boxes and segmentation masks to improve annotation accuracy, and uses the newly annotated data to continue to fine-tune the Grounding DINO model, forming an iterative optimization process to continuously improve detection performance.

[0072] In this method, step S1 provides basic data for subsequent model training and testing; steps S2 and S3 together realize the automatic detection and labeling of equipment; step S4 optimizes the model to adapt to specific environments; and step S5 further enhances the model's labeling and detection capabilities through iterative optimization.

[0073] See Figure 4 In a preferred embodiment, the SAM model includes: an input processing module for standardizing the input image, including resizing, color space conversion, and enhancement techniques to improve the robustness of the model; a feature extraction network, employing a CNN-based backbone network such as ResNet or EfficientNet, and progressively extracting high-level features of the image through multi-layer convolutional operations, while introducing a self-attention mechanism to enhance feature representation capabilities; a segmentation mask generation module, using a combination of multi-layer perceptron (MLP) and convolutional layers to generate the final segmentation mask, while considering contextual information to improve segmentation accuracy, and applying post-processing algorithms such as Conditional Random Field (CRF) to optimize the segmentation results; and a post-processing module, including edge smoothing techniques and pixel recalibration schemes, using computer graphics algorithms to optimize the segmentation contours and enhance the quality of the final mask.

[0074] In some embodiments, the SAM model is trained using a diverse dataset containing both labeled and unlabeled images to enhance its performance under self-supervised learning; a combination of cross-entropy loss and contrastive loss is employed to ensure the model's accuracy in pixel-level segmentation; and dynamic learning rate adjustment based on training progress is implemented, adjusting the learning rate in real time according to the model's training progress and actual performance to ensure that the model can converge quickly in the early stages of training and remain stable in the later fine-tuning stages.

[0075] See Figure 5 The Grounding DINO model is based on the DINO architecture and integrates self-supervised learning and dynamic feature extraction. It processes input images through multi-scale feature representation to enhance the model's ability to capture details. Self-supervised learning allows the model to be trained on unlabeled data. The model dynamically adjusts the feature extraction process to adapt to different maritime environments and equipment characteristics.

[0076] In a preferred embodiment, the Grounding DINO model possesses the following characteristics: it dynamically adjusts the normalization parameters based on the distribution of input features to enhance feature representation and mitigate the performance degradation of batch normalization; during training, the model employs a self-supervised learning strategy, training on unlabeled data to discover potential structures within the data, reducing reliance on labeled data and improving generalization ability. The model utilizes a multi-layer convolutional neural network (CNN) structure to extract features at different levels, capturing object diversity and accurately locating target objects by combining object appearance features with contextual information.

[0077] In the SAM-Grounding DINO framework, users interactively select portions of an image as expert cues, and the model automatically matches entities and generates accurate bounding boxes and masks, achieving automatic annotation of the entire image.

[0078] See Figure 7In a preferred embodiment, the operation flow of the automatic annotation system includes: Manual annotation stage: Utilizing the automatic segmentation capability of the SAM model, annotators are assisted in performing preliminary mask annotations on target objects in the image, providing basic data for subsequent automatic annotation and model training; Fine-tuning the Grounding DINO model: The manually annotated data is used to fine-tune the Grounding DINO model, enabling it to learn the features of marine engineering equipment and improve performance in detection tasks; Automatic annotation stage: The fine-tuned Grounding DINO model is applied to the unannotated image, automatically detecting marine engineering equipment and generating detection boxes. The detection results, along with the unannotated image, are input into the SAM model to generate accurate segmentation masks, completing fully automatic annotation; Iterative optimization: Through iterative adaptive learning, the model learns the features of marine engineering equipment in specific scenarios during the manual annotation stage and efficiently applies these features during the automatic annotation stage, expanding the annotation range and efficiency.

[0079] like Figure 8 As shown, in a preferred embodiment, the Grounding DINO model includes: a text encoder: receiving input text, such as "Ship," and converting it into text features to provide semantic information for the model; an image encoder: processing the input image, extracting image features, and providing a numerical description of the image content; a text-image fusion module: combining text and image features, enhancing feature representation capabilities through a cross-modal learning strategy to better understand and associate text and image content; a language-guided index extraction module: utilizing the fused features, extracting image regions related to the input text through language guidance to achieve accurate target localization; and a cross-modal decoder: combining text features (Key) and image features (Value), generating accurate detection boxes through a cross-modal decoder to achieve text-to-image content mapping. Through the collaborative work of these modules, automatic identification and labeling of offshore operational equipment is achieved, improving the automation level of monitoring.

[0080] The fine-tuning of the Grounding DINO model includes: Feature extraction and fusion: Visual information is extracted from images using a Swing transformer, text descriptions are extracted using BERT, and cross-modal feature fusion is performed through a feature enhancer module; Cross-attention mechanism: Feature fusion is enhanced by adding image-to-text cross-attention and text-to-image cross-attention, as well as a language-guided query selection module; Cross-modal decoder design: A cross-modal decoder is created to fuse text and image features to improve modality alignment and feature representation capabilities. During fine-tuning, L1 loss and GIOU loss are used for bounding box regression, following GLIP, and contrastive loss between predicted objects and language tags is used for classification.

[0081] The following further describes specific embodiments and experimental verifications of the present invention.

[0082] This invention provides a vision-based method for identifying offshore equipment, which improves the safety and success rate of offshore operations, ensuring the safe and stable operation of equipment even in complex environments and with frequent ship traffic. The method develops a cascaded pipeline that combines the visual foundation model SAM and the Grounding DINO model. It uses the SAM-Grounding DINO framework to automatically annotate relevant data of offshore operating platforms in batches, and fine-tunes the Grounding DINO model to improve the accuracy of offshore equipment detection. On the ShipDataset ship image dataset, the fine-tuned Grounding DINO model achieved the best results compared to other methods. These experimental results demonstrate the significant potential of visual foundation models in offshore equipment detection tasks, providing the possibility for the automated generation and updating of content in virtual environments through their powerful image processing and recognition capabilities.

[0083] The proposed automatic image annotation method based on a visual fundamental model can generate a large amount of high-quality training data, further optimizing the target detection model and effectively addressing the impact of adverse factors such as waves and glare on detection results. This provides strong technical support for the safety and success rate of maritime operations. The method utilizes the automatic image annotation function of a visual fundamental model to create a large amount of high-quality training data, thereby further optimizing the text-based target detection model. This process is as follows: Figure 2 As shown. This approach fully leverages the robustness of the base model, effectively mitigating the adverse effects of marine environmental factors such as waves and glare on detection results. The specific steps are as follows:

[0084] 1. Automatic labeling system for marine engineering equipment

[0085] The SAM (Segment Anything Model) model's power lies in its pre-training on large-scale datasets, enabling it to effectively identify and label different objects in images. This invention cascades the SAM model and the Grounding DINO model to automatically label marine engineering equipment. Through this step, the method can quickly generate high-quality target image annotations for marine engineering equipment, providing additional data for subsequent model fine-tuning.

[0086] 2. Fine-tuning of the Grounding DINO model

[0087] The Grounding DINO model was originally designed for natural environments. By fine-tuning it on the dataset obtained in the previous step, the model can be better adapted to the specific environment of offshore operating platforms, thereby improving the accuracy of marine equipment inspection. Furthermore, a significant advantage of the Grounding DINO model is its support for text-based prompt-based learning. This prompt-based learning approach can more effectively utilize the potential of large models, save significant computational resources, and improve the model's flexibility and efficiency in adapting to new tasks.

[0088] 3. Iterative optimization of the annotation and target detection system

[0089] After fine-tuning the Grounding DINO model, it replaces the corresponding model in the marine engineering equipment automatic annotation system to improve the accuracy of given annotations and enhance the automatic annotation capability of the SAM model. Then, the newly annotated data is used to further fine-tune the Grounding DINO model, forming a fully automated incremental optimization annotation and detection system.

[0090] Automatic labeling system for marine engineering equipment

[0091] The automatic labeling system for marine engineering equipment of this invention is a tool based on artificial intelligence and big data technology, used to automatically label images, videos, or sensor data of marine engineering equipment. Through deep learning algorithms and computer vision technology, this system can identify and classify marine engineering equipment and its components, providing high-precision labeling information for subsequent monitoring, analysis, and maintenance.

[0092] SAM (Segment Anything Model) aims to solve image segmentation problems. Especially in the context of the increasing demand for accurate object segmentation in different application scenarios, the SAM model has shown its wide applicability and flexibility.

[0093] The SAM model introduces a self-attention mechanism into its architecture, allowing the model to focus on specific regions when processing images. This mechanism enables the model to capture the relationships between different regions in an image, resulting in more accurate segmentation. In image segmentation tasks, the features of one region may be influenced by other regions; therefore, this mechanism enhances the model's contextual understanding. More importantly, the SAM model is pre-trained using a large amount of diverse data, covering different scenes, objects, and backgrounds. This not only improves the model's robustness in various situations but also allows it to quickly adapt to complex image segmentation tasks without requiring extensive fine-tuning. See [link to relevant documentation]. Figure 3 The perception results. This data-driven approach makes the SAM model more flexible in practical applications.

[0094] In training the SAM model, a training set containing a diverse image dataset is first constructed. This dataset includes not only labeled images but also unlabeled images, so that the model can improve its performance within a self-supervised learning framework. The model utilizes a novel loss function design, typically combining cross-entropy loss with contrastive loss, to ensure that the model can accurately perform pixel-level segmentation.

[0095] To achieve more efficient training, the SAM model employs a dynamic learning rate strategy. This strategy adjusts the learning rate in real time based on the model's training progress and actual performance, ensuring rapid convergence in the early stages of training and stability during later fine-tuning phases. This flexible learning rate adjustment method further improves the convergence speed and final performance of the SAM model.

[0096] The architecture of the SAM model mainly includes the following key components: input processing module, feature extraction network, segmentation mask generation module, and post-processing module, such as... Figure 4 As shown. The seamless connection between the various modules enables the model to perform segmentation tasks efficiently and handle complex visual scenes, where:

[0097] 1. Input processing module

[0098] To adapt to diverse input data, the SAM model first performs normalization processing on the input images, including resizing, color space conversion, and enhancement techniques. These steps aim to improve the model's robustness, ensuring that it maintains stable performance under different conditions.

[0099] 2. Feature Extraction Network

[0100] Feature extraction is the core of the model, and SAM employs a CNN-based backbone network, such as ResNet or EfficientNet. The backbone network's role is to progressively extract high-level features from the image through multiple convolutional operations. Building upon this, SAM also introduces a self-attention mechanism. This mechanism effectively captures the relationships between different regions in the image, thereby enhancing the expressive power of features, especially when dealing with complex backgrounds and fine-grained features, significantly improving the model's performance.

[0101] 3. Segmentation Mask Generation Module

[0102] After feature extraction, the model feeds the processed feature maps into the segmentation mask generation module. This module uses a combination of multilayer perceptron (MLP) and convolutional layers to generate the final segmentation mask. For each target, contextual information is also considered to improve segmentation accuracy. Furthermore, this module applies post-processing algorithms such as conditional random fields (CRF) to further optimize the segmentation results and eliminate potential noise and missegmentation.

[0103] 4. Post-processing module

[0104] To ensure high-quality output, SAM also features a powerful post-processing module that includes edge smoothing techniques and pixel recalibration. The post-processing module optimizes the segmented contours using computer graphics algorithms, based on the initial segmentation results, thereby enhancing the quality of the final mask.

[0105] The automatic annotation system for marine engineering equipment in this invention also utilizes the Grounding DINO (Dynamic Instance Normalization and Optimization) model. Due to its advanced characteristics, the Grounding DINO model has gradually become an important tool in tasks such as object detection, segmentation, and localization. Its design philosophy mainly focuses on improving the robustness and accuracy of the model in complex scenes through dynamic instance normalization and optimization strategies. Traditional object detection models often face many challenges such as object occlusion, scale variations, and cluttered backgrounds. The Grounding DINO model was proposed precisely to solve these problems.

[0106] This grounding DINO model is based on the DINO (Self-Distillation with No Labels) architecture, combining self-supervised learning and dynamic feature extraction methods. The model architecture is shown below. Figure 5 By performing multi-scale feature representation on the input image, Grounding DINO can identify and locate different objects in the feature space. The core innovations of Grounding DINO are as follows:

[0107] 1. Dynamic instance normalization

[0108] Unlike traditional normalization methods, DIN dynamically adjusts the normalization parameters based on the distribution of the current input features. This approach not only improves the expressive power of features but also effectively mitigates the performance degradation caused by batch normalization. When dealing with salient objects, DIN considers the feature differences among various object types, enabling it to specifically enhance certain feature dimensions and thus improve the model's ability to capture objects.

[0109] 2. Self-supervised learning strategy

[0110] Grounding DINO employs a self-supervised learning strategy during training, using unlabeled data to discover latent structures within the data. This approach not only reduces reliance on labeled data but also improves the model's generalization ability. During the self-supervised learning phase, the model generates multiple views by performing different enhancements on the images and leverages the consistency between these views to optimize the model's representational capabilities.

[0111] 3. Multi-scale feature extraction

[0112] Multi-scale feature extraction is a crucial component of GroundingDINO. The model employs a multi-layered convolutional neural network (CNN) structure to extract features at different levels, capturing the diversity of objects. During feature extraction, the model not only focuses on the object's appearance features but also incorporates contextual information. This fusion mechanism enables GroundingDINO to accurately locate target objects in complex scenes.

[0113] In the SAM-Grounding DINO framework of this invention, users can flexibly click and select corresponding parts of an image (expert hints), and these hints are then automatically matched with entities in the image. Based on this, a pre-trained object detection model and expert hints can be used as input, and their output (labels) can be used as input to grounded SAM, generating accurate bounding boxes and masks (e.g., ...) for each instance. Figure 6 (As shown).

[0114] This allows for automatic annotation of the entire image, realizing an automatic annotation system. The marine engineering equipment automatic annotation system proposed in this invention can automatically perform category prediction (such as...). Figure 7 (As shown), and provides dense annotations for input images in various scenarios. This significantly reduces annotation costs and greatly improves the flexibility of image annotation.

[0115] Specifically, SAM can generate high-quality object masks from flexible cues (including points and bounding boxes) and demonstrates strong zero-shot performance across a range of segmentation tasks, further expanding its applicability. Nevertheless, SAM's zero-shot generalization ability in automatic prompting may not be sufficient to compete with baseline models, especially in specialized environments such as marine scenes.

[0116] Figure 7 This invention demonstrates a "data-driven mechanism" developed based on the SAM model. By synchronizing with the dataset annotation process within the model's internal loop, the model possesses strong adaptability and generalization capabilities for new datasets in specific scenarios. The data engine of this invention has two stages: manual annotation and automatic annotation.

[0117] 1. Manual annotation stage

[0118] In the manual annotation phase, annotators use the SAM model to create mask annotations for target objects. This phase is similar to classic interactive image segmentation tasks, where annotators use tools to refine the objects in the image at the pixel level, and the SAM model helps generate accurate segmentation masks. The advantage of this phase is that it doesn't impose excessive restrictions on the semantics of the annotated objects. Even when faced with non-standardized objects or complex backgrounds, such as blurry or extreme scenes, the system can automatically discard these extreme cases and move on to the next image, thus reducing unnecessary time wasted during annotation.

[0119] However, although SAM provides automatic segmentation capabilities, the granularity of input cue points can be ambiguous, necessitating further refinement of annotations under human supervision. The main task of manual annotation is to ensure improved image quality, especially for marine engineering equipment, laying a solid foundation for subsequent model training through refined annotation. Subsequently, the manually annotated data will be used to fine-tune the object detection model Grounding DINO, enabling it to gradually learn the features of marine engineering equipment and demonstrate better performance in subsequent detection tasks.

[0120] 2. Automatic annotation stage

[0121] In the automatic annotation phase, the annotation process gradually shifts from manual reliance to full automation. At this stage, the Grounding DINO model, fine-tuned in the manual annotation phase, is applied to the unannotated images. By inputting the name of the marine equipment, the Grounding DINO model can automatically detect the marine equipment in the image and generate corresponding bounding boxes.

[0122] Next, the detection results, along with the unlabeled images, are fed into the SAM model. Based on the detection bounding boxes output by Grounding DINO, the SAM model further generates a precise segmentation mask for each target, thus completing fully automated image annotation. This process can efficiently annotate a large number of images, significantly reducing the cost of manual annotation and improving the annotation consistency of the dataset.

[0123] The core advantage of this data-driven mechanism lies in the model's iterative adaptive learning. Through fine-tuning of Grounding DINO during the manual annotation phase, the model gradually learns the features of marine engineering equipment in specific scenarios, improving its recognition capabilities in specific domains. In the automatic annotation phase, the model can efficiently apply these features to new datasets, further expanding the annotation scope and efficiency. Simultaneously, the introduction of the SAM model significantly improves the accuracy of image segmentation, ensuring the accuracy of fully automatic annotation.

[0124] This phased annotation mechanism not only enables the model to exhibit stronger generalization ability in specific scenarios, but also reduces reliance on human resources through automated processes, providing an effective solution for the rapid annotation of large-scale image datasets. This method has important implications for future large-scale data annotation tasks and also provides technical support for the training of automated data annotation and detection models in specific domains.

[0125] Grounding DINO model fine-tuning

[0126] The results of manual annotation are fed into the Grounding DINO model for fine-tuning, enabling subsequent automated annotation and more accurate detection of marine engineering equipment. To this end, this invention integrates Grounding DINO as a cross-modal detector into the SAM module. The model structure of Grounding DINO is as follows... Figure 8 As shown, it accepts (image, text) pairs as input and outputs multiple object boxes. For example, if there is a pair of scissors on a table in the input image, it will locate the scissors and the table, and extract the words "scissors" and "table" as labels. It is built on a DETR-like model called DINO. When text descriptions appear in an image, the model can recognize and locate objects in the image.

[0127] This model can understand the language and visual content of images and associate visual elements with text or information. It consists of a Swin transformer to extract visual information from images and BERT to extract textual descriptions. After extracting the features of the fictional image and text, they are fed into a feature enhancer module for cross-modal feature fusion, which utilizes deformable self-attention to enhance the features.

[0128] To facilitate feature fusion, this invention also adds image-to-text cross-attention and text-to-image cross-attention, and designs a language-guided query selection module to select features related to the input text.

[0129] To fuse features from text and image modes, this invention creates a cross-modal decoder. Each cross-modal decoder layer feeds the results of each cross-modal query into a self-attention layer, an image cross-attention layer to merge image features, a text cross-attention layer to merge text features, and a feedforward network layer. Because textual information is injected into the query to improve modality alignment, each decoding layer includes an additional text cross-attention layer compared to the DINO decoding layer.

[0130] During fine-tuning, L1 loss and GIOU loss were used for bounding box regression. Furthermore, the model follows GLIP and uses a contrastive loss between predicted objects and language tags for classification. This invention was tested on the ShipDataset's ship image dataset. Experimental results show that the fine-tuned GroundingDINO model achieves the best detection results; detailed results are described below.

[0131] Ship inspection and testing

[0132] This invention uses the ShipDataset dataset for experimental verification. The ShipDataset ship image dataset was collected by operating a drone (DJI Phantom 4 Pro) over the waters of Shanghai, China. The entire collection process covered scenes from five different shooting angles and under different lighting conditions.

[0133] Three video clips were obtained using drone footage. The detailed parameters for each video are as follows: resolution 3840x2160, frame rate 23.98 frames / second, flight altitude 500 meters, focal length 24mm, and all footage was captured during traffic congestion. These videos contain 3261, 7553, and 7841 frames respectively. Based on these videos, a dataset was constructed, containing five scenes from different shooting angles to support ship detection and classification. To balance the sample distribution in the dataset, it was manually split into training, validation, and test sets in a ratio of 8.5:1:0.5. This dataset contains a total of 18655 images obtained from publicly available websites.

[0134] Experimental results

[0135] To verify the superior performance of the proposed method, experiments were conducted on the ShipDataset dataset. The experimental results are shown in Table 5-1. Compared with other methods, the proposed method has the highest average accuracy (AP), indicating the most outstanding performance in ship detection. Of particular note is the use of an automatic annotation tool for cyclical fine-tuning of the Grounding DINO model. This operation highlights the efficiency and excellent robustness of both the automatic annotation tool and the Grounding DINO model.

[0136] Table 5-1 Comparison of object detection methods on ship datasets

[0137]

[0138] Marine engineering equipment testing experiment

[0139] This invention employs a self-developed automatic annotation system for marine engineering equipment to automatically annotate two datasets from different sources: one is an actual dataset from ships, and the other is a dataset of boom cranes generated in a simulation environment. This automatic annotation process generates high-quality, finely annotated data, which is then used to fine-tune the Grounding DINO model to optimize its performance in marine equipment testing experiments. To further improve the generalization ability of the fine-tuned Grounding DINO model, this invention utilizes Google image search technology to collect approximately 1000 original images related to marine engineering equipment. After rigorous manual screening, a marine engineering equipment dataset for model testing was constructed. This test dataset will help evaluate the model's performance and reliability in practical applications.

[0140] Experimental results

[0141] This invention compares the detection accuracy of YOLOv7 and the present method on a marine engineering equipment dataset, where YOLOv7 uses a pre-trained model and is fine-tuned on an automatically labeled dataset.

[0142] Table 5-2 Comparison of monitoring progress between this method and YOLOv7

[0143]

[0144] Table 5-2 shows the detection accuracy of the two methods. Experimental results demonstrate that the method proposed in this invention can achieve better separation accuracy in the open domain, reflecting the robustness of the method.

[0145] In addition, such as Figure 9 As shown, the method of the present invention can accurately detect target content in offshore engineering scenarios based on user-input text. In these images, the user-input text prompts are "platform, ship, crane," and the method of the present invention successfully detected and labeled offshore oil and gas platforms, ships, and cranes, demonstrating that the method can accurately identify and distinguish different types of facilities and equipment in offshore engineering scenarios.

[0146] With the text prompt "wind turbine," the method effectively detected and labeled multiple offshore wind turbines in this scenario. This demonstrates the method's ability to identify wind turbines in a wind farm environment, accurately labeling them even in situations with multiple targets and complex backgrounds. With the text prompt "ship, crane," the method successfully identified and labeled a large offshore vessel and its crane equipment, proving the accuracy of target object detection in complex offshore operating environments (including large hulls, crane booms, and other devices).

[0147] In summary, this invention proposes a target detection method for marine engineering equipment based on a visual fundamental model. This method facilitates the rapid and accurate identification and tracking of changes in the real world by VP systems, and timely reflection of these changes in the virtual environment, thereby improving automation and detection accuracy. Through in-depth analysis of the needs of marine surveillance systems and the limitations of existing technologies, this invention proposes a method for automatically labeling and detecting various equipment on marine operation systems using large-scale visual models, particularly the SAM and Grounding DINO models. This method first automatically generates high-quality labeled data using the powerful image segmentation capabilities of the SAM model. Subsequently, the Grounding DINO model is fine-tuned using this data, thereby improving the model's detection accuracy for marine engineering equipment. Furthermore, by iteratively optimizing the labeling and target detection systems, this method achieves the goal of continuously improving detection performance. Experimental results show that the proposed method outperforms traditional models in both AP and AR performance on the ShipDataset dataset, especially on the marine engineering equipment dataset, where it outperforms the YOLOv7 model.

[0148] This invention also provides a storage medium for storing a computer program, which, when executed, performs at least the methods described above.

[0149] This invention also provides a control device, including a processor and a storage medium for storing a computer program; wherein the processor executes the computer program by performing at least the method described above.

[0150] This invention also provides a processor that executes a computer program, at least performing the methods described above.

[0151] The storage medium can be implemented by any type of non-volatile storage device, or a combination thereof. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); the magnetic surface memory can be a disk drive or magnetic tape drive. The storage media described in the embodiments of this invention are intended to include, but are not limited to, these and any other suitable types of memory.

[0152] In the several embodiments provided by this invention, it should be understood that the disclosed systems and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.

[0153] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.

[0154] In addition, in the various embodiments of the present invention, each functional unit can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.

[0155] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0156] Alternatively, if the integrated units of this invention are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this invention, or the parts that contribute to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

[0157] The methods disclosed in the several method embodiments provided by this invention can be arbitrarily combined without conflict to obtain new method embodiments.

[0158] The features disclosed in the several product embodiments provided by this invention can be arbitrarily combined without conflict to obtain new product embodiments.

[0159] The features disclosed in the several method or device embodiments provided by the present invention can be arbitrarily combined without conflict to obtain new method or device embodiments.

[0160] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various equivalent substitutions or obvious modifications can be made without departing from the concept of the present invention, and all such modifications, achieving the same performance or application, should be considered within the scope of protection of the present invention.

Claims

1. A vision-based method for identifying marine work equipment, characterized in that, Includes the following steps: S1. Collect image data from the offshore operation site and preprocess it to prepare a dataset for subsequent model training and testing; S2. Image segmentation is performed using the SAM model to automatically generate high-quality target object masks. The SAM model performs pixel-level analysis on the preprocessed image to identify and segment the marine operation equipment in the image and generate the corresponding segmentation mask. In this process, an automatically labeled dataset containing rich images of marine operation equipment and corresponding masks is constructed. S3. Automatic detection using a Grounding DINO model. This process includes using the model to automatically detect marine equipment in the image and generate detection boxes based on text prompts input by the user. The detection box results are combined with the segmentation mask generated by the SAM model to refine the detection results. S4. On the automatically labeled dataset constructed by the SAM model, the Grounding DINO model is fine-tuned through cross-modal feature fusion and self-supervised learning strategies to better adapt it to the offshore operating platform environment, thereby improving the model's generalization ability and detection accuracy. S5. Iterative optimization of annotation and object detection, in which the fine-tuned Grounding DINO model is replaced by an automatic annotation system based on the collaborative construction of the SAM model segmentation mask and the initial detection capability of the Grounding DINO model. The automatic annotation system uses the fine-tuned model to automatically generate more accurate detection boxes and segmentation masks to improve annotation accuracy, and uses the newly annotated data to continue to fine-tune the Grounding DINO model, forming an iterative optimization process to continuously improve detection performance.

2. The vision-based marine equipment identification method as described in claim 1, characterized in that, The SAM model includes: The input processing module is used to standardize the input image, involving size adjustment, color space conversion and enhancement techniques, in order to systematically improve the robustness of the model; The feature extraction network uses a CNN-based backbone network to gradually extract high-level features of the image through multi-layer convolutional operations, while introducing a self-attention mechanism to improve feature representation capabilities. The segmentation mask generation module combines a multilayer perceptron (MLP) and convolutional layers to generate the final segmentation mask. This process also considers contextual information to improve segmentation accuracy and incorporates a conditional random field (CRF) post-processing strategy to optimize the segmentation results. The post-processing module, including edge smoothing technology and pixel recalibration strategy, optimizes the segmented contours through advanced graphics algorithms, further enhancing the quality of the final mask.

3. The vision-based marine equipment identification method as described in claim 1 or 2, characterized in that, The SAM model is trained on a diverse dataset containing both labeled and unlabeled images to enhance its performance under self-supervised learning. A combination of cross-entropy loss and contrastive loss is used to ensure the model's accuracy in pixel-level segmentation. A dynamic learning rate adjustment scheme based on training progress is adopted to adjust the learning rate in real time according to the model's training progress and actual performance, ensuring that the model can converge quickly in the early stage of training and remain stable in the later fine-tuning stage.

4. The vision-based marine equipment identification method as described in any one of claims 1 to 2, characterized in that, The Grounding DINO model is based on the DINO architecture and integrates self-supervised learning and dynamic feature extraction capabilities. It processes input images through multi-scale feature representation to enhance the model's ability to capture details, and utilizes self-supervised learning to enable the model to be trained on unlabeled data. The model has the ability to dynamically adjust the feature extraction process to adapt to different maritime environments and equipment characteristics.

5. The vision-based marine equipment identification method as described in claim 4, characterized in that, The Grounding DINO model has the following characteristics: The normalization parameters are dynamically adjusted based on the distribution of input features to improve feature representation and alleviate the performance degradation problem of batch normalization. During training, the model employs a self-supervised learning strategy, training on unlabeled data to discover potential structures in the data, thereby reducing dependence on labeled data and improving generalization ability. The model uses a multi-layer convolutional neural network (CNN) architecture to extract features at different levels, capture the diversity of objects, and combine the object's appearance features with contextual information to achieve accurate localization of the target object. In the SAM-Grounding DINO framework, users interactively select portions of an image as expert cues, and the model automatically matches entities and generates accurate bounding boxes and masks, achieving fully automated annotation of the entire image.

6. The vision-based marine equipment identification method as described in any one of claims 1 to 2, characterized in that, The operation process of the automatic annotation system includes: By leveraging the automatic segmentation capabilities of the SAM model, we can assist annotators in performing preliminary mask annotations for target objects, providing foundational data for subsequent fully automated annotation and model training. Fine-tuning the Grounding DINO model using manually labeled data enables it to better learn the characteristics of marine engineering equipment and improve its performance in inspection tasks. The finely tuned Grounding DINO model is applied to the unlabeled image, which automatically detects marine equipment and generates detection boxes. The detection results and the unlabeled image are simultaneously input into the SAM model to generate accurate segmentation masks and complete fully automatic annotation. Through cyclical adaptive learning, the model actively learns the features of marine engineering equipment in the scenario during the manual annotation stage, and applies these features during the automatic annotation stage to expand the scope of annotation and improve the efficiency of annotation.

7. The vision-based marine equipment identification method as described in any one of claims 1 to 2, characterized in that, The Grounding DINO model consists of the following parts: A text encoder receives input text and converts it into text features, providing accurate semantic information for the model. An image encoder is specifically designed to process input images, extract image features, and provide a numerical description of the image content. The text-image fusion module combines text features with image features and improves feature representation capabilities through cross-modal learning strategies, enabling the model to better understand and associate text and image content. The language-guided index extraction module utilizes fused features to accurately extract image regions related to the input text through language guidance, thereby achieving target localization. A cross-modal decoder combines text and image features to generate accurate detection boxes, achieving precise mapping from text to image content.

8. The vision-based marine equipment identification method as described in any one of claims 1 to 2, characterized in that, The fine-tuning of the Grounding DINO model includes: Visual information is extracted from images using the Swing transformer, text descriptions are extracted using BERT, and cross-modal feature fusion is performed using the feature enhancer module. Add image-to-text cross-attention and text-to-image cross-attention, as well as a language-guided query selection module, to enhance feature fusion; Create a cross-modal decoder to optimize the fusion of text and image features, thereby improving modality alignment and feature representation capabilities; During fine-tuning, L1 loss and GIOU loss were used for bounding box regression, following GLIP, and classification was performed using contrastive loss between predicted objects and language labels.

9. A computer-readable storage medium storing a computer program, characterized in that, When executed by a processor, the computer program is capable of implementing the method as described in any one of claims 1 to 8.

10. A computer program product, comprising a computer program, characterized in that, The computer program, when executed by a processor, implements the method as described in any one of claims 1 to 8.