Object detection and coordinate output method based on visual large language model

CN122289221APending Publication Date: 2026-06-26CHENGDU XINHAOSI ELECTRONICS DETECTING TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHENGDU XINHAOSI ELECTRONICS DETECTING TECH CO LTD
Filing Date
2026-04-01
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies cannot simultaneously meet the requirements of detecting any specified object and outputting the precise bounding box coordinates of the object in industrial customized inspection and intelligent security scenarios. Traditional methods are closed in categories and have limited cloud API functionality.

Method used

We employ a visual large language model-based object detection method. By preprocessing the image through bilinear interpolation and center padding, combined with the ViT visual encoder and transformer architecture, we introduce self-attention to enhance the focus on the detected target, and output accurate coordinates through autoregressive language modeling.

Benefits of technology

It enables flexible detection and high-precision bounding box output of user-defined objects, breaking through the category closure of traditional methods and the functional limitations of cloud APIs, and improving the flexibility and business adaptability of the detection system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289221A_ABST
    Figure CN122289221A_ABST
Patent Text Reader

Abstract

This invention discloses an object detection and coordinate output method based on a visual large language model, relating to the field of image detection. The invention includes inputting a user's instruction to detect a target object, embedding the instruction into the original image, preprocessing the image using bilinear interpolation and centering, segmenting and encoding the processed image based on a ViT visual encoder, introducing self-attention to enhance the detection focus on the target object, and jointly decoding the visual encoding and embedded instruction within the same transformer architecture to output a detection report. This invention avoids the quantization error caused by direct normalization, offers output flexibility (generated through autoregression, with the output format dynamically controlled by the instruction and not limited by predefined templates), and semantic filtering (utilizing the common sense reasoning capability of the VLM for high-level semantic judgment in the hidden state space, eliminating the need for separate classifier training).
Need to check novelty before this filing date? Find Prior Art