Object detection and coordinate output method based on visual large language model
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENGDU XINHAOSI ELECTRONICS DETECTING TECH CO LTD
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies cannot simultaneously meet the requirements of detecting any specified object and outputting the precise bounding box coordinates of the object in industrial customized inspection and intelligent security scenarios. Traditional methods are closed in categories and have limited cloud API functionality.
We employ a visual large language model-based object detection method. By preprocessing the image through bilinear interpolation and center padding, combined with the ViT visual encoder and transformer architecture, we introduce self-attention to enhance the focus on the detected target, and output accurate coordinates through autoregressive language modeling.
It enables flexible detection and high-precision bounding box output of user-defined objects, breaking through the category closure of traditional methods and the functional limitations of cloud APIs, and improving the flexibility and business adaptability of the detection system.
Smart Images

Figure CN122289221A_ABST