Object detection and coordinate output method based on visual large language model

CN122289221APending Publication Date: 2026-06-26CHENGDU XINHAOSI ELECTRONICS DETECTING TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHENGDU XINHAOSI ELECTRONICS DETECTING TECH CO LTD
Filing Date: 2026-04-01
Publication Date: 2026-06-26

Application Information

Patent Timeline

01 Apr 2026

Application

26 Jun 2026

Publication

CN122289221A

IPC: G06T7/00; G06V10/44; G06V10/80; G06V10/766; G06V10/82; G06N3/045; G06N3/0455; G06N3/048; G06N5/04

AI Tagging

Technology Topics

Semantic filteringLinguistic model

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A relationship-guided full-induction multi-modal knowledge graph reasoning method and system
CN122264091ADigital data information retrieval Semantic analysisSemantic filteringMessage delivery
Multi-modal zero-shot anomaly detection method and system based on semantic filtering and batch processing cooperation
CN122435309AVision processingSemantic filtering
A method, system, medium, and apparatus for real-time detection of defects in laser powder bed fusion additive manufacturing
CN122335848ASemantic filteringData set
A decoding method for medical image segmentation with cross-layer feature refinement
CN122115596AImage analysis Character and pattern recognition Pattern recognitionSemantic filtering
Enhanced segment analysis and quality control for content distribution
US20260181212A1Selective content distributionSemantic filteringVideo recognition

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies cannot simultaneously meet the requirements of detecting any specified object and outputting the precise bounding box coordinates of the object in industrial customized inspection and intelligent security scenarios. Traditional methods are closed in categories and have limited cloud API functionality.

Method used

We employ a visual large language model-based object detection method. By preprocessing the image through bilinear interpolation and center padding, combined with the ViT visual encoder and transformer architecture, we introduce self-attention to enhance the focus on the detected target, and output accurate coordinates through autoregressive language modeling.

Benefits of technology

It enables flexible detection and high-precision bounding box output of user-defined objects, breaking through the category closure of traditional methods and the functional limitations of cloud APIs, and improving the flexibility and business adaptability of the detection system.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122289221A_ABST

Patent Text Reader

Abstract

This invention discloses an object detection and coordinate output method based on a visual large language model, relating to the field of image detection. The invention includes inputting a user's instruction to detect a target object, embedding the instruction into the original image, preprocessing the image using bilinear interpolation and centering, segmenting and encoding the processed image based on a ViT visual encoder, introducing self-attention to enhance the detection focus on the target object, and jointly decoding the visual encoding and embedded instruction within the same transformer architecture to output a detection report. This invention avoids the quantization error caused by direct normalization, offers output flexibility (generated through autoregression, with the output format dynamically controlled by the instruction and not limited by predefined templates), and semantic filtering (utilizing the common sense reasoning capability of the VLM for high-level semantic judgment in the hidden state space, eliminating the need for separate classifier training).

Need to check novelty before this filing date? Find Prior Art