A virtual supervision target detection method, system and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a cascaded HQP-DETR detection model and high-quality candidate box encoding, combined with a cascaded denoising query module, the problems of poor dataset quality and weak model generalization ability in virtual supervised object detection are solved, achieving high-precision detection and fast training on real images.

CN122244588APending Publication Date: 2026-06-19HARBIN INSTITUTE OF TECHNOLOGY (SHENZHEN) (INSTITUTE OF SCIENCE AND TECHNOLOGY INNOVATION HARBIN INSTITUTE OF TECHNOLOGY SHENZHEN)

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: HARBIN INSTITUTE OF TECHNOLOGY (SHENZHEN) (INSTITUTE OF SCIENCE AND TECHNOLOGY INNOVATION HARBIN INSTITUTE OF TECHNOLOGY SHENZHEN)
Filing Date: 2026-03-17
Publication Date: 2026-06-19

Application Information

Patent Timeline

17 Mar 2026

Application

19 Jun 2026

Publication

CN122244588A

IPC: G06V10/774; G06V10/82; G06V10/70; G06V10/26; G06N3/0455; G06N3/048; G06N3/045; G06N5/04; G06F16/334; G06F40/284; G06F40/30; G06F40/186

AI Tagging

Application Domain

Digital data information retrieval Semantic analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing virtual supervised object detection technologies suffer from problems such as poor quality of synthetic datasets, weak cross-domain generalization ability of detection models, and denoising mechanisms that are prone to overfitting synthetic domain features, resulting in insufficient detection accuracy and generalization ability of the models on real images.

Method used

A cascaded HQP-DETR detection model is constructed. By synthesizing a dataset based on the target frequency distribution in the real world, combining high-quality candidate box encoding and a cascaded denoising query module, image-specific geometric and semantic priors are injected, and the training weights of the denoising query features are dynamically adjusted to improve the training efficiency and cross-domain generalization ability of the model.

Benefits of technology

Achieving high-precision detection of real images with zero manual annotation cost improves the model's training efficiency and cross-domain generalization ability, and promotes the transformation of virtual supervised object detection from weak supervision to full supervision paradigm.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244588A_ABST

Patent Text Reader

Abstract

This invention discloses a virtual supervised object detection method, system, and electronic device. The method includes synthesizing a dataset based on the frequency distribution of real-world objects and constructing a cascaded HQP-DETR detection model. In the cascaded HQP-DETR detection model, a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising are embedded in a multi-layer Transformer decoder. The target query module guides the model to learn low-level visual features consistent across domains by injecting image-specific geometric and semantic priors into the target query. The denoising query module dynamically adjusts the training weights of the denoising query features according to the prediction quality, guiding the model to prioritize learning visual features including the target contour. This invention achieves high-precision detection on real images without relying on large-scale manually labeled datasets, effectively improving model training efficiency, convergence speed, and cross-domain generalization ability.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, specifically to a virtual supervised target detection method, system, and electronic device. Background Technology

[0002] Object detection is a fundamental task in computer vision, aiming to identify and locate foreground objects in images. With the development of deep learning technology, object detection models have made significant progress. However, these models typically require training on large-scale, manually labeled datasets, a process that is costly and time-consuming. To address this issue, Imaginary Supervised Object Detection (ISOD) technology has emerged. Its core idea is to train the model on synthetic images and then test it on real images, thereby reducing or eliminating the reliance on manually labeled data.

[0003] In terms of synthetic image generation technology, early synthetic image generation methods mainly relied on digital processing or enhancement of real images, such as: (1) cut and paste: extracting foreground objects from real images and randomly pasting them onto different background images. (2) domain randomization: creating new synthetic images by changing image parameters (such as lighting, texture, color, etc.). (3) 3D rendering: using 3D rendering models and scene information to synthesize datasets. With the development of text-based image generation technology (GAN and diffusion models), the scale and quality of synthetic datasets have been improved. For example, some works fine-tune diffusion models on detection datasets to generate labeled datasets. Among them, ImaginaryNet first proposed ISOD, which uses GPT-2 to generate simple text prompts, inputs them into the DALL-E 2 model for image synthesis, and uses the categories in the text prompts to train the model according to the paradigm of weakly supervised object detection.

[0004] At the object detection model level, early convolutional neural network-based detection models were divided into two-stage and one-stage methods. Two-stage methods, represented by the R-CNN series (Faster R-CNN, Cascade R-CNN), first generate candidate boxes, then perform classification and regression on these boxes. Early methods for generating candidate boxes (such as Selective Search, EdgeBoxes, and BING) relied on hand-designed heuristics and low-level features; while deep learning-based methods (such as RPN) required training on real images, violating the ISOD (Object Detection Principle). Single-stage methods (such as the YOLO series and SSD) directly predict bounding boxes and categories in a single forward pass. With the great success of the Transformer architecture in NLP, object detection models gradually evolved towards the Transformer architecture. DETR reformulated object detection as an ensemble prediction problem. The network model includes a backbone, an encoder-decoder Transformer, and a Hungarian matching module. It uses bipartite graph matching to perform a one-to-one matching between the bounding boxes predicted by the decoder based on the query and the ground truth boxes, employing classification loss, L1 loss, and GIoU loss for end-to-end optimization. Subsequent work such as Deformable DETR, Conditional DETR, and SAM-DETR improves detection performance by refining the attention mechanism in the transformer; while Anchor DETR and DAB-DETR effectively accelerate model convergence by introducing explicit anchor box priors for object queries.

[0005] DN-DETR adds noise to the ground truth (GT) boxes as a denoising query, and inputs it into the Transformer decoder along with the regular matching query. The model performs an additional denoising task to reconstruct the original GT boxes corresponding to the denoising query.

[0006] ISOD technology needs to comprehensively consider the quality of synthesized images and the shortcomings of object detection models. Regarding synthesized images, early techniques such as cut-and-paste and domain randomization still rely on real images and require additional annotation using auxiliary models (such as segmentation models). They also suffer from insufficient diversity in generated images and may violate the logic of real-world scenes. Later works based on text-based graphs, such as ImaginaryNet, suffer from simple prompt word templates, insufficient scene diversity, and difficulty in depicting relationships between categories. Furthermore, the performance of the text-based graph models they rely on is limited, resulting in low-quality generated images, distorted details, and compositions that may violate real-world physics. In addition, they only use category information from prompt words as weak supervision signals, resulting in insufficient supervision strength, limited model performance, and poor generalization to real-world scenes.

[0007] In object detection models, the final accuracy of two-stage methods is highly dependent on the quality of candidate boxes. Early candidate box generation techniques such as Selective Search, BING, and EdgeBoxes relied on hand-designed heuristics and some low-level image features, often generating thousands of redundant candidate boxes per image, which may not be comprehensive. While deep learning-based methods such as HyperNet and SPG improve recall, they require training the candidate box generation network on real images, violating the ISOD (Object Detection Principle) and also generating a large number of redundant candidate boxes.

[0008] Transformer-based object detection models, such as DETR, use random initialization of object queries during training, independent of the input image. This lack of clearly image-related geometric and semantic priors leads to slow convergence, requiring the model to learn the query distribution from scratch. Under ISOD constraints, the model is prone to overfitting to specific patterns in synthetic datasets (such as fixed backgrounds and layouts), making it difficult to generalize effectively to real-world image scenes. Meanwhile, DETR models, such as DN-DETR, which incorporate denoising mechanisms, suffer from overfitting. They impose uniform training constraints on all denoised queries. Under ISOD constraints, the pseudo-labels generated by automatic annotation tools inevitably contain labeling errors. This training strategy causes the model to indiscriminately overfit to inaccurate pseudo-labels, thus tending to fit noisy labels rather than learning stable low-level visual features such as object contours. Summary of the Invention

[0009] To address the aforementioned issues, this invention provides a virtual supervised object detection method, system, and electronic device. It aims to overcome three core shortcomings of existing virtual supervised object detection technologies: poor quality of synthetic datasets, weak cross-domain generalization ability of detection models, and the tendency for denoising mechanisms to overfit synthetic domain features. This method achieves high-precision detection on real images without relying on large-scale manually labeled datasets, effectively improving model training efficiency, convergence speed, and cross-domain generalization ability. It promotes the transformation of virtual supervised object detection from weak supervision to full supervision, reducing the training cost and application threshold of object detection models.

[0010] According to a first aspect of the present disclosure, a virtual supervised target detection method is provided, the method comprising the following steps: Synthetic dataset based on real-world target frequency distribution; A cascaded HQP-DETR detection model is constructed, comprising a backbone network, a multi-layer Transformer encoder, a DenseFusion module, and a multi-layer Transformer decoder. The backbone network is used for feature extraction; the multi-layer Transformer encoder performs contextual modeling on the features output by the backbone network; the DenseFusion module concatenates the features output by the multi-layer Transformer encoder with the features output by the backbone network, projects them back to the original dimensions, and then inputs them into the multi-layer Transformer decoder; the multi-layer Transformer decoder outputs bounding box offsets and class predictions. The cascaded HQP-DETR detection model was trained and tested using a synthetic dataset; The image to be detected is input into the trained cascaded HQP-DETR detection model for detection. The multi-layer Transformer decoder embeds a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising. The target query module guides the model to learn low-level visual features that are consistent across domains by injecting image-specific geometric and semantic priors into the target query. The denoising query module dynamically adjusts the training weights of the denoising query features according to the prediction quality, guiding the model to prioritize learning visual features including the target contour.

[0011] A further technical solution of the present invention is as follows: the synthetic dataset specifically includes the following steps: Synthetic dataset generation: Given a set of categories containing C target categories, and divide it into a common category set, a medium frequency category set, and a rare category set, assign sampling weights to the three levels based on the real-world target frequency distribution; For the samples to be generated, the number of categories is randomly selected from the category set C according to the sampling weight of each category to form category combinations; Convert category combinations into basic prompt word templates: By using a large language model, the basic prompt word template is enhanced with expert prompt words to generate enhanced prompt words; The enhanced prompts are input into the text-based image model to generate an image. Detection prompts are constructed based on the category names of the basic prompt template; The generated image and detection cue words are input into the open vocabulary detector, and the output includes annotation information containing bounding box coordinates and category labels.

[0012] A further technical solution of the present invention is as follows: the method injects specific geometric and semantic priors of the input image into the target query by anchor box initialization based on SAM candidate boxes and semantic perception priors based on region features.

[0013] A further technical solution of the present invention is: anchor box initialization based on SAM candidate boxes, specifically including: 64 points are uniformly sampled along the length and width of the image as automatic prompts, and then processed by non-maximum suppression to obtain the result. N Category-independent segmentation masks; For each segmentation mask, the minimum bounding rectangle is calculated and used as a candidate box, thus obtaining the candidate box set; Initialize the anchor box parameters of the target query using the candidate box set.

[0014] A further technical solution of the present invention is: semantic perception prior based on region features, specifically including: Perform RoI Align operation on each candidate bounding box to extract features of a fixed size; The extracted features are projected onto the hidden layer of the decoder through the Neck module, which consists of two linear layers with ReLU activation.

[0015] A further technical solution of the present invention is as follows: the denoising query module dynamically adjusts the training weights by progressively increasing the IoU threshold at each layer of the decoder, wherein, No. Layer threshold for:

[0016] in, As the initial threshold, N represents the total increment, and N represents the total number of decoder layers. Based on threshold , for the first For each denoised query j in the layer, a dynamic weight is calculated:

[0017] In each decoder layer, the bounding box reconstructed based on the j-th denoising query... , and its corresponding real bounding box The calculated IoU is denoted as ,Will With threshold The bias is mapped to the training weights:

[0018] in It is the Sigmoid function. This is the temperature coefficient.

[0019] A further technical solution of the present invention is as follows: During the training phase of the cascaded HQP-DETR detection model, the input image is processed by a CNN backbone network and a Transformer encoder to generate a feature map rich in contextual information; the Transformer decoder isolates the target query module based on high-quality candidate box encoding and the denoising query module based on cascaded denoising to perform target query and denoising query in parallel.

[0020] During the inference phase, the cascaded HQP-DETR detection model only performs target queries based on high-quality candidate box encoding. After optimization by the decoder, it finally outputs bounding boxes and class predictions.

[0021] According to a second aspect of the present disclosure, a virtual supervised target detection system is provided, comprising the following steps: The Synthetic Dataset Module is used to synthesize datasets based on real-world target frequency distributions. A cascaded HQP-DETR detection model module is constructed, comprising a backbone network, a multi-layer Transformer encoder, a Dense Fusion module, and a multi-layer Transformer decoder. The backbone network is used for feature extraction; the multi-layer Transformer encoder performs contextual modeling on the features output by the backbone network; the Dense Fusion module concatenates the features output by the multi-layer Transformer encoder with the features output by the backbone network, projects them back to the original dimensions, and then inputs them into the multi-layer Transformer decoder; the multi-layer Transformer decoder outputs bounding box offsets and class predictions. The model training module is used to train and test the cascaded HQP-DETR detection model using a synthetic dataset; The detection module is used to input the image to be detected into the trained cascaded HQP-DETR detection model for detection; The multi-layer Transformer decoder embeds a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising. The target query module guides the model to learn low-level visual features that are consistent across domains by injecting image-specific geometric and semantic priors into the target query. The denoising query module dynamically adjusts the training weights of the denoising query features according to the prediction quality, guiding the model to prioritize learning visual features including the target contour.

[0022] According to a third aspect of the present disclosure, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the virtual supervised target detection method described above.

[0023] According to a fourth aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, wherein computer instructions are stored on the storage medium, and when executed by a processor, the instructions implement the steps of the virtual supervised target detection method described above.

[0024] The virtual supervised target detection method, system, and electronic device provided in this disclosure have the following specific beneficial effects: 1) To address the shortcomings of existing synthetic datasets, such as limited prompts, poor image quality, and weak supervision signals, this invention constructs a high-quality synthetic dataset generation pipeline. This pipeline generates synthetic data with diverse scenes, realistic image details, and accurate and complete annotations, thereby upgrading virtual supervised object detection from a weakly supervised paradigm to a fully supervised paradigm. This solves the core problem that existing synthetic datasets cannot support the training of high-precision object detection models.

[0025] 2) To address the shortcomings of existing DETR series models, such as random initialization of target queries leading to a lack of semantic priors for specific image locations, slow convergence speed, and susceptibility to overfitting specific patterns in synthetic data, this invention proposes a query encoding mechanism guided by high-quality candidate boxes. This mechanism injects image-specific geometric and semantic priors into the target query, guiding the model to learn consistent low-level visual features across domains, rather than overfitting the inherent patterns of synthetic data. This accelerates the model's convergence speed and improves its generalization ability on real images.

[0026] 3) To address the shortcomings of existing denoising mechanisms that impose uniform training constraints on denoising queries, which can easily lead to model overfitting to inaccurate pseudo-labels, this invention designs a cascaded denoising algorithm. By dynamically adjusting the training weights of denoising query features based on prediction quality, the algorithm guides the model to prioritize learning stable and reliable visual features such as target contours, avoiding blind fitting to inaccurate pseudo-labels and further improving the model's localization accuracy and detection robustness.

[0027] In summary, this invention provides a complete, efficient, and high-precision virtual supervised target detection solution. It enables the model to stably generalize to real-world scenarios with zero manual annotation costs, thus promoting the application of target detection technology in scenarios where annotation is difficult.

[0028] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0029] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.

[0030] Figure 1 This is a flowchart of the virtual supervised target detection method in an embodiment of the present invention; Figure 2 This is a schematic diagram of the cascaded HQP-DETR detection model in an embodiment of the present invention; Figure 3 This is a structural diagram of the virtual supervised target detection system in an embodiment of the present invention; Figure 4 This is a schematic diagram of an electronic device according to an embodiment of the present invention; Figure 5 These are examples of images from the FluxVOC dataset in this embodiment of the invention; Figure 6 This is a kernel density estimation map of the center of the bounding box in an embodiment of the present invention; Figure 7 These are training curves of each model in each round in the embodiments of the present invention; Figure 8 This is a partial example diagram of the test results of Cascade HQP-DETR on the PascalVOC2007 test set in an embodiment of the present invention; Figure 9 This is an example of the FluxCOCO portion in an embodiment of the present invention. Detailed Implementation

[0031] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the invention. Furthermore, it should be noted that, for ease of description, only the parts relevant to the present invention are shown in the drawings, not the entire structure.

[0032] Before discussing the exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe the steps as sequential processes, many of these steps can be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the steps can be rearranged. The process can be terminated when its operation is complete, but may also have additional steps not included in the figures. The process can correspond to a method, function, procedure, subroutine, subroutine, etc.

[0033] This invention relates to a virtual supervised target detection method, such as... Figure 1 As shown, the method includes the following steps: S1. Synthetic dataset based on real-world target frequency distribution; S2. Construct a cascaded HQP-DETR detection model, including a backbone network, a multi-layer Transformer encoder, a DenseFusion module, and a multi-layer Transformer decoder: the backbone network is used for feature extraction; the multi-layer Transformer encoder performs contextual modeling on the features output by the backbone network; the DenseFusion module concatenates the features output by the multi-layer Transformer encoder with the features output by the backbone network, projects them back to the original dimensions, and then inputs them into the multi-layer Transformer decoder; the multi-layer Transformer decoder outputs bounding box offsets and class predictions. S3. Train and test the cascaded HQP-DETR detection model using a synthetic dataset; S4. Input the image to be detected into the trained cascaded HQP-DETR detection model for detection; The multi-layer Transformer decoder embeds a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising. The target query module guides the model to learn low-level visual features that are consistent across domains by injecting image-specific geometric and semantic priors into the target query. The denoising query module dynamically adjusts the training weights of the denoising query features according to the prediction quality, guiding the model to prioritize learning visual features including the target contour.

[0034] The synthetic dataset in S1 specifically includes the following steps: Synthetic dataset generation: Given a set of categories containing C target categories, and divide it into a common category set, a medium frequency category set, and a rare category set, assign sampling weights to the three levels based on the real-world target frequency distribution; For the samples to be generated, the number of categories is randomly selected from the category set C according to the sampling weight of each category to form category combinations; Convert category combinations into basic prompt word templates: By using a large language model, the basic prompt word template is enhanced with expert prompt words to generate enhanced prompt words; The enhanced prompts are input into the text-based image model to generate an image. Detection prompts are constructed based on the category names of the basic prompt template; The generated image and detection cue words are input into the open vocabulary detector, and the output includes annotation information containing bounding box coordinates and category labels.

[0035] In a specific implementation process, the generation of high-quality synthetic datasets includes: Given a set of categories containing C target categories The category set is divided into common category sets. medium frequency category set Rare category set Assign sampling weights to the three levels To reflect the target frequency distribution in the real world.

[0036] For each sample to be generated, randomly select the target number. Based on the sampling weights of each category, samples are randomly drawn from the category set C. Each category forms a category combination. Then, the category combinations are converted into a basic prompt word template. For example, in one embodiment: "A realistic photo which contains c_s1, c_s2, ..., and c_s(n_obj)." Then, the large language model Lalama3 was used to analyze expert prompts. Enhance the basic template. The content is as follows: You are an expert in generating high-quality prompts for text-to-image models. Your task is to take a basic prompt and enhance it by adding vividdetails, specific attributes, context, lighting, and atmosphere. Also, foreach object mentioned, determine and assign a plausible number of instances that would naturally fit the scene you are creating. The count should be guided by realism and the typical context of the objects. The final output must be asingle, concise, highly descriptive prompt in English. Basic Prompt: ; Input basic template Output enhanced prompt words .

[0037] Then Input the Flux image model to generate a high-fidelity image. Flux models can generate synthetic images with realistic detail, diversity, and physical consistency.

[0038] Then according to Category names are used to construct detection prompts. The specific rule is to separate category names with periods. For example, if the base template is "a realistic photo of a cat and a dog," then the corresponding detection prompts are "cat. dog." Then, the open-vocabulary detector Grounding DINO is used, with the synthesized image as input. and detection prompt words Outputs annotation information including bounding box coordinates and category labels. Based on this process, two datasets were generated in a specific embodiment: ①FluxVOC: 10,000 images, 20 categories ②FluxCOCO: 80,000 images, 80 categories.

[0039] Anchor box initialization based on SAM candidate boxes and semantic-aware priors based on region features are used to inject input image-specific geometric and semantic priors into the target query.

[0040] Specifically, the cascaded HQP-DETR detection model is as follows: Figure 2 As shown, the system comprises two core components: a query encoding mechanism guided by high-quality candidate boxes and a cascaded denoising algorithm. The cascaded HQP-DETR detection module includes the following parts: a CNN backbone network using ResNet-50 as the feature extraction backbone network, outputting the feature map of the final convolutional layer C5; and a Transformer encoder containing 6 encoder layers and hidden layer dimensions. =256, Contextual modeling is performed on the features output by the backbone network; the DenseFusion module inherits the design of Stable-DINO, concatenating the encoder output features with the backbone network output features, projecting them back to the original dimensions, and then inputting them into the decoder; the Transformer decoder contains 6 decoder layers, with hidden layer dimensions... =256, Processing two parallel branches: ①HQP branch: target query initialized by high-quality candidate boxes; ② denoising query with noise added to the ground truth boxes; Output bounding box offset and class prediction after each decoder layer.

[0041] DAB-DETR represents each target query as a combination of a four-dimensional anchor box and a content embedding. However, both components are randomly initialized without incorporating input image information. This "double-blind" initialization strategy has two major limitations in the ISOD task: ① The model needs to learn the target query distribution from scratch, resulting in a significantly slower convergence speed; ② The model is prone to overfitting to specific patterns in synthetic datasets (such as background style, target layout, etc.), thereby reducing its cross-domain generalization ability in real-world scenes. To address this, a high-quality candidate box-guided query encoding mechanism is proposed. This mechanism injects specific geometric and semantic priors of the input image into the target query through anchor box initialization based on SAM candidate boxes and semantically aware priors based on region features.

[0042] Furthermore, the anchor box initialization based on SAM candidate boxes specifically includes: 64 points are uniformly sampled along the length and width of the image as automatic prompts, and then processed by non-maximum suppression to obtain the result. N Category-independent segmentation masks; For each segmentation mask, the minimum bounding rectangle is calculated and used as a candidate box, thus obtaining the candidate box set; Initialize the anchor box parameters of the target query using the candidate box set.

[0043] In practice, the Segment Anything Model, as a fundamental visual segmentation model trained on massive labeled data, possesses powerful zero-shot generalization capabilities. It can generate high-precision segmentation masks for images without requiring fine-tuning for specific domains. This model supports interactive prompts using points, bounding boxes, and masks, and can output the corresponding target segmentation mask based on the input prompts. In this work, we use the SAM-H (Huge) automatic segmentation mode, uniformly sampling 64 points along the width and height directions of each image as automatic prompts, which are then processed after non-maximum suppression to obtain the desired segmentation mask. Category-independent segmentation mask For each mask Calculate its minimum bounding rectangle. This yields a set of candidate boxes. . It has the following attributes: ① High recall: At an IoU=0.5 threshold, The average recall rate can reach 0.95, which basically ensures full coverage of foreground objects in the image.

[0044] ② Low redundancy: The number of candidate boxes generated per image is usually between 150 and 200, which is far less than the thousands of candidate boxes generated by traditional candidate box generation methods (such as RPN, EdgeBoxes, Selective Search).

[0045] ③ Domain consistency: SAM can accurately cover the foreground target in both the synthetic domain and the real domain, and has good cross-domain segmentation capabilities.

[0046] direct use Initialize the anchor box parameters (x, y, w, h) for the target query. Unlike the fixed strategy of using 300 randomly initialized target queries in DAB-DETR, this method can generate an adaptive number of target queries and focus the queries on the regions where foreground objects are located in the image from the initial training stage, providing a more targeted geometric prior for subsequent detection.

[0047] Furthermore, semantic perception priors based on region features specifically include: Perform RoI Align operation on each candidate bounding box to extract features of a fixed size; The extracted features are projected onto the hidden layer of the decoder through the Neck module, which consists of two linear layers with ReLU activation.

[0048] In the specific implementation process, in order to supplement the preceding geometric prior with the corresponding semantic prior, the corresponding features are extracted from the feature map output by the backbone network and used to initialize the content embedding.

[0049] For an input image I, the CNN backbone network (such as ResNet-50) outputs a feature map through its last convolutional layer. , where C is the channel dimension. For each candidate box We apply the RoI Align operation:

[0050] This operation extracts fixed-size features by spatial pooling the corresponding region. The extracted features are then projected onto the hidden layer dimension of the decoder using the Neck module.

[0051] in It contains two linear layers with ReLU activation: and ,in =256. This makes each content embeddable. They can all obtain semantic information from their corresponding regions.

[0052] Geometric and semantic priors complement each other synergistically. Geometric priors guide the target query to focus on the foreground region in the image, while semantic priors inject corresponding regional semantic information into the target query, providing initial semantic awareness. The combination of these two approaches transforms the decoder's task paradigm from global exploration to local refinement and classification of the foreground region of the image, significantly reducing optimization difficulty and improving training efficiency.

[0053] The denoising query module dynamically adjusts training weights by progressively increasing the IoU threshold at each layer of the decoder. Specifically, DN-DETR accelerates the convergence of DETR through the denoising training task, but it applies uniform training pressure to all denoising queries. Under ISOD settings, due to the inevitable inaccuracies in the pseudo-labels generated by the automatic annotation tool, this strategy causes the model to indiscriminately fit all pseudo-labels, including inaccurate ones. To address this issue, inspired by Cascade R-CNN, a cascaded denoising algorithm is proposed. This algorithm dynamically adjusts training weights by progressively increasing the IoU threshold at each layer of the decoder.

[0054] For each ground truth bounding box (GTBox) in the training image, random noise is added to its center coordinates, width, and height to generate a noise box used to initialize the anchor box portion of the denoising query. Similarly, the RoI Align and Neck modules are applied to the noise box to extract the corresponding region features to initialize the content embedding of the denoising query. Then, the denoising query is sent to the decoder.

[0055] An incremental threshold was set for each decoder, the first... Layer threshold for:

[0056] in, As the initial threshold, For the total increment, N =6 represents the total number of decoder layers. The threshold gradually increases from 0.3 in the first layer to 0.8 in the last layer.

[0057] Based on threshold , for the first For each denoising query j in the layer, dynamic weights are calculated. In each decoder layer, we reconstruct the bounding box based on the j-th denoising query. The corresponding real bounding box IoU is denoted as Next, With threshold The bias is mapped to the training weights:

[0058] in It is the Sigmoid function, temperature coefficient. The steepness of the weight curve is controlled. This weight directly modulates the features of the denoised query.

[0059] This creates three training scenarios: ①High quality: When hour, This query will retrieve the complete training stress.

[0060] ②Low quality: When hour, The query was almost completely suppressed.

[0061] ③Medium quality: when hour,

[0062] This allows early decoder layers to be trained on queries with low IoU, while later layers focus on queries with high IoU. Since each decoder layer reconstructs the bounding box from its denoised query, this creates a progressive refinement process: the model first learns coarse localization, then focuses on precise bounding box refinement.

[0063] During the training phase, the input image of the cascaded HQP-DETR detection model is processed by a CNN backbone network and a Transformer encoder to generate a feature map rich in contextual information. The Transformer decoder is isolated by attention masks and uses a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising to perform target query and denoising query in parallel.

[0064] During the inference phase, the cascaded HQP-DETR detection model only performs target queries based on high-quality candidate box encoding. After optimization by the decoder, it finally outputs bounding boxes and class predictions.

[0065] Specifically, in addition to the two core modules mentioned above, Cascade HQP-DETR also incorporates two strategies from Stable-DINO: ① Dense fusion: The features output by the encoder are concatenated with the features output by the backbone network, and then projected onto the hidden layer dimension of the decoder before being fed into the decoder; ② Stable matching: A position modulation term is introduced into the Hungarian matching cost; at the same time, a position-supervised classification loss is added to the classification branch to enhance the consistency between classification confidence and localization quality.

[0066] The training process is as follows: The input image is processed by a CNN backbone network and a Transformer encoder (integrating a dense fusion module) to generate a feature map rich in contextual information. Subsequently, the Transformer decoder isolates the input image using an attention mask and processes two sets of queries in parallel: ① target queries initialized with high-quality candidate boxes; ② denoised queries progressively optimized using the cascaded denoising algorithm proposed in this work. The final total loss function consists of classification loss, regression loss, and denoising loss.

[0067] in, The classification loss is guided by IoU, and the regression loss includes... Error loss Compared with the generalized intersection and comparison loss , This is for noise reduction loss.

[0068] During the inference phase, the denoising branch is removed, retaining only the target query initialized with high-quality candidate boxes. After optimization by the decoder, the final output is the bounding box and class prediction. The prediction results are then processed after being filtered by confidence threshold and subjected to non-maximum suppression.

[0069] Another embodiment provides a virtual supervised target detection system 300, comprising the following steps: Synthetic dataset module 310 is used to synthesize datasets based on real-world target frequency distributions; A cascaded HQP-DETR detection model module 320 is constructed. The cascaded HQP-DETR detection model includes a backbone network, a multi-layer Transformer encoder, a Dense Fusion module, and a multi-layer Transformer decoder: the backbone network is used for feature extraction; the multi-layer Transformer encoder performs contextual modeling on the features output by the backbone network; the Dense Fusion module concatenates the features output by the multi-layer Transformer encoder with the features output by the backbone network, projects them back to the original dimensions, and then inputs them into the multi-layer Transformer decoder; the multi-layer Transformer decoder outputs bounding box offsets and class predictions. Model training module 330 is used to train and test the cascaded HQP-DETR detection model using a synthetic dataset; Detection module 340 is used to input the image to be detected into the trained cascaded HQP-DETR detection model for detection; The multi-layer Transformer decoder embeds a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising. The target query module guides the model to learn low-level visual features that are consistent across domains by injecting image-specific geometric and semantic priors into the target query. The denoising query module dynamically adjusts the training weights of the denoising query features according to the prediction quality, guiding the model to prioritize learning visual features including the target contour.

[0070] In addition to the modules described above, the virtual supervised target detection system 300 may also include other components; however, since these components are not relevant to the embodiments of this disclosure, their illustrations and descriptions are omitted herein. Other specific working processes for virtual supervised target detection using the virtual supervised target detection system 300 described above are as described in the above virtual supervised target detection method embodiment and will not be repeated here.

[0071] Another embodiment illustrating that the system of the present invention can also be achieved by means of... Figure 4 The architecture of the computing device shown is used to implement this. Figure 4 The architecture of the computing device is shown. For example... Figure 4 As shown, the computer system 410 includes a system bus 430, one or more CPUs 440, input / output 420, and memory 450. The memory 450 can store various data or files used for computer processing and / or communication, as well as program instructions executed by the CPU, including virtual supervised target detection methods. Figure 4 The architecture shown is merely exemplary and should be adjusted according to actual needs when implementing different devices. Figure 4 One or more components are included. The memory 450, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules, such as the program instructions / modules corresponding to the virtual supervised target detection method described above in this embodiment of the invention. One or more CPUs 440 execute various functional applications and data processing of the system of the present invention by running the software programs, instructions, and modules stored in the memory 450. Of course, the processor of the server provided in the embodiments of the present invention is not limited to performing the method operations described above, but can also perform related operations in the virtual supervised target detection method provided in any embodiment of the present invention.

[0072] The memory 450 may primarily include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on terminal usage. Furthermore, the memory 450 may include high-speed random access memory and non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, the memory 450 may further include memory remotely configured relative to one or more CPUs 440, these remote memories being connected to the device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0073] Input / output 420 can be used to receive input digital or character information, and to generate key signal inputs related to user settings and function control of the device. Input / output 420 may also include a display device such as a display screen.

[0074] This invention also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, this computer program implements the virtual supervised target detection method described in the above embodiments. The computer-readable storage medium of this invention can be any combination of one or more computer-readable media. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

[0075] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit programs for use by or in connection with an instruction execution system, apparatus, or device.

[0076] The program code contained on the storage medium can be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0077] Furthermore, other specific operational processes of a non-temporary computer-readable storage medium are described in the above-described embodiments of the virtual supervised target detection method, and will not be repeated here.

[0078] To verify the effectiveness of this invention, Cascade HQP-DETR was trained on two datasets for the ISOD task: FluxVOC (10k images, 20 categories) and FluxCOCO (80k images, 80 categories), whose sizes are comparable to the PASCAL VOC 2012 training and validation sets and the MSCOCO 2014 training set, respectively. To verify the model's generality, it was also trained on the PASCAL VOC 2012 training and validation sets. The trained model was evaluated on the PASCAL VOC 2007 test set, with reported metrics of mAP@0.5, AP@[0.5:0.95], and AP@0.75. All experiments used ResNet-50 as the backbone network. FluxVOC and FluxCOCO used Grounding DINO for automatic annotation, with a text confidence threshold of 0.30. The encoder and decoder each had 6 layers, and the hidden layer dimension of the model was... =256. For the query encoding mechanism guided by high-quality candidate boxes, this work uses the SAM-H model to obtain candidate boxes through automatic masking (sampling 64 points per side), and projects the 7×7 RoI Align features into a 256-dimensional query through two layers of MLP (hidden dimension 4096). For the cascaded denoising algorithm, the threshold of each layer of the decoder is... The value increases linearly from 0.3 to 0.8. The classification loss uses Focal Loss (…). , The loss weight for the matching branch is... Denoising loss With coefficient Add to the overall goal.

[0079] Evaluation of datasets like Figure 5The image shows a sample example from the FluxVOC dataset. The first column represents various modes of transportation, the second column represents common indoor objects, the third column represents complex human-centric scenes, and the fourth column represents diverse animal categories. The visualization results fully demonstrate the dataset construction process proposed in this paper, which can generate high-quality images in different domains and effectively handle challenging scenes including small objects, densely distributed elements, and occlusion.

[0080] To evaluate the quality of the FluxVOC dataset, it was compared with the re-annotated ImaginaryNet and the real-world PASCAL VOC 2012 training and validation dataset. Since the original ImaginaryNet does not include bounding box annotations, CLIP was used for class prediction. For each image, a cue word "a photo of a {category}" was constructed for all 20 categories of the VOC. The image and cue word were encoded using CLIP, and the category with the highest cosine similarity was selected as the predicted category for that region. Subsequently, the category label obtained from this inference was used as the detection cue, and Grounding DINO generated the corresponding bounding box. The evaluation focused on three aspects: the number and distribution of instances, the diversity of spatial and visual distributions, and the performance on specific models.

[0081] As shown in Table 1, FluxVOC contains 34,750 target instances, a 220.5% increase compared to ImaginaryNet's 10,841 instances, more than three times the size of the latter. Significant growth was achieved across all 20 categories (e.g., "Person" increased by 1204%, "Potted Plant" by 360.4%, and "TV" by 311.2%). Furthermore, the total number of instances in FluxVOC is closer to the 31,561 instances in the real-world VOC2012 training and validation dataset, demonstrating the strong scalability of our synthesis pipeline.

[0082] Table 1: Comparison of Instance Count by Category

[0083] To visually compare the diversity of target spatial distribution across datasets, we normalized the bounding box center coordinates and generated a visualization map (e.g., using kernel density estimation (KDE)). Figure 6As shown in the image (left), ImaginaryNet exhibits a significant center bias, with the density distribution at the center of the bounding boxes forming a sharp single peak, indicating that the vast majority of foreground targets are concentrated in a narrow region at the center of the image. This highly homogeneous and overly simplistic distribution pattern makes the model prone to overfitting to the false prior of "targets located in the center," resulting in a significant decrease in its generalization performance in real-world scenes (where targets are often off-center). The Pascal VOC2012 dataset (right), as a real-world dataset, has a more natural spatial distribution. Although it exhibits a certain central tendency, it covers a wider area and has a more balanced density distribution, consistent with the diverse characteristics of target locations in real-world scenes. In contrast, the FluxVOC dataset (right) constructed in this work demonstrates a spatial distribution complexity highly consistent with the real dataset. The bounding box center distribution covers a wide area, exhibiting multi-peak density characteristics, and forming multiple distribution "islands" in the image edge region. This spatial distribution characteristic, which is closer to real-world scenes, can effectively alleviate the model's overfitting to the center prior and significantly improve its cross-domain generalization ability.

[0084] To quantify the quality of different datasets, DAB-DETR was trained on various data sources and evaluated on the PASCAL VOC2007 test set, as shown in Table 2. In the training setting using only synthetic data, FluxVOC achieved a mAP@0.5 of 44.92%, a 9.78 percentage point improvement over the re-annotated ImaginaryNet (35.14%), and also performed better on more stringent evaluation metrics: an average AP of 22.63% (vs. 19.52%) and an AP@0.75 of 19.37% (vs. 17.84%). The difference in dataset quality was even more pronounced when training with the PASCAL VOC2012 dataset. The combination of FluxVOC and VOC12 achieved a mAP@0.5 of 74.50%, exceeding ImaginaryNet+VOC12 (68.14%) by 6.36 percentage points and the VOC12-only model (68.16%) by 6.34 points. It's worth noting that mixing ImaginaryNet with real data did not improve performance; in fact, it led to a performance decrease (68.14% vs. 68.16%). FluxVOC, on the other hand, served as an effective data augmentation technique, significantly improving model performance. This comparison demonstrates that in ISOD tasks, the quality, rather than the quantity, of synthetic data determines its practical application value.

[0085] Table 2: Performance comparison of DAB-DETR trained on different data sources, evaluated on the VOC 2007 test set.

[0086] This represents the re-labeled dataset. This represents the original dataset.

[0087] Model evaluation in the synthetic domain The Cascade HQP-DETR model was compared with baseline models based on the DETR architecture. All models were trained on the FluxVOC dataset and evaluated on the PASCAL VOC 2007 test set. The results are shown in Table 3. To verify the training efficiency, most baseline models were trained for 30 epochs, while the proposed method was trained in only 12 epochs.

[0088] Cascade HQP-DETR achieved a mAP of 62.87% at 0.5. Despite a 60% reduction in the number of training epochs, it still represents a 5.53 percentage point improvement over the best-performing baseline model, StableDINO (57.34%), which uses a ResNet-50 backbone. Notably, our method even outperforms DEIMv2-S (61.02% mAP at 0.5), a 1.85 percentage point improvement, while DEIMv2-S employs the significantly more powerful DinoV3 backbone, fully demonstrating the effectiveness of our proposed architectural innovation. This performance advantage remains consistent across more stringent evaluation metrics: compared to StableDINO, mean AP increased from 38.33% to 43.47% (+5.14 percentage points), and AP at 0.75 increased from 40.48% to 46.91% (+6.43 percentage points).

[0089] Table 3

[0090] Figure 7 The training curves for each model are shown. Cascade HQP-DETR achieved 42.01% mAP@0.5 in the first epoch, a performance comparable to the early stages of the baseline models; by the 12th epoch, the model converged to 62.87% mAP@0.5. In contrast, all baseline models required 30 epochs of training to achieve lower final accuracy, validating that high-quality candidate box initialization provides effective image-specific priors, allowing the model to focus on learning transferable feature representations rather than exploring the distribution space of the target query from scratch.

[0091] Figure 8The results showcase partial test results of the model on real images. The visualizations demonstrate that the proposed model exhibits excellent generalization performance across various complex and challenging scenarios: it achieves accurate object detection and prediction for scenes such as riders partially obscured by motorcycles, small-scale sheep in the distance in a field, and densely packed crowds around a dining table. This demonstrates that FluxVOC, as an image training set, can rival real-world image training sets, and also reflects the generalization ability of the Cascade HQP-DETR model.

[0092] Evaluation in the real domain To verify that the generalization ability of Cascade HQP-DETR is not limited to ISOD, we also trained all models on the VOC 2012 trainval dataset and evaluated them on the VOC 07 test set, as shown in Table 4. Cascade HQP-DETR achieved an mAP@0.5 of 80.41% and a mean accuracy of 60.31%, which are improvements of 1.31 and 2.21 percentage points respectively compared to StableDINO. Notably, although DEIMv2-S uses the significantly more powerful DinoV3 backbone network, our method still outperforms this model (its mAP@0.5 is 79.55% and mean AP is 58.94%). These results fully demonstrate that the architectural improvements proposed in this paper are not only applicable to the ISOD domain, but also have significant effectiveness in the field of fully supervised object detection.

[0093] Table 4

[0094] Training results on FluxCOCO Figure 9 The image showcases some example images from the FluxCOCO dataset, demonstrating its high-quality and high-precision annotation of various targets, including animals, outdoor scenes, indoor objects, and sports equipment.

[0095] The model was trained on the FluxCOCO dataset and tested on the COCO14 validation set. The experimental results are shown in Table 5. Compared to the 11.40% AP50 achieved by ImaginaryNet, Cascade HQP-DETR achieved an AP50 of 31.07%, representing a performance improvement of nearly three times. Table 6 shows the test results for specific categories.

[0096] Table 5

[0097] Table 6

[0098] In this document, the terms “comprising,” “including,” or any other variations thereof are intended to cover non-exclusive inclusion, such that a step or method that comprises a list of elements includes not only those elements but also other elements not expressly listed or inherent to such a step or method.

[0099] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various simple deductions or substitutions can be made without departing from the concept of the present invention, and all such modifications and substitutions should be considered within the scope of protection of the present invention.

Claims

1. A virtual supervised target detection method, characterized in that, The method includes the following steps: Synthetic dataset based on real-world target frequency distribution; A cascaded HQP-DETR detection model is constructed, comprising a backbone network, a multi-layer Transformer encoder, a Dense Fusion module, and a multi-layer Transformer decoder. The backbone network is used for feature extraction; the multi-layer Transformer encoder performs contextual modeling on the features output by the backbone network; the Dense Fusion module concatenates the features output by the multi-layer Transformer encoder with the features output by the backbone network, projects them back to the original dimensions, and then inputs them into the multi-layer Transformer decoder; the multi-layer Transformer decoder outputs bounding box offsets and class predictions. The cascaded HQP-DETR detection model was trained and tested using a synthetic dataset; The image to be detected is input into the trained cascaded HQP-DETR detection model for detection. The multi-layer Transformer decoder embeds a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising. The target query module guides the model to learn low-level visual features that are consistent across domains by injecting image-specific geometric and semantic priors into the target query. The denoising query module dynamically adjusts the training weights of the denoising query features according to the prediction quality, guiding the model to prioritize learning visual features including the target contour.

2. The virtual supervised target detection method according to claim 1, characterized in that, The synthetic dataset includes the following steps: Synthetic dataset generation: Given a set of categories containing C target categories, and divide it into a common category set, a medium frequency category set, and a rare category set, assign sampling weights to the three levels based on the real-world target frequency distribution; For the samples to be generated, the number of categories is randomly selected from the category set C according to the sampling weight of each category to form category combinations; Convert category combinations into basic prompt word templates: By using a large language model, the basic prompt word template is enhanced with expert prompt words to generate enhanced prompt words; The enhanced prompts are input into the text-based image model to generate an image. Detection prompts are constructed based on the category names of the basic prompt template; The generated image and detection cue words are input into the open vocabulary detector, and the output includes annotation information containing bounding box coordinates and category labels.

3. The virtual supervised target detection method according to claim 1, characterized in that, The method injects specific geometric and semantic priors of the input image into the target query through anchor box initialization based on SAM candidate boxes and semantic-aware priors based on region features.

4. The virtual supervised target detection method according to claim 3, characterized in that, Anchor box initialization based on SAM candidate boxes specifically includes: 64 points are uniformly sampled along the length and width of the image as automatic prompts, and then processed by non-maximum suppression to obtain the result. N Category-independent segmentation masks; For each segmentation mask, the minimum bounding rectangle is calculated and used as a candidate box, thus obtaining the candidate box set; Initialize the anchor box parameters of the target query using the candidate box set.

5. The virtual supervised target detection method according to claim 4, characterized in that, Semantic perception priors based on region features specifically include: Perform RoI Align operation on each candidate bounding box to extract features of a fixed size; The extracted features are projected onto the hidden layer of the decoder through the Neck module, which consists of two linear layers with ReLU activation.

6. The virtual supervised target detection method according to claim 1, characterized in that, The denoising query module dynamically adjusts the training weights by progressively increasing the IoU threshold at each layer of the decoder. No. Layer threshold for: in, As the initial threshold, For the total increment, N This represents the total number of layers in the decoder. Based on threshold , for the first For each denoised query j in the layer, a dynamic weight is calculated: In each decoder layer, the bounding box reconstructed based on the j-th denoising query... , and its corresponding real bounding box The calculated IoU is denoted as ,Will With threshold The bias is mapped to the training weights: in It is the Sigmoid function. This is the temperature coefficient.

7. The virtual supervised target detection method according to claim 1, characterized in that, During the training phase, the cascaded HQP-DETR detection model generates feature maps rich in contextual information by passing the input image through a CNN backbone network and a Transformer encoder. The Transformer decoder isolates targets through attention masks and utilizes in parallel a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising for both target and denoising queries. During the inference phase, the cascaded HQP-DETR detection model only performs target queries based on high-quality candidate box encoding. After optimization by the decoder, it finally outputs bounding boxes and class predictions.

8. A virtual supervised target detection system, characterized in that, Includes the following steps: The Synthetic Dataset Module is used to synthesize datasets based on real-world target frequency distributions. A cascaded HQP-DETR detection model module is constructed, comprising a backbone network, a multi-layer Transformer encoder, a Dense Fusion module, and a multi-layer Transformer decoder. The backbone network is used for feature extraction; the multi-layer Transformer encoder performs contextual modeling on the features output by the backbone network; the Dense Fusion module concatenates the features output by the multi-layer Transformer encoder with the features output by the backbone network, projects them back to the original dimensions, and then inputs them into the multi-layer Transformer decoder; the multi-layer Transformer decoder outputs bounding box offsets and class predictions. The model training module is used to train and test the cascaded HQP-DETR detection model using a synthetic dataset; The detection module is used to input the image to be detected into the trained cascaded HQP-DETR detection model for detection; Among them, the multi-layer Transformer decoder embeds a target query module based on high-quality candidate box encoding and a denoising query module based on cascaded denoising. The target query module guides the model to learn cross-domain consistent low-level visual features by injecting image-specific geometric and semantic priors into the target query. The denoising query module dynamically adjusts the training weights of the denoising query features based on the prediction quality, guiding the model to prioritize learning visual features including the target contour.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the virtual supervised target detection method as described in any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium, wherein computer instructions are stored on the storage medium, characterized in that, When the instructions are executed by the processor, they implement the steps of the virtual supervised target detection method as described in any one of claims 1 to 7.