A target detection method and device
By improving the feature extraction and fusion module of the RT-DETR model, and combining lightweight neural networks and cross-scale feature fusion, the problem of poor detection performance on edge devices was solved, and real-time target detection with high accuracy and high frame rate was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-01-27
- Publication Date
- 2026-06-19
AI Technical Summary
Existing target detection models suffer from poor detection performance due to hardware differences between cloud servers and edge devices, making it difficult to achieve high-precision and high-frame-rate real-time detection on edge devices.
An improved RT-DETR model is adopted. Through improvements to the feature extraction and feature fusion modules, a lightweight MobileNetv4 neural network and a cross-scale feature fusion module (CCFM) are utilized, and the UIB module configuration is optimized by combining neural architecture search (NAS) technology to achieve multi-scale feature extraction and fusion.
It improves the detection performance of edge devices, ensures the real-time performance and accuracy of detection, reduces the number of model parameters and floating-point operations, improves computational efficiency, and meets the real-time requirements in resource-constrained environments.
Smart Images

Figure CN122244491A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a target detection method and apparatus. Background Technology
[0002] Traditional object detection systems typically consist of a cloud server and edge devices deployed at the edge. The cloud server trains the object detection model, which is then deployed to the edge devices, where the model is used for object detection at the edge. Because the processing unit configurations of the cloud server and the edge devices differ, the training and execution of the object detection model on both differs, affecting the model's detection performance. Summary of the Invention
[0003] In view of this, the purpose of this application is to provide a target detection method and apparatus.
[0004] To achieve the above objectives, embodiments of this application provide a target detection method, including:
[0005] Acquire the image to be tested; Extract low-resolution, medium-resolution, and high-resolution features from the image under test; The low-resolution, medium-resolution, and high-resolution features are fused to obtain enhanced features. Target detection is performed based on the enhanced features to obtain the target detection result.
[0006] Optionally, the extraction of low-resolution features, medium-resolution features, and high-resolution features from the image to be tested includes: The image to be tested is subjected to two-dimensional convolution and depthwise convolution operations in sequence to obtain preliminary image features; The first general-purpose inverted bottleneck module is used to extract features from the preliminary image features to obtain high-resolution features; The high-resolution features are extracted using the second general-purpose inverted bottleneck module to obtain medium-resolution features; The medium-resolution features are subjected to a depthwise convolution operation to obtain low-resolution features.
[0007] Optionally, the low-resolution features, medium-resolution features, and high-resolution features are fused to obtain enhanced features, including: The low-resolution features are adjusted by convolutional layers, then upsampled for size adjustment. The resized low-resolution features are concatenated with the medium-resolution features. The concatenated features are then processed by a partial convolutional layer, followed by feature refinement to obtain fused features. These fused features are then upsampled for size adjustment, and concatenated with the high-resolution features. The concatenated features are then processed by a partial convolutional layer to obtain enhanced high-resolution features. These enhanced high-resolution features are then adjusted by convolutional layers and concatenated with the fused features. After processing by a partial convolutional layer, enhanced medium-resolution features are output. These enhanced medium-resolution features are then processed by convolutional layers and concatenated with the feature-adjusted low-resolution features. Finally, after processing by a partial convolutional layer, enhanced low-resolution features are output.
[0008] Optionally, target detection is performed based on the enhanced features to obtain target detection results, including: The enhanced low-resolution features, enhanced medium-resolution features, and enhanced high-resolution features are input into the RT-DETR decoder and prediction head, and the RT-DETR decoder and prediction head output the predicted target's category, confidence level, and bounding box coordinates.
[0009] Optionally, the partial convolutional layer performs 3×3 convolution on a portion of the input feature map to extract spatial features, while the remaining portion is not processed and is directly retained. The extracted spatial features and the directly retained features are concatenated and then input into a pointwise convolutional layer consisting of two consecutive 1×1 convolutions for processing.
[0010] This application also provides a mine target detection device, including: The acquisition module is used to acquire the image to be tested; The feature extraction module is used to extract low-resolution, medium-resolution, and high-resolution features from the image under test. The feature fusion module is used to fuse the low-resolution features, medium-resolution features and high-resolution features to obtain enhanced features; The detection module is used to perform target detection based on the enhanced features and obtain the target detection result.
[0011] Optionally, the feature extraction module is used to sequentially perform two-dimensional convolution and depthwise convolution operations on the image to be tested to obtain preliminary image features; use a first general inverted bottleneck module to extract features from the preliminary image features to obtain high-resolution features; use a second general inverted bottleneck module to extract features from the high-resolution features to obtain medium-resolution features; and perform depthwise convolution operations on the medium-resolution features to obtain low-resolution features.
[0012] Optionally, a feature fusion module is used to adjust the low-resolution features through convolutional layers, then resize them through upsampling. The resized low-resolution features are then concatenated with the medium-resolution features. The concatenated features are processed by a partial convolutional layer, then refined through another convolutional layer to obtain fused features. The fused features are then resized through upsampling, and concatenated with the high-resolution features. The concatenated features are then processed by a partial convolutional layer to obtain enhanced high-resolution features. The enhanced high-resolution features are then adjusted by convolutional layers and concatenated with the fused features. After processing by a partial convolutional layer, enhanced medium-resolution features are output. The enhanced medium-resolution features are then processed by convolutional layers and concatenated with the feature-adjusted low-resolution features. After processing by a partial convolutional layer, enhanced low-resolution features are output.
[0013] Optionally, a detection module is used to input the enhanced low-resolution features, enhanced medium-resolution features, and enhanced high-resolution features into the RT-DETR decoder and prediction head, and the RT-DETR decoder and prediction head output the predicted target's category, confidence level, and bounding box coordinates.
[0014] Optionally, the partial convolutional layer performs 3×3 convolution on a portion of the input feature map to extract spatial features, while the remaining portion is not processed and is directly retained. The extracted spatial features and the directly retained features are concatenated and then input into a pointwise convolutional layer consisting of two consecutive 1×1 convolutions for processing.
[0015] As can be seen from the above, the target detection method and apparatus provided in this application acquire a test image, extract low-resolution, medium-resolution, and high-resolution features from the test image, fuse the low-resolution, medium-resolution, and high-resolution features to obtain enhanced features, and perform target detection based on the enhanced features to obtain the target detection result. This application provides a target detection model suitable for edge devices, which can improve the detection effect and ensure real-time detection. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a schematic diagram of the method flow of an embodiment of this application; Figure 2 This is a schematic diagram of the feature extraction module structure in an embodiment of this application; Figure 3 This is a schematic diagram illustrating the instantiation of UIB in an embodiment of this application; Figure 4 This is a schematic diagram of the feature fusion module structure in an embodiment of this application; Figure 5 This is a schematic diagram of the PConv module structure according to an embodiment of this application; Figure 6 This is a schematic diagram of the model architecture of an embodiment of this application; Figure 7 This is a schematic diagram of the system architecture of an embodiment of this application; Figure 8 This is a schematic diagram of the device structure according to an embodiment of this application; Figure 9 This is a block diagram of the electronic device structure according to an embodiment of this application. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.
[0019] It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of this application should have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms "first," "second," and similar terms used in the embodiments of this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are only used to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.
[0020] In related technologies, the training and validation of object detection models are generally performed on cloud server processing units (e.g., Graphics Processing Units (GPUs)), while the model runs on edge device AI processing units (e.g., Tensor Processing Units (TPUs)). Due to hardware architecture differences, models trained on GPUs may experience significantly slower inference speeds when running on AI processing units due to issues such as unsupported operators, poor computational graph optimization, and memory mismatches. This prevents the full utilization of the AI processing unit's computing power and makes it difficult to meet real-time detection requirements. For example, object detection models based on the YOLO model are lightweight and efficient on GPUs, but when ported to TPUs, they have to sacrifice more accuracy to maintain speed, or fail to meet speed requirements while pursuing accuracy, making it difficult to achieve high-precision and high-frame-rate real-time detection simultaneously on edge devices. Object detection models based on the RT-DETR (Real-Time DetectionTransformer) model are generally trained and optimized on NVIDIA GPU clusters. During training, FP32 full-precision floating-point operations are commonly used. The model design focuses on the accuracy of the model and adopts large-scale parallel computing, resulting in high memory consumption. However, on edge devices, the INT8 model is usually used, and the memory bandwidth is usually much lower than that of the GPU. The ResNet backbone network of the RT-DETR model has a large number of parameters, making it difficult to run multiple instances efficiently in the limited memory of edge devices, which leads to a decrease in the actual detection performance of the object detection model.
[0021] In view of this, the embodiments of this application provide a target detection method, which improves the feature extraction module and feature fusion module based on the RT-DETR model, and provides a lightweight target detection model suitable for edge devices. This model can make full use of the computing power of the artificial intelligence processing unit of the edge device, improve the detection effect of the model, and ensure real-time detection.
[0022] The technical solution of this application will be further described in detail below through specific embodiments.
[0023] like Figure 1 As shown in the figure, this application provides a target detection method, including: S101: Acquire the image to be tested; In this embodiment, the image acquisition unit acquires the image to be tested in the monitoring area, and the image to be tested is input into the target detection model deployed on the edge device. The target detection model is then used to detect the target in the monitoring area.
[0024] In some applications, the method of this application can be applied to scenarios with certain safety requirements, such as mines, utility tunnels, and petrochemical industrial parks. For example, in dimly lit, damp, and structurally complex mines, target detection of personnel and equipment is required to monitor for intrusion, proper use of protective equipment, and normal equipment status; target detection of personnel and equipment in tunnels is also required to monitor the safety of distances between equipment and personnel. The above are merely illustrative examples, and specific application scenarios are not limited.
[0025] S102: Extract low-resolution, medium-resolution, and high-resolution features from the image to be tested; In this embodiment, the target detection model includes a feature extraction module, a feature fusion module, and a detection module. The image to be tested is input into the target detection module. First, the feature extraction module extracts low-resolution features, medium-resolution features, and high-resolution features from the image to be tested. Specifically, this includes: The image to be tested is subjected to two-dimensional convolution and depthwise convolution operations in sequence to obtain preliminary image features; The first general-purpose inverted bottleneck module is used to extract features from the preliminary image features to obtain high-resolution features; The second general inverted bottleneck module is used to extract features from high-resolution features to obtain medium-resolution features; Perform a depthwise convolution operation on the medium-resolution features to obtain low-resolution features.
[0026] like Figure 2 As shown, the feature extraction module includes a two-dimensional convolutional layer (Conv2d), a deep convolutional layer (ConvBN), and a universal inverted bottleneck module (UIB). The image to be tested is first subjected to a 1×1 convolution operation in the two-dimensional convolutional layer to expand the number of channels (for example, expanding the 64 dimensions of the image to 128 dimensions). Then, spatial features are extracted through two deep convolutional layers to obtain preliminary image features. The first universal inverted bottleneck module is used to extract features to obtain high-resolution features S3. The second universal inverted bottleneck module is used to extract features from the high-resolution features to obtain medium-resolution features S4. The medium-resolution features S4 are convolved through a deep convolutional layer to obtain low-resolution features S5. Multi-scale feature extraction can meet the multi-scale detection requirements.
[0027] In some approaches, the downsampling rate of high-resolution feature S3 is 8× (e.g., when the input image resolution is 640×640, the output image resolution is 80×80 after 8x downsampling). High-resolution features contain rich detail information and are suitable for detecting small targets. The downsampling rate of medium-resolution feature S4 is 16× (e.g., when the input image resolution is 640×640, the output image resolution is 40×40 after 16x downsampling). Medium-resolution features can balance detail and semantic information and are used for medium-scale target detection. The downsampling rate of low-resolution feature S5 is 32× (e.g., when the input image resolution is 640×640, the output image resolution is 20×20 after 32x downsampling). Low-resolution features contain high-level semantic information and can identify the overall outline of the target, making them suitable for large target detection.
[0028] In some embodiments, the object detection model is based on the RT-DETR model. To address the issue of the large number of parameters in the ResNet backbone network and poor support from edge devices, the ResNet backbone network of the RT-DETR model is replaced with an improved MobileNetv4 lightweight neural network model. This model uses multiple general inverted bottleneck modules to output features at three different resolutions. The UIB module adopts an "expansion-transformation-compression" process: first, the number of channels is expanded through 1×1 convolution (conv2d), then spatial features are extracted through depthwise convolution, and finally, the number of channels is compressed through 1×1 convolution to restore the original dimension, which can significantly reduce the number of parameters. The UIB module adopts a regular convolution structure, which can avoid complex control flow, thereby matching the systolic array computation mode of the artificial intelligence processing unit and achieving efficient multi-scale object detection.
[0029] Simultaneously, Neural Architecture Search (NAS) technology is employed to achieve hardware-independent Pareto optimal performance, automatically selecting the optimal convolutional kernel size and number of channels on the AI processing unit. NAS technology automatically optimizes the UIB module configuration on the AI processor, enabling the model to improve inference speed while maintaining accuracy. For example... Figure 3 As shown, NAS technology dynamically optimizes the UIB module, and instantiates it into ExtraDW, IB, ConvNext-Like, and FFN modules based on neural architecture search.
[0030] S103: Feature fusion is performed on low-resolution features, medium-resolution features and high-resolution features to obtain enhanced features; In this embodiment, the feature fusion module of the target detection model is implemented based on the lightweight cross-scale feature fusion module (CCFM module). Low-resolution features, medium-resolution features and high-resolution features of multiple scales are input into the feature fusion module, which then fuses the multi-scale features to obtain enhanced low-resolution features, enhanced medium-resolution features and enhanced high-resolution features that fuse multiple levels of features.
[0031] like Figure 4 As shown, in this embodiment, the low-resolution feature S5 is first adjusted by a convolutional layer (Conv), and the adjusted low-resolution feature is upsampled to match its spatial size with that of the medium-resolution feature S4. The size-adjusted low-resolution feature and the medium-resolution feature S4 are concatenated, and the concatenated feature input part is processed efficiently by a convolutional layer (PConv), and then refined by a convolutional layer (Conv) to obtain a fused feature that combines the semantic information of the low-resolution feature S5 and the mid-level information of the medium-resolution feature S4. This fused feature is then upsampled to match its size with that of the high-resolution feature S3. The resized fused features are concatenated with the high-resolution feature S3. The concatenated features are then processed by a partial convolutional layer (PConv) to extract key spatial features, resulting in an enhanced high-resolution feature for identifying smaller targets and image details. The enhanced high-resolution feature is further convolved and its channels are adjusted before being concatenated with the fused features. After processing by a partial convolutional layer, an enhanced medium-resolution feature for identifying medium-sized targets is output. The enhanced medium-resolution feature is then processed by a convolutional layer and concatenated with the adjusted low-resolution feature. An enhanced low-resolution feature is then extracted by a partial convolutional layer. The enhanced low-resolution feature integrates global contextual information, has a lower resolution, and is used to identify larger targets, providing an overall semantic understanding of the image.
[0032] like Figure 5As shown, the feature fusion module employs a lightweight partial convolutional layer, Pconv, which performs 3×3 convolutions (Conv3) only on a portion of the input feature map channels (e.g., the first 1 / 4) to extract spatial features. The remaining portion is retained without processing, significantly reducing the enormous computational overhead of conventional convolution operations. The convolved features and the directly retained features are concatenated, and the input is a pointwise convolutional layer, PWConv, consisting of two consecutive 1×1 convolutions (Conv). Dimensionality enhancement, activation, and dimensionality reduction processes maintain feature integrity, achieving the optimal balance between accuracy and speed, and efficiently enhancing global information interaction and integration across all channels. Since the partial convolutional layer does not introduce any special operators or complex operations, its inference process is entirely implemented by standard convolutional layers, providing support and efficient acceleration on various hardware platforms without requiring customized operator libraries or additional adaptation measures. Compared to standard convolution, it requires fewer floating-point operations (reduced to 1 / 16 of ordinary convolution) and less memory access, while also better extracting spatial features from the feature map.
[0033] S104: Target detection is performed based on enhanced features to obtain the target detection results.
[0034] In this embodiment, after feature fusion to obtain enhanced features, the RT-DETR decoder is used to perform target detection based on the enhanced features. The RT-DETR encoder establishes global contextual association through a self-attention mechanism, and the decoder outputs end-to-end target detection results based on the enhanced features. The results include a series of predicted target bounding boxes and their corresponding class labels and confidence scores. The entire process requires no complex post-processing (e.g., non-maximum suppression), enabling efficient and real-time target detection.
[0035] like Figure 6 As shown, in some embodiments, the acquired image to be tested is input into the target detection model of the edge device. A feature extraction module extracts high-resolution features S3, medium-resolution features S4, and low-resolution features S5 with different semantic depths and spatial resolutions from the image. The low-resolution feature S5 is then input separately into an adaptive intra-feature interaction (AIFI) module for feature enhancement before being input into a feature fusion module. The feature fusion module fuses the input low-resolution, medium-resolution, and high-resolution features to obtain enhanced low-resolution, medium-resolution, and high-resolution features. These enhanced features are then input into the RT-DETR decoder and prediction head to predict the target's category, confidence level, and bounding box coordinates, thus achieving target detection.
[0036] like Figure 7As shown, this embodiment provides a mine video safety detection system applied to target detection in mines. This system is configured on edge devices, and after training and optimizing the target detection model using a cloud server, the trained model is deployed on the edge devices. The system includes a perception layer, an edge computing layer, a user interface, an interactive screen, a service layer, and a data layer. The perception layer deploys video acquisition equipment and dust control equipment in the monitoring area. The video acquisition equipment collects video within the monitoring range, and the dust control equipment is used for dust removal within the mine. The edge computing layer uses the target detection model to perform target detection on image frames in the video and issues alarms based on the target detection results. The user interface provides video display and interactive functions. Users can send datasets and update training commands to the cloud server through the user interface, improving the model's adaptability to changes in the mine environment and its long-term effectiveness, extending the algorithm's lifespan. The data layer provides an algorithm database for storing model and algorithm-related data, an alarm database for storing alarm-related data, a user database for storing user-related data, and a video stream database for storing video data. The interaction layer uses vue-router to set up front-end routes and interacts with the service layer via AJAX. The service layer adopts a microservice architecture built on Spring Boot, which is responsible for receiving, storing, and managing structured data uploaded from the edge layer, and provides business logic such as user management, device management, and alarm handling. It provides centralized data aggregation and complex business processing.
[0037] The target detection method provided in this application improves the feature extraction and feature fusion modules based on the RT-DETR model, offering a target detection model suitable for edge devices. The improved target detection model, through hardware adaptive optimization, enhances the computational efficiency of the AI processing unit, fully leveraging its computing power. Through lightweight reconstruction and quantized perception training, while maintaining high detection accuracy (mAP50 ≥ 89.5%), the accuracy loss is only 0.4%, achieving high-speed inference (FPS ≥ 26), with inference latency reduced from 53.94ms to 36.93ms. This overcomes the performance trade-offs in resource-constrained environments, solves the problem of balancing accuracy and speed on edge devices, and meets real-time detection requirements. Employing the UIB module and PConv lightweight design, the model's parameter size is reduced from 32.8M to 13.7M, a reduction of 58.2%, and the number of floating-point operations is reduced from 108G to 36G, a reduction of 66.7%. Through testing, the improved target detection model improved precision (P) from 89.3% to 90.6% and recall (R) from 80.4% to 83.2%, ensuring detection reliability while maintaining a lightweight design.
[0038] It should be noted that the method in this embodiment can be executed by a single device, such as a computer or server. The method can also be applied in a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the method in this embodiment, and the multiple devices will interact with each other to complete the method described.
[0039] It should be noted that the above description describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims may be performed in a different order than that shown in the embodiments and still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0040] like Figure 8 As shown in the illustration, this application also provides a target detection device, comprising: The acquisition module is used to acquire the image to be tested; The feature extraction module is used to extract low-resolution, medium-resolution, and high-resolution features from the image under test. The feature fusion module is used to fuse low-resolution features, medium-resolution features and high-resolution features to obtain enhanced features; The detection module is used to perform target detection based on enhanced features and obtain the target detection results.
[0041] For ease of description, the above devices are described in terms of function, divided into various modules. Of course, in implementing the embodiments of this application, the functions of each module can be implemented in one or more software and / or hardware.
[0042] The apparatus described above is used to implement the corresponding methods in the foregoing embodiments and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0043] Figure 9 This embodiment illustrates a more specific hardware structure of an electronic device. The device may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.
[0044] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.
[0045] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.
[0046] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.
[0047] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0048] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.
[0049] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.
[0050] The electronic devices described above are used to implement the corresponding methods in the foregoing embodiments and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0051] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.
[0052] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this disclosure (including the claims) is limited to these examples; within the framework of this disclosure, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of the embodiments of this application as described above, which are not provided in the details for the sake of brevity.
[0053] Additionally, to simplify the description and discussion, and to avoid obscuring the embodiments of this application, the well-known power / ground connections to integrated circuit (IC) chips and other components may or may not be shown in the provided drawings. Furthermore, the apparatus may be shown in block diagram form to avoid obscuring the embodiments of this application, and this also takes into account the fact that the details of the implementation of these block diagram apparatuses are highly dependent on the platform on which the embodiments of this application will be implemented (i.e., these details should be fully understood by those skilled in the art). While specific details (e.g., circuits) have been set forth to describe exemplary embodiments of this disclosure, it will be apparent to those skilled in the art that the embodiments of this application can be implemented without these specific details or with variations thereof. Therefore, these descriptions should be considered illustrative rather than restrictive.
[0054] Although this disclosure has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.
[0055] The embodiments of this application are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the embodiments of this application should be included within the protection scope of this disclosure.
Claims
1. A target detection method, characterized in that, include: Acquire the image to be tested; Extract low-resolution, medium-resolution, and high-resolution features from the image under test; The low-resolution, medium-resolution, and high-resolution features are fused to obtain enhanced features. Target detection is performed based on the enhanced features to obtain the target detection result.
2. The method according to claim 1, characterized in that, The extraction of low-resolution, medium-resolution, and high-resolution features from the image under test includes: The image to be tested is subjected to two-dimensional convolution and depthwise convolution operations in sequence to obtain preliminary image features; The first general-purpose inverted bottleneck module is used to extract features from the preliminary image features to obtain high-resolution features; The high-resolution features are extracted using the second general-purpose inverted bottleneck module to obtain medium-resolution features; The medium-resolution features are subjected to a depthwise convolution operation to obtain low-resolution features.
3. The method according to claim 1, characterized in that, The low-resolution, medium-resolution, and high-resolution features are fused to obtain enhanced features, including: The low-resolution features are adjusted by convolutional layers, then upsampled for size adjustment. The resized low-resolution features are concatenated with the medium-resolution features. The concatenated features are then processed by a partial convolutional layer, followed by feature refinement to obtain fused features. These fused features are then upsampled for size adjustment, and concatenated with the high-resolution features. The concatenated features are then processed by a partial convolutional layer to obtain enhanced high-resolution features. These enhanced high-resolution features are then adjusted by convolutional layers and concatenated with the fused features. After processing by a partial convolutional layer, enhanced medium-resolution features are output. These enhanced medium-resolution features are then processed by convolutional layers and concatenated with the feature-adjusted low-resolution features. Finally, after processing by a partial convolutional layer, enhanced low-resolution features are output.
4. The method according to claim 3, characterized in that, Target detection is performed based on the enhanced features to obtain target detection results, including: The enhanced low-resolution features, enhanced medium-resolution features, and enhanced high-resolution features are input into the RT-DETR decoder and prediction head, and the RT-DETR decoder and prediction head output the predicted target's category, confidence level, and bounding box coordinates.
5. The method according to claim 3, characterized in that, The partial convolutional layer performs 3×3 convolution on a portion of the input feature map to extract spatial features, while the remaining portion is retained without processing. The extracted spatial features and the directly retained features are then concatenated and processed by a pointwise convolutional layer consisting of two consecutive 1×1 convolutions.
6. A target detection device for mines, characterized in that, include: The acquisition module is used to acquire the image to be tested; The feature extraction module is used to extract low-resolution, medium-resolution, and high-resolution features from the image under test. The feature fusion module is used to fuse the low-resolution features, medium-resolution features and high-resolution features to obtain enhanced features; The detection module is used to perform target detection based on the enhanced features and obtain the target detection result.
7. The apparatus according to claim 6, characterized in that, The feature extraction module is used to sequentially perform two-dimensional convolution and depthwise convolution operations on the image to be tested to obtain preliminary image features; use a first general inverted bottleneck module to extract features from the preliminary image features to obtain high-resolution features; use a second general inverted bottleneck module to extract features from the high-resolution features to obtain medium-resolution features; and perform depthwise convolution operations on the medium-resolution features to obtain low-resolution features.
8. The apparatus according to claim 6, characterized in that, The feature fusion module is used to adjust the low-resolution features through convolutional layers, then resize them through upsampling. The resized low-resolution features are then concatenated with the medium-resolution features. The concatenated features are processed by a partial convolutional layer, then refined through another convolutional layer to obtain fused features. The fused features are then resized through upsampling, and concatenated with the high-resolution features. The concatenated features are then processed by a partial convolutional layer to obtain enhanced high-resolution features. The enhanced high-resolution features are then adjusted by convolutional layers and concatenated with the fused features. After processing by a partial convolutional layer, enhanced medium-resolution features are output. Finally, the enhanced medium-resolution features are processed by convolutional layers and concatenated with the low-resolution features, and after processing by a partial convolutional layer, enhanced low-resolution features are output.
9. The apparatus according to claim 8, characterized in that, The detection module is used to input the enhanced low-resolution features, enhanced medium-resolution features, and enhanced high-resolution features into the RT-DETR decoder and prediction head, and the RT-DETR decoder and prediction head output the predicted target's category, confidence level, and bounding box coordinates.
10. The apparatus according to claim 8, characterized in that, The partial convolutional layer performs 3×3 convolution on a portion of the input feature map to extract spatial features, while the remaining portion is retained without processing. The extracted spatial features and the directly retained features are then concatenated and processed by a pointwise convolutional layer consisting of two consecutive 1×1 convolutions.