Target detection method and device, and non-transitory storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By performing resolution processing and joint feature image analysis on the input image, the problem of existing algorithms being unable to detect small targets is solved, and high-precision small target detection and attribute information extraction are achieved.

CN116363442BActive Publication Date: 2026-06-23TSINGHUA UNIVERSITY

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA UNIVERSITY
Filing Date: 2021-12-23
Publication Date: 2026-06-23

Application Information

Patent Timeline

23 Dec 2021

Application

23 Jun 2026

Publication

CN116363442B

IPC: G06V10/774; G06V10/764; G06V10/25; G06V10/82; G06N3/048; G06N3/0464; G06N3/08

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Technology Topics

Image resolution Radiology

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

High-speed tunable narrow linewidth on-chip external cavity laser and lidar
CN224384785UWave based measurement systemsOptical resonator shape and constructionSpiral resonatorExternal cavity laser
DToF sensor and ranging method, laser receiving module and ranging device
CN117434521BIncrease the output frame rateThere is no mutual coverage problemElectromagnetic wave reradiation Image resolution Hemt circuits
An endo projection imaging apparatus and an endo projection imaging method
CN122239355AStatic indicating devices ProjectorsProjection imageImage resolution
Retinal blood vessel segmentation method, device and electronic equipment
CN122266027ABiological modelsSubcutaneous biometric featuresOphthalmology Image resolution
Fiber-optic geodesy: high-resolution subsurface deformation monitoring with telecommunication infrastructure
WO2026128906A1Subsonic/sonic/ultrasonic wave measurementOptical prospectingLow noiseTelecommunications

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN116363442B_ABST

Patent Text Reader

Abstract

A target detection method, a target detection device and a non-transitory storage medium. The target detection method comprises: obtaining an input image; obtaining initial feature images of multiple levels with different resolutions based on the input image; performing scale processing on the initial feature images of the multiple levels to obtain multiple intermediate feature images with the same resolution; performing joint processing on the multiple intermediate feature images to obtain a joint feature image; performing region nomination processing based on the joint feature image to determine a candidate target object and a first candidate box of the candidate target object; extracting attribute information of the candidate target object and determining a second candidate box of the candidate target object based on the joint feature image and the first candidate box; and performing filtering processing on the candidate target object based on the attribute information and the second candidate box to obtain a final target object and a detection box of the final target object.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] Embodiments of this disclosure relate to a target detection method, a target detection device, and a non-transient storage medium. Background Technology

[0002] The goal of computer vision research is to use computers to realize the human functions of perception, recognition, and understanding of the objective world. Object detection, as one of the core research topics in computer vision, has received widespread attention in theoretical research and has broad application prospects. Object detection technology integrates cutting-edge technologies from many fields such as object detection, pattern recognition, artificial intelligence, and computer vision, and has been widely applied in intelligent transportation systems, intelligent monitoring systems, human-computer interaction, autonomous driving, image retrieval, and intelligent robots. Summary of the Invention

[0003] This disclosure provides at least some embodiments of a target detection method, comprising: acquiring an input image; obtaining multiple initial feature images of different resolutions based on the input image; performing scale scaling on the multiple initial feature images to obtain multiple intermediate feature images of the same resolution; performing joint processing on the multiple intermediate feature images to obtain a joint feature image; performing region nomination processing based on the joint feature image to determine candidate target objects and first candidate bounding boxes of the candidate target objects; extracting attribute information of the candidate target objects and determining second candidate bounding boxes of the candidate target objects based on the joint feature image and the first candidate bounding boxes; and performing filtering processing on the candidate target objects based on the attribute information and the second candidate bounding boxes to obtain a final target object and a detection box of the final target object.

[0004] For example, in some embodiments of the target detection method provided in this disclosure, obtaining the input image includes: obtaining an original input image; and preprocessing the original input image to obtain the input image, wherein the preprocessing includes at least one of cropping and resolution conversion processing.

[0005] For example, in some embodiments of the target detection method provided in this disclosure, obtaining initial feature images of multiple levels with different resolutions based on the input image includes: performing M consecutive analysis processes based on the input image to obtain M sets of initial feature images with different resolutions; and selecting N sets of initial feature images from the M sets of initial feature images as the initial feature images of the multiple levels; wherein M and N are both positive integers, and M≥N≥2.

[0006] For example, in some embodiments of the target detection method provided in this disclosure, each of the M analysis processes includes convolution processing, and the resolution of the output of each analysis process decreases sequentially.

[0007] For example, in some embodiments of the target detection method provided in this disclosure, the scaling process is performed on the initial feature images of the multiple levels to obtain the multiple intermediate feature images with the same resolution. This includes: in response to the resolution of the initial feature image of any level being greater than a predetermined resolution, downsampling the initial feature image of any level to obtain an intermediate feature image corresponding to the initial feature image of any level; in response to the resolution of the initial feature image of any level being equal to the predetermined resolution, using the initial feature image of any level as the intermediate feature image corresponding to the initial feature image of any level; and in response to the resolution of the initial feature image of any level being less than the predetermined resolution, upsampling the initial feature image of any level to obtain an intermediate feature image corresponding to the initial feature image of any level.

[0008] For example, in some embodiments of the target detection method provided in this disclosure, the region nomination process is performed based on the joint feature image to determine the candidate target object and the first candidate bounding box of the candidate target object, including: performing the region nomination process using a region nomination network based on the joint feature image to determine the candidate target object and the first candidate bounding box of the candidate target object.

[0009] For example, in some embodiments of the target detection method provided in this disclosure, extracting attribute information of the candidate target object and determining a second candidate box of the candidate target object based on the joint feature image and the first candidate box includes: analyzing and processing the joint feature image to obtain a first feature image; determining the nomination region corresponding to the first candidate box on the first feature image as a first region of interest, performing region of interest pooling processing on the first region of interest to obtain a second feature image; and extracting attribute information of the candidate target object and determining a second candidate box of the candidate target object based on the second feature image.

[0010] For example, in some embodiments of the target detection method provided in this disclosure, the attribute information includes first attribute information; based on the attribute information and the second candidate box, the filtering process is performed on the candidate target object to obtain the final target object and the detection box of the final target object, including: determining the region corresponding to the second candidate box on the input image as a second region of interest, and extracting the second region of interest and its neighboring regions from the input image to obtain an intermediate input image; classifying the intermediate input image according to the first attribute information to determine the second attribute information of the intermediate input image; in response to the first attribute information being consistent with the second attribute information, using the candidate target object as the final target object, and using the second candidate box of the candidate target object as the detection box of the final target object; and in response to the first attribute information being inconsistent with the second attribute information, filtering out the candidate target object and the second candidate box of the candidate target object.

[0011] At least some embodiments of this disclosure also provide a target detection apparatus, including: a memory for non-transitory storage of computer-readable instructions; and a processor for executing the computer-readable instructions, wherein the computer-readable instructions, when executed by the processor, perform a target detection method provided in any embodiment of this disclosure.

[0012] At least some embodiments of this disclosure also provide a non-transitory storage medium for storing computer-readable instructions in a non-transitory manner, wherein when the non-transitory computer-readable instructions are executed by a computer, the target detection method provided in any embodiment of this disclosure can be executed. Attached Figure Description

[0013] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings of the embodiments will be briefly described below. Obviously, the drawings described below only relate to some embodiments of this disclosure and are not intended to limit this disclosure.

[0014] Figure 1 A flowchart illustrating a target detection method provided for at least some embodiments of this disclosure;

[0015] Figure 2 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary flowchart of step S100 shown;

[0016] Figure 3 A schematic diagram illustrating an original input image and its corresponding plurality of input images provided for at least some embodiments of this disclosure;

[0017] Figure 4 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary flowchart of step S200 shown;

[0018] Figure 5 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary network architecture flowchart of steps S200, S300, and S400 shown;

[0019] Figure 6 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary flowchart of step S300 shown;

[0020] Figure 7 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary network architecture flowchart of step S500 shown;

[0021] Figure 8 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary flowchart of step S600 shown;

[0022] Figure 9 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary network architecture flowchart of step S600 shown;

[0023] Figure 10 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary network architecture flowchart of step S700 shown;

[0024] Figure 11 A schematic diagram illustrating a spiral cyclic learning rate setting provided for at least some embodiments of this disclosure;

[0025] Figure 12 A schematic block diagram of a target detection device provided for at least some embodiments of this disclosure; and

[0026] Figure 13 This is a schematic diagram of a non-transitory storage medium provided for at least some embodiments of the present disclosure. Detailed Implementation

[0027] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the described embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.

[0028] Unless otherwise defined, the technical or scientific terms used in this disclosure shall have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms “first,” “second,” and similar terms used in this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Similarly, the terms “an,” “a,” or “the,” and similar terms do not indicate a quantity limitation, but rather indicate the presence of at least one. The terms “including,” “comprising,” or “containing,” and similar terms mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. The terms “connected,” “linked,” or similar terms are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. The terms “upper,” “lower,” “left,” and “right,” etc., are used only to indicate relative positional relationships, and these relative positional relationships may change accordingly when the absolute position of the described objects changes.

[0029] The present disclosure will now be described through several specific embodiments. To keep the following description of the embodiments of the present disclosure clear and concise, detailed descriptions of known functions and known components may be omitted. When any component of an embodiment of the present disclosure appears in more than one drawing, that component is represented by the same or similar reference numerals in each drawing.

[0030] Object detection refers to the process of accurately locating objects in an image and identifying their category. Small object detection refers to locating and identifying objects in an image that contain only a small number of pixels; for example, in some cases, if an object in an image is smaller than, say, 32*32 pixels, it can be considered a small object. The above examples of small object definitions are illustrative, and this disclosure includes, but is not limited to, these. It is understood that objects in an image other than small objects (i.e., objects containing a large number of pixels) can be considered large objects, and this disclosure does not specifically classify large objects. It should be noted that small object detection has extremely high application prospects in areas such as vehicle recognition from the perspective of drones, road sign recognition in autonomous driving, and identification of personal belongings in the security field.

[0031] With the widespread application of deep learning technology in computer vision, convolutional neural networks (CNNs) have become increasingly popular in object detection tasks due to their ability to significantly improve accuracy. Common object detection algorithms / models include, but are not limited to, region-proposal-based CNNs such as R-CNN (Region-based Convolutional Neural Networks), SPP-net (Spatial Pyramid Pooling-net), Fast R-CNN, Faster R-CNN, and R-FCN (Region-based Fully Convolutional Networks), as well as end-to-end CNNs such as YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector). However, these algorithms / models are primarily designed for detecting large objects and are typically inefficient in detecting small objects in images. Furthermore, the output of these algorithms / models usually only includes the object category and its bounding box, which is not conducive to further analysis of the detected objects.

[0032] This disclosure provides at least some embodiments of a target detection method. The target detection method includes: acquiring an input image; obtaining multiple initial feature images of different resolutions based on the input image; scaling the multiple initial feature images to obtain multiple intermediate feature images of the same resolution; jointly processing the multiple intermediate feature images to obtain a joint feature image; performing region nomination processing based on the joint feature image to determine candidate target objects and first candidate bounding boxes of the candidate target objects; extracting attribute information of the candidate target objects and determining second candidate bounding boxes of the candidate target objects based on the joint feature image and the first candidate bounding boxes; and filtering the candidate target objects based on the attribute information and the second candidate bounding boxes to obtain a final target object and its detection bounding box.

[0033] At least some embodiments of this disclosure also provide a target detection device and a non-transient storage medium corresponding to the target detection method described above.

[0034] The target detection method provided in this disclosure can extract attribute information of the detected target object while performing target detection. This not only improves detection accuracy but also facilitates obtaining more comprehensive detection results for further analysis. Furthermore, this target detection algorithm can effectively detect small target objects in the input image.

[0035] It should be noted that, in this disclosure, processing operations such as convolution, downsampling, and upsampling can be executed or implemented through layers such as convolutional layers, downsampling layers, and upsampling layers, respectively. Correspondingly, these layers can also be used to refer to the corresponding processing operations, which will not be repeated below.

[0036] The following detailed description, with reference to the accompanying drawings, outlines some embodiments and examples of this disclosure. It should be understood that the specific implementations described herein are for illustrative and explanatory purposes only and are not intended to limit the scope of this disclosure.

[0037] Figure 1 This is a flowchart illustrating a target detection method provided in at least some embodiments of the present disclosure. For example, this target detection method can be applied to a computing device, which includes any electronic device with computing capabilities, such as a smartphone, laptop, tablet, desktop computer, server, etc., and the embodiments of the present disclosure are not limited thereto. For example, the computing device has a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and also includes memory. This memory is, for example, non-volatile memory (e.g., read-only memory, ROM) on which operating system code is stored. For example, the memory also stores code or instructions that, by running these codes or instructions, can implement the target detection method provided in the embodiments of the present disclosure.

[0038] Figure 2 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary flowchart of step S100 shown, Figure 3 This is a schematic diagram illustrating an original input image and its corresponding plurality of input images, provided for at least some embodiments of this disclosure. Figure 4 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary flowchart of step S200 shown is provided. Figure 5 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 The exemplary network architecture flowchart shown in steps S200, S300, and S400 is as follows. Figure 6 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 An exemplary flowchart of step S300 shown is provided. Figure 7 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 The exemplary network architecture flowchart shown in step S500 Figure 8 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1An exemplary flowchart of step S600 shown is provided. Figure 9 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 The exemplary network architecture flowchart shown in step S600 Figure 10 A corresponding embodiment provided for at least some embodiments of this disclosure Figure 1 The exemplary network architecture flowchart for step S700 is shown below. Figures 2 to 10 right Figure 1 The target detection method shown is described in detail, but should not be construed as limiting the embodiments of this disclosure.

[0039] For example, such as Figure 1 As shown, the target detection method includes the following steps S100 to S700.

[0040] Step S100: Obtain the input image.

[0041] For example, the input image may include photos captured by cameras such as drone cameras, traffic cameras, smartphone cameras, tablet cameras, personal computer cameras, digital camera lenses, surveillance cameras, or webcams. These images may include pictures of people, animals, landscapes, and various objects (e.g., vehicles). The input image may also include a target object to be detected, which may include, but is not limited to, people, animals, buildings, and vehicles. Alternatively, the input image may be an image obtained by preprocessing the aforementioned captured photos (i.e., the original input image).

[0042] The following, combined with Figure 2 and Figure 3 Step S100 is described in detail, but should not be construed as a limitation on the embodiments of this disclosure.

[0043] For example, such as Figure 2 As shown, obtaining the input image, i.e. step S100, may include the following steps S110 and S120.

[0044] Step S110: Obtain the original input image.

[0045] For example, the original input image can be a photograph captured as described above; it should be noted that the embodiments of this disclosure include, but are not limited to, this.

[0046] Step S120: Preprocess the original input image to obtain the input image, wherein the preprocessing includes at least one of cropping and resolution conversion.

[0047] For example, the original input image is typically large in size (resolution), and processing the original input image directly places high demands on the hardware capabilities of the computing device. In this case, such as Figure 3 As shown, the original input image can be cropped to obtain multiple input images, which can be used separately for object detection processing, thereby reducing the computational resources consumed in the object detection process. For example, the multiple input images may overlap or not overlap, and the embodiments of this disclosure do not limit this. For example, the multiple input images may be exactly the same, partially the same, or different in size, and the embodiments of this disclosure do not limit this. For example, the centers of these input images may be uniformly distributed or non-uniformly distributed in the original input image, and the embodiments of this disclosure do not limit this.

[0048] For example, the size of the input image can be set according to actual needs, and the embodiments of this disclosure do not limit this. For example, in some embodiments, the above-mentioned cropping process can be performed according to the size requirements of the input image; for example, in other embodiments, if the size of the cropped image does not match the size requirements of the input image, the cropped image can be subjected to resolution conversion processing (i.e., scaling processing) to obtain the input image. For example, in still other embodiments, if the size (resolution) of the original input image is relatively small, the original input image can also be directly subjected to resolution conversion processing (i.e., scaling processing) to obtain the input image.

[0049] For example, in some examples, the scaling process can be uniform scaling, meaning that the width and height of the image are scaled using the same proportional factor; in other examples, the scaling process can be non-uniform scaling, meaning that the width and height of the image are scaled using different proportional factors. It should be noted that the scaling factor can be set according to actual needs, and the embodiments of this disclosure do not limit this. For example, the scaling process can be implemented using interpolation algorithms, and the embodiments of this disclosure include, but are not limited to, these. For example, interpolation algorithms can include, but are not limited to, interpolation, bilinear interpolation, and bicubic interpolation.

[0050] It is understood that, in some embodiments, preprocessing may further include image denoising of the input image to remove irrelevant or noise information in the input image, so as to better perform target detection processing on the input image.

[0051] For example, in some embodiments, the input image can be a color image. For example, a color image includes, but is not limited to, a color image with three color channels. For example, the three color channels include a first color channel, a second color channel, and a third color channel. For example, the three color channels correspond to the three primary colors. For example, in some examples, the first color channel is the red (R) channel, the second color channel is the green (G) channel, and the third color channel is the blue (B) channel; that is, the aforementioned color image can be an RGB format color image. It should be noted that the embodiments of this disclosure include, but are not limited to, this. For example, in other embodiments, the input image can also be a grayscale image.

[0052] The following are large-scale aerial images taken from the perspective of a drone (i.e., a sky view), using the original input image as the drone's viewpoint (i.e., a sky viewpoint). Figure 3 The embodiments of this disclosure are illustrated using the original input image shown in the image and the target object to be detected as a vehicle, but should not be construed as limiting the present disclosure.

[0053] Step S200: Based on the input image, obtain initial feature images of multiple levels with different resolutions.

[0054] For example, multiple consecutive analysis processes can be performed on the input image to extract multiple sets of initial feature images with different resolutions (e.g., arranged from high to low resolution), each set of initial feature images corresponding to a level; then, several sets of initial feature images are selected from these multiple sets of initial feature images as the initial feature images for multiple levels in step S200. It is understood that in these multiple consecutive analysis processes, the output of each analysis process is a set of initial feature images, the input of the first analysis process is the input image, and except for the first analysis process, the input of each subsequent analysis process is the output of the previous analysis process.

[0055] The following, combined with Figure 4 and Figure 5 Step S200 is described in detail, but should not be construed as limiting the embodiments of this disclosure.

[0056] For example, such as Figure 4 As shown, based on the input image, multiple initial feature images of different resolutions are obtained, i.e., step S200, which may include the following steps S210 and S220.

[0057] Step S210: Based on the input image, perform M consecutive analysis processes to obtain M sets of initial feature images with different resolutions; and

[0058] Step S220: Select N sets of initial feature images from M sets of initial feature images as initial feature images for multiple levels, where M and N are both positive integers, and M≥N≥2.

[0059] For example, in step S210, the analysis processing may typically include convolution processing, activation processing, and downsampling processing. For example, the analysis processing may further include normalization processing. It should be noted that the embodiments of this disclosure are not limited in this regard. For example, the resolution of the output of each analysis processing step decreases sequentially.

[0060] Convolutional processing can be implemented through convolutional layers. A convolutional layer applies several convolutional kernels (also called filters) to an input image to extract various types of features. Each kernel extracts one type of feature. Convolutional kernels are typically initialized as random fractional matrices, and during the training of the convolutional neural network, they learn appropriate weights. The result obtained after applying a convolutional kernel to the input image is called a feature image, and the number of feature images is equal to the number of convolutional kernels.

[0061] Activation processing can be implemented through activation layers, which include activation functions. These activation functions introduce nonlinearity into the convolutional neural network (CNN), enabling it to better solve more complex problems. Activation functions can include ReLU (Recursive Luminaire), sigmoid, or tanh functions. ReLU is a non-saturating nonlinear function, while sigmoid and tanh are saturating nonlinear functions. For example, activation layers can be a standalone layer in a convolutional neural network, or they can be included within convolutional layers.

[0062] Downsampling is used to reduce the size (resolution) of a feature image, thereby reducing the amount of data in the feature image. This can be achieved through downsampling layers, but is not limited to these methods. For example, downsampling layers can employ downsampling methods such as max pooling, average pooling, strided convolution, decimation (e.g., selecting a fixed number of pixels), and demuxout (splitting the input image into multiple smaller images). Downsampling layers can also use interpolation algorithms such as interpolation, bilinear interpolation, bicubic interpolation, and Lanczos interpolation. For instance, when using interpolation algorithms for downsampling, only the interpolated values can be retained while the original pixel values are removed, thus reducing the size of the feature image.

[0063] Normalization can be achieved through a normalization layer, which allows the pixel values of the feature image to vary within a predetermined range, thereby simplifying the feature image generation process and improving the image processing effect. For example, the predetermined range can be [-1, 1], etc.

[0064] It is understandable that the M initial feature images obtained in step S210 form a feature pyramid.

[0065] For example, in step S210, the downsampling factor (i.e., the downsampling factor in each analysis process) in the M analysis processes can be set to 2. It should be noted that embodiments of this disclosure include, but are not limited to, this.

[0066] For example, when M=N, the selection process in step S220 can be omitted, that is, N initial feature images arranged from high to low resolution can be obtained directly through step S210.

[0067] For example, when M>N, the N sets of initial feature images typically include the first set of initial feature images (i.e., the initial feature image with the largest size or highest resolution) and the last set of initial feature images (i.e., the initial feature image with the smallest size or lowest resolution) from the M sets of initial feature images obtained in step S210, as well as N-2 sets of initial feature images selected from the middle M-2 sets of initial feature images. Thus, these multiple levels of initial feature images can include shallow-level feature images (e.g., the first set of initial feature images), deep-level feature images (e.g., the last set of initial feature images), and intermediate-level feature images (e.g., N-2 sets of initial feature images selected from the middle M-2 sets of initial feature images). Since the pixel information of shallow-level feature images is more suitable for precise localization, and the pixel information of deep-level feature images is more suitable for accurate classification, subsequent processing based on the aforementioned multiple levels of feature images (shallow, intermediate, and deep) ensures that rich feature information is included in the calculation process, which is beneficial for small target detection.

[0068] Understandably, the N initial feature images used for subsequent processing also form a feature pyramid.

[0069] For example, in a specific example, such as Figure 5 As shown, for example, in a specific example, such as Figure 2 As shown, M=5, N=4, and the downsampling factor for each analysis is 2. Specifically, as... Figure 2As shown, the input image can be analyzed and processed five times consecutively to obtain five sets of initial feature images F1 to F5 arranged from high to low resolution. Specifically, the input image is analyzed and processed to obtain the first set of initial feature images F1, whose resolution is half that of the input image. The first set of initial feature images F1 is then analyzed and processed to obtain the second set of initial feature images F2, whose resolution is half that of the first set of initial feature images F1, i.e., one-quarter of the input image's resolution. The second set of initial feature images F2 is then analyzed and processed to obtain the third set of initial feature images F1, F2, F3, F4, F5, F6, F7, F8, F9, F1, F10, F11, F11, F12, F11, F12, F13, F14, F15, F16, F17, F18, F19, F1 ... 3. The resolution of the third initial feature image F3 is half the resolution of the second initial feature image F2, which is 1 / 8 of the resolution of the input image. The third initial feature image F3 is analyzed to obtain the fourth initial feature image F4, whose resolution is half the resolution of the third initial feature image F3, which is 1 / 16 of the resolution of the input image. The fourth initial feature image F4 is analyzed to obtain the fifth initial feature image F5, whose resolution is half the resolution of the fourth initial feature image F4, which is 1 / 32 of the resolution of the input image. Then, four initial feature images (i.e., the first initial feature image F1, the third initial feature image F3, the fourth initial feature image F4, and the fifth initial feature image F5) can be selected from the five initial feature images F1 to F5 for subsequent processing.

[0070] For example, such as Figure 2 As shown, the above five analysis processes can be performed using analysis modules A1 to A5. For example, depending on actual needs, each analysis module may include convolutional layers, activation layers, downsampling layers, and normalization layers. For example, the convolutional layers in analysis modules A1 to A5 may use 7*7, 5*5, or 3*3 convolutional kernels; the embodiments disclosed herein include, but are not limited to, these.

[0071] It should be understood that, Figure 5 The specific examples shown are illustrative and should not be construed as limiting the embodiments of this disclosure.

[0072] Step S300: Scale the initial feature images of the multiple levels to obtain multiple intermediate feature images with the same resolution.

[0073] For example, based on the comparison between the resolution of the initial feature images at each level and the predetermined resolution, and combined with downsampling or upsampling operations, the initial feature images at each level can be processed accordingly to obtain the corresponding intermediate feature images, and the resolution of the intermediate feature images corresponding to the initial feature images at each level is the same.

[0074] The following, combined with Figure 5 and Figure 6 Step S300 is described in detail, but should not be construed as limiting the embodiments of this disclosure.

[0075] For example, such as Figure 6 As shown, the initial feature images of multiple levels are scaled to obtain multiple intermediate feature images with the same resolution, i.e., step S300, which may include the following steps S310 to S330.

[0076] Step S310: In response to the initial feature image of any level having a resolution greater than a predetermined resolution, the initial feature image of any level is downsampled to obtain the intermediate feature image corresponding to the initial feature image of any level.

[0077] Step S320: In response to the initial feature image of any level having a resolution equal to a predetermined resolution, the initial feature image of that level is used as the intermediate feature image corresponding to the initial feature image of that level; and

[0078] Step S330: In response to the initial feature image of any level having a resolution less than a predetermined resolution, upsampling is performed on the initial feature image of that level to obtain an intermediate feature image corresponding to the initial feature image of that level.

[0079] The specific details and implementation process of downsampling can be found in the foregoing descriptions. Upsampling is used to increase the size of the feature image, thereby increasing the amount of data in the feature image. This can be achieved through upsampling layers, but is not limited to these methods. For example, upsampling layers can employ strided transposed convolution, interpolation algorithms, and other upsampling methods. Interpolation algorithms can include, for example, interpolation, bilinear interpolation, bicubic interpolation, and Lanczos interpolation. For instance, when using interpolation algorithms for upsampling, both the original pixel values and the interpolated values can be preserved, thereby increasing the size of the feature image.

[0080] For example, the predetermined resolution can be set according to actual needs. It can be the same as or different from the resolution of the initial feature image of a certain level among the multiple initial feature images. For example, the predetermined resolution is usually set to an integer multiple of the resolution of the initial feature image of a certain level. The initial feature images of each level either have this predetermined resolution or can have this predetermined resolution by upsampling or downsampling at an integer multiple. It should be noted that the embodiments of this disclosure do not impose any limitations on this.

[0081] For example, in Figure 5 In the specific example shown, taking the case where the predetermined resolution is the same as the resolution of the fourth initial feature image F4, the scaling processing module T1 (e.g., including a downsampling layer) can be used to downsample the first initial feature image F1 to obtain the intermediate feature image P1; the scaling processing module T2 (e.g., including a downsampling layer) can be used to downsample the third initial feature image F3 to obtain the intermediate feature image P2; the scaling processing module T3 (e.g., without any layer structure) can be used to perform empty processing on the fourth initial feature image F4 to obtain the intermediate feature image P3, that is, the fourth initial feature image F4 is directly used as the intermediate feature image P3; the scaling processing module T4 (e.g., including an upsampling layer) can be used to upsampling the fifth initial feature image F5 to obtain the intermediate feature image P4. It is understood that the downsampling factor and the upsampling factor of the downsampling processing can be set according to the actual situation.

[0082] Step S400: Perform joint processing on the multiple intermediate feature images to obtain a joint feature image.

[0083] For example, the concatenate process is used to stack the channel images of multiple (e.g., two or more) intermediate feature images to be concatenated, so that the number of channels of the concatenated image (i.e., the concatenated feature image) is the sum of the number of channels of the multiple intermediate feature images to be concatenated.

[0084] For example, in Figure 5 In the specific example shown, the number of channels of the joint feature image is the sum of the number of channels of the first set of initial feature images F1, the third set of initial feature images F3, the fourth set of initial feature images F4, and the fifth set of initial feature images F5 used for jointing.

[0085] Step S500: Based on the joint feature image, perform region nomination processing to determine the candidate target object and the first candidate bounding box of the candidate target object.

[0086] For example, based on joint feature images, Region Proposal Networks (RPNs) can be used to perform region proposal processing to determine candidate target objects and their first candidate bounding boxes. For details and implementation processes of Region Proposal Networks, please refer to relevant descriptions of RPNs in the field of computer vision.

[0087] The following, combined with Figure 7 Step S500 is described in detail, but should not be construed as a limitation on the embodiments of this disclosure.

[0088] For example, such as Figure 7As shown, multiple anchor boxes are first generated. Anchor boxes can be understood as candidate boxes or candidate regions. Anchor box parameters include anchor box area (scale) and anchor box aspect ratio (aspects). One anchor box parameter (i.e., a set of anchor box areas and aspect ratios) can characterize one anchor box. For example, three areas and three aspect ratios can be combined to form nine anchor boxes. Each location in the image to be processed (e.g., a joint feature image) can correspond to nine anchor boxes. For example, for a feature image of size W*H, which includes W*H locations (which can be understood as W*H pixels), there can be W*H*9 anchor boxes. It should be noted that in practical applications, most region-nominated object detection methods use nine or 25 anchor boxes. Therefore, nine or 25 anchor boxes can also be used in step S300. It should be noted that the embodiments of this disclosure include, but are not limited to, these examples.

[0089] For example, such as Figure 7 As shown, a fully connected layer L1 can be used to extract the feature vector of the nomination region of the joint feature image corresponding to the anchor box. Based on this feature vector, a classification network L2 (e.g., a binary SoftMax classifier, which may include a fully connected layer) is used to determine whether the nomination region corresponding to the anchor box is foreground or background. For example, the output (score) of the classification network L2 is used to characterize the probability that the nomination region belongs to the foreground or background. If the nomination region is foreground, it is a region of interest (ROI), which is likely to include the target object. Simultaneously, based on this feature vector, a bounding box regression network L3 (e.g., which may include a fully connected layer) can be used to perform bounding box regression to determine the parameters (bb_reg) of the detection box of the region of interest. For example, the parameters of the detection box may include the center coordinates x and y of the detection box and the width w and height h of the detection box; or, the parameters of the detection box may include the coordinates x1 and y1 of the top left corner of the detection box and the width w and height h of the detection box. It should be noted that embodiments of this disclosure include, but are not limited to, these. Therefore, the candidate target objects and the first candidate bounding boxes of the candidate target objects (i.e., the detection boxes mentioned above) can be determined.

[0090] Step S600: Based on the joint feature image and the first candidate box, extract the attribute information of the candidate target object and determine the second candidate box of the candidate target object.

[0091] For example, in step S600, not only can the attribute information of the candidate target object be extracted, but the first candidate box can also be refined to obtain a more accurate second candidate box.

[0092] The following, combined with Figure 8 and Figure 9 Step S600 is described in detail, but should not be construed as a limitation on the embodiments of this disclosure.

[0093] For example, such as Figure 8 As shown, based on the joint feature image and the first candidate box, the attribute information of the candidate target object is extracted and the second candidate box of the candidate target object is determined, i.e., step S600, which may include the following steps S610 to S630.

[0094] Step S610: Analyze and process the joint feature image to obtain the first feature image.

[0095] For example, in some examples, such as Figure 9 As shown, the analysis process in step S610 can be performed using analysis module A6. For example, depending on actual needs, each analysis module may include convolutional layers, activation layers, downsampling layers, and normalization layers, etc.

[0096] Step S620: Determine the nomination region corresponding to the first candidate box on the first feature image as the first region of interest, and perform region of interest pooling on the first region of interest to obtain the second feature image.

[0097] For example, in some cases, the size of the nomination region is not fixed; that is, the size of the first region of interest is not fixed. In this case, such as... Figure 9 As shown, a region of interest (ROI) pooling process can be performed on the first region of interest to obtain a second feature image with a fixed size, which facilitates subsequent processing (e.g., inputting the second feature image into a subsequent fully connected layer L4). For example, the size of the output of the ROI pooling process (i.e., the second feature image) can be 7*7, and embodiments of this disclosure include, but are not limited to, this.

[0098] Step S630: Based on the second feature image, extract the attribute information of the candidate target object and determine the second candidate bounding box of the candidate target object.

[0099] For example, in some examples, such as Figure 9 As shown, a fully connected layer L4 can be used to extract the feature vector of the second feature image. Figure 9 As shown, based on this feature vector, one or more attribute information can be used to extract the network (e.g., Figure 9The two attribute information extraction networks (L5 and L6) shown extract attribute information (e.g., attribute information Attr1 and Attr2) from candidate target objects. It can be understood that the attribute information extraction network can essentially be a classification network (e.g., it may include fully connected layers). For example, taking a vehicle as a candidate target object, the attribute information of the candidate target object may include one or more of the following: vehicle type, vehicle color, whether it is carrying cargo, and whether it is occluded. Meanwhile, as... Figure 9 As shown, based on this feature vector, a bounding box regression network L7 (e.g., which may include fully connected layers) can be used to perform bounding box regression to refine the first candidate box and obtain the parameters (bb_reg) of the second candidate box. This allows for the acquisition of structured attribute information of the candidate target objects, which is beneficial for obtaining more comprehensive detection results for further analysis. For example, in scenarios where large-scale aerial images are used to assist intelligent transportation systems, the detected vehicle structured attribute information helps the intelligent transportation system track specific vehicles and plan routes for special vehicles. Furthermore, the structured attribute information of candidate target objects can also be used to filter candidate target objects (refer to the relevant description in subsequent step S700) to improve detection accuracy.

[0100] Step S700: Based on the attribute information and the second candidate box, filter the candidate target objects to obtain the final target object and the detection box of the final target object.

[0101] In practical applications, the candidate target objects and their first candidate bounding boxes determined in step S500 may contain false positives, such as incorrectly predicting the background as the foreground. Correspondingly, the attribute information of the candidate target objects extracted in step S600 and the determined second candidate bounding boxes may be affected. In this case, based on the attribute information and the second candidate bounding boxes, a filtering network (e.g., a post-filter) can be used to filter the candidate target objects to obtain the final target object and its detection bounding box.

[0102] The following, combined with Figure 10 Step S700 is described in detail, but should not be construed as a limitation on the embodiments of this disclosure.

[0103] For example, attribute information may include first attribute information, in which case, such as Figure 10 As shown, based on attribute information and the second candidate box, the candidate target object is filtered to obtain the final target object and the detection box of the final target object, i.e., step S700, which may include the following steps S710 to S740.

[0104] Step S710: Determine the region corresponding to the second candidate box on the input image as the second region of interest, and extract the second region of interest and its neighboring regions from the input image to obtain the intermediate input image;

[0105] Step S720: Based on the first attribute information, classify the intermediate input image to determine the second attribute information of the intermediate input image;

[0106] Step S730: In response to the consistency between the first attribute information and the second attribute information, the candidate target object is taken as the final target object, and the second candidate bounding box of the candidate target object is taken as the detection bounding box of the final target object; and

[0107] Step S740: In response to the inconsistency between the first attribute information and the second attribute information, filter out the candidate target object and the second candidate box of the candidate target object.

[0108] For example, if the second region of interest is a rectangular region, the intermediate input image is equivalent to a region image defined by extending the second region of interest outward by a certain number of pixels in at least one of the four directions (up, down, left, and right). For example, the number of pixels extended outward can be set according to actual needs; for example, the number can be 5 to 20, or for example, 10. It should be noted that the embodiments of this disclosure include, but are not limited to, this.

[0109] For example, taking vehicle type as the first attribute information, the intermediate input image can be classified according to vehicle type (which can also be understood as extracting vehicle type attribute information) to determine the second attribute information. It is understood that the first attribute information can include one or more attribute information, and correspondingly, the second attribute information can also include that one or more attribute information. It is also understood that when the first attribute information includes only one attribute information, a single classification network can be used for classification processing; when the first attribute information includes multiple attribute information, multiple classification networks can be used to perform corresponding classification processing respectively.

[0110] For example, after obtaining the first attribute information and the second attribute information, it is possible to compare whether the two are consistent, and perform the operation of step S730 or step S740 according to the comparison result to realize the filtering process in step S700, thereby improving the detection accuracy.

[0111] It should be noted that before processing the input image using the image processing method provided in the embodiments of this disclosure, it is usually necessary to train the neural network structures (such as convolutional layers, fully connected layers, etc.) involved in steps S200 to S700. The training process can refer to common training methods, which will not be described in detail here. It is understood that the neural network structures involved in steps S200 to S600 can be trained as a whole; while the neural network structure involved in step S700 can be trained separately.

[0112] At least some embodiments of this disclosure also provide a method for setting a learning rate, which can be applied to the aforementioned training process. Figure 11 This diagram illustrates a spiral cyclic learning rate setting, provided for at least some embodiments of this disclosure. For example, as... Figure 11 As shown, an iteration baseline value V is preset, which can represent a certain number of iterations or a certain number of epochs. The following describes in detail the setting method of the spiral cyclic learning rate, taking the example that the iteration baseline value V can represent a certain number of iterations, but this should not be considered as a limitation on the embodiments of this disclosure.

[0113] For example, such as Figure 11 The total number of iterations in the first training phase is V. In the first training phase, the learning rate decays from its maximum value to its minimum value, and the decay process follows a cosine function relationship. The total number of iterations in the second training phase is 2V. In the second training phase, the learning rate decays from its maximum value to its minimum value, and the decay process follows a cosine function relationship. The total number of iterations in the third training phase is 4V. In the third training phase, the learning rate decays from its maximum value to its minimum value, and the decay process follows a cosine function relationship. ... and so on, with the total number of iterations in the nth training phase being 2. n V, in the nth training phase, the learning rate decays from its maximum value to its minimum value, and the decay process follows a cosine function relationship.

[0114] It should be noted that manually adjusting the learning rate during training is easily affected by the initial value. If the initial value is higher than the optimal value, it may cause the system to deviate from the objective function; if the initial value is lower than the optimal value, it may result in a very slow learning speed. Non-convex optimization of neural networks often gets trapped in local optima, making it more difficult to obtain the global optimum. The spiral learning rate setting method provided in the embodiments of this disclosure introduces a new dynamic learning rate to alleviate the task of selecting the learning rate. This dynamic learning rate only uses first-order information (the total number of iterations in each training phase is ), and its value can be determined by only a small amount of additional calculation in each iteration of gradient descent. The advantages of this method are as follows: it is insensitive to hyperparameters, has low computational cost, easily iterates to obtain better values, and is effective for different model architectures.

[0115] It should be noted that, in the embodiments of this disclosure, the flow of the target detection method described above may include more or fewer operations, which may be executed sequentially or in parallel. Although the flow of the target detection method described above includes multiple operations appearing in a specific order, it should be clearly understood that the order of the multiple operations is not limited. The target detection method described above may be executed once or multiple times according to predetermined conditions.

[0116] The target detection method provided in this disclosure can extract attribute information of the detected target object while performing target detection. This not only improves detection accuracy but also facilitates obtaining more comprehensive detection results for further analysis. Furthermore, this target detection algorithm can effectively detect small target objects in the input image.

[0117] At least some embodiments of this disclosure also provide a target detection device. Figure 12 This is a schematic block diagram of a target detection device provided for at least some embodiments of this disclosure. For example, such as... Figure 12 As shown, the target detection device 100 includes a memory 110 and a processor 120.

[0118] For example, memory 110 is used to store computer-readable instructions in a non-transitory manner, and processor 120 is used to run the computer-readable instructions, which are executed by processor 120 to perform the target detection method provided in any embodiment of this disclosure.

[0119] For example, memory 110 and processor 120 can communicate with each other directly or indirectly. For example, in some examples, such as... Figure 12As shown, the target detection device 100 may further include a system bus 130, through which the memory 110 and the processor 120 can communicate with each other. For example, the processor 120 can access the memory 110 through the system bus 130. In other examples, components such as the memory 110 and the processor 120 may communicate via a network connection. The network may include a wireless network, a wired network, and / or any combination of wireless and wired networks. The network may include a local area network, the Internet, a telecommunications network, an Internet of Things (IoT) based on the Internet and / or a telecommunications network, and / or any combination of the above networks. Wired networks may use methods such as twisted-pair cables, coaxial cables, or fiber optic transmission for communication, while wireless networks may use methods such as 3G / 4G / 5G mobile communication networks, Bluetooth, Zigbee, or Wi-Fi. This disclosure does not limit the type and function of the network.

[0120] For example, processor 120 can control other components in the target detection device to perform desired functions. Processor 120 can be a device with data processing and / or program execution capabilities, such as a central processing unit (CPU), tensor processor (TPU), or graphics processing unit (GPU). The CPU can be based on x86 or ARM architectures. The GPU can be integrated directly onto the motherboard or built into the motherboard's northbridge chip. The GPU can also be integrated into the CPU.

[0121] For example, memory 110 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, etc.

[0122] For example, one or more computer instructions may be stored on memory 110, and processor 120 may execute the computer instructions to perform various functions. Various application programs and various data may also be stored in the computer-readable storage medium, such as input images, initial feature images, intermediate feature images, joint feature images, first candidate boxes, second candidate boxes, detection boxes of the final target object, and various data used and / or generated by the application programs.

[0123] For example, some computer instructions stored in memory 110 can be executed by processor 120 to perform one or more steps in the target detection method described above.

[0124] For example, such as Figure 12 As shown, the target detection device 100 may further include an input interface 140 that allows external devices to communicate with the target detection device 100. For example, the input interface 140 may be used to receive instructions or data (e.g., input images, etc.) from external computer devices, users, etc. The target detection device 100 may further include an output interface 150 that enables the target detection device 100 to connect to one or more external devices. For example, the target detection device 100 may output target detection results (e.g., first candidate box, second candidate box, final target object detection box, etc.) through the output interface 150. External devices that communicate with the target detection device 100 through the input interface 140 and the output interface 150 may be included in an environment that provides any type of user interface that a user can interact with. Examples of user interface types include graphical user interfaces, natural user interfaces, etc. For example, a graphical user interface may accept input from a user using an input device such as a keyboard, mouse, remote control, etc., and provide output on an output device such as a display. Furthermore, a natural user interface allows a user to interact with the target detection device 100 in a manner that is not constrained by input devices such as a keyboard, mouse, remote control, etc. In contrast, natural user interfaces can rely on voice recognition, touch and stylus recognition, on-screen and near-screen gesture recognition, air gestures, head and eye tracking, voice and semantics, vision, touch, gestures, and machine intelligence.

[0125] In addition, the target detection device 100, although in Figure 12 While shown as a single system, it is understood that the target detection device 100 can also be a distributed system, and may be deployed as a cloud facility (including a public or private cloud). Thus, for example, several devices can communicate via a network connection and collaboratively perform tasks described as being performed by the target detection device 100. For example, in some embodiments, an input image can be acquired by a client and uploaded to a server; the server performs a target detection process based on the received input image and returns, for example, a detection bounding box of the final target object to the client for the user.

[0126] For example, a detailed description of the implementation process of the target detection method can be found in the relevant descriptions in the above-mentioned embodiments of the target detection method, and repeated details will not be repeated here.

[0127] For example, in some cases, the target detection device may include, but is not limited to, smartphones, tablets, personal computers, personal digital assistants (PDAs), servers, etc.

[0128] It should be noted that the target detection device provided in the embodiments of this disclosure is exemplary and not restrictive. Depending on the actual application needs, the target detection device may also include other conventional components or structures. For example, in order to realize the necessary functions of the target detection device, those skilled in the art can set other conventional components or structures according to specific application scenarios. The embodiments of this disclosure do not limit this.

[0129] The technical effects of the target detection device provided in the embodiments of this disclosure can be referred to the corresponding descriptions of the target detection methods in the above embodiments, and will not be repeated here.

[0130] At least some embodiments of this disclosure also provide a non-transitory storage medium. Figure 13 This is a schematic diagram of a non-transitory storage medium provided as an embodiment of the present disclosure. For example, such as Figure 13 As shown, the non-transient storage medium 200 stores computer-readable instructions 201 non-transiently. When the non-transient computer-readable instructions 201 are executed by a computer (including a processor), the target detection method provided in any embodiment of this disclosure can be executed.

[0131] For example, one or more computer instructions may be stored on the non-transitory storage medium 200. Some of the computer instructions stored on the non-transitory storage medium 200 may be, for example, instructions for implementing one or more steps in the target detection method described above.

[0132] For example, non-transitory storage media may include storage components of a tablet computer, hard disks of a personal computer, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), optical disc read-only memory (CD-ROM), flash memory, or any combination of the above storage media, or other suitable storage media.

[0133] The technical effects of the non-transient storage medium provided in the embodiments of this disclosure can be found in the corresponding descriptions of the target detection method in the above embodiments, and will not be repeated here.

[0134] The following points need to be clarified regarding this disclosure:

[0135] (1) The accompanying drawings of the embodiments of this disclosure only involve the structures involved in the embodiments of this disclosure. Other structures can be referred to the general design.

[0136] (2) Where there is no conflict, the embodiments of this disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.

[0137] The above are merely specific embodiments of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims.

Claims

1. A target detection method, comprising: Obtain the input image; Based on the input image, initial feature images of multiple levels with different resolutions are obtained; The initial feature images of the multiple levels are scaled to obtain multiple intermediate feature images with the same resolution; The multiple intermediate feature images are jointly processed to obtain a joint feature image; Based on the joint feature image, region nomination processing is performed to determine candidate target objects and the first candidate bounding boxes of the candidate target objects; Based on the joint feature image and the first candidate bounding box, attribute information of the candidate target object is extracted and a second candidate bounding box of the candidate target object is determined, wherein the attribute information includes first attribute information; and Based on the attribute information and the second candidate bounding box, the candidate target objects are filtered to obtain the final target object and the detection bounding box of the final target object, including: The region corresponding to the second candidate box on the input image is determined as the second region of interest, and the second region of interest and its neighboring regions are extracted from the input image to obtain an intermediate input image; Based on the first attribute information, the intermediate input image is classified to determine the second attribute information of the intermediate input image; In response to the first attribute information being consistent with the second attribute information, the candidate target object is taken as the final target object, and the second candidate bounding box of the candidate target object is taken as the detection bounding box of the final target object; and In response to the inconsistency between the first attribute information and the second attribute information, the candidate target object and the second candidate box of the candidate target object are filtered out.

2. The target detection method according to claim 1, wherein, Acquiring the input image includes: Obtain the original input image; and The original input image is preprocessed to obtain the input image, wherein the preprocessing includes at least one of cropping and resolution conversion.

3. The target detection method according to claim 1 or 2, wherein, Based on the input image, initial feature images of multiple levels with different resolutions are obtained, including: Based on the input image, perform M consecutive analysis processes to obtain M sets of initial feature images with different resolutions; and N sets of initial feature images are selected from the M sets of initial feature images as the initial feature images for the multiple levels; Where M and N are both positive integers, and M ≥ N ≥ 2.

4. The target detection method according to claim 3, wherein, In the M analysis processes, each analysis process includes convolution processing, and the resolution of the output of each analysis process decreases sequentially.

5. The target detection method according to claim 1 or 2, wherein, The scaling process is applied to the initial feature images at the multiple levels to obtain the multiple intermediate feature images with the same resolution, including: In response to the initial feature image at any level having a resolution greater than a predetermined resolution, the initial feature image at any level is downsampled to obtain an intermediate feature image corresponding to the initial feature image at any level. In response to the initial feature image at any level having a resolution equal to the predetermined resolution, the initial feature image at any level is used as the intermediate feature image corresponding to the initial feature image at that level; and In response to the initial feature image at any level having a resolution less than the predetermined resolution, the initial feature image at any level is upsampled to obtain an intermediate feature image corresponding to the initial feature image at any level.

6. The target detection method according to claim 1 or 2, wherein, Based on the joint feature image, the region nomination process is performed to determine the candidate target object and the first candidate bounding box of the candidate target object, including: Based on the joint feature image, a region nomination network is used to perform the region nomination process to determine the candidate target object and the first candidate bounding box of the candidate target object.

7. The target detection method according to claim 1 or 2, wherein, Based on the joint feature image and the first candidate bounding box, the attribute information of the candidate target object is extracted and the second candidate bounding box of the candidate target object is determined, including: The joint feature image is analyzed and processed to obtain a first feature image; The nomination region corresponding to the first candidate box on the first feature image is determined as the first region of interest (ROI). ROI pooling is then performed on the first ROI to obtain the second feature image. Based on the second feature image, the attribute information of the candidate target object is extracted and the second candidate bounding box of the candidate target object is determined.

8. A target detection device, comprising: Memory is used for non-transitory storage of computer-readable instructions; as well as A processor for executing the computer-readable instructions, wherein the computer-readable instructions, when executed by the processor, perform the target detection method according to any one of claims 1-7.

9. A non-transitory storage medium for non-transitory storage of computer-readable instructions, wherein, When the non-transient computer-readable instructions are executed by a computer, the target detection method according to any one of claims 1-7 can be performed.