Object position detection method and apparatus, electronic device, and storage medium
By utilizing a pre-trained classification model to extract heatmaps and calculate center point locations, combined with a large image segmentation model, the problem of inaccurate manual annotation in object location detection is solved, achieving efficient and accurate target object location detection.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NANJING YIMU INTELLIGENT TECHNOLOGY CO LTD
- Filing Date
- 2025-06-04
- Publication Date
- 2026-06-18
AI Technical Summary
Existing object location detection methods rely on manual annotation, which is prone to inaccuracy. This results in a large amount of manpower being spent on data annotation, and the detection results are not accurate enough.
A heatmap of the image to be detected is extracted using a pre-trained classification model, the center point of the response region is calculated, and the image, response region, and center point position are input into a pre-set image segmentation model to obtain the precise location of the target object.
It reduces reliance on manual annotation, improves the efficiency of the data annotation process, and enhances the accuracy of detection results.
Smart Images

Figure CN2025099118_18062026_PF_FP_ABST
Abstract
Description
An object location detection method, apparatus, electronic device, and storage medium Technical Field
[0001] This invention relates to the field of computer vision technology, and in particular to an object position detection method, apparatus, electronic device, and storage medium. Background Technology
[0002] Object localization or detection, also known as object location detection, is a fundamental area of visual research. Currently, object detection primarily relies on deep learning. However, the application of deep learning models requires a large amount of manually labeled data to train them. This labeling process demands the collection of massive amounts of data and significant manual labor.
[0003] Currently, improving the efficiency of manual annotation is a hot topic in the field of visual AI (Artificial Intelligence). In actual data annotation, the most labor-intensive part is the accurate annotation of target locations, which in the vision field is segmentation annotation. There are several ways to improve the efficiency of manual annotation. The first method is semi-automatic learning, where a portion of the data is manually annotated to train the model, and then the model infers from the remaining data. Correct inferences are directly used as annotations, while incorrect inferences are corrected manually. This method still requires some manual annotation, and the quantity and quality of manual annotations directly affect the initial model's training effect, still requiring significant manpower. The second method is segmentation annotation based on manual prompts, represented by SAM (Segment Anything Model). This method requires each target to be given a point or bounding box prompt manually. In practice, on average, more than two prompts are needed per target to obtain a good segmentation prediction result, still requiring considerable manpower. Summary of the Invention
[0004] This invention provides an object location detection method, apparatus, electronic device, and storage medium to solve the problem that existing object location detection methods rely on manual annotation, which can easily lead to inaccurate quality. It reduces reliance on manual labor, improves the efficiency of the data annotation process, and enhances the accuracy of the detection results.
[0005] According to one aspect of the present invention, an object location detection method is provided, the method comprising:
[0006] A heatmap is extracted from at least one target object in an image to be detected using a pre-trained classification model. The heatmap represents the response region of the at least one target object to the classification result.
[0007] Calculate the center point position of the response region;
[0008] The image to be detected, the response region, and the center point position are input into a preset image segmentation model to obtain the position detection result of at least one target object in the image to be detected.
[0009] According to another aspect of the present invention, an object position detection device is provided, the device comprising:
[0010] A heatmap extraction module is used to extract a heatmap of at least one target object in an image to be detected using a pre-trained classification model. The heatmap represents the response region of the at least one target object to the classification result.
[0011] The center point location calculation module is used to calculate the center point location of the response area;
[0012] The location detection result acquisition module is used to input the image to be detected, the response region, and the center point position into a preset image segmentation large model to obtain the location detection result of at least one target object in the image to be detected.
[0013] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising:
[0014] At least one processor; and
[0015] A memory communicatively connected to the at least one processor; wherein,
[0016] The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the object location detection method according to any embodiment of the present invention.
[0017] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to execute and implement the object location detection method according to any embodiment of the present invention.
[0018] The technical solution of this invention utilizes a pre-trained classification model to extract a heatmap of at least one target object in the image to be detected. The heatmap represents the response region of at least one target object to the classification result. The center point position of the response region is calculated. The image to be detected, the response region, and the center point position are input into a preset image segmentation model to obtain the location detection result of at least one target object in the image to be detected. By using the heatmap calculated by the classification model as the approximate position of the target object in the image, and further using the approximate position of the target object as the input to the image segmentation model to obtain the precise position of the target object, this method solves the problem that existing object location detection methods rely on manual annotation, which is prone to inaccuracy. It reduces the reliance on manual annotation, improves the efficiency of the data annotation process, and enhances the accuracy of the detection results. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 is a flowchart of an object location detection method provided in Embodiment 1 of the present invention;
[0021] Figure 2 is an example diagram of obtaining a heat map grad cam through a classification model according to an embodiment of the present invention;
[0022] Figure 3 is an example diagram of obtaining a heat map mask through a classification model according to an embodiment of the present invention;
[0023] Figure 4 is an example diagram of the target object location detection result obtained by a large image segmentation model according to an embodiment of the present invention;
[0024] Figure 5 is a structural schematic diagram of an object position detection device provided in Embodiment 2 of the present invention;
[0025] Figure 6 is a schematic diagram of the structure of an electronic device that implements the object position detection method of the present invention. Detailed Implementation
[0026] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0027] It should be noted that the terms "target," "initial," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in sequences other than those illustrated or described herein.
[0028] Example 1
[0029] Figure 1 is a flowchart of an object location detection method provided in Embodiment 1 of the present invention. This embodiment is applicable to the situation of detecting the location of objects in an image. The method can be executed by an object location detection device, which can be implemented in hardware and / or software and can be configured in a server. As shown in Figure 1, the method includes:
[0030] S110. Use a pre-trained classification model to extract a heatmap of at least one target object in the image to be detected. The heatmap represents the response region of at least one target object to the classification result.
[0031] The pre-trained classification model can be a model for classifying and detecting objects in an image. The classification results of the pre-trained model can indicate objects such as birds, cars, pedestrians, or even more specific categories. At least one target object in the image to be detected refers to an object in the image, and an image to be detected can contain one or more target objects. A heatmap refers to the response region in the image to the classification results. As shown in Figure 2, the heatmap represents the response region of the image to the bird detection result. Regions A, B, C, and D in the figure correspond to different colors on the heatmap.
[0032] In this embodiment, a pre-trained classification model can be used to perform object classification and detection on the image to be detected, and obtain the heatmap corresponding to the classification result.
[0033] In one alternative implementation, the pre-trained classification model can be pre-trained as follows: A sample image set is obtained, where each sample image includes at least one object, and each object has a corresponding classification label; the current sample image is input into an initial classification model, which outputs a target sample image, containing the classification result of at least one object; a loss function is constructed based on the difference between the classification result of at least one object and the classification label of at least one object in the current sample image, and the initial classification model is optimized based on the loss function; after each sample image in the sample image set has been processed by the initial classification model, the pre-trained classification model is obtained.
[0034] In this embodiment, the sample images used to train the classification model can be pre-labeled with classification labels. This embodiment only requires labeling the sample images with classification labels, which is simpler than labeling the segmentation mask in the prior art, saves manpower, and greatly improves the efficiency of the data labeling process.
[0035] In this embodiment, each sample image in the sample image set can be input into the initial classification model to train the initial classification model. Specifically, after processing the current sample image, the initial classification model outputs a target sample image containing the predicted classification result. A loss function is then constructed based on the difference between the predicted classification result of the target sample image and the pre-labeled classification label of the current sample image. The initial classification model is then optimized based on the loss function until all sample images in the sample image set have been processed, resulting in a pre-trained classification model. For example, the predicted classification result of the target sample image may include the object's category and the bounding box of the object's location.
[0036] It should be noted that when labeling sample images with classification labels in this embodiment, if a sample image contains multiple or more target objects, a classification label can be assigned to each target in the sample image. That is, the sample image can be labeled with multiple classification labels. In this case, the training classification model is a multi-classification model.
[0037] Based on the pre-trained classification model mentioned above, extracting a heatmap of at least one target object in the image to be detected using the pre-trained classification model can include: inputting the image to be detected into the pre-trained classification model to obtain the classification result of at least one target object in the image to be detected, and displaying the classification result of the target object in the form of a heatmap.
[0038] For example, by inputting the image to be detected into a pre-trained classification model, a heatmap grad cam corresponding to the image to be detected can be obtained, as shown in Figure 2.
[0039] S120, Calculate the center point location of the response region.
[0040] In one optional implementation, calculating the center point position of the response region may include: obtaining the current response region; obtaining the target response region from the current response region, wherein the pixel value of each pixel in the target response region is greater than a preset pixel threshold; and calculating the center point position of the target response region.
[0041] Here, the current response region can refer to the currently processed heatmap gradient cam. The target response region can refer to the heatmap mask corresponding to the current response region. The preset pixel threshold can refer to a pixel boundary value in the process of obtaining the target response region based on the current response region. The target response region can be the approximate location of the target object in the image to be detected.
[0042] In this embodiment, the pre-trained classification model processes the image to be detected to obtain a heatmap grad cam (equivalent to the current response region). Further, a preset pixel threshold is used to filter pixels in the heatmap grad cam to obtain a heatmap mask (as shown in Figure 3, equivalent to the target response region). Specifically, pixels with values greater than the preset pixel threshold in the heatmap grad cam are used as pixels in the heatmap mask. The center point position of the heatmap mask is then calculated.
[0043] S130. Input the image to be detected, the response region, and the center point position into the preset image segmentation large model to obtain the position detection result of at least one target object in the image to be detected.
[0044] The preset image segmentation large model can refer to a model that obtains segmented image regions by inputting prompts and performing image segmentation based on the prompts. For example, the preset image segmentation large model in this embodiment can be the open-source SAM visual large model.
[0045] Based on the optional implementation of S120 above, inputting the image to be detected, the response region, and the center point position into a preset image segmentation large model to obtain the position detection result of at least one target object in the image to be detected may include: inputting the image to be detected, the target response region, and the center point position of the target response region into a preset image segmentation large model to obtain the position detection result of at least one target object in the image to be detected.
[0046] In this embodiment, the image to be detected, its heatmap mask, and the center point position of the heatmap mask are specifically input as prompting information into a preset image segmentation model. This allows the preset image segmentation model to output the location detection result of the target object in the image to be detected using image segmentation techniques. For example, the location detection result of the target object in the image to be detected can be as shown in Figure 4, where a fine mask of the target object can be selected as the location detection result.
[0047] Furthermore, the image to be detected, the target response region, and the center point position of the target response region are input into a preset image segmentation model to obtain the location detection result of at least one target object in the image to be detected. This can include: if the image to be detected includes one target object, inputting the image to be detected, the target response region, and the center point position of the target response region into the preset image segmentation model to obtain the location detection result of the target object in the image to be detected; if the image to be detected includes multiple target objects, inputting the image to be detected, the currently processed target response region, and the center point position of the currently processed target response region into the preset image segmentation model to obtain the location detection result of the currently processed target object in the image to be detected.
[0048] In this embodiment, the target object in the image to be detected can be one or more. When there is only one target object, the image to be detected, its heatmap mask, and the center point position of the heatmap mask can be directly input into a preset image segmentation model to obtain the target object's location detection result. When there are multiple target objects, each target object in the image to be detected can have a corresponding heatmap mask. Therefore, each heatmap mask and its corresponding center point position can be input into the preset image segmentation model for location detection.
[0049] The technical solution of this embodiment extracts a heatmap of at least one target object in the image to be detected using a pre-trained classification model. The heatmap represents the response region of at least one target object to the classification result. The center point position of the response region is calculated. The image to be detected, the response region, and the center point position are input into a preset image segmentation model to obtain the position detection result of at least one target object in the image to be detected. By using the heatmap calculated by the classification model as the approximate position of the target object in the image, and further using the approximate position of the target object as the input to the image segmentation model to obtain the precise position of the target object, this method solves the problem that existing object position detection methods rely on manual annotation and are prone to inaccuracy. It reduces the dependence on manual annotation, improves the efficiency of the data annotation process, and improves the accuracy of the detection results.
[0050] Example 2
[0051] Figure 5 is a schematic diagram of an object position detection device provided in Embodiment 2 of the present invention. As shown in Figure 5, the device includes: a heat map extraction module 210, a center point position calculation module 220, and a position detection result acquisition module 230. Wherein:
[0052] The heatmap extraction module 210 is used to extract a heatmap of at least one target object in the image to be detected using a pre-trained classification model. The heatmap represents the response region of the at least one target object to the classification result.
[0053] The center point position calculation module 220 is used to calculate the center point position of the response area;
[0054] The location detection result acquisition module 230 is used to input the image to be detected, the response region and the center point position into a preset image segmentation large model to obtain the location detection result of at least one target object in the image to be detected.
[0055] The technical solution of this invention utilizes a pre-trained classification model to extract a heatmap of at least one target object in the image to be detected. The heatmap represents the response region of at least one target object to the classification result. The center point position of the response region is calculated. The image to be detected, the response region, and the center point position are input into a preset image segmentation model to obtain the location detection result of at least one target object in the image to be detected. By using the heatmap calculated by the classification model as the approximate position of the target object in the image, and further using the approximate position of the target object as the input to the image segmentation model to obtain the precise position of the target object, this method solves the problem that existing object location detection methods rely on manual annotation, which is prone to inaccuracy. It reduces the reliance on manual annotation, improves the efficiency of the data annotation process, and enhances the accuracy of the detection results.
[0056] Optionally, the object location detection device further includes a classification model pre-training module, used for:
[0057] Obtain a sample image set, wherein each sample image in the sample image set includes at least one object, and the at least one object has a corresponding classification label;
[0058] The current sample image is input into the initial classification model, and the output is a target sample image, wherein the target sample image carries the classification result of the at least one object;
[0059] A loss function is constructed based on the difference between the classification result of the at least one object and the classification label of the at least one object in the current sample image, and the initial classification model is optimized based on the loss function;
[0060] After each sample image in the sample image set has been processed by the initial classification model, the pre-trained classification model is obtained.
[0061] Optional, the heatmap extraction module 210 can be used for:
[0062] The image to be detected is input into the pre-trained classification model to obtain the classification result of at least one target object in the image to be detected, and the classification result of the target object is displayed in the form of a heatmap.
[0063] Optional, the center point location calculation module 220 can be used for:
[0064] Get the current response area;
[0065] Obtain the target response region from the current response region, wherein the pixel value of each pixel in the target response region is greater than a preset pixel threshold;
[0066] Calculate the center point location of the target response region.
[0067] Optionally, the location detection result acquisition module 230 may include:
[0068] The location detection result acquisition unit is used to input the image to be detected, the target response region, and the center point position of the target response region into a preset image segmentation large model to obtain the location detection result of at least one target object in the image to be detected.
[0069] Optionally, the location detection result acquisition unit can be used for:
[0070] When the image to be detected includes a target object, the image to be detected, the target response region, and the center point position of the target response region are input into a preset image segmentation large model to obtain the position detection result of the target object in the image to be detected.
[0071] When the image to be detected includes multiple target objects, the image to be detected, the current processing target response region, and the center point position of the current processing target response region are input into a preset image segmentation large model to obtain the position detection result of the current processing target object in the image to be detected.
[0072] The object position detection device provided in the embodiments of the present invention can execute the object position detection method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the method execution.
[0073] Example 3
[0074] Figure 6 shows a schematic diagram of an electronic device 300 that can be used to implement embodiments of the present invention. The electronic device is intended to represent various forms of digital computers or various forms of mobile devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the invention described and / or claimed herein.
[0075] As shown in Figure 6, the electronic device 300 includes at least one processor 301 and a memory, such as a read-only memory (ROM) 302 and a random access memory (RAM) 303, communicatively connected to the at least one processor 301. The memory stores computer programs executable by the at least one processor. The processor 301 can perform various appropriate actions and processes based on the computer program stored in the ROM 302 or loaded into the RAM 303 from storage unit 308. The RAM 303 can also store various programs and data required for the operation of the electronic device 300. The processor 301, ROM 302, and RAM 303 are interconnected via a bus 304. An input / output (I / O) interface 305 is also connected to the bus 304.
[0076] Multiple components in electronic device 300 are connected to I / O interface 305, including: input unit 306, such as keyboard, mouse, etc.; output unit 307, such as various types of displays, speakers, etc.; storage unit 308, such as disk, optical disk, etc.; and communication unit 309, such as network card, modem, wireless transceiver, etc. Communication unit 309 allows electronic device 300 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0077] Processor 301 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 301 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 301 performs the various methods and processes described above, such as object location detection methods.
[0078] In some embodiments, the object location detection method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and / or mounted on electronic device 300 via ROM 302 and / or communication unit 309. When the computer program is loaded into RAM 303 and executed by processor 301, one or more steps of the object location detection method described above may be performed. Alternatively, in other embodiments, processor 301 may be configured to perform the object location detection method by any other suitable means (e.g., by means of firmware).
[0079] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A method for detecting the location of an object, characterized in that, include: A heatmap is extracted from at least one target object in an image to be detected using a pre-trained classification model. The heatmap represents the response region of the at least one target object to the classification result. Calculate the center point position of the response region; The image to be detected, the response region, and the center point position are input into a preset image segmentation model to obtain the position detection result of at least one target object in the image to be detected.
2. The method according to claim 1, characterized in that, The pre-trained classification model is pre-trained in the following manner: Obtain a sample image set, wherein each sample image in the sample image set includes at least one object, and the at least one object has a corresponding classification label; The current sample image is input into the initial classification model, and the output is a target sample image, wherein the target sample image carries the classification result of the at least one object; A loss function is constructed based on the difference between the classification result of the at least one object and the classification label of the at least one object in the current sample image, and the initial classification model is optimized based on the loss function; After each sample image in the sample image set has been processed by the initial classification model, the pre-trained classification model is obtained.
3. The method according to claim 1, characterized in that, Extracting heatmaps of at least one target object in an image to be detected using a pre-trained classification model, including: The image to be detected is input into the pre-trained classification model to obtain the classification result of at least one target object in the image to be detected, and the classification result of the target object is displayed in the form of a heatmap.
4. The method according to claim 1, characterized in that, Calculating the center point location of the response region includes: Get the current response area; Obtain the target response region from the current response region, wherein the pixel value of each pixel in the target response region is greater than a preset pixel threshold; Calculate the center point location of the target response region.
5. The method according to claim 4, characterized in that, The image to be detected, the response region, and the center point position are input into a preset image segmentation model to obtain the location detection result of at least one target object in the image to be detected, including: The image to be detected, the target response region, and the center point position of the target response region are input into a preset image segmentation model to obtain the position detection result of at least one target object in the image to be detected.
6. The method according to claim 5, characterized in that, The image to be detected, the target response region, and the center point position of the target response region are input into a preset image segmentation large model to obtain the position detection result of at least one target object in the image to be detected, including: When the image to be detected includes a target object, the image to be detected, the target response region, and the center point position of the target response region are input into a preset image segmentation large model to obtain the position detection result of the target object in the image to be detected. When the image to be detected includes multiple target objects, the image to be detected, the current processing target response region, and the center point position of the current processing target response region are input into a preset image segmentation large model to obtain the position detection result of the current processing target object in the image to be detected.
7. An object position detection device, characterized in that, include: A heatmap extraction module is used to extract a heatmap of at least one target object in an image to be detected using a pre-trained classification model. The heatmap represents the response region of the at least one target object to the classification result. The center point location calculation module is used to calculate the center point location of the response area; The location detection result acquisition module is used to input the image to be detected, the response region, and the center point position into a preset image segmentation large model to obtain the location detection result of at least one target object in the image to be detected.
8. The apparatus according to claim 7, characterized in that, It also includes a classification model pre-training module, used for: Obtain a sample image set, wherein each sample image in the sample image set includes at least one object, and the at least one object has a corresponding classification label; The current sample image is input into the initial classification model, and the output is a target sample image, wherein the target sample image carries the classification result of the at least one object; A loss function is constructed based on the difference between the classification result of the at least one object and the classification label of the at least one object in the current sample image, and the initial classification model is optimized based on the loss function; After each sample image in the sample image set has been processed by the initial classification model, the pre-trained classification model is obtained.
9. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the object location detection method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the object location detection method according to any one of claims 1-7.