Target pre-targeting methods, devices, and media

By performing depth recognition and segmentation masking on the initial results of the target detection algorithm using a depth model and a segmentation model, and combining this with a multimodal large model for secondary verification, the problem of improper confidence threshold setting during the pre-annotation process of the target detection algorithm is solved, thereby improving the accuracy of the detection results and the annotation efficiency.

CN122199899APending Publication Date: 2026-06-12JINAN BOGUAN INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JINAN BOGUAN INTELLIGENT TECH CO LTD
Filing Date
2024-12-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In the pre-annotation process, improper setting of the confidence threshold in existing object detection algorithms can lead to inaccurate detection results, affecting the quality and efficiency of the training dataset.

Method used

We employ depth model and segmentation model to perform depth recognition and segmentation masking on the initial results of the target detection algorithm, and combine them with a multimodal large model for secondary verification to improve the accuracy of the detection results.

🎯Benefits of technology

By combining deep models, segmentation models, and multimodal large models, the accuracy of object detection results is improved, the workload of manual correction is reduced, and the annotation quality and efficiency of the training dataset are improved.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199899A_ABST
    Figure CN122199899A_ABST
Patent Text Reader

Abstract

The application discloses a target pre-labeling method and device and a medium. The method comprises the following steps: performing target category detection on a target image based on a target detection algorithm to obtain an initial target detection result; performing depth recognition on the target image based on a depth model to obtain a depth map; performing target segmentation on the target image based on a segmentation model to obtain a segmentation mask map; wherein the initial target detection result comprises position information of a candidate prediction box, category information corresponding to the candidate prediction box and confidence; determining a target prediction box to be verified from the candidate prediction box according to the confidence; inputting the target image, the position information of the target prediction box, the depth map and the segmentation mask map into a multimodal large model to obtain a pre-labeling result of the target prediction box. The application improves the accuracy of the target detection result obtained by the target detection algorithm and improves the quality and efficiency of the overall training data set labeling.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of deep learning technology, and in particular to a target pre-calibration method, apparatus, and medium. Background Technology

[0002] Using deep learning object detection models to detect and identify objects is a mainstream technique in the field of machine vision. Its goal is to accurately locate and identify specific objects in images or videos. The key to accurate detection results lies in whether the object detection model can effectively learn the features of the target object, and high-quality data annotation is fundamental to achieving this goal. The training data for the object detection model needs to accurately label the location and category of the target in the image.

[0003] While manual annotation ensures accuracy, its efficiency is relatively low and cannot meet the needs of large-scale data processing. With the continuous improvement of object detection algorithm performance, a more efficient and accurate annotation method is gaining popularity: pre-annotating using object detection algorithms followed by manual correction. This method not only improves annotation efficiency but also guarantees annotation quality. For example, when pre-annotating a training dataset using object detection algorithms, a confidence threshold is typically set; annotations with confidence scores below this threshold are corrected manually.

[0004] However, if the confidence threshold is set too high, although it can ensure the accuracy of the detection boxes, it will cause some correct detection results to be missed, increasing the cost of manual correction later. Conversely, if the confidence threshold is set too low, it will introduce a large number of false detections, making the training dataset inaccurate and affecting the detection accuracy of the trained target detection model. Summary of the Invention

[0005] This invention provides a target pre-labeling method, apparatus, and medium to solve the problem of inaccurate detection results when using target detection algorithms to pre-label target detection training datasets, thereby improving the efficiency of the overall labeling process.

[0006] According to one aspect of the present invention, a target pre-calibration method is provided, comprising:

[0007] The target image is subjected to target category detection based on the target detection algorithm to obtain an initial target detection result; the target image is subjected to depth recognition based on the depth model to obtain a depth map; the target image is subjected to target segmentation based on the segmentation model to obtain a segmentation mask map; wherein, the initial target detection result includes the position information of the candidate prediction box, the category information corresponding to the candidate prediction box, and the confidence score;

[0008] Based on the confidence level, a target prediction box to be verified is determined from the candidate prediction boxes;

[0009] The target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into a multimodal large model to obtain the pre-calibration result of the target prediction box.

[0010] According to another aspect of the present invention, a target pre-calibration device is provided, comprising:

[0011] The target image processing module is used to perform target category detection on the target image based on the target detection algorithm to obtain an initial target detection result; to perform depth recognition on the target image based on the depth model to obtain a depth map; and to perform target segmentation on the target image based on the segmentation model to obtain a segmentation mask map; wherein, the initial target detection result includes the position information of the candidate prediction box, the category information corresponding to the candidate prediction box, and the confidence level;

[0012] A target prediction box determination module is used to determine a target prediction box to be verified from the candidate prediction boxes based on the confidence level.

[0013] The large model recognition module is used to input the target image, the location information of the target prediction box, the depth map and the segmentation mask map into the multimodal large model to obtain the pre-calibration result of the target prediction box.

[0014] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising:

[0015] At least one processor; and

[0016] A memory communicatively connected to the at least one processor; wherein,

[0017] The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the target precalibration method according to any embodiment of the present invention.

[0018] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to execute and implement the target precalibration method according to any embodiment of the present invention.

[0019] The technical solution of this invention verifies the accuracy of the predicted bounding boxes obtained by the object detection algorithm through deep models, segmentation models, and multimodal large models, thereby improving the accuracy of the object detection results obtained by the object detection algorithm, reducing the workload of manual correction after pre-labeling the training dataset of the object detection model, and thus improving the overall quality and efficiency of training dataset labeling.

[0020] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0021] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 This is a flowchart of a target pre-calibration method provided according to an embodiment of the present invention;

[0023] Figure 2 This is a flowchart of another target pre-calibration method provided according to an embodiment of the present invention;

[0024] Figure 3 This is a flowchart of another target pre-calibration method provided according to an embodiment of the present invention;

[0025] Figure 4 This is a schematic diagram of the structure of a target pre-calibration device according to an embodiment of the present invention;

[0026] Figure 5 This is a schematic diagram of the structure of an electronic device that implements the target precalibration method of the present invention. Detailed Implementation

[0027] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0028] It should be noted that the terms "candidate," "target," etc., used in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0029] Figure 1 This invention provides a flowchart of a target pre-labeling method, applicable to situations requiring pre-labeling of training datasets for target detection models. This method can be executed by a target pre-labeling device, which can be implemented in hardware and / or software and configured on a server with computing power. Figure 1 As shown, the method includes:

[0030] S110. Target category detection is performed on the target image based on the target detection algorithm to obtain the initial target detection result; depth recognition is performed on the target image based on the depth model to obtain the depth map; target segmentation is performed on the target image based on the segmentation model to obtain the segmentation mask map.

[0031] The initial target detection results include the location information of the candidate prediction boxes, the category information corresponding to the candidate prediction boxes, and the confidence level.

[0032] Here, "object detection algorithm" refers to existing object detection algorithms, such as traditional object detection algorithms or open-source object detection models; "object image" refers to the image to be detected, which can be an image with detection requirements or a sample image in the training dataset of the object detection model to be trained; and "object category" is the category of the object to be detected by the object detection model to be trained.

[0033] The depth model is used to identify the distance between the target and the image acquisition device corresponding to each pixel in the target image. In the depth map, a pixel value closer to 1 indicates that the target is close to the image acquisition device and located in the foreground; a pixel value closer to 0 indicates that the target is far from the image acquisition device and located in the background. This embodiment of the invention does not limit the depth model used. The segmentation model is used to identify and segment the target to be detected in the target image. The segmentation mask map includes masks corresponding to all targets to be detected. The targets to be detected are the objects to be detected by the target detection model to be trained. This embodiment of the invention does not limit the segmentation model used.

[0034] Specifically, the target objects and corresponding target categories of the target detection model to be trained are pre-determined, as is the training dataset for the model. Each sample image in the training dataset is used as a target image, and the target image is detected using an object detection algorithm. The location information, category information, and confidence score of the candidate predicted bounding boxes corresponding to the detected objects are obtained. Simultaneously, the target images are input into a depth model and a segmentation model. The output of the depth model is the depth map corresponding to the target image, and the output of the segmentation model is the segmentation mask map corresponding to the target image.

[0035] S120. Determine the target prediction box to be verified from the candidate prediction boxes based on the confidence level.

[0036] Since confidence level represents the accuracy of object detection algorithms in detecting candidate bounding boxes, candidate bounding boxes with low detection accuracy need to undergo secondary verification to confirm the accuracy of the detection results. Specifically, a confidence threshold is determined, and candidate bounding boxes with confidence levels below the threshold are considered as target bounding boxes to be verified.

[0037] Since the accuracy of the target prediction boxes needs to be verified secondaryly by a multimodal large model, the determination of the confidence threshold does not need to be strict. To improve the accuracy of the detection results, a higher confidence threshold can be set, that is, secondary verification is performed on most of the candidate prediction boxes in the initial target detection results obtained by the target detection algorithm. Optionally, all candidate prediction boxes can be determined as target prediction boxes, that is, the deep model, segmentation model, and multimodal large model are used to perform secondary verification on all candidate prediction boxes to ensure the accuracy of the final pre-labeling results.

[0038] Optionally, depth recognition is performed on the target image based on a depth model to obtain a depth map; target segmentation is performed on the target image based on a segmentation model to obtain a segmentation mask map, including:

[0039] The target detection region is determined in the target image based on the position information of the target prediction box. For example, the target detection region is the region that extends outward from the edge of the target prediction box by a preset distance. The depth of the target detection region is recognized based on the depth model to obtain a depth map, and the target detection region is segmented based on the segmentation model to obtain a segmentation mask map.

[0040] Since only the target prediction boxes with low confidence in the initial target detection results need to be validated by the multimodal large model, the other prediction boxes with high confidence are considered to be accurate in the pre-calibration results of the target detection algorithm and do not need to be input into the multimodal large model. Therefore, only the regions corresponding to the target prediction boxes to be validated are used to obtain depth maps and segmentation mask maps, which reduces the performance consumption of computing power and improves the efficiency of target pre-calibration.

[0041] S130. Input the target image, the location information of the target prediction box, the depth map and the segmentation mask map into the multimodal large model to obtain the pre-calibration result of the target prediction box.

[0042] Multimodal large models refer to large-scale machine learning models capable of processing and understanding multiple types of data (such as text, images, and audio). By integrating information from different modalities, these models can more comprehensively understand complex scenes, providing richer and more accurate analytical results, and are suitable for various tasks such as image description, visual question answering, and text generation. For example, when processing image content description tasks, multimodal large models can not only identify objects in the image but also understand the relationships between these objects, and even generate more natural and fluent descriptions by combining contextual information.

[0043] Since the depth map and segmentation mask map also include the corresponding feature relationship between the detected object and the corresponding target category corresponding to the target prediction box, that is, including the front, back, left and right features of the target prediction box, the target image, the position information of the target prediction box, the depth map and the segmentation mask map are input into the multimodal large model. The multimodal large model identifies their corresponding feature relationships, further judges the detection accuracy of the target prediction box, and determines the pre-calibration result of the target detection box based on the output result of the multimodal large model.

[0044] In a feasible embodiment, the target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into a multimodal large model to obtain the pre-calibration result of the target prediction box, including:

[0045] The target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into the multimodal large model. The output of the multimodal large model is whether the target prediction box matches the corresponding category information successfully.

[0046] If the target prediction box matches the corresponding category information, the pre-calibration result of the target prediction box is determined to be accurate based on the detection information of the target prediction box obtained by the target detection algorithm.

[0047] If the target prediction box fails to match the corresponding category information, the pre-calibration result of the target prediction box is determined to be an error in the detection information of the target prediction box obtained based on the target detection algorithm.

[0048] The pre-calibration results of the target prediction boxes are determined based on the output results of the multimodal large model. The pre-calibration results of the target prediction boxes include whether the detection information of the target prediction boxes obtained based on the target detection algorithm is accurate or whether the detection information of the target prediction boxes obtained based on the target detection algorithm is incorrect. The detection information of the target prediction boxes includes the location information and category information of the target prediction boxes obtained based on the target detection algorithm.

[0049] The output of the multimodal large model is whether the target prediction box matches the category information corresponding to the target prediction box in the initial target detection result. If the match is successful, it is determined that the pre-labeling result of the target prediction box by the target detection algorithm is accurate, that is, the position information and category information of the target prediction box obtained by the target detection algorithm are accurate. If the match fails, it is determined that the pre-labeling result of the target prediction box by the target detection algorithm is inaccurate, that is, the position information and / or category information of the target prediction box obtained by the target detection algorithm are incorrect, and the target image is determined to be an image that needs to be manually re-labeled.

[0050] For example, textual description information is determined based on the location information of the target prediction box, the depth map, and the segmentation mask map. This textual description information is used to describe the location information of the target prediction box, the depth features corresponding to the target prediction box in the depth map, and the mask features corresponding to the target prediction box in the segmentation mask map. The target image and textual description information are input into the multimodal large model to obtain the output of the multimodal large model.

[0051] By performing secondary detection verification using a multimodal large model, the target prediction boxes with low confidence in the initial target detection results obtained by the target detection algorithm are re-detected. If the detected target prediction box is correct, manual correction of the target prediction box is avoided; if the detected target prediction box is incorrect, manual correction is required. This reduces the amount of manual correction after the target detection algorithm pre-labels the training dataset, thereby improving the overall labeling efficiency of the training dataset.

[0052] When performing pre-labeling, it is first necessary to clarify the category to be labeled. Before using the object detection algorithm, the detection category information and confidence value are set. After inputting the target image, candidate predicted boxes are obtained. Since there are many targets in real images that occlude each other, especially in some dense scenes, such as a bicycle in the background and a person in the foreground of the target image, when the bounding boxes of the two are close or overlap, if the target predicted box and the target image are directly input into the multimodal large model, the multimodal large model may produce misidentification. For example, it may use its logically flawed description to say "a person is riding a bicycle". This is because the category and position information of the predicted box only convey the relative left and right relationship of the target in the two-dimensional plane and fail to capture the three-dimensional spatial context, such as depth information. The relationship between the front and back of the target can be obtained through depth information. Therefore, this embodiment of the invention uses a segmentation model and a depth model to obtain the segmentation mask map and depth map corresponding to the target image, respectively. The segmentation mask map and depth map are combined to perform secondary verification of the target predicted box, which improves the accuracy of data pre-labeling by the object detection algorithm.

[0053] The technical solution of this invention verifies the accuracy of the predicted bounding boxes obtained by the object detection algorithm through deep models, segmentation models, and multimodal large models, thereby improving the accuracy of the object detection results obtained by the object detection algorithm, reducing the workload of manual correction after pre-labeling the training dataset of the object detection model, and thus improving the overall quality and efficiency of training dataset labeling.

[0054] Figure 2 This is a flowchart of another target pre-calibration method provided by an embodiment of the present invention. This embodiment verifies the accuracy of the target prediction boxes in the above embodiments. Figure 2 As shown, the method includes:

[0055] S210. Target category detection is performed on the target image based on the target detection algorithm to obtain the initial target detection result; depth recognition is performed on the target image based on the depth model to obtain the depth map; target segmentation is performed on the target image based on the segmentation model to obtain the segmentation mask map.

[0056] The initial target detection results include the location information of the candidate prediction boxes, the category information corresponding to the candidate prediction boxes, and the confidence level.

[0057] S220. Determine the target prediction box to be verified from the candidate prediction boxes based on the confidence level.

[0058] S230. Determine the overlap value between the target prediction box and the corresponding target mask region based on the position information of the target prediction box and the segmentation mask image.

[0059] The overlap value is used to characterize the degree of overlap between the predicted target bounding box and the corresponding target mask region. For example, the overlap value is determined based on the area ratio of the predicted target bounding box to the corresponding target mask region. The target mask region is either the region in the segmentation mask image corresponding to the predicted target bounding box's position, or the region in the segmentation mask image corresponding to the predicted target bounding box's target mask. Furthermore, since the segmentation mask image includes the target segmentation results, its reliability is relatively high. Therefore, based on the segmentation mask image to determine the position of the true predicted bounding box, the overlap value between the predicted target bounding box and the corresponding target mask region can characterize the degree of offset between the predicted target bounding box and the corresponding true predicted bounding box.

[0060] Specifically, based on the location information of the target predicted bounding box, the target mask bounding box corresponding to the target predicted bounding box in the segmentation mask image is determined, and the area where the target mask bounding box is located is determined as the target mask region; the overlap value is determined based on the relevant information of the target mask region and the target predicted bounding box. For example, the overlap value is determined based on the distance between the center points of the target mask region and the target predicted bounding box.

[0061] In one feasible embodiment, S230 includes:

[0062] Determine the target mask region in the segmentation mask image that corresponds to the target prediction box based on the location information of the target prediction box;

[0063] Determine the mask area based on the mask value in the target mask region;

[0064] The overlap value is determined based on the ratio of the mask area to the area of ​​the target prediction box.

[0065] In this embodiment, the overlap value is determined based on the area offset information of the target mask region and the target prediction box.

[0066] Specifically, for all target prediction bounding boxes, the area ratio between the target mask region and the target prediction bounding box is determined using the following formula:

[0067]

[0068] Where r represents the area ratio of the target mask region to the target predicted box, which serves as the overlap value between the target predicted box and the corresponding target mask region; bbox_region represents the target mask region; mask(i) represents the mask value of the i-th point in the target mask region; and bbox_area represents the area of ​​the target predicted box.

[0069] S240. If the overlap value is less than the preset overlap threshold, the position information of the target prediction box is corrected according to the segmentation mask image.

[0070] Since the overlap value between the predicted target bounding box and its corresponding target mask region can characterize the degree of offset between the predicted target bounding box and its corresponding ground truth bounding box, a larger overlap value indicates a smaller offset between them. If the overlap value is less than a preset overlap threshold, it is determined that the offset between the predicted target bounding box and the ground truth bounding box is large, indicating that the predicted target bounding box's position is inaccurate and needs to be corrected. Because the segmentation results in the segmentation mask image are relatively accurate, the position information of the predicted target bounding box is corrected based on the target mask corresponding to it in the segmentation mask image to improve the accuracy of the corrected position information.

[0071] Specifically, if the overlap value of the target predicted bounding box is greater than or equal to the preset overlap threshold, the offset between the target predicted bounding box and the real predicted bounding box corresponding to the detected target is small and can be ignored. That is, the position information of the target predicted bounding box is highly accurate and does not need to be corrected. It can be directly used as data for subsequent verification. If the overlap value is less than the preset overlap threshold, the position information of the target predicted bounding box is corrected according to the position information of the target mask corresponding to the segmentation mask image. The corrected position information of the target predicted bounding box is then used as data for subsequent verification.

[0072] For example, based on the location information of the target predicted bounding box, a target mask corresponding to the target predicted bounding box is determined in the segmentation mask image. The positions of the target mask and the target predicted bounding box are then neutralized to obtain the corrected position information of the target predicted bounding box. The preset overlap threshold can be set according to the actual scenario and is not limited here.

[0073] In one feasible embodiment, the position information of the target prediction box is corrected based on the segmentation mask image, including:

[0074] Based on the location information of the target prediction box, determine the target mask corresponding to the target prediction box in the segmentation mask image, and determine the minimum bounding rectangle of the target mask;

[0075] The position of the centroid of the minimum bounding rectangle is determined based on the position information of the minimum bounding rectangle and the corresponding mask value, and is used as the center position of the corrected prediction box.

[0076] The corrected prediction box width is determined by the weighted sum of the width of the minimum bounding rectangle and the width of the target prediction box;

[0077] The corrected prediction box height is determined by the weighted sum of the height of the minimum bounding rectangle and the height of the target prediction box.

[0078] The position information of the corrected target prediction box is determined based on the center position, width, and height of the corrected prediction box.

[0079] The position information of the target prediction box is corrected by weighting and neutralizing the position information of the minimum bounding rectangle corresponding to the target mask and the target prediction box.

[0080] Based on the location information of the target prediction box, determine the complete mask of the target object at the corresponding position in the segmentation mask image, which is the target mask. Determine the center position of the corrected prediction box based on the center point of the minimum bounding rectangle of the target mask. Then, determine the size information of the corrected prediction box based on the weighted sum of the size information of the minimum bounding rectangle and the size information of the target prediction box.

[0081] Specifically, the center position of the corrected prediction box is determined according to the following formula:

[0082]

[0083] Among them, (C) x C y (x) represents the coordinates of the center position of the corrected prediction box. i The x-coordinate and y-coordinate of the i-th pixel in the region corresponding to the minimum bounding rectangle are represented by the x-coordinate and y-coordinate respectively. i m represents the y-coordinate of the i-th pixel. i This represents the mask value of the i-th pixel.

[0084] The corrected predicted bounding box width and height are determined using the following formulas:

[0085] width new =β*width mask +(1-β)width box ;

[0086] heigh new =β*heigh mask +(1-β)heigh box ;

[0087] Among them, width new Height represents the corrected predicted bounding box width. new The width represents the height of the corrected predicted bounding box. mask Height represents the width of the minimum bounding matrix. mask Height represents the height of the minimum outermost matrix. box The width represents the height of the predicted bounding box. boxβ represents the width of the target predicted bounding box, and β represents the correction weight of the minimum bounding rectangle, which can be set according to the actual situation. For example, since the segmentation mask is usually more accurate than the target predicted bounding box obtained by the detection algorithm, the segmentation mask is given priority, and the weight of the minimum bounding rectangle corresponding to the segmentation mask is greater than that of the original target predicted bounding box. For example, setting the weight to 0.6 will not completely discard the position information of the original target predicted bounding box, and will take into account more the position information of the target mask. This is suitable for most cases and avoids large shifts in the corrected predicted bounding box.

[0088] Finally, the position coordinates of the corrected target prediction box are determined based on the center position, width, and height of the corrected prediction box.

[0089] S250. Input the target image, the location information of the target prediction box, the depth map and the segmentation mask map into the multimodal large model to obtain the pre-calibration result of the target prediction box.

[0090] In step 250, the location information of the target prediction box uses the corrected location information of the target prediction box.

[0091] The technical solution of this invention determines the overlap value between the target prediction box and the corresponding target mask region based on the position information of the target prediction box and the segmentation mask image. When the offset of the original target prediction box is large based on the overlap value, the position information of the target prediction box is corrected based on the segmentation mask image, thereby improving the accuracy of the position information of the target prediction box and thus improving the accuracy of the secondary verification results through the large model.

[0092] Figure 3 This is a flowchart of another target pre-calibration method provided by an embodiment of the present invention. This embodiment further refines the features input into the multimodal large model in the above embodiments. For example... Figure 3 As shown, the method includes:

[0093] S310. Target category detection is performed on the target image based on the target detection algorithm to obtain the initial target detection result; depth recognition is performed on the target image based on the depth model to obtain the depth map; target segmentation is performed on the target image based on the segmentation model to obtain the segmentation mask map.

[0094] The initial target detection results include the location information of the candidate prediction boxes, the category information corresponding to the candidate prediction boxes, and the confidence level.

[0095] S320. Determine the target prediction box to be verified from the candidate prediction boxes based on the confidence level.

[0096] S330. Determine the target depth information based on the target prediction bounding box location information and the depth map.

[0097] Among them, the target depth information represents the depth information of the target object corresponding to the target prediction box in the depth map.

[0098] Specifically, the depth region corresponding to the target prediction box in the depth map is determined based on the location information of the target prediction box, and the target depth information is determined based on the statistical results of the depth values ​​in the depth region, such as using the average or median value of the depth values ​​in the depth region as the target depth information.

[0099] In one feasible embodiment, S330 includes:

[0100] Smooth the depth map to obtain a smoothed depth map;

[0101] Determine the target depth region corresponding to the target prediction box in the smoothed depth map based on the location information of the target prediction box;

[0102] Invalid depth values ​​are removed from the target depth region to obtain a set of valid depth values;

[0103] Target depth information is determined based on the average value of the effective depth value set.

[0104] To improve the stability of depth information in the depth map and avoid the influence of excessively deep points, a smoothing operation, such as Gaussian smoothing, is performed on the depth map to obtain a smoothed depth map. Based on the location information of the target prediction bounding box, the depth value set corresponding to the target depth region is extracted from the smoothed depth map. Invalid depth values ​​in the set are removed, resulting in a valid depth value set. The average of the valid depth value set is used as the target depth information. Invalid depth values ​​are determined based on the actual scene. Since a depth value closer to 1 indicates that the target corresponding to that pixel is close to the image acquisition device and located in the foreground, while a depth value closer to 0 indicates that the target corresponding to that pixel is far from the image acquisition device and located in the background, a depth value of 0 is determined as an invalid depth value, i.e., the background is ignored, further improving the accuracy of the target depth information in representing the depth of the target depth region.

[0105] Specifically, the target depth information is determined according to the following formula:

[0106]

[0107] Where Davg represents the target depth information, and D'(i,j) represents the pixel in the target depth region. Svalid This represents the set of valid depth values.

[0108] S340. Determine the target proportion information based on the target prediction bounding box position information and the segmentation mask image.

[0109] Among them, the target scale information represents the scale information of the target object corresponding to the target prediction box relative to the original image.

[0110] Specifically, the target mask or target mask region corresponding to the target prediction box is determined in the depth map based on the position information of the target prediction box, and the target ratio information is determined based on the size ratio of the target mask or target mask region to the segmentation mask map.

[0111] In one feasible embodiment, S340 includes:

[0112] Determine the target mask corresponding to the target prediction box in the segmentation mask image based on the location information of the target prediction box;

[0113] The target proportion information is determined based on the ratio of the target mask to the size of the segmented mask image.

[0114] Based on the location information of the target prediction box, the mask of the target object corresponding to the target prediction box in the segmentation mask image is determined as the target mask. The mask values ​​of other positions in the segmentation mask image except for the target mask are set to 0 to obtain the target mask image, that is, the target mask image only includes one mask corresponding to the target prediction box.

[0115] The target proportion information is determined based on the proportion occupied by the target mask in the target mask image, such as by using the following formula:

[0116]

[0117] Where R represents the target scale information, H and W are the width and height of the target mask image, i.e. the width and height of the target image, and M(i,j) represents the mask value of point (i,j) in the target mask image.

[0118] S350. Input the target image, the position information of the target prediction box, the target depth information, and the target scale information into the multimodal large model to obtain the pre-calibration result of the target prediction box.

[0119] If the depth map and segmentation mask map are directly input into the multimodal large model, the multimodal large model will be confused because the depth map and segmentation mask map also include features corresponding to other non-target prediction boxes. In other words, invalid information will be input into the multimodal large model, which will easily lead to inaccurate output results of the multimodal large model.

[0120] Therefore, in this embodiment, the target depth information and target proportion information are obtained by processing the depth map and segmentation mask map. The complex depth map, segmentation mask map and prediction box coordinates are processed into easily describable and understandable feature information and provided to the multimodal large model, which can better help the multimodal large model understand the image features and improve the accuracy of the verification results.

[0121] Target depth and scale information are parameters that influence the correspondence between pixel information within the target prediction bounding box and target category information. For example, target depth information is used to characterize contextual information in three-dimensional space; target scale information is used to characterize the relative left-right relationship in two-dimensional space. Using target depth and scale information to further simplify the target features of the target prediction bounding box improves the accuracy of subsequent target prediction bounding box verification.

[0122] For example, text description information is determined based on the location information, depth information, and scale information of the target prediction box. This text description information is used to describe the location information of the target prediction box, the depth features corresponding to the target prediction box represented by the target depth information, and the mask features corresponding to the target prediction box represented by the target scale information. The target image and text description information are input into the multimodal large model to obtain the output of the multimodal large model.

[0123] In the use of multimodal large models, prompts are an extremely important component; prompts are textual descriptive information. Prompts can be understood as input provided by the user to the multimodal large model, guiding it on how to generate output results. The design and optimization of prompts play a crucial role in improving the performance, accuracy, and relevance of multimodal large models. Through multiple trials with output results, this invention has designed a set of applicable labeling prompts as follows:

[0124] You are a helpful image-text matching assistant. Your task is to determine whether the content and category at specific coordinates in an input image match. For example, if the input category is "dog," and the content at the corresponding coordinates in the image is indeed a dog, then the match is successful; otherwise, the match fails. You will use the object provided later and its additional information, including but not limited to the following: 1. Relative spatial position: The specific position of the object in the image, along with the x-axis (horizontal) and y-axis (vertical). Input (x1, y1, x2, y2) represents the coordinates of the top-left and bottom-right points in the original image. 2. Relative lens distance: The relative distance between the object and the lens, which measures the distance between the object and the camera in the image. The closer the value is to 1, the closer the object is to the camera, located in the foreground. Conversely, a value closer to 0 indicates that the object is far from the camera, located in the background. 3. Relative size proportion in the image: The relative size of the object in the image (e.g., occupying 10%, 20%, etc.). Your response should only contain two results: successful match or failed match, and should not include any other content (such as the reason for failure).

[0125] By continuously refining and supplementing information, the prompts become more specific and clear, improving the accuracy of the multimodal large model's output. Based on the output of the multimodal large model, secondary matching is performed on target detection boxes predicted by the target detection algorithm with confidence levels below a threshold. This allows for manual modification of target prediction boxes that fail to match in the multimodal large model's output, thus improving both efficiency and quality.

[0126] The technical solution of this invention further extracts features from the depth map and the segmentation mask map to obtain target depth information and target proportion information. The simplified features are then input into the multimodal large model, which facilitates the multimodal large model to accurately identify all input features and improves the accuracy of the output results.

[0127] Figure 4 This is a schematic diagram of a target pre-calibration device provided in an embodiment of the present invention. Figure 4 As shown, the device includes:

[0128] The target image processing module 410 is used to perform target category detection on the target image based on the target detection algorithm to obtain an initial target detection result; perform depth recognition on the target image based on the depth model to obtain a depth map; and perform target segmentation on the target image based on the segmentation model to obtain a segmentation mask map; wherein, the initial target detection result includes the position information of the candidate prediction box, the category information corresponding to the candidate prediction box, and the confidence level;

[0129] The target prediction box determination module 420 is used to determine the target prediction box to be verified from the candidate prediction boxes based on the confidence level.

[0130] The large model recognition module 430 is used to input the target image, the position information of the target prediction box, the depth map and the segmentation mask map into the multimodal large model to obtain the pre-calibration result of the target prediction box.

[0131] The technical solution of this invention verifies the accuracy of the predicted bounding boxes obtained by the object detection algorithm through deep models, segmentation models, and multimodal large models, thereby improving the accuracy of the object detection results obtained by the object detection algorithm, reducing the workload of manual correction after pre-labeling the training dataset of the object detection model, and thus improving the overall quality and efficiency of training dataset labeling.

[0132] Optionally, the apparatus further includes a prediction box position correction module, configured to, after determining the target prediction box to be verified from the candidate prediction boxes based on the confidence level, include:

[0133] The overlap value determination unit is used to determine the overlap value between the target prediction box and the corresponding target mask region based on the position information of the target prediction box and the segmentation mask image;

[0134] The position correction unit is used to correct the position information of the target prediction box according to the segmentation mask image if the overlap value is less than a preset overlap threshold.

[0135] Optional, the coincidence value determination unit is specifically used for:

[0136] Based on the location information of the target prediction box, determine the target mask region in the segmentation mask image corresponding to the target prediction box;

[0137] The mask area is determined based on the mask value in the target mask region;

[0138] The overlap value is determined based on the ratio of the mask area to the area of ​​the target prediction box.

[0139] Optional, position correction unit, specifically used for:

[0140] Based on the position information of the target prediction box, determine the target mask corresponding to the target prediction box in the segmentation mask image, and determine the minimum bounding rectangle of the target mask;

[0141] The position of the centroid of the minimum bounding rectangle is determined based on the position information of the minimum bounding rectangle and the corresponding mask value, and is used as the center position of the corrected prediction box.

[0142] The corrected prediction box width is determined by the weighted sum of the width of the minimum bounding rectangle and the width of the target prediction box;

[0143] The corrected prediction box height is determined by the weighted sum of the height of the minimum bounding rectangle and the height of the target prediction box;

[0144] The position information of the corrected target prediction box is determined based on the center position of the corrected prediction box, the width of the corrected prediction box, and the height of the corrected prediction box.

[0145] Optional, large model recognition module, including:

[0146] A depth information determination unit is used to determine the target depth information based on the position information of the target prediction box and the depth map;

[0147] A proportion information determination unit is used to determine target proportion information based on the position information of the target prediction box and the segmentation mask image;

[0148] The recognition unit is used to input the target image, the position information of the target prediction box, the target depth information and the target scale information into the multimodal large model to obtain the pre-calibration result of the target prediction box.

[0149] Optional, depth information determination unit, specifically used for:

[0150] The depth map is smoothed to obtain a smoothed depth map;

[0151] Based on the location information of the target prediction box, determine the target depth region in the smoothed depth map corresponding to the target prediction box;

[0152] Invalid depth values ​​in the target depth region are removed to obtain a set of valid depth values;

[0153] The target depth information is determined based on the average value of the set of effective depth values.

[0154] Optional, the proportion information determination unit is specifically used for:

[0155] Determine the target mask corresponding to the target prediction box in the segmentation mask image based on the position information of the target prediction box;

[0156] The target proportion information is determined based on the proportion of the target mask to the size of the segmented mask image.

[0157] Optional, large model recognition model, specifically used for:

[0158] The target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into the multimodal large model, and the output of the multimodal large model is whether the target prediction box matches the corresponding category information successfully;

[0159] If the target prediction box successfully matches the corresponding category information, then the pre-calibration result of the target prediction box is determined to be accurate based on the target detection algorithm.

[0160] If the target prediction box fails to match the corresponding category information, then the pre-calibration result of the target prediction box is determined to be an error in the detection information of the target prediction box obtained based on the target detection algorithm.

[0161] The target precalibration device provided in the embodiments of the present invention can execute the target precalibration method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the method execution.

[0162] The acquisition, storage, use, and processing of data in this application comply with relevant national laws and regulations and do not violate public order and good morals.

[0163] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0164] Figure 5 A schematic diagram of an electronic device 10 that can be used to implement embodiments of the present invention is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.

[0165] like Figure 5 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 may also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.

[0166] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0167] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as the target pre-calibration method.

[0168] In some embodiments, the method for determining fluid extraction parameters in a high water-cut oilfield can be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program can be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the method for determining fluid extraction parameters in a high water-cut oilfield described above can be performed. Alternatively, in other embodiments, processor 11 can be configured to perform the method for determining fluid extraction parameters in a high water-cut oilfield by any other suitable means (e.g., by means of firmware).

[0169] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific reference products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transferring data and instructions to the storage system, the at least one input device, and the at least one output device.

[0170] Computer programs used to implement the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer programs cause the functions / operations specified in the flowcharts and / or block diagrams to be performed. The computer programs may be executed entirely on a machine, partially on a machine, or as a standalone software package, partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0171] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0172] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0173] The systems and technologies described herein can be implemented in computing systems that include back-end components (e.g., as data servers), or computing systems that include switching components (e.g., application servers), or computing systems that include front-end components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such back-end, switching, or front-end components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.

[0174] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.

[0175] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.

[0176] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A target pre-calibration method, characterized in that, The method includes: The target image is subjected to target category detection based on the target detection algorithm to obtain an initial target detection result; the target image is subjected to depth recognition based on the depth model to obtain a depth map; the target image is subjected to target segmentation based on the segmentation model to obtain a segmentation mask map; wherein, the initial target detection result includes the position information of the candidate prediction box, the category information corresponding to the candidate prediction box, and the confidence score; Based on the confidence level, a target prediction box to be verified is determined from the candidate prediction boxes; The target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into a multimodal large model to obtain the pre-calibration result of the target prediction box.

2. The method according to claim 1, characterized in that, After determining the target prediction box to be verified from the candidate prediction boxes based on the confidence level, the method further includes: The overlap value between the target prediction box and the corresponding target mask region is determined based on the position information of the target prediction box and the segmentation mask image; If the overlap value is less than a preset overlap threshold, the position information of the target prediction box is corrected according to the segmentation mask image.

3. The method according to claim 2, characterized in that, Determining the overlap value between the target prediction box and the corresponding target mask region based on the position information of the target prediction box and the segmentation mask image includes: Based on the location information of the target prediction box, determine the target mask region in the segmentation mask image corresponding to the target prediction box; The mask area is determined based on the mask value in the target mask region; The overlap value is determined based on the ratio of the mask area to the area of ​​the target prediction box.

4. The method according to claim 2, characterized in that, Correcting the position information of the target prediction box based on the segmentation mask image includes: Based on the position information of the target prediction box, determine the target mask corresponding to the target prediction box in the segmentation mask image, and determine the minimum bounding rectangle of the target mask; The position of the centroid of the minimum bounding rectangle is determined based on the position information of the minimum bounding rectangle and the corresponding mask value, and is used as the center position of the corrected prediction box. The corrected prediction box width is determined by the weighted sum of the width of the minimum bounding rectangle and the width of the target prediction box; The corrected prediction box height is determined by the weighted sum of the height of the minimum bounding rectangle and the height of the target prediction box; The position information of the corrected target prediction box is determined based on the center position of the corrected prediction box, the width of the corrected prediction box, and the height of the corrected prediction box.

5. The method according to any one of claims 1-4, characterized in that, The target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into a multimodal large model to obtain the pre-calibration result of the target prediction box, including: Determine the target depth information based on the location information of the target prediction box and the depth map; The target ratio information is determined based on the position information of the target prediction box and the segmentation mask image; The target image, the position information of the target prediction box, the target depth information, and the target scale information are input into a multimodal large model to obtain the pre-calibration result of the target prediction box.

6. The method according to claim 5, characterized in that, Determining target depth information based on the location information of the target prediction bounding box and the depth map includes: The depth map is smoothed to obtain a smoothed depth map; Based on the location information of the target prediction box, determine the target depth region in the smoothed depth map corresponding to the target prediction box; Invalid depth values ​​in the target depth region are removed to obtain a set of valid depth values; The target depth information is determined based on the average value of the set of effective depth values.

7. The method according to claim 5, characterized in that, Determining target scale information based on the location information of the target prediction bounding box and the segmentation mask image includes: Determine the target mask corresponding to the target prediction box in the segmentation mask image based on the position information of the target prediction box; The target proportion information is determined based on the proportion of the target mask to the size of the segmented mask image.

8. The method according to claim 1, characterized in that, The target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into a multimodal large model to obtain the pre-calibration result of the target prediction box, including: The target image, the location information of the target prediction box, the depth map, and the segmentation mask map are input into the multimodal large model, and the output of the multimodal large model is whether the target prediction box matches the corresponding category information successfully; If the target prediction box successfully matches the corresponding category information, then the pre-calibration result of the target prediction box is determined to be accurate based on the target detection algorithm. If the target prediction box fails to match the corresponding category information, then the pre-calibration result of the target prediction box is determined to be an error in the detection information of the target prediction box obtained based on the target detection algorithm.

9. A target pre-calibration device, characterized in that, The device includes: The target image processing module is used to perform target category detection on the target image based on the target detection algorithm to obtain an initial target detection result; to perform depth recognition on the target image based on the depth model to obtain a depth map; and to perform target segmentation on the target image based on the segmentation model to obtain a segmentation mask map; wherein, the initial target detection result includes the position information of the candidate prediction box, the category information corresponding to the candidate prediction box, and the confidence level; A target prediction box determination module is used to determine a target prediction box to be verified from the candidate prediction boxes based on the confidence level. The large model recognition module is used to input the target image, the location information of the target prediction box, the depth map and the segmentation mask map into the multimodal large model to obtain the pre-calibration result of the target prediction box.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the target precalibration method according to any one of claims 1-8.