Image positioning calibration method and device, electronic equipment and computer readable medium

By overlaying anchor point object images onto the image to be calibrated and using a positioning transformation matrix for image positioning calibration, the problem of large model output object position information not adapting to the original image size is solved, thus improving calibration efficiency and reducing computational resource waste.

CN121937535BActive Publication Date: 2026-06-26LINGBAN INTELLIGENT (HANGZHOU) INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
LINGBAN INTELLIGENT (HANGZHOU) INFORMATION TECHNOLOGY CO LTD
Filing Date
2026-03-31
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Large models output object position information based on images that match the model's input size, rather than the resolution and size of the original input image, resulting in low image annotation efficiency and wasted computational resources.

Method used

By overlaying anchor point object images onto the image to be calibrated as reference points for localization and calibration, the object position information set is determined using a large uncalibrated object detection model, and image localization and calibration are performed using a localization transformation matrix, simplifying the image calibration process and reducing computational resource costs.

Benefits of technology

It improves the efficiency of image calibration, shortens the calibration time, reduces the waste of computing resources, and adapts to image calibration with different input sizes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121937535B_ABST
    Figure CN121937535B_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure disclose an image positioning calibration method and device, electronic equipment and a computer readable medium. A specific implementation of the method comprises: obtaining a to-be-calibrated image and an uncalibrated object detection large model, wherein the target to-be-calibrated image is an image obtained by superimposing an anchor object image located at a target position information set on the to-be-calibrated image; determining an object position information set of the to-be-calibrated image; determining a positioning transformation matrix according to the object position information set and the target position information set; determining an image position information set according to the positioning transformation matrix; and performing image calibration in the to-be-calibrated image according to the image position information set to obtain a calibrated image. The implementation can simplify the calibration process in the to-be-calibrated image, reduce the waste of computing resource costs, improve calibration efficiency and shorten calibration time by superimposing an anchor object image on the to-be-calibrated image as a reference point for positioning calibration and transforming the object position information set output by the model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments disclosed herein relate to the field of computer technology, and more specifically to image positioning and calibration methods, apparatus, electronic devices, and computer-readable media. Background Technology

[0002] To reduce inference costs, increase inference speed, and support different graphics cards, large models incorporate input compatibility processing (using dynamic resolution methods, adaptively adjusting the input image resolution according to user-defined expectations, adapting to the user's GPU (Graphics Processing Unit) model and memory size, or automatically adjusting based on inconsistent preprocessing behaviors of different types of large models). This results in the object position information output by the large model (e.g., bounding boxes, segmented surfaces, etc.) being based on the object position information on an image conforming to the input size requirements of the large model, rather than on the resolution and size of the original input image. This is an inherent structural defect of large models, leading to reduced quality and efficiency in image annotation. The common approach to handling the position mapping problem of the model output is to record the parameter set of all image processing steps (e.g., scaling, cropping, padding) for the original input image. Then, the processed image is input into the large model to obtain the object position information. Finally, based on the parameter set, an inverse transformation is performed on the object position information to obtain the object position information for the original input image.

[0003] However, in practice, it has been found that when the above method is used to map the position of the model output, the following technical problems often exist: Since image processing involves multiple complex image processing steps, the inverse transformation after obtaining the object position information requires a lot of computing resources. It is difficult to adapt to the input of images of different sizes. In fields such as image annotation and image processing involving coordinate calibration and positioning, it will lead to low efficiency of image annotation, prolong the time of image annotation and waste a lot of computing resources. Summary of the Invention

[0004] The summary portion of this disclosure is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description portion. This summary portion is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.

[0005] Some embodiments of this disclosure provide image localization and calibration methods, apparatuses, electronic devices, and computer-readable media to address one or more of the technical problems mentioned in the background section above.

[0006] In a first aspect, some embodiments of this disclosure provide an image localization and calibration method, comprising: acquiring a target image to be calibrated and an uncalibrated object detection large model, wherein the target image to be calibrated is an image on which anchor point object images located in a target position information set are superimposed; determining the object position information set of each object included in the target image to be calibrated using the uncalibrated object detection large model; determining a localization transformation matrix based on the object position information set corresponding to the anchor point object images and the target position information set; determining the image position information set of the object position information set in the target image to be calibrated based on the localization transformation matrix; and performing image localization and calibration in the target image to be calibrated based on the image position information set to obtain a calibrated image.

[0007] Optionally, the above-mentioned determination of the object position information set of each object included in the target image to be calibrated through the above-mentioned uncalibrated object detection large model includes: determining target detection prompt information for each of the above-mentioned objects; performing intent recognition adjustment on the above-mentioned target detection prompt information to obtain adjusted target detection prompt information; and inputting the adjusted target detection prompt information and the target image to be calibrated into the above-mentioned uncalibrated object detection large model to obtain the object position information set for each of the above-mentioned objects.

[0008] Optionally, determining the positioning transformation matrix based on the object position information set corresponding to the anchor point object image and the target position information set includes: generating a coordinate mapping equation set based on the coordinate mapping information of the object position information set and the target position information set; performing coordinate normalization on the coordinate mapping equation set to obtain a normalized coordinate mapping equation set; performing a linear transformation on the normalized coordinate mapping equation set to obtain a system of coordinate mapping linear equations; constructing a coordinate transformation coefficient matrix based on the system of coordinate mapping linear equations; and performing constraint decomposition and reshaping on the coordinate transformation coefficient matrix to obtain the positioning transformation matrix.

[0009] Optionally, determining the image position information set of the object position information set in the image to be calibrated based on the positioning transformation matrix includes: determining the matrix product of the inverse matrix of the positioning transformation matrix and the target object position information set to obtain an initial position information set, wherein the target object position information set is the set obtained by removing the object position information set corresponding to the target position information set from the object position information set; and performing resolution mapping on the initial position information set to obtain the image position information set.

[0010] Optionally, the above-mentioned acquisition of the target image to be calibrated and the large-scale uncalibrated object detection model includes: constructing object images that can be recognized by the large-scale uncalibrated object detection model as anchor object images; determining the position information set of the anchor object images in the image to be calibrated as the target position information set; and superimposing the anchor object images onto the image to be calibrated according to the target position information set to obtain the target image to be calibrated.

[0011] Optionally, the above-mentioned construction of the anchor object image that can be recognized by the above-mentioned uncalibrated object detection large model includes: performing feature recognition on the above-mentioned image to be calibrated to obtain global feature information to be calibrated; determining the information of the region to be superimposed on the above-mentioned image to be calibrated and the construction constraint information set for the above-mentioned anchor object image; inputting the above-mentioned global feature information to be calibrated, the above-mentioned region to be superimposed and the above-mentioned construction constraint information set into the anchor image generation model to obtain an initial anchor image; generating a binary encoding matrix according to a preset anchor mark dictionary; adding the above-mentioned binary encoding matrix to the above-mentioned initial anchor image to obtain the anchor object image.

[0012] Optionally, the above-mentioned intention recognition adjustment of the target detection prompt information to obtain the adjusted target detection prompt information includes: deblurring the target detection prompt information to obtain deblurred target detection prompt information; performing task decomposition processing on the deblurred target detection prompt information to obtain a sub-task prompt information set; extracting positive and negative keywords from the sub-task prompt information set to obtain a positive and negative keyword set; performing constraint normalization processing on the output constraint information set included in the sub-task prompt information set to obtain an output constraint normalization information set; and generating a sub-task prompt information set based on the positive and negative keyword set and the output constraint normalization information set, as the adjusted target detection prompt information.

[0013] Secondly, some embodiments of this disclosure provide an image localization and calibration apparatus, comprising: an acquisition unit configured to acquire a target image to be calibrated and an uncalibrated object detection large model, wherein the target image to be calibrated is an image on which anchor point object images located in a target position information set are superimposed; a first determination unit configured to determine, through the uncalibrated object detection large model, an object position information set of each object included in the target image to be calibrated; a second determination unit configured to determine a localization transformation matrix based on the object position information set corresponding to the anchor point object images and the target position information set; and a third determination unit configured to determine, based on the localization transformation matrix, an image position information set of the object position information set in the image to be calibrated, and to perform image localization and calibration in the image to be calibrated based on the image position information set to obtain a calibrated image.

[0014] Thirdly, some embodiments of this disclosure provide an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, such that when the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any implementation of the first aspect.

[0015] Fourthly, some embodiments of this disclosure provide a computer-readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method as described in any implementation of the first aspect.

[0016] The above embodiments of this disclosure have the following beneficial effects: The image localization and calibration methods of some embodiments of this disclosure, by superimposing anchor point object images on the image to be calibrated as reference points for localization and calibration, transform the object position information set output by the model. This simplifies the calibration process on the image to be calibrated, reduces the waste of computational resources, improves calibration efficiency, and shortens the calibration time. Specifically, the reason for the low efficiency of related image annotation, the prolonged image annotation time, and the waste of a large amount of computational resources is that image processing involves multiple complex image processing steps. After obtaining the object position information, an inverse transformation is performed, requiring a large amount of computational resources. This makes it difficult to adapt to input images of different sizes. In fields involving coordinate calibration and localization, such as image annotation and image processing, this leads to low efficiency of image annotation, prolonged image annotation time, and a waste of a large amount of computational resources. Based on this, the image localization and calibration methods of some embodiments of this disclosure can first obtain the target image to be calibrated and the large model for detecting uncalibrated objects, wherein the target image to be calibrated is an image on which anchor point object images located in the target position information set are superimposed. Therefore, anchor objects—those with known target location information superimposed on the image to be calibrated and accurately identified by the subsequent large-scale uncalibrated object detection model—serve as the foundation for determining the object location information set and the localization transformation matrix. This provides a stable, predictable reference standard for subsequent construction. Furthermore, the target location information set reduces occlusion and interference with the main content of the image to be calibrated and ensures that the image to be calibrated can still be preserved after image processing such as cropping and scaling. Secondly, using the aforementioned large-scale uncalibrated object detection model, the object location information set of each object included in the target image to be calibrated is determined. This allows for accurate identification of the object location information set of each object (including anchor object images), improving the accuracy of the object location information set and providing a data foundation for determining the localization transformation matrix. Finally, based on the object location information set corresponding to the anchor object images and the aforementioned target location information set, the localization transformation matrix is ​​determined. Therefore, a geometric transformation relationship can be established between the object position information set of the anchor point object image and the target position information set, improving the accuracy of the positioning transformation matrix. This facilitates the subsequent precise determination of the object position information mapping transformation between the image to be calibrated and the image after image processing, as well as between images of different resolutions, thereby improving the efficiency and accuracy of calibrating each object in the image to be calibrated. Finally, based on the aforementioned positioning transformation matrix, the image position information set of the aforementioned object position information set in the aforementioned image to be calibrated is determined, and based on the aforementioned image position information set, image positioning and calibration are performed in the aforementioned image to be calibrated to obtain the calibrated image.Therefore, by using a localization transformation matrix to transform the calibration of object boundaries on an image conforming to the input size of an uncalibrated large object detection model, the process is transformed into precise calibration of the boundaries of each object on the image to be calibrated. This effectively simplifies the coordinate mapping process, adapts to calibration of images of different sizes, improves image calibration efficiency, shortens calibration time, and reduces the computational resources required for image calibration. In summary, this image localization and calibration method, by superimposing anchor point object images on the image to be calibrated as reference points for localization and calibration, transforms the object position information set output by the model. This effectively solves the output coordinate mapping problem caused by the network structure of the large model itself, simplifies the calibration process on the image to be calibrated, reduces the waste of computational resources, improves calibration efficiency, and shortens calibration time. Attached Figure Description

[0017] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and elements are not necessarily drawn to scale.

[0018] Figure 1 This is a flowchart of some embodiments of the image positioning and calibration method according to the present disclosure;

[0019] Figure 2 This is a schematic diagram of a target image to be calibrated according to some embodiments of the image localization and calibration method of this disclosure;

[0020] Figure 3 This is a schematic diagram showing the set of object position information output by the large model for uncalibrated object detection in some embodiments of the image localization and calibration method of this disclosure, displayed on the target image to be calibrated by bounding boxes;

[0021] Figure 4 This is a comparative schematic diagram showing the mapping of an object position information set to an image to be calibrated and the mapping of an object position information set to an image to be calibrated after transformation based on a positioning transformation matrix, according to some embodiments of the image positioning and calibration method of this disclosure.

[0022] Figure 5 These are flowcharts of some other embodiments of the image positioning and calibration method according to this disclosure;

[0023] Figure 6 These are schematic diagrams illustrating the structure of some embodiments of the image positioning and calibration apparatus according to this disclosure;

[0024] Figure 7 This is a schematic diagram of the structure of an electronic device suitable for implementing some embodiments of the present disclosure. Detailed Implementation

[0025] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0026] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.

[0027] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0028] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0029] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

[0030] This disclosure will now be described in detail with reference to the accompanying drawings and embodiments.

[0031] Figure 1 A flowchart 100 is shown, illustrating some embodiments of an image localization and calibration method according to the present disclosure. The image localization and calibration method includes the following steps:

[0032] Step 101: Obtain the target image to be calibrated and the large model for detecting uncalibrated objects.

[0033] In some embodiments, the executing entity of the above-described image localization and calibration method (e.g., a head-mounted display device) can acquire a target image to be calibrated and an uncalibrated object detection large model via a wired or wireless connection. The target image to be calibrated is an image on which anchor object images located in the target location information set are superimposed. The target image to be calibrated can be an image of objects to be identified and labeled. The target location information in the target location information set can be known location information of the anchor object images on the target image. For example, the target location information can be location information located at the lower right corner of the target image, a certain number of pixels away from the edge of the target image. The anchor object images can be images of objects that the large model can accurately identify, that are within the model's cognitive scope, and that are significantly different from the target image. For example, the anchor object images can be, but are not limited to, at least one of the following: QR code images, airplane images, bird images, and tank images. The uncalibrated object detection large model can be a deep neural network model that performs object detection on the input target image to be calibrated to output an object location information set. The aforementioned large-scale unlabeled object detection model can be a Vision-Language Model (VLM). This model can also include a visual encoder, a text encoder, a multi-layer self-attention mechanism layer, a cross-modal cross-attention mechanism layer, and a cross-modal decoder detection head. The object location information set can be determined through the following steps: First, the target image to be labeled is input into the visual encoder to obtain an image feature map. This visual encoder can be a Vision Transformer model. Second, the cue word information corresponding to the target image to be labeled is input into the text encoder to obtain a text feature vector. This text encoder can be a BERT (Bidirectional Encoder Representations from Transformers) model. Then, the text feature vector and the image feature map are input into the multi-layer self-attention mechanism layer and the cross-modal cross-attention mechanism layer to obtain a fused feature vector. Finally, the fused feature vector is input into the cross-modal decoder detection head to obtain the object location information set. This cross-modal decoder detection head can be a Transformer decoder. The aforementioned large-scale unlabeled object detection model can be trained by first performing large-scale image-text comparison pre-training, and then fine-tuning it end-to-end using detection tasks (e.g., bounding box regression + classification). Figure 2The image shown is a schematic diagram illustrating the target image to be calibrated. It should be noted that the aforementioned wireless connection methods may include, but are not limited to, 3G / 4G connections, WiFi connections, Bluetooth connections, WiMAX connections, Zigbee connections, UWB (ultra wideband) connections, and other currently known or future-developed wireless connection methods.

[0034] In some optional implementations of certain embodiments, the aforementioned execution entity can obtain the target image to be calibrated and the large-scale model for detecting uncalibrated objects through the following steps:

[0035] The first step is to construct object images that can be recognized by the aforementioned unlabeled object detection model, serving as anchor object images. These recognizable object images can be images containing the category and location information of objects detected by the unlabeled object detection model. In practice, the execution entity can input preset image generation information into an image generation model to obtain anchor object images. This preset image generation information can be object images selected by experts from the object categories recognizable by the unlabeled object detection model, along with constraint information related to the object image's size, brightness, and edge contours. For example, the image generation model could be a diffusion model.

[0036] The second step is to determine the set of positional information of the anchor point objects within the image to be calibrated, which will be used as the target positional information set. This set of positional information within the image to be calibrated can include the positional information of all objects within the image that are not occluded. This determination can be performed using a large language model.

[0037] The third step involves overlaying the anchor point object image onto the image to be calibrated, based on the aforementioned target location information set, to obtain the target image to be calibrated. In practice, the executing entity can overlay the anchor point object image onto the position corresponding to the target location information set of the image to be calibrated to obtain the target image to be calibrated.

[0038] In some optional implementations of certain embodiments, the aforementioned execution entity may construct an object image that can be recognized by the aforementioned unlabeled object detection large model, as an anchor object image, through the following steps:

[0039] The first step is to perform feature recognition on the image to be calibrated to obtain global feature information. This global feature information may include the brightness range, hue information, and texture frequency information of the image to be calibrated. Feature recognition may involve calculating the global histogram of the image to be calibrated.

[0040] The second step involves determining the overlay region information for the image to be calibrated and the set of construction constraint information for the anchor object image. The overlay region information can be information about the regions in the image to be calibrated, excluding the regions corresponding to each included object. The construction constraint information in the set of construction constraint information can constrain the contrast, shape, size, and internal encoding information of the anchor object image, so that the subsequent large-scale uncalibrated object detection model can accurately identify it. For example, the construction constraint information may include, but is not limited to, at least one of the following: the anchor object image includes objects whose category and location information can be detected by the large-scale uncalibrated object detection model, and which have high contrast, strong edge information, and strong gradient features; provides shape and size information for at least four corner points of the image to be calibrated where there is no occlusion; and provides constraint information to prevent the large-scale uncalibrated object detection model from using Hamming error correction to uniquely identify the anchor image. In practice, the executing entity can first perform object detection on the image to be calibrated to obtain an image recognition information set. Then, the region information corresponding to the image region to be calibrated, after removing the object region information set corresponding to the image recognition information set, is determined as the region information to be superimposed. Finally, the construction constraint information set for the anchor point object image is determined.

[0041] The third step involves inputting the aforementioned global feature information to be calibrated, the information of the region to be overlaid, and the set of construction constraints into the anchor image generation model to obtain the initial anchor image. The anchor image generation model can be a neural network model that generates the initial anchor image based on the input global feature information to be calibrated, the information of the region to be overlaid, and the set of construction constraints. This anchor image generation model can be a model comprising: a CLIP (Contrastive Language-Image Pretraining) Text Encoder model, a T2I (Text-To-Image Model)-Adapter module, a multi-layer Denoising U-Net (Convolutional Networks for Biomedical Image Segmentation) network, and an encoding network. Training the anchor image generation model can be a loss function-based model training process that minimizes the mean squared error between the initial anchor image and the real anchor image. The real anchor image can be an image that satisfies the set of construction constraints.

[0042] In practice, the aforementioned execution entity can first perform feature encoding on the aforementioned global feature information to be calibrated, the aforementioned region information to be overlaid, and the aforementioned constraint information set to obtain a text feature encoding vector set. This feature encoding can be performed using the CLIP Text Encoder model. Secondly, a preset anchor point contour map is input into the T2I-Adapter module to obtain a multi-scale feature vector set. The preset anchor point contour map can be a contour map of anchor point images with at least four corner points obtained by constructing the constraint information set. For example, the preset anchor point contour map can be a map including a thick border and an internal geometric skeleton. Finally, the aforementioned multi-scale feature vector set and the aforementioned text feature encoding vector set are input into a multi-layer Denoising U-Net network to obtain a denoising feature vector set. The Denoising U-Net can be a deep neural network based on a U-Net encoder-decoder architecture. Finally, the aforementioned denoising feature vector set is input into the encoding network to obtain the initial anchor point image.

[0043] The fourth step is to generate a binary encoding matrix based on a preset anchor point marker dictionary. This binary encoding matrix can be a binary matrix code with Hamming error correction capabilities that can be recognized by large-scale unlabeled object detection models.

[0044] The fifth step is to add the aforementioned binary encoding matrix to the initial anchor point image to obtain the anchor point object image.

[0045] Step 102: Using the large unlabeled object detection model, determine the set of object position information for each object in the target image to be labeled.

[0046] In some embodiments, the execution entity can determine the object position information set of each object included in the target image to be calibrated using the aforementioned large-scale uncalibrated object detection model. The object position information in this set can be the position information of object boundary points representing object boundaries. For example, the object position information set can be a bounding box position information set, or it can be a set of boundary coordinate information such as the object's curved surface segmentation. Figure 3 The diagram shows a schematic of the objects detected in the target image, with their respective bounding boxes, obtained by using a large unlabeled object detection model.

[0047] In addressing the aforementioned technical problems in the application scenario—industrial visual quality inspection based on large models—the following technical issues often arise: Because object detection based on large models requires full storage of the key and value vectors of each attention layer, object detection efficiency is low, requiring significant GPU memory resources, prolonging image annotation time, and wasting considerable computational resources. Considering the following requirements for this application scenario: adaptability to irregular curved surfaces, adaptability to two-stage localization, adaptability to large models, and adaptability to key-value vector caching, we have decided to adopt the following solution:

[0048] In some optional implementations of certain embodiments, the execution entity may determine the set of object position information for each object included in the target image to be calibrated by using the aforementioned large uncalibrated object detection model through the following steps:

[0049] The first step involves extracting object features from the target image to be calibrated and the preset target detection prompts using the aforementioned large-scale uncalibrated object detection model. This yields a global visual feature vector set, a visual key-value tensor, a global text feature vector set, and a text key-value tensor. The preset target detection prompts can be information used to guide the large-scale uncalibrated object detection model in target detection. The global visual feature vectors in the global visual feature vector set represent the spatial and visual semantic information of the target image to be calibrated (e.g., the shape, color, position, texture, and inter-region relationships of the object). The global text feature vectors in the global text feature vector set represent the semantic and syntactic information of the preset target detection prompts. The visual key-value tensor represents the attention computation context of the global visual feature vector set. The text key-value tensor represents the attention computation context of the global text feature vector set. In practice, the execution entity can preprocess the target image to be calibrated and then input it into the visual encoder included in the large-scale uncalibrated object detection model to obtain the global visual feature vector set and the visual key-value tensor. The aforementioned visual encoder can be a visual encoder included in a VLM. The aforementioned image preprocessing can be image segmentation processing. Then, the aforementioned preset object detection prompt information is segmented into words and input into the language encoder included in the aforementioned unlabeled object detection large model to obtain a global text feature vector set and a text key tensor. The aforementioned language encoder can be a language encoder included in a VLM.

[0050] The second step involves generating an initial object boundary position information set based on the aforementioned global visual feature vector set and global text feature vector set. The initial object boundary position information in this set can be the coordinates of a point on the boundary of an object in the target image to be calibrated. In practice, the executing entity can first perform tensor integration on the aforementioned visual key-value tensor and text key-value tensor to obtain a global key-value tensor. Next, cross-modal attention is calculated using the global key-value tensor as the key vector and the aforementioned global visual feature vector set and global text feature vector set as the query vector, resulting in a global cross-modal fusion feature vector. Then, the global cross-modal fusion feature vectors corresponding to the image patches after image segmentation are identified from the aforementioned global cross-modal fusion feature vectors, serving as the visual fusion feature vector set. Next, the cosine similarity set between the aforementioned visual fusion feature vector set and the aforementioned global text feature vector set is identified, serving as the feature similarity set. Finally, at least one visual fusion feature vector with a corresponding feature similarity greater than or equal to a preset similarity threshold is selected from the aforementioned visual fusion feature vector set. The aforementioned preset similarity threshold can be a pre-defined screening threshold. The value of the preset similarity threshold can be determined based on specific circumstances and is not limited here. Then, at least one visual fusion feature vector is input into a multilayer perceptron and subjected to non-maximum suppression processing to obtain a bounding box information set. Next, the visual fusion feature vector corresponding to the bounding box information set is input into a boundary morphology generation model to obtain object boundary contour information. The aforementioned boundary morphology generation model can be a deep neural network model that extracts the boundary contours of objects included in the input bounding box set. For example, the boundary morphology generation model can be a Bézier curve fitting layer for polygons or curves to generate contour curves or thick polygon boundaries; it can also be a convolutional neural network model including a Sigmoid activation function for segmented surfaces or masks to generate pixel-level segmented surface masks. Finally, the object bounding box contour information set and the bounding box information set are determined as the object boundary position information set.

[0051] The third step involves pruning the visual and text key-value tensors based on the initial object boundary position information set, resulting in a pruned key-value tensor set. The pruned key-value tensors in this set can be the combined visual and text key-value tensors corresponding to the initial object boundary position information set. In practice, the executing entity can first map the initial object boundary position information set to the block index range included in the global cross-modal fusion feature vector, obtaining a block index set. The block indices in this set can represent the row and column index ranges covered by the initial object boundary position information set in the global cross-modal fusion feature vector. Then, based on the block index set, the visual and text key-value tensors are pruned to obtain the pruned key-value tensor set.

[0052] The fourth step involves caching the pruned key-value tensor set to obtain a cache-optimized key-value tensor set. The cache-optimized key-value tensors in this set can be used to record the association between the pruned key-value tensor set and the initial object boundary position information set, and are cached by dividing them into fixed-size blocks to improve memory utilization. In practice, the execution entity can utilize the PagedAttention algorithm to perform cache optimization on the pruned key-value tensor set to obtain the cache-optimized key-value tensor set.

[0053] The fifth step involves progressively distilling the global cross-modal fusion feature vectors corresponding to the initial object boundary position information set to obtain a set of locally enhanced feature vectors. The locally enhanced feature vectors in this set can be the global cross-modal fusion feature vectors corresponding to the initial object boundary position information set. These global cross-modal fusion feature vectors can be feature vectors obtained by performing cross-modal attention calculations on the global key-value tensor, the global visual feature vector set, and the global text feature vector set. In practice, the execution entity can first perform global pooling on the initial object boundary position information set to obtain boundary feature vectors. Then, progressively distilling and fusing the global cross-modal fusion feature vectors corresponding to the initial object boundary position information set and the boundary feature vectors yields distilled enhanced feature vectors. This progressive feature distillation and fusing can be achieved by weighting the boundary feature vectors (0.7 times) and the corresponding global cross-modal fusion feature vectors (0.3 times) and then performing L2 normalization to generate the distilled enhanced feature vectors. Finally, a convolutional neural network is used to extract the contours of the distilled enhanced feature vectors to obtain the set of locally enhanced feature vectors.

[0054] The sixth step involves performing object boundary calibration on the initial object boundary position information set based on the aforementioned local enhanced feature vector set and the aforementioned cached and optimized key-value tensor set, to obtain the calibrated boundary position information set as the object position information set, and mapping the object position information set onto the image to be calibrated to obtain the calibration image.

[0055] As an example, the aforementioned execution entity can first input the aforementioned local enhanced feature vector set and the aforementioned cached and optimized key-value tensor set into the cross-modal attention module to obtain a local cross-modal boundary feature vector set. Then, the aforementioned local cross-modal boundary feature vector set is input into the aforementioned boundary morphology generation model for boundary localization to obtain an object position information set.

[0056] The above-described technical solution and its related content, as an inventive point of this disclosure, solve the technical problem of "low object detection efficiency, requiring a large amount of GPU memory resources, prolonging image annotation time, and wasting a large amount of computational resources." The factors leading to low object detection efficiency, the need for large amounts of GPU memory resources, prolonged image annotation time, and wasted computational resources are often as follows: Because object detection based on large models requires full storage of the key vectors and value vectors of each attention layer, object detection efficiency is low, requiring a large amount of GPU memory resources, prolonging image annotation time, and wasting a large amount of computational resources. Solving these factors can improve the efficiency of object detection, reduce the waste of GPU memory resources, shorten the image annotation time, and reduce the waste of a large amount of computational resources. To achieve this effect, this disclosure first generates an initial object boundary position information set based on the global visual feature vector set extracted from object features and the aforementioned global text feature vector set. A large-scale unlabeled object detection model is used to coarsely locate the initial object boundaries. Furthermore, cosine similarity is used to filter out interference from background images in the target image to be labeled, further improving the coverage and accuracy of the initial object boundary position information set. Secondly, the visual and text key-value tensors are pruned, retaining only the key-value tensors corresponding to the initial object boundary position information set, releasing the key-value tensors in the background region, thus reducing the GPU memory resources occupied by the key-value tensors. Next, the pruned key-value tensor set is cached and optimized to facilitate rapid secondary boundary recognition based on the cached pruned key-value tensors, effectively avoiding re-detection of the target image to be labeled and improving localization efficiency. Then, progressive distillation is performed on the global cross-modal fusion feature vector corresponding to the initial object boundary position information set. Fine-grained local feature extraction is performed based on the spatial range corresponding to the initial object boundary position information set, and progressive distillation can improve the extraction of object boundary contours. Next, object boundary calibration is performed on the initial object boundary position information set. Secondary precise boundary localization is performed using cached key-value tensors, which can improve the accuracy of boundary localization and improve the reuse of key-value tensors to avoid redundant calculations and improve localization efficiency. Finally, the object position information set is mapped onto the image to be calibrated to obtain the calibration image. The accurate object position information set can improve the accuracy and quality of calibration on the original image (i.e., the image to be calibrated), avoiding the problem of inconsistency between the output coordinates and the corresponding object coordinates in the original image, which is common in large models. This shortens the image annotation time and reduces the waste of a large amount of computational resources.

[0057] Step 103: Determine the positioning transformation matrix based on the object position information set and target position information set corresponding to the anchor point object image.

[0058] In some embodiments, the execution entity may determine a positioning transformation matrix based on the object position information set corresponding to the anchor point object image and the target position information set. The positioning transformation matrix may be a matrix representing the coordinate mapping relationship between the target position information set and the object position information set of the anchor point object image.

[0059] As an example, the aforementioned execution entity can utilize least-squares DLT (Direct Linear Transform) to determine the positioning transformation matrix based on the object position information set corresponding to the anchor point object image and the target position information set.

[0060] In some optional implementations of certain embodiments, the execution entity may determine the positioning transformation matrix by following the steps described above: based on the object position information set corresponding to the anchor object and the target position information set.

[0061] The first step is to generate a set of coordinate mapping equations based on the coordinate mapping information of the aforementioned object position information set and target position information set. The coordinate mapping equations in this set characterize the transformation process from the target position information set to the object position information set via a planar projection transformation matrix. This planar projection transformation matrix can be an invertible transformation from the two-dimensional plane of the image corresponding to the object position information set (which conforms to the input of the uncalibrated object detection large model) to the two-dimensional plane of the image to be calibrated. This planar projection transformation matrix can be a 3x3 matrix with 8 degrees of freedom, capable of uniformly representing two-dimensional transformations such as translation, rotation, scaling, shearing, and perspective. The set of coordinate mapping equations can be expressed as follows: .in, It can represent a planar projection transformation matrix. Linear transformations (e.g., rotation, scaling, shearing) are defined. Perspective components were defined to achieve the effect of objects appearing larger when closer and smaller when farther away. The scale factor is set to 1 for normalization. It can represent target location information. It can represent the position information of an object.

[0062] The second step is to perform coordinate normalization on the above coordinate mapping equation set to obtain a normalized coordinate mapping equation set. This normalized coordinate mapping equation set can be expressed as follows: .

[0063] The third step involves performing a linear transformation on the normalized set of coordinate mapping equations to obtain a system of linear coordinate mapping equations. This system of linear coordinate mapping equations characterizes the process of transforming the nonlinear normalized set of coordinate mapping equations into linear equations. The system of linear coordinate mapping equations can be expressed as follows: In practice, the aforementioned executing entity can first perform cross-multiplication on the normalized coordinate mapping equation set to eliminate the denominators, obtaining the multiplied equation set. Then, it can perform a linear transformation on the multiplied equation set to obtain a system of linear coordinate mapping equations.

[0064] The fourth step is to construct the coordinate transformation coefficient matrix based on the aforementioned system of linear equations for coordinate mapping. This matrix can be an 8x9 matrix, where each row corresponds to one linear equation for coordinate mapping, and each column corresponds to an element in the planar projection transformation matrix, including corner pairs. Each corner pair corresponds to two linear equations for coordinate mapping, for a total of eight linear equations for coordinate mapping. The coordinate transformation coefficient matrix can be represented as follows: Each element in the coordinate transformation coefficient matrix described above can be a coefficient of the aforementioned coordinate mapping linear equation system. As an example, the aforementioned execution entity can transform the aforementioned coordinate mapping linear equation system into... The coefficient matrix after the form is used as the coordinate transformation coefficient matrix. It can represent the column vector that is converted from the plane projection transformation matrix.

[0065] The fifth step involves constrained decomposition and reshaping of the coordinate transformation coefficient matrix to obtain the positioning transformation matrix. In practice, the execution entity can use the singular value decomposition algorithm to solve for the coordinate transformation coefficient matrix, obtaining the solved matrix. Then, the solved matrix is ​​reshaped to obtain the positioning transformation matrix. This reshaping can involve converting the column vectors into a 3x3 matrix.

[0066] In addressing the aforementioned technical problems in the application scenario—calibration of original images based on large models—the following technical issues often arise: the sparse and difficult-to-segment curve boundary points of the acquired object and target position information sets lead to low accuracy of the localization matrices, resulting in poor coordinate mapping (calibration) quality based on the original image. This necessitates repeated determination of the localization matrix to extend the calibration duration. Considering the following requirements for this application scenario: adaptability to equivariant self-attention mechanisms, equivariant cross-attention mechanisms, sparse point sets, and soft-matching confidence matrices, combined with existing advantages such as ample computational resources and a high-precision large model, we have decided to adopt the following solution:

[0067] In some optional implementations of certain embodiments, the execution entity may determine the positioning transformation matrix by following the steps described above: based on the object position information set corresponding to the anchor object and the target position information set.

[0068] The first step involves normalizing the coordinate scale of the aforementioned object position information set and target position information set to obtain a normalized object position information set and a normalized target position information set. Both sets can be used to eliminate absolute and scale-sensitive positions. In practice, the executing entity can first determine the mean of the object position information set and the mean of the target position information set, respectively, as the object's centroid and the target's centroid. Next, it determines the difference between the object position information set and the object's centroid to obtain an object position difference set, and determines the difference between the target position information set and the target's centroid to obtain a target position difference set. Then, it determines the mean of the object position difference set and the target position difference set to obtain the mean of the object position difference and the mean of the target position difference. Finally, it determines the ratio of the object position difference set to the mean of the object position difference, as the object normalized position information set, and the ratio of the target position difference set to the mean of the target position difference, as the target normalized position information set.

[0069] The second step involves determining the geometric context features of the aforementioned normalized object position information set and the aforementioned normalized target position information set, which serve as the initial object position feature vector set and the initial target position feature vector set, respectively. The initial object position feature vectors in the initial object position feature vector set can represent the geometric context features of the normalized object position information. Similarly, the initial target position feature vectors in the initial target position feature vector set can represent the geometric context features of the normalized target position information. In practice, the executing entity can first convert the object position information set and the target position information set into a three-dimensional object position information set and a three-dimensional target position information set. This includes position information where the vertical axis values ​​of both the three-dimensional object position information set and the three-dimensional target position information set are 0. Next, the relative position vector set and relative distance set from each three-dimensional object position information set to other three-dimensional object position information sets are determined as object geometric feature information, and the relative position vector set and relative distance set from each three-dimensional target position information set to other three-dimensional target position information sets are determined as target geometric feature information. Finally, the above-mentioned object geometric feature information and target geometric feature information are input into the multilayer perceptron to obtain the initial object position feature vector set and the initial target position feature vector set.

[0070] The third step involves performing multi-layer isovariant self-attention processing on the initial object position feature vector set and the initial target position feature vector set, respectively, to obtain the object attention feature vector set and the target attention feature vector set. The object attention feature vectors in the object attention feature vector set can represent the global structural information of the aggregated object position information set. Similarly, the target attention feature vectors in the target attention feature vector set can represent the global structural information of the aggregated target position information set. This multi-layer isovariant self-attention processing can be performed using three self-attention layers in the SE3-Transformer.

[0071] The fourth step involves performing equivariant cross-attention processing on the aforementioned object attention feature vector set and target attention feature vector set to obtain object position feature vector set and target position feature vector set. The object position feature vector in the object position feature vector set can be obtained by performing cross-attention processing using the object attention feature vector set as the query vector and the target attention feature vector set as the key and value vectors. This vector represents the context-aware and geometric-aware information of each object's position. Similarly, the target position feature vector in the target position feature vector set can be obtained by performing cross-attention processing using the target attention feature vector set as the query vector and the object attention feature vector set as the key and value vectors. This vector represents the context-aware and geometric-aware information of each target's position. This equivariant cross-attention processing can be performed using three cross-attention layers in the SE3-Transformer.

[0072] The fifth step is to determine the point matching confidence matrix for the aforementioned object position feature vector set and the aforementioned target position feature vector set. Each element in the point matching confidence matrix represents the degree of matching between each object position information and each target position information. In practice, the execution entity can first determine the dot product of each object position feature vector and each target position feature vector to obtain the matching matrix. Then, the matching matrix, the aforementioned object position feature vector set, and the aforementioned target position feature vector set are input into a lightweight multilayer perceptron to obtain the point matching confidence matrix.

[0073] Step 6: Based on the aforementioned point matching confidence matrix, a weighted decomposition is performed on the aforementioned object position feature vector set and the aforementioned target position feature vector set to obtain a positioning transformation matrix. Then, based on the positioning transformation matrix, the aforementioned object position information set is mapped onto the aforementioned image to be calibrated to complete image calibration. In practice, the execution entity can use the Sinkhorn algorithm to normalize the aforementioned point matching confidence matrix to obtain a matching point double random matrix. Each element in the aforementioned matching point double random matrix can represent the matching probability or confidence level between object position information and target position information. Then, using a weighted singular value decomposition algorithm, a weighted decomposition is performed on the aforementioned object position feature vector set and the aforementioned target position feature vector set based on the matching point double random matrix to obtain a positioning transformation matrix. Finally, using the positioning transformation matrix, the image position information set of the aforementioned object position information set in the aforementioned image to be calibrated is determined. Based on the aforementioned image position information set, image positioning calibration is performed in the aforementioned image to be calibrated to obtain a calibrated image.

[0074] The above-described technical solution and its related content, as an inventive point of this disclosure, solve the technical problem of "low accuracy of the positioning matrix, poor calibration quality, and prolonged calibration time". The factors leading to low accuracy of the positioning matrix, poor calibration quality, and prolonged calibration time are often as follows: due to the sparse and difficult-to-precise matching of curve boundary points between the acquired object position information set and the target position information set, the accuracy of the positioning matrix of the object position information set and the target position information set is low, resulting in poor coordinate mapping based on the original image, i.e., poor calibration quality, requiring repeated determination of the positioning matrix and prolonged calibration time. Solving these factors can improve the accuracy and calibration quality and efficiency of the positioning matrix and shorten the calibration time. To achieve this effect, this disclosure first performs coordinate scale normalization processing on the object position information set and the target position information set, which can effectively eliminate absolute position offset and scale differences, allowing the SE3-Transformer model to focus only on the relative geometric structure of the position information set, significantly improving the model's generalization ability to different imaging conditions and targets of different scales. Secondly, the geometric context features of the normalized position information sets of the objects and the target are determined. By encoding global geometric association information for each position information using global object shape context features, the inherent geometric relationships between sparse position information sets can be effectively captured, avoiding the problem of inability to extract local features of sparse points. Subsequently, multi-layer equivariant self-attention processing allows the features of each position information to aggregate the global geometric information of other position information in the same position information set; equivariant cross-attention processing can accurately capture the geometric similarity between the object position information set and the target position information set, improving the accuracy of sparse position information matching. Then, the matching confidence matrix is ​​determined, which quantifies the matching similarity between the object position information set and the target position information set, effectively filtering out low-quality and invalid matching position information. Next, a weighted decomposition is performed based on the point matching confidence matrix, transforming it into rigid body transformation parameters. This completes the mapping from point matching to actual coordinates. The double-random matrix of matching points is compatible with noise, exterior points, and differences in point set size. Furthermore, the weighted singular decomposition algorithm effectively addresses the optimization problem of singular decomposition algorithms, which can only handle hard correspondences that are not differentiable, by allocating weights, thus improving the accuracy of the localization transformation matrix. Finally, calibration based on the image to be calibrated is completed using the localization transformation matrix, which improves the accuracy of image calibration and shortens the calibration time.

[0075] Step 104: Based on the positioning transformation matrix, determine the image position information set of the object position information set in the image to be calibrated, and based on the image position information set, perform image positioning calibration in the image to be calibrated to obtain the calibration image.

[0076] In some embodiments, the execution entity can determine the image position information set of the object position information set in the image to be calibrated based on the positioning transformation matrix, and perform image positioning calibration in the image to be calibrated based on the image position information set to obtain a calibration image. The image position information in the image position information set can be the position information of the object position information set on the image to be calibrated. There is a one-to-one correspondence between the image position information set and the object position information set in space. The calibration image can be an image that marks the position information of each object in the image to be calibrated in the form of a matrix bounding box. Figure 4 As shown, Figure 4 The upper part of the diagram shows a schematic of directly calibrating the object position information set output by the large uncalibrated object detection model on the image to be calibrated. Figure 4 The red triangle in the image represents an object to be identified in the image to be calibrated, while the black rectangle represents the bounding box of the object location information set output by the large uncalibrated object detection model. Figure 4 The lower half of the diagram shows a schematic diagram of image calibration on the target image after the object position information set is mapped to coordinates by the positioning deformation matrix.

[0077] In some optional implementations of certain embodiments, the execution entity may determine the image position information set of the object position information set in the image to be calibrated by means of the following steps, based on the positioning transformation matrix:

[0078] The first step is to determine the product of the inverse matrix of the above positioning transformation matrix and the matrix of the target object position information set to obtain the initial position information set, wherein the target object position information set is the set obtained by removing the object position information set corresponding to the target position information set from the object position information set.

[0079] The second step involves performing resolution mapping on the initial position information set to obtain the image position information set. In practice, the execution entity can determine the ratio of each initial position information in the initial position information set to the element in the third row and third column of the positioning transformation matrix to obtain the image position information set.

[0080] The above embodiments of this disclosure have the following beneficial effects: The image localization and calibration methods of some embodiments of this disclosure, by superimposing anchor point object images on the image to be calibrated as reference points for localization and calibration, transform the object position information set output by the model. This simplifies the calibration process on the image to be calibrated, reduces the waste of computational resources, improves calibration efficiency, and shortens the calibration time. Specifically, the reason for the low efficiency of related image annotation, the prolonged image annotation time, and the waste of a large amount of computational resources is that image processing involves multiple complex image processing steps. After obtaining the object position information, an inverse transformation is performed, requiring a large amount of computational resources. This makes it difficult to adapt to input images of different sizes. In fields involving coordinate calibration and localization, such as image annotation and image processing, this leads to low efficiency of image annotation, prolonged image annotation time, and a waste of a large amount of computational resources. Based on this, the image localization and calibration methods of some embodiments of this disclosure can first obtain the target image to be calibrated and the large model for detecting uncalibrated objects, wherein the target image to be calibrated is an image on which anchor point object images located in the target position information set are superimposed. Therefore, anchor objects—those with known target location information superimposed on the image to be calibrated and accurately identified by the subsequent large-scale uncalibrated object detection model—serve as the foundation for determining the object location information set and the localization transformation matrix. This provides a stable, predictable reference standard for subsequent construction. Furthermore, the target location information set reduces occlusion and interference with the main content of the image to be calibrated and ensures that the image to be calibrated can still be preserved after image processing such as cropping and scaling. Secondly, using the aforementioned large-scale uncalibrated object detection model, the object location information set of each object included in the target image to be calibrated is determined. This allows for accurate identification of the object location information set of each object (including anchor object images), improving the accuracy of the object location information set and providing a data foundation for determining the localization transformation matrix. Finally, based on the object location information set corresponding to the anchor object images and the aforementioned target location information set, the localization transformation matrix is ​​determined. Therefore, a geometric transformation relationship can be established between the object position information set of the anchor point object image and the target position information set, improving the accuracy of the positioning transformation matrix. This facilitates the subsequent precise determination of the object position information mapping transformation between the image to be calibrated and the image after image processing, as well as between images of different resolutions, thereby improving the efficiency and accuracy of calibrating each object in the image to be calibrated. Finally, based on the aforementioned positioning transformation matrix, the image position information set of the aforementioned object position information set in the aforementioned image to be calibrated is determined, and based on the aforementioned image position information set, image positioning and calibration are performed in the aforementioned image to be calibrated to obtain the calibrated image.Therefore, by using a localization transformation matrix to transform the calibration of object boundaries on an image conforming to the input size of an uncalibrated large object detection model, the process is transformed into precise calibration of the boundaries of each object on the image to be calibrated. This effectively simplifies the coordinate mapping process, adapts to calibration of images of different sizes, improves image calibration efficiency, shortens calibration time, and reduces the computational resources required for image calibration. In summary, this image localization and calibration method, by superimposing anchor point object images on the image to be calibrated as reference points for localization and calibration, transforms the object position information set output by the model. This effectively solves the output coordinate mapping problem caused by the network structure of the large model itself, simplifies the calibration process on the image to be calibrated, reduces the waste of computational resources, improves calibration efficiency, and shortens calibration time.

[0081] Further reference Figure 5 The flowchart 500 illustrates some other embodiments of the image localization and calibration method according to the present disclosure. This image localization and calibration method includes the following steps:

[0082] Step 501: Obtain the target image to be calibrated and the large model for detecting uncalibrated objects.

[0083] In some embodiments, the specific implementation of step 501 and its resulting technical effects can be found in [reference needed]. Figure 1 Step 501 in the corresponding embodiment will not be repeated here.

[0084] Step 502: Determine the target detection prompt information for the target image to be calibrated.

[0085] In some embodiments, the execution entity may determine target detection prompt information for the target image to be calibrated. This target detection prompt information may be a prompt used to guide the uncalibrated object detection model to perform target detection on the target image to be calibrated. For example, the target detection prompt information may be "Please output the absolute coordinates of the QR code and the electric vehicle license plate in the image to be calibrated."

[0086] Step 503: Adjust the target detection prompt information by performing intent recognition to obtain the adjusted target detection prompt information.

[0087] In some embodiments, the execution entity can adjust the target detection prompt information by performing intent recognition to obtain adjusted target detection prompt information. The adjusted target detection prompt information can be an optimization of the target detection prompt information to enable the unlabeled object detection large model to accurately identify the user's intent. In practice, the execution entity can use a few-shot prompting method to adjust the target detection prompt information by performing intent recognition to obtain the adjusted target detection prompt information.

[0088] In some optional implementations of certain embodiments, the aforementioned execution entity may perform intent recognition adjustment on the aforementioned target detection prompt information through the following steps to obtain the adjusted target detection prompt information:

[0089] The first step is to deblur the target detection prompt information to obtain the deblurred target detection prompt information. This deblurring process can involve removing ambiguous words.

[0090] The second step involves decomposing the deblurred target detection prompts into sub-task prompt information sets. These sub-task prompts can be prompts specific to a single problem. In practice, the executing entity can first input the deblurred target detection prompts into a large language model to extract constraint information, resulting in a key prompt information set. This key prompt information set may include, but is not limited to, at least one of the following: final target information, style and tone constraint information, and quantification information. Finally, based on a preset decomposition rule information set and the key prompt information set, the deblurred target detection prompts are further decomposed into sub-tasks, yielding a sub-task prompt information set. This preset decomposition rule information set can be a rule information set obtained by inputting the key prompt information set into a large language model.

[0091] The third step involves extracting positive and negative keywords from the aforementioned subtask prompt information set, resulting in a set of positive and negative keyword groups. Each set of positive and negative keyword groups can include multiple positive keywords and multiple negative keywords. The positive keywords can represent the intent of the subtask prompt information. The negative keywords can represent constraint boundary information that prohibits the subtask prompt information from guiding the large-scale unlabeled object detection model.

[0092] The fourth step involves performing constraint normalization on the output constraint information set included in the aforementioned subtask prompt information set, resulting in a normalized output constraint information set. This normalized output constraint information set can be information that normalizes and quantifies the format of the output from the unlabeled object detection large-scale model. This constraint normalization process can be performed using XML Schema 2.0 technology. Because XML Schema 2.0 technology can define the output framework, field rules, and data types for the unlabeled object detection large-scale model, allowing it to fill in the output content rather than outputting it autonomously, it can eliminate format deviations and reduce the efficiency loss of format inference in the unlabeled object detection large-scale model.

[0093] Fifth, based on the aforementioned positive and negative keyword sets and the aforementioned output constraint normalization information set, a sub-task prompt information set is generated, which serves as the adjusted target detection prompt information. In practice, the executing entity can input the aforementioned positive and negative keyword sets and the aforementioned output constraint normalization information set into the combination template to obtain the sub-task prompt information set, which serves as the adjusted target detection prompt information.

[0094] Step 504: Input the adjusted target detection prompt information and the target image to be calibrated into the uncalibrated object detection large model to obtain the object position information set.

[0095] In some embodiments, the execution entity may input the adjusted target detection prompt information and the target image to be calibrated into the uncalibrated object detection large model to obtain a set of object position information for each of the objects.

[0096] Step 505: Determine the positioning transformation matrix based on the object position information set and target position information set corresponding to the anchor point object image.

[0097] Step 506: Based on the positioning transformation matrix, determine the image position information set of the object position information set in the image to be calibrated, and perform image positioning calibration in the image to be calibrated based on the image position information set to obtain the calibration image.

[0098] In some embodiments, the specific implementation of steps 505-506 and the resulting technical effects can be found in [reference needed]. Figure 1 Steps 103-104 in the corresponding embodiments will not be repeated here.

[0099] from Figure 5 It can be seen from this that, with Figure 1 Compared to the description of some corresponding embodiments, Figure 5 The flowchart 500 of the image localization and calibration method in some corresponding embodiments embodies the step of determining the localization transformation matrix of the large-scale uncalibrated object detection model based on the object position information set of the anchor point object image and the target position information set. Therefore, the scheme described in these embodiments can accurately detect the object position information set of each object included in the target object to be calibrated through a large language model, thereby reducing the cumulative error in the subsequent determination of the localization transformation matrix.

[0100] like Figure 6As shown, an image localization and calibration device 600 includes: an acquisition unit 601, a first determination unit 602, a second determination unit 603, and a third determination unit 604. The acquisition unit 601 is configured to acquire a target image to be calibrated and an uncalibrated object detection model, wherein the target image to be calibrated is an image on which anchor point object images located in a target position information set are superimposed. The first determination unit 602 is configured to determine the object position information set of each object included in the target image to be calibrated using the uncalibrated object detection model. The second determination unit 603 is configured to determine a localization transformation matrix based on the object position information set corresponding to the anchor point object images and the target position information set. The third determination unit 604 is configured to determine the image position information set of the object position information set in the target image to be calibrated based on the localization transformation matrix, and perform image localization and calibration in the target image to be calibrated based on the image position information set to obtain a calibrated image.

[0101] It is understandable that the units described in the image positioning and calibration device 600 are related to the reference. Figure 1 The steps in the described method correspond to each other. Therefore, the operations, features, and beneficial effects described above for the method also apply to the image positioning and calibration device 600 and the units contained therein, and will not be repeated here.

[0102] The following is for reference. Figure 7 It shows a schematic diagram of the structure of an electronic device (e.g., an electronic device) 700 suitable for implementing some embodiments of the present disclosure. Figure 7 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments of this disclosure.

[0103] like Figure 7 As shown, the electronic device 700 may include a processing unit (e.g., a central processing unit, a graphics processor, etc.) 701, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 702 or a program loaded from a storage device 708 into a random access memory (RAM) 703. The RAM 703 also stores various programs and data required for the operation of the electronic device 700. The processing unit 701, ROM 702, and RAM 703 are interconnected via a bus 704. An input / output (I / O) interface 705 is also connected to the bus 704.

[0104] Typically, the following devices can be connected to I / O interface 705: input devices 706 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 707 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 708 including, for example, magnetic tapes, hard disks, etc.; and communication devices 709. Communication device 709 allows electronic device 700 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 7 An electronic device 700 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively. Figure 7 Each box shown can represent a device or multiple devices as needed.

[0105] In particular, according to some embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, some embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication device 709, or installed from storage device 708, or installed from ROM 702. When the computer program is executed by processing device 701, it performs the functions defined in the methods of some embodiments of this disclosure.

[0106] It should be noted that, in some embodiments of this disclosure, the computer-readable medium described above may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In some embodiments of this disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In some embodiments of this disclosure, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0107] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future-developed networks.

[0108] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: acquire a target image to be calibrated and a large-scale uncalibrated object detection model, wherein the target image to be calibrated is an image on which anchor point object images located in the target position information set are superimposed; determine the object position information set of each object included in the target image to be calibrated using the large-scale uncalibrated object detection model; determine a localization transformation matrix based on the object position information set corresponding to the anchor point object images and the target position information set; determine the image position information set of the object position information set in the target image to be calibrated based on the localization transformation matrix; and perform image localization calibration in the target image to be calibrated based on the image position information set to obtain a calibrated image.

[0109] Computer program code for performing operations of some embodiments of this disclosure can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0110] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0111] The units described in some embodiments of this disclosure can be implemented in software or hardware. The described units can also be housed in a processor; for example, a processor may be described as including an acquisition unit, a first determining unit, a second determining unit, and a third determining unit. The names of these units do not necessarily limit the specific unit; for example, the acquisition unit may also be described as "a unit for acquiring a target image to be calibrated and a large model for detecting uncalibrated objects."

[0112] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip (SoCs), complex programmable logic devices (CPLDs), and so on.

[0113] Some embodiments of this disclosure also provide a computer program product, including a computer program that, when executed by a processor, implements any of the above-described screen recording methods for rendering images.

[0114] The above description is merely a selection of preferred embodiments of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.

Claims

1. An image localization and calibration method, characterized in that, include: Obtain a target image to be calibrated and a large model for detecting uncalibrated objects, wherein the target image to be calibrated is an image on which the images of anchor objects located in the target location information set are superimposed; The object location information set of each object included in the target image to be calibrated is determined by the large uncalibrated object detection model. The positioning transformation matrix is ​​determined based on the object position information set corresponding to the anchor point object image and the target position information set; Based on the positioning transformation matrix, the image position information set of the object position information set in the image to be calibrated is determined, wherein determining the image position information set of the object position information set in the image to be calibrated based on the positioning transformation matrix includes: The product of the inverse matrix of the positioning transformation matrix and the matrix of the target object position information set is determined to obtain the initial position information set, wherein the target object position information set is the set obtained by removing the object position information set corresponding to the target position information set from the object position information set; The initial position information set is resolution-mapped to obtain an image position information set, and image positioning and calibration are performed on the image to be calibrated based on the image position information set to obtain a calibration image.

2. The method according to claim 1, characterized in that, The step of determining the set of object location information for each object in the target image to be calibrated using the uncalibrated object detection large model includes: Determine the target detection prompt information for the target image to be calibrated; The target detection prompt information is adjusted by intent recognition to obtain the adjusted target detection prompt information; The adjusted target detection prompt information and the target image to be calibrated are input into the uncalibrated object detection large model to obtain the object position information set.

3. The method according to claim 1, characterized in that, The step of determining the positioning transformation matrix based on the object position information set corresponding to the anchor point object image and the target position information set includes: Based on the coordinate mapping information of the object position information set and the target position information set, a set of coordinate mapping equations is generated; The coordinate mapping equation set is normalized to obtain the normalized coordinate mapping equation set. A linear transformation is performed on the normalized set of coordinate mapping equations to obtain a system of linear coordinate mapping equations. Construct a coordinate transformation coefficient matrix based on the aforementioned system of linear equations for coordinate mapping; The coordinate transformation coefficient matrix is ​​subjected to constraint decomposition and reshaping to obtain the positioning transformation matrix.

4. The method according to claim 1, characterized in that, The acquisition of the target image to be calibrated and the large-scale model for detecting uncalibrated objects includes: Construct object images that can be recognized by the unlabeled object detection large model, and use them as anchor object images; Determine the location information set of the anchor point object image in the image to be calibrated, and use it as the target location information set; Based on the target location information set, the anchor point object image is superimposed onto the image to be calibrated to obtain the target image to be calibrated.

5. The method according to claim 4, characterized in that, The construction of object images that can be recognized by the unlabeled object detection large model, as anchor object images, includes: The image to be calibrated is subjected to feature recognition to obtain global feature information for calibration; Determine the information of the region to be superimposed for the image to be calibrated and the set of construction constraint information for the image of the anchor point object; The global feature information to be calibrated, the information of the region to be superimposed, and the set of constraint information to be constructed are input into the anchor point image generation model to obtain the initial anchor point image. Generate a binary encoding matrix based on a preset anchor point marker dictionary; The binary encoding matrix is ​​added to the initial anchor point image to obtain the anchor point object image.

6. The method according to claim 2, characterized in that, The process of adjusting the target detection prompt information based on intent recognition to obtain the adjusted target detection prompt information includes: The target detection prompt information is deblurred to obtain the deblurred target detection prompt information; The deblurred target detection prompt information is decomposed into a task to obtain a set of subtask prompt information. Positive and negative keywords are extracted from the subtask prompt information set to obtain a set of positive and negative keyword groups; The output constraint information set included in the subtask prompt information set is subjected to constraint normalization processing to obtain the output constraint normalization information set; Based on the set of positive and negative keyword groups and the set of output constraint normalization information, a set of subtask prompt information is generated as the adjusted target detection prompt information.

7. An image positioning and calibration device, characterized in that, include: The acquisition unit is configured to acquire a target image to be calibrated and an uncalibrated object detection large model, wherein the target image to be calibrated is an image on which the anchor point object image located in the target location information set is superimposed; The first determining unit is configured to determine the set of object position information of each object included in the target image to be calibrated by using the uncalibrated object detection large model; The second determining unit is configured to determine the positioning transformation matrix based on the object position information set corresponding to the anchor point object image and the target position information set; The third determining unit is configured to determine the image position information set of the object position information set in the image to be calibrated based on the positioning transformation matrix. The determination of the image position information set of the object position information set in the image to be calibrated based on the positioning transformation matrix includes: determining the matrix product of the inverse matrix of the positioning transformation matrix and the target object position information set to obtain an initial position information set, wherein the target object position information set is the set obtained by removing the object position information set corresponding to the target object position information set from the object position information set; performing resolution mapping on the initial position information set to obtain an image position information set; and performing image positioning calibration in the image to be calibrated based on the image position information set to obtain a calibration image.

8. An electronic device, characterized in that, include: One or more processors; Storage device for storing one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-6.

9. A computer-readable medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-6.