Image labeling method and apparatus, device, program product, and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining multimodal large models and image segmentation models, image annotation is automatically performed, solving the problem of time-consuming and labor-intensive manual annotation in existing technologies, and realizing an efficient and accurate image annotation process.

WO2026123615A1PCT designated stage Publication Date: 2026-06-18NANJING YIMU INTELLIGENT TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: NANJING YIMU INTELLIGENT TECHNOLOGY CO LTD
Filing Date: 2025-06-04
Publication Date: 2026-06-18

Application Information

Patent Timeline

04 Jun 2025

Application

18 Jun 2026

Publication

WO2026123615A1

IPC: G06V20/70

AI Tagging

Application Domain

Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing image target detection models rely heavily on manual annotation, especially the accurate annotation of target locations, which is labor-intensive and leads to low efficiency.

⚗Method used

The system automatically performs image annotation using a multimodal large model and an image segmentation model. Through iterative optimization of the initial annotation information, it generates high-precision target annotation information.

🎯Benefits of technology

It enables efficient and accurate image annotation without human intervention, reducing the cost of manual annotation and improving the efficiency and accuracy of image annotation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025099119_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Embodiments of the present invention disclose an image labeling method and apparatus, a device, a program product, and a storage medium. The method comprises: acquiring an image to be processed, and, on the basis of the image to be processed and a predetermined multi-modal large model, obtaining labeling task information of the image to be processed; for each piece of initial labeling information, on the basis of current initial labeling information and a pre-established image segmentation model, performing image segmentation on the image to be processed to obtain a segmentation result; determining whether the segmentation result satisfies a preset condition, and if the segmentation result does not satisfy the preset condition, optimizing the current initial labeling information on the basis of the segmentation result, the image segmentation model, and the multi-modal large model to obtain target labeling information corresponding to the current initial labeling information; and, on the basis of target labeling information corresponding to each piece of initial labeling information, performing image labeling on the image to be processed. The method of the present invention implements an efficient, accurate, and automatic image labeling process, reduces the costs of manual labeling, and improves user experience.

Need to check novelty before this filing date? Find Prior Art

Description

An image annotation method, apparatus, device, program product, and storage medium Technical Field

[0001] The embodiments of the present invention relate to the field of artificial intelligence, and in particular to an image annotation method, apparatus, device, program product and storage medium. Background Technology

[0002] In the field of computer vision, image object detection and localization has become a fundamental and crucial research direction. Currently, deep learning-based image object detection methods have become the mainstream research approach, with significant achievements made in methods based on YOLO models and neural network models.

[0003] However, the successful application of these models relies on a large amount of high-quality data that requires manual annotation. In practice, data annotation is extremely labor-intensive, especially for precise annotation of target locations (i.e., segmentation annotation). Therefore, how to efficiently and automatically perform accurate image annotation has become an urgent problem to be solved. Summary of the Invention

[0004] This invention provides an image annotation method, apparatus, device, program product, and storage medium that can automatically and accurately obtain image annotation positions using multimodal large models and image segmentation models, achieving precise image annotation without human intervention and improving the efficiency of image annotation.

[0005] In a first aspect, embodiments of the present invention provide an image annotation method, comprising:

[0006] The image to be processed is acquired, and the annotation task information of the image to be processed is obtained based on the image to be processed and a pre-determined multimodal large model; wherein, the annotation task information includes at least one initial annotation information;

[0007] For each initial annotation, the image to be processed is segmented based on the current initial annotation and the pre-established image segmentation model to obtain the segmentation result;

[0008] Determine whether the segmentation result meets the preset conditions. If the segmentation result does not meet the preset conditions, optimize the current initial annotation information based on the segmentation result, the image segmentation model, and the multimodal large model to obtain the target annotation information corresponding to the current initial annotation information.

[0009] The image to be processed is annotated according to the target annotation information corresponding to each initial annotation information.

[0010] In a second aspect, embodiments of the present invention provide an image annotation apparatus, the apparatus comprising:

[0011] An image acquisition module is used to acquire an image to be processed and obtain annotation task information of the image to be processed based on the image to be processed and a pre-determined multimodal large model; wherein, the annotation task information includes at least one initial annotation information;

[0012] The image segmentation module is used to segment the image to be processed based on the current initial annotation information and a pre-established image segmentation model, and obtain the segmentation result.

[0013] An information optimization module is used to determine whether the segmentation result meets the preset conditions. If the segmentation result does not meet the preset conditions, the current initial annotation information is optimized based on the segmentation result, the image segmentation model, and the multimodal large model to obtain the target annotation information corresponding to the current initial annotation information.

[0014] The image annotation module is used to annotate the image to be processed according to the target annotation information corresponding to each initial annotation information.

[0015] Thirdly, embodiments of the present invention also provide an electronic device, the electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the image annotation method as described in any of the embodiments of the present invention.

[0016] Fourthly, embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the image annotation method as described in any of the embodiments of the present invention.

[0017] Fifthly, embodiments of the present invention provide a computer program product, including a computer program that, when executed by a processor, implements the image annotation method as described in any of the embodiments of the present invention.

[0018] In this embodiment of the invention, an image to be processed is acquired, and annotation task information for the image to be processed is obtained based on the image to be processed and a pre-determined multimodal large model. The annotation task information includes at least one initial annotation. For each initial annotation, image segmentation is performed on the image to be processed based on the current initial annotation and a pre-established image segmentation model to obtain a segmentation result. It is determined whether the segmentation result meets preset conditions. If the segmentation result does not meet the preset conditions, the current initial annotation information is optimized based on the segmentation result, the image segmentation model, and the multimodal large model to obtain target annotation information corresponding to the current initial annotation. Image annotation is performed on the image to be processed based on the target annotation information corresponding to each initial annotation. The method of this embodiment can automatically generate the annotation information required by the image segmentation model using a multimodal large model, reducing the manual annotation process. Iterative optimization of the initial annotation information based on the multimodal large model and the image segmentation model, through this coarse-to-fine optimization method, accurately performs image annotation, improving the accuracy of image annotation. This image annotation method achieves an efficient, accurate, and automated image annotation process, reducing the cost of manual annotation while improving the user experience. Attached Figure Description

[0019] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 is a first flowchart of an image annotation method provided in an embodiment of the present invention;

[0021] Figure 2 is a schematic diagram of the image to be processed provided in an embodiment of the present invention;

[0022] Figure 3 is a flowchart of the iterative optimization of initial annotation information provided in an embodiment of the present invention;

[0023] Figure 4 is a second flowchart of an image annotation method provided in an embodiment of the present invention;

[0024] Figure 5 is a schematic diagram of the image drawing provided in an embodiment of the present invention;

[0025] Figure 6 is a schematic diagram of an image annotation device provided in an embodiment of the present invention;

[0026] Figure 7 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0027] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, the accompanying drawings show only the parts relevant to the present invention, and not all of the structures.

[0028] Figure 1 is a first flowchart of an image annotation method provided by an embodiment of the present invention. The method of this embodiment can automatically and accurately obtain the image annotation position using a multimodal large model and an image segmentation model, achieving precise image annotation without human intervention and improving the efficiency of image annotation. This method can be executed by an image annotation device provided by an embodiment of the present invention, which can be implemented in software and / or hardware. The following embodiments will illustrate this using the example of the device integrated into an electronic device, such as a server or computer device. Referring to Figure 1, the method specifically includes the following steps:

[0029] Step 101: Obtain the image to be processed, and obtain the annotation task information of the image to be processed based on the image to be processed and the pre-determined multimodal large model.

[0030] The annotation task information includes at least one initial annotation information, which includes at least one initial annotation point and the position information of each initial annotation point in the image to be processed. The initial annotation information is the annotation information obtained by the multimodal large model from image annotation of the image to be processed. The image to be processed is the image that needs to be annotated. The multimodal large model is a model capable of processing and understanding multiple types of input data (such as text and images) and generating comprehensive outputs about this data. In this scheme, the multimodal large model can be a multimodal large model connected to various networks according to pre-defined model interfaces, or it can be a pre-built and trained visual question-answering large model. Specifically, after acquiring the image to be processed, the information to be input to the multimodal large model can be determined based on the basic information of the image to be processed (such as size information and image content), and this information and the image to be processed are input into the multimodal large model. The multimodal large model can intelligently generate annotation task information based on the input information and the image to be processed. Optionally, in this scheme, the annotation task information of the image to be processed is obtained based on the image to be processed and a pre-determined multimodal large model, including: determining annotation guidance information based on the image to be processed and pre-acquired annotation requirement information; inputting the annotation guidance information into the multimodal large model to obtain the annotation task information output by the multimodal large model.

[0031] The annotation guidance information is used to guide the multimodal large model in annotating the images to be processed. The annotation requirement information includes the object information to be annotated and the annotation requirements. The object information to be annotated includes objects, specific points, scenes or specific regions in the image to be processed; the annotation requirements include the annotation format requirements and the annotation accuracy requirements.

[0032] Specifically, when image annotation is required for an image to be processed, the process acquires the image to be processed and annotation requirements, and generates annotation guidance information for the image based on the basic information of the image to be processed and the annotation requirements. For example, the image to be processed is an image containing multiple garments, and the annotation requirements are: mark the center point of each garment in coordinate form. Based on the basic information of the image to be processed and the annotation requirements, the annotation guidance information is determined to be: "The image size is 800mm wide and 600mm high; identify the garments in the image and output the coordinates of the center point of each garment."

[0033] After receiving the annotation guidance information, the annotation guidance information and the image to be processed are input into the multimodal large model. Upon receiving the annotation guidance information, the multimodal large model performs text analysis on the annotation guidance information to determine how the image to be processed needs to be processed. Based on the text analysis results of the annotation guidance information, image processing is performed on the image to be processed, and the image processing result is determined as the annotation task information. For example, Figure 2 is a schematic diagram of the image to be processed provided in an embodiment of the present invention. As shown in Figure 2, the image to be processed includes two garments. After inputting the image to be processed and the annotation guidance information into the multimodal large model, the multimodal large model can analyze the image to be processed and obtain the annotation task information, which includes two initial annotation information: the initial annotation information for garment 1 and the initial annotation information for garment 2, that is, the coordinates of the center point of each garment: the center point coordinates of garment 1 are (300, 400); the center point coordinates of garment 2 are (400, 350).

[0034] In the above steps, the generalization image recognition capability of the multimodal large model can be used to perform preliminary recognition of the image to be processed, obtain approximate initial annotation information, lay the foundation for obtaining accurate annotation points, save manpower, and improve the efficiency of image annotation.

[0035] Step 102: For each initial annotation information, perform image segmentation on the image to be processed based on the current initial annotation information and the pre-established image segmentation model to obtain the segmentation result.

[0036] The initial annotation information includes at least one initial annotation point and the position information of each initial annotation point in the image to be processed. The image segmentation model is used to perform fine segmentation of the image to be processed based on the initial annotation information output by the multimodal large model, in order to obtain more accurate image annotation information. The segmentation result of the image to be processed includes the image region corresponding to the initial annotation point. The image segmentation model in this scheme can be a Segment Anything Model (SAM). In an optional implementation, after obtaining each initial annotation information, the image segmentation model can be called multiple times to process each initial annotation information. For each initial annotation information, the current initial annotation information is input into the SAM. The SAM performs image segmentation on the image to be processed based on the position information of each initial annotation point in the current initial annotation information, obtaining one or more binary masks (segmentation results) corresponding to the image to be processed. This binary mask represents the set of pixels in the image to be processed that belong to the image region corresponding to the initial annotation point.

[0037] Step 103: Determine whether the segmentation result meets the preset conditions. If the segmentation result does not meet the preset conditions, optimize the current initial annotation information based on the segmentation result, image segmentation model and multimodal large model to obtain the target annotation information corresponding to the current initial annotation information.

[0038] The preset conditions are used to determine whether the initial annotation points corresponding to the segmentation results are accurate enough or meet the annotation requirements. If the segmentation results meet the preset conditions, it means that the initial annotation points are accurate enough, and they can be directly determined as the target annotation points of the image to be processed. If the segmentation results do not meet the preset conditions, it means that the initial annotation points are not accurate enough, and the initial annotation information needs to be continuously optimized based on the segmentation model and the multimodal large model until initial annotation information that meets the preset conditions is obtained, and the initial annotation information that meets the preset conditions is determined as the target annotation information. In an optional implementation, after obtaining the segmentation results, the accuracy of the segmentation results can be evaluated using the multimodal large model to obtain the accuracy evaluation result of the segmentation results. The accuracy evaluation result of the segmentation results is used to determine whether the segmentation results meet the preset conditions.

[0039] In an optional implementation, if the segmentation result does not meet the preset conditions, the initial annotation information is adjusted based on the accuracy evaluation information of the segmentation result to obtain candidate annotation information; the current initial annotation information is updated according to the candidate annotation information, and the step of segmenting the image to be processed based on the current initial annotation information and the pre-established image segmentation model is repeated until the segmentation result meets the preset conditions, and the target annotation information of the image to be processed is obtained. Figure 3 is a flowchart of the iterative optimization of the initial annotation information provided in the embodiment of the present invention. As shown in Figure 3, after obtaining the image to be processed, the image to be processed is input into the multimodal large model to obtain the initial annotation information. The initial annotation information is input into the image segmentation model for image segmentation to obtain the segmentation result. The segmentation result is input into the multimodal large model to determine whether the segmentation result meets the preset conditions. If it does, the initial annotation information is determined as the target annotation information. If the segmentation result does not meet the preset conditions, candidate annotation information is obtained according to the segmentation result, and the candidate annotation information is added to the initial annotation information. The step of inputting the initial annotation information into the image segmentation model for image segmentation to obtain the segmentation result is repeated until a segmentation result that meets the preset conditions is obtained, and the target annotation information is obtained.

[0040] Step 104: Annotate the image to be processed according to the target annotation information corresponding to each initial annotation information.

[0041] The target annotation information can be the coordinates of the target annotation points, the bounding box information of the target region, or other forms of information. After obtaining the target annotation information, image annotation is performed on the image to be processed based on the target annotation information and the annotation requirements.

[0042] The technical solution of this embodiment involves acquiring an image to be processed and obtaining annotation task information for the image based on the image to be processed and a pre-determined multimodal large model. The annotation task information includes at least one initial annotation. For each initial annotation, image segmentation is performed on the image to be processed based on the current initial annotation and a pre-established image segmentation model to obtain a segmentation result. It is determined whether the segmentation result meets preset conditions. If the segmentation result does not meet the preset conditions, the current initial annotation information is optimized based on the segmentation result, the image segmentation model, and the multimodal large model to obtain target annotation information corresponding to the current initial annotation. Image annotation is then performed on the image to be processed based on the target annotation information corresponding to each initial annotation. This technical solution can automatically generate the annotation information required by the image segmentation model using a multimodal large model, reducing the manual annotation process. Iterative optimization of the initial annotation information based on the multimodal large model and the image segmentation model, through this coarse-to-fine optimization method, accurately performs image annotation, improving the accuracy of image annotation. This image annotation method achieves an efficient, accurate, and automated image annotation process, reducing the cost of manual annotation while improving the user experience.

[0043] Figure 4 is a second flowchart of an image annotation method provided by an embodiment of the present invention. This embodiment is a refinement based on the above embodiment. The specific method is shown in Figure 4, and the method may include the following steps:

[0044] Step 401: Obtain the image to be processed, and obtain the annotation task information of the image to be processed based on the image to be processed and the pre-determined multimodal large model.

[0045] The annotation task information includes at least one initial annotation information.

[0046] Step 402: For each initial annotation information, perform image segmentation on the image to be processed based on the current initial annotation information and the pre-established image segmentation model to obtain the segmentation result.

[0047] Step 403: Update the image to be processed based on the segmentation results to obtain the drawn image of the image to be processed; analyze the drawn image through a multimodal large model to obtain the annotation offset information of the segmentation results.

[0048] The offset information includes the offset amount and offset direction. The magnitude of the offset indicates the accuracy of the initial annotation points corresponding to the segmentation result; the smaller the offset, the higher the accuracy of the segmentation result and the higher the accuracy of the initial annotation points. The plotted image is used to visually display the segmentation result on the image to be processed, allowing the multimodal large model to evaluate the accuracy of the segmentation result. Specifically, after obtaining the segmentation result, the segmentation result (such as a binary mask) is superimposed on the image to be processed to obtain a new image, namely the plotted image of the image to be processed. After obtaining the plotted image, the multimodal large model is used to perform image analysis on the plotted image to obtain the offset information of the segmentation result. In this scheme, optionally, the analysis of the plotted image by the multimodal large model to obtain the offset information of the segmentation result includes: determining evaluation guidance information based on the plotted image; inputting the plotted image and the evaluation guidance information into the multimodal large model to obtain the offset information output by the multimodal large model.

[0049] The evaluation guidance information is used to instruct the multimodal large model on how to evaluate the accuracy of the segmentation results. In this scheme, the evaluation guidance information includes image description information, accuracy evaluation criteria for the segmentation results, and evaluation precision. After obtaining the evaluation guidance information, it is input into the multimodal large model along with the rendered image. The multimodal large model can then analyze the rendered image based on the evaluation guidance information and calculate the offset and offset direction of the segmentation results.

[0050] For example, Figure 5 is a schematic diagram of the drawn image provided in an embodiment of the present invention. After obtaining the segmentation result corresponding to the first initial annotation information (initial annotation information of clothing 1) in Figure 2, the segmentation result is drawn onto the image to be processed, resulting in the drawn image shown in Figure 5. The dashed lines in Figure 5 represent the outline of clothing 1 obtained according to the segmentation result. The evaluation guidance information is determined based on the drawn image as "Confirm whether the clothing outline drawn by the dashed lines is accurate. If accurate, output no deviation. If inaccurate, output deviation, and output the offset direction and offset amount of the outline center." The drawn image and the evaluation guidance information are input into the multimodal large model to obtain the annotation offset information output by the multimodal large model. If the multimodal large model outputs "no deviation," the offset amount is determined to be 0, and the offset direction is "none."

[0051] The above steps can be used to evaluate the accuracy of the segmentation results through a multimodal large model, providing valuable feedback for further optimization of the annotation information and helping to gradually improve the overall accuracy of image segmentation and annotation.

[0052] Step 404: If the offset is greater than or equal to the preset offset, the segmentation result is determined not to meet the preset conditions.

[0053] The preset offset is determined in advance based on domain big data and annotation requirements, and is used to determine the accuracy of the segmentation result. Preset conditions determine whether the initial annotation points need adjustment. If the segmentation result meets the preset conditions, it means that no further adjustment is needed to the initial annotation points corresponding to the segmentation result, and the initial annotation points can be directly determined as the target annotation points. If the segmentation result does not meet the preset conditions, it means that the initial annotation points corresponding to the segmentation result need optimization. If the offset is greater than or equal to the preset offset, it indicates that the accuracy of the segmentation result is low, and it can be further determined that the segmentation result does not meet the preset conditions.

[0054] Step 405: If the segmentation result does not meet the preset conditions, obtain candidate annotation information based on the annotation offset information of the segmentation result.

[0055] Specifically, if the segmentation result does not meet the preset conditions, it means that the initial annotation points are not accurate enough. After obtaining the segmentation result, the annotation points of the image to be processed are recalculated based on the offset amount and offset direction of the annotation offset information of the segmentation result to obtain candidate annotation information.

[0056] For example, the initial coordinates of the annotation points in the image to be processed are (x1, y1), x1 = 100, y1 = 150. The offset of the annotation offset information is d = 10, and the offset direction is (dx, dy), where dx and dy represent the unit vector components in the x and y directions, dx = 0.5, dy = 0.8. The candidate annotation point = initial center point + offset * offset direction, further yielding the candidate annotation point = (x1, y1) + d * (dx, dy) = (100, 150) + 10 * (0.5, 0.8) = (100 + 10 * 0.5, 150 + 10 * 0.8) = (100 + 5, 150 + 8) = (105, 158).

[0057] Step 406: Update the current initial annotation information according to the candidate annotation information, and repeat the step of segmenting the image to be processed based on the current initial annotation information and the pre-established image segmentation model until the segmentation result meets the preset conditions, and obtain the target annotation information of the image to be processed.

[0058] The initial annotation information includes at least one initial annotation point and its position information in the image to be processed; the candidate annotation information includes at least one candidate annotation point and its position information in the image to be processed. In an optional implementation, after obtaining the candidate annotation information, the candidate annotation information is added to the current initial annotation information to obtain the updated current initial annotation information. That is, both the candidate annotation points and the initial annotation points before the update are used as input information for the image segmentation model (updated initial annotation information). The candidate annotation information and the initial annotation information before the update are input into the image segmentation model to perform image segmentation on the image to be processed, and the segmentation result is obtained. The accuracy of the segmentation result is evaluated using a multimodal large model to determine whether the segmentation result meets the preset conditions. If the segmentation result does not meet the preset conditions, the updated initial annotation information is updated again until a segmentation result that meets the preset conditions is obtained, and the initial annotation information corresponding to the segmentation result is determined as the target annotation information.

[0059] By iteratively updating the initial annotation information based on the multimodal large model and the image segmentation model, the accuracy of the segmentation results can be gradually improved, and high-precision target annotation information can be obtained in the end.

[0060] Step 407: If the offset is less than the preset offset, determine that the segmentation result meets the preset conditions.

[0061] Step 408: If the segmentation result meets the preset conditions, then the current initial annotation information is determined as the target annotation information.

[0062] Step 409: Perform image annotation on the image to be processed based on the target annotation information.

[0063] In this embodiment, the technical solution involves acquiring an image to be processed and obtaining annotation task information for the image based on the image and a pre-determined multimodal large model. The annotation task information includes at least one initial annotation. For each initial annotation, the image to be processed is segmented based on the current initial annotation and a pre-established image segmentation model to obtain a segmentation result. The image to be processed is updated based on the segmentation result to obtain a rendered image. The rendered image is analyzed using the multimodal large model to obtain annotation offset information for the segmentation result, where the annotation offset information includes the offset amount and offset direction. If the offset amount is greater than or equal to a preset offset, the segmentation result is determined not to meet the preset condition. If the segmentation result does not meet the preset condition, candidate annotation information is obtained based on the annotation offset information of the segmentation result. The current initial annotation information is updated based on the candidate annotation information, and the step of segmenting the image based on the current initial annotation and the pre-established image segmentation model is repeated until the segmentation result meets the preset condition, at which point the target annotation information for the image to be processed is obtained. If the offset amount is less than the preset offset, the segmentation result is determined to meet the preset condition. This embodiment utilizes a multimodal large model to automatically generate initial annotation information, reducing the cost of manual image annotation. The image is finely segmented based on the initial annotation information and the image segmentation model, improving the accuracy of image annotation. The annotation offset information based on the segmentation results continuously updates the initial annotation information, gradually and automatically optimizing the annotation information. This ensures the accuracy of image annotation while reducing reliance on manual annotation. Especially when processing large amounts of image data, it significantly reduces labor costs and improves the efficiency of image annotation.

[0064] Figure 6 is a schematic diagram of an image annotation device provided in an embodiment of the present invention. This device is suitable for executing the image annotation method provided in an embodiment of the present invention. As shown in Figure 6, the device may specifically include:

[0065] Image acquisition module 601 is used to acquire an image to be processed and obtain annotation task information of the image to be processed based on the image to be processed and a pre-determined multimodal large model; wherein, the annotation task information includes at least one initial annotation information;

[0066] Image segmentation module 602 is used to segment the image to be processed based on the current initial annotation information and a pre-established image segmentation model, and obtain the segmentation result;

[0067] Information optimization module 603 is used to determine whether the segmentation result meets the preset conditions. If the segmentation result does not meet the preset conditions, the current initial annotation information is optimized according to the segmentation result, the image segmentation model and the multimodal large model to obtain the target annotation information corresponding to the current initial annotation information.

[0068] The image annotation module 604 is used to annotate the image to be processed according to the target annotation information corresponding to each initial annotation information.

[0069] Optionally, the image acquisition module 601 is specifically used to: determine annotation guidance information based on the image to be processed and the pre-acquired annotation requirement information;

[0070] The annotation guidance information is input into the multimodal large model to obtain the annotation task information output by the multimodal large model.

[0071] Optionally, the information optimization module 603 is specifically used to: update the image to be processed according to the segmentation result to obtain a drawing image of the image to be processed;

[0072] The rendered image is analyzed using the multimodal large model to obtain the annotation offset information of the segmentation result, wherein the annotation offset information includes the offset amount and the offset direction;

[0073] If the offset is less than a preset offset, the segmentation result is determined to meet the preset condition; if the offset is greater than or equal to the preset offset, the segmentation result is determined not to meet the preset condition.

[0074] Optionally, the information optimization module 603 is further configured to: determine evaluation guidance information based on the drawn image;

[0075] The drawn image and the evaluation guidance information are input into the multimodal large model to obtain the annotation offset information output by the multimodal large model.

[0076] Optionally, the information optimization module 603 is further configured to: if the segmentation result does not meet the preset conditions, obtain candidate annotation information based on the annotation offset information of the segmentation result;

[0077] The current initial annotation information is updated according to the candidate annotation information, and the step of performing image segmentation on the image to be processed based on the current initial annotation information and the pre-established image segmentation model is repeated until the segmentation result meets the preset condition, and the target annotation information of the image to be processed is obtained.

[0078] Optionally, the initial annotation information includes at least one initial annotation point and the position information of each initial annotation point in the image to be processed; the candidate annotation information includes at least one candidate annotation point and the position information of each candidate annotation point in the image to be processed; the information optimization module 603 is further configured to: add the candidate annotation information to the current initial annotation information to obtain the updated current initial annotation information.

[0079] The image annotation apparatus provided in this embodiment of the invention can execute the image annotation method provided in any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the method. Content not described in detail in this embodiment can be referred to the description in any method embodiment of the invention.

[0080] This invention also provides a computer program product.

[0081] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer program products, which may include one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be an application-specific or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0082] Figure 7 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Referring to Figure 7, the electronic device 12 shown in Figure 7 is merely an example and should not impose any limitations on the function and scope of use of the embodiments of this application. As shown in Figure 7, the electronic device 12 is presented in the form of a general-purpose computing device. The components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting different system components (including system memory 28 and processing unit 16).

[0083] Bus 18 represents one or more of several bus architectures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of the various bus architectures. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.

[0084] Electronic device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by electronic device 12, including volatile and non-volatile media, removable and non-removable media.

[0085] System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and / or cache memory 32. Electronic device 12 may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 7, commonly referred to as "hard disk drives"). Although not shown in FIG. 7, disk drives for reading and writing to removable non-volatile disks (e.g., "floppy disks") and optical disk drives for reading and writing to removable non-volatile optical disks (e.g., CD-ROMs, DVD-ROMs, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 via one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the functions of the embodiments of this application.

[0086] A program / utility 40 having a set (at least one) of program modules 46 may be stored, for example, in memory 28. Such program modules 46 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. Program modules 46 typically perform the functions and / or methods described in the embodiments of this application.

[0087] Electronic device 12 can also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), and with one or more devices that enable a user to interact with electronic device 12, and / or with any device that enables electronic device 12 to communicate with one or more other computing devices (e.g., network card, modem, etc.). This communication can be performed via input / output (I / O) interface 22. Furthermore, electronic device 12 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with other modules of electronic device 12 via bus 18. It should be understood that, although not shown in Figure 7, other hardware and / or software modules can be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0088] The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing an image annotation method provided in this embodiment of the invention: acquiring an image to be processed, and obtaining annotation task information of the image to be processed based on the image to be processed and a pre-determined multimodal large model; wherein, the annotation task information includes at least one initial annotation information; for each initial annotation information, performing image segmentation on the image to be processed based on the current initial annotation information and a pre-established image segmentation model to obtain a segmentation result; determining whether the segmentation result meets a preset condition, and if the segmentation result does not meet the preset condition, optimizing the current initial annotation information based on the segmentation result, the image segmentation model, and the multimodal large model to obtain target annotation information corresponding to the current initial annotation information; and annotating the image to be processed based on the target annotation information corresponding to each initial annotation information.

[0089] This invention provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements an image annotation method as provided in all embodiments of this invention: acquiring an image to be processed, and obtaining annotation task information for the image to be processed based on the image to be processed and a pre-determined multimodal large model; wherein the annotation task information includes at least one initial annotation information; for each initial annotation information, performing image segmentation on the image to be processed based on the current initial annotation information and a pre-established image segmentation model to obtain a segmentation result; determining whether the segmentation result meets a preset condition; if the segmentation result does not meet the preset condition, optimizing the current initial annotation information based on the segmentation result, the image segmentation model, and the multimodal large model to obtain target annotation information corresponding to the current initial annotation information; and annotating the image to be processed based on the target annotation information corresponding to each initial annotation information. The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium can be, for example, but not limited to, an electronic device, apparatus, or device that is electrical, magnetic, optical, electromagnetic, infrared, or semiconductor, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an electronic device, apparatus, or device that executes instructions.

[0090] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of sending, propagating, or transmitting programs for use by or in conjunction with an electronic device, apparatus, or device that executes instructions.

Claims

1. An image annotation method, characterized in that, The method includes: The image to be processed is acquired, and the annotation task information of the image to be processed is obtained based on the image to be processed and a pre-determined multimodal large model; wherein, the annotation task information includes at least one initial annotation information; For each initial annotation, the image to be processed is segmented based on the current initial annotation and the pre-established image segmentation model to obtain the segmentation result; Determine whether the segmentation result meets the preset conditions. If the segmentation result does not meet the preset conditions, optimize the current initial annotation information based on the segmentation result, the image segmentation model, and the multimodal large model to obtain the target annotation information corresponding to the current initial annotation information. The image to be processed is annotated according to the target annotation information corresponding to each initial annotation information.

2. The method according to claim 1, characterized in that, Acquire the image to be processed, and obtain the annotation task information of the image to be processed based on the image to be processed and a pre-determined multimodal large model, including: Annotation guidance information is determined based on the image to be processed and the pre-acquired annotation requirements information; The annotation guidance information is input into the multimodal large model to obtain the annotation task information output by the multimodal large model.

3. The method according to claim 1, characterized in that, Determining whether the segmentation result meets preset conditions includes: The image to be processed is updated based on the segmentation result to obtain the drawn image of the image to be processed; The rendered image is analyzed using the multimodal large model to obtain the annotation offset information of the segmentation result, wherein the annotation offset information includes the offset amount and the offset direction; If the offset is less than a preset offset, the segmentation result is determined to meet the preset condition; if the offset is greater than or equal to the preset offset, the segmentation result is determined not to meet the preset condition.

4. The method according to claim 3, characterized in that, The rendered image is analyzed using the multimodal large model to obtain the annotation offset information of the segmentation result, including: Evaluation guidance information is determined based on the drawn image; The drawn image and the evaluation guidance information are input into the multimodal large model to obtain the annotation offset information output by the multimodal large model.

5. The method according to claim 3, characterized in that, If the segmentation result does not meet the preset conditions, the current initial annotation information is optimized based on the segmentation result, the image segmentation model, and the multimodal large model to obtain the target annotation information corresponding to the current initial annotation information, including: If the segmentation result does not meet the preset conditions, candidate annotation information is obtained based on the annotation offset information of the segmentation result; The current initial annotation information is updated according to the candidate annotation information, and the step of performing image segmentation on the image to be processed based on the current initial annotation information and the pre-established image segmentation model is repeated until the segmentation result meets the preset condition, and the target annotation information of the image to be processed is obtained.

6. The method according to claim 5, characterized in that, The initial annotation information includes at least one initial annotation point and the position information of each initial annotation point in the image to be processed; the candidate annotation information includes at least one candidate annotation point and the position information of each candidate annotation point in the image to be processed. Updating the current initial annotation information based on the candidate annotation information includes: The candidate annotation information is added to the current initial annotation information to obtain the updated current initial annotation information.

7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements an image annotation method according to any one of claims 1-6.

8. An image annotation device, characterized in that, include: An image acquisition module is used to acquire an image to be processed and obtain annotation task information of the image to be processed based on the image to be processed and a pre-determined multimodal large model; wherein, the annotation task information includes at least one initial annotation information; The image segmentation module is used to segment the image to be processed based on the current initial annotation information and a pre-established image segmentation model, and obtain the segmentation result. An information optimization module is used to determine whether the segmentation result meets the preset conditions. If the segmentation result does not meet the preset conditions, the current initial annotation information is optimized based on the segmentation result, the image segmentation model, and the multimodal large model to obtain the target annotation information corresponding to the current initial annotation information. The image annotation module is used to annotate the image to be processed according to the target annotation information corresponding to each initial annotation information.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the image annotation method as described in any one of claims 1 to 6.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the image annotation method as described in any one of claims 1 to 6.