Annotation processing method and apparatus, server, and storage medium

By combining multiple recognition models with large visual language models and large language models to perform detailed image annotation, the problem of insufficient detail description in image annotation by visual language models is solved, thereby improving the ability of autonomous vehicles to recognize their surrounding environment.

WO2026137518A1PCT designated stage Publication Date: 2026-07-02GUANGZHOU XIAOPENG MOTORS TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GUANGZHOU XIAOPENG MOTORS TECH CO LTD
Filing Date
2025-01-02
Publication Date
2026-07-02

Smart Images

  • Figure CN2025070024_02072026_PF_FP_ABST
    Figure CN2025070024_02072026_PF_FP_ABST
Patent Text Reader

Abstract

An annotation processing method and apparatus, a server, and a storage medium. The annotation processing method comprises: performing information extraction on an image by means of at least two preset recognition models to obtain first description information of the image; performing first processing on the image to obtain a first processed image; and inputting the first description information of the image and the first processed image into a preset vision-language model to obtain first annotation information of the image. The solution can improve the accuracy and comprehensiveness of annotation description, and improve the ability of a vehicle to accurately identify the surrounding environment.
Need to check novelty before this filing date? Find Prior Art

Description

Annotation processing methods, devices, servers, and storage media

[0001] This application claims priority to Chinese Patent Application No. 2024119634407, filed with the State Intellectual Property Office of China on December 28, 2024, entitled “Annotation Processing Method, Apparatus, Server and Storage Medium”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of autonomous driving technology, and in particular to a labeling processing method, device, server, and storage medium. Background Technology

[0003] In intelligent autonomous driving technology, accurate understanding and processing of images are crucial. Training machine learning models in autonomous driving systems requires a large amount of image data with detailed annotations.

[0004] In related technologies, image text annotation can be performed based on visual language models. However, methods for text annotation of images based on visual language models lack the ability to describe image details. In particular, their ability to infer the specific location and relationships of key objects in an image is poor. At the same time, current visual language models lack the ability to recognize text in natural scenes such as road signs and license plates in autonomous driving scenarios.

[0005] Therefore, the relevant technologies use large visual language models to annotate images, but the accuracy and comprehensiveness of the annotation descriptions are poor, which affects the vehicle's ability to accurately identify the surrounding environment. Summary of the Invention

[0006] To address or partially address the problems existing in related technologies, this application provides a labeling processing method, apparatus, server, and storage medium, which can improve the accuracy and comprehensiveness of labeling descriptions and enhance the vehicle's ability to accurately identify its surrounding environment.

[0007] The first aspect of this application provides an annotation processing method, including:

[0008] Information is extracted from the image using at least two preset recognition models to obtain the first descriptive information of the image;

[0009] By performing a first process on the image, a first processed image is obtained;

[0010] The first description information of the image and the first processed image are input into a preset visual language model to obtain the first annotation information of the image.

[0011] In one embodiment, the method further includes:

[0012] The first description information and the first annotation information of the image are input into a preset language model to obtain the second annotation information of the image.

[0013] In one embodiment, the step of extracting information from the image using at least two preset recognition models to obtain the first descriptive information of the image includes:

[0014] The image is segmented using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0015] The instance detection model is used to detect instances in the image to obtain information about objects with instance concepts in the image;

[0016] The information of objects in the image that do not have the concept of instance is combined with the information of objects in the image that have the concept of instance to obtain the first descriptive information of the image.

[0017] In one embodiment, the step of extracting information from the image using at least two preset recognition models to obtain the first descriptive information of the image includes:

[0018] The image is segmented using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0019] The instance detection model is used to detect instances in the image to obtain information about objects with instance concepts in the image;

[0020] Traffic sign detection is performed on the image using a road surface sign detection model to obtain road surface traffic sign information in the image;

[0021] The information of objects in the image that do not have the concept of instance, the information of objects in the image that have the concept of instance, and the road surface traffic sign information in the image are combined and processed to obtain the first descriptive information of the image.

[0022] In one embodiment, the step of performing instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image includes:

[0023] Traffic light devices in an image are detected using an instance detection model, and traffic light information within the traffic light devices is obtained using a traffic light detection model; and / or

[0024] The instance detection model detects traffic scene objects containing text in the image, and the natural scene text recognition model is used to identify the text information in the traffic scene objects.

[0025] In one embodiment, a first processed image is obtained by performing a first processing on the image;

[0026] The 2D bounding boxes of objects with instance concepts are superimposed onto the image to obtain the first processed image.

[0027] In one embodiment, the step of inputting the first description information and the first annotation information of the image into a preset language model to obtain the second annotation information of the image includes:

[0028] The first description information of the image, the first annotation information of the image, and the obtained description requirement information are input into a preset language model to obtain the second annotation information of the image.

[0029] A second aspect of this application provides an annotation processing apparatus, comprising:

[0030] The first information determination module is used to extract information from the image using at least two preset recognition models to obtain the first descriptive information of the image;

[0031] A first image processing module is used to obtain a first processed image by performing a first processing on the image;

[0032] The first annotation module is used to input the first description information of the image and the first processed image into a preset visual language model to obtain the first annotation information of the image.

[0033] In one embodiment, the apparatus further includes:

[0034] The second annotation module is used to input the first description information and the first annotation information of the image into a preset language model to obtain the second annotation information of the image.

[0035] In one embodiment, the first information determining module includes:

[0036] The semantic segmentation submodule is used to segment images using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0037] The instance detection submodule is used to perform instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image;

[0038] The combination processing submodule is used to combine information about objects in the image that do not have an instance concept with information about objects in the image that do have an instance concept to obtain the first descriptive information of the image.

[0039] In one embodiment, the first information determining module includes:

[0040] The semantic segmentation submodule is used to segment images using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0041] The instance detection submodule is used to perform instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image;

[0042] The road marking detection submodule is used to detect traffic signs in the image using the road marking detection model, and obtain the road traffic sign information in the image.

[0043] The combined processing submodule is used to combine information about objects in the image that do not have an instance concept, information about objects in the image that have an instance concept, and road surface traffic sign information in the image to obtain the first descriptive information of the image.

[0044] In one embodiment, the instance detection submodule detects traffic light devices in the image using an instance detection model, and obtains traffic light information from the traffic light devices using the traffic light detection model; and / or,

[0045] The instance detection model detects traffic scene objects containing text in the image, and the natural scene text recognition model is used to identify the text information in the traffic scene objects.

[0046] A third aspect of this application provides a server, comprising:

[0047] Processor; and

[0048] A memory that stores executable code, which, when executed by the processor, causes the processor to perform the method described above.

[0049] A fourth aspect of this application provides a computer-readable storage medium having executable code stored thereon, which, when executed by a processor of an electronic device, causes the processor to perform the method described above.

[0050] The technical solution provided in this application may include the following beneficial effects:

[0051] This application extracts information from an image using at least two preset recognition models to obtain first descriptive information. Then, it performs a first processing on the image to obtain a first processed image. Finally, it inputs the first descriptive information and the first processed image into a preset visual language model to obtain first annotation information. Because this application integrates multiple traditional deep visual neural networks into the annotation processing flow, it utilizes at least two preset recognition models to extract fine-grained information from the image. This extracted information is then analyzed and processed by a visual language model, combining image and text information. The visual language model generates relevant image annotation information by analyzing image content and context, thereby improving the accuracy and comprehensiveness of the annotation description and enhancing the vehicle's ability to accurately identify its surrounding environment.

[0052] Furthermore, this application can input the first descriptive information and the first annotation information of the image into a preset language model to obtain the second annotation information of the image. In other words, this application further utilizes the high-level understanding and generalization capabilities of a large language model to simulate the human language cognition and generation process, ultimately achieving pixel-level descriptive annotation of autonomous driving scene images; thereby further improving the accuracy and comprehensiveness of the annotation description, and further improving the vehicle's accurate recognition capability of the surrounding environment.

[0053] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0054] The above and other objects, features and advantages of this application will become more apparent from the more detailed description of exemplary embodiments thereof in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the same components in the exemplary embodiments thereof.

[0055] Figure 1 is a schematic diagram of the first process of the annotation processing method shown in this application;

[0056] Figure 2 is a schematic diagram of the second process of the annotation processing method shown in this application;

[0057] Figure 3 is a schematic diagram of the third process of the annotation processing method shown in this application;

[0058] Figure 4 is a schematic diagram of the application framework corresponding to Figure 3 of this application;

[0059] Figure 5 is a schematic diagram of the application process of the annotation processing method shown in this application;

[0060] Figure 6 is a schematic diagram of another application framework shown in this application;

[0061] Figure 7 is a first structural schematic diagram of the annotation processing device shown in this application;

[0062] Figure 8 is a schematic diagram of the second structure of the annotation processing device shown in this application;

[0063] Figure 9 is a schematic diagram of the server structure shown in this application. Detailed Implementation

[0064] Embodiments of this application will now be described in more detail with reference to the accompanying drawings. While embodiments of this application are shown in the drawings, it should be understood that this application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to make this application more thorough and complete, and to fully convey the scope of this application to those skilled in the art.

[0065] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.

[0066] It should be understood that although the terms "first," "second," "third," etc., may be used in this application to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0067] The related technology uses a large visual language model to annotate images, but the accuracy and comprehensiveness of the annotation descriptions are poor, which affects the vehicle's ability to accurately identify the surrounding environment.

[0068] To address the aforementioned issues, this application provides an annotation processing method that can improve the accuracy and comprehensiveness of annotation descriptions and enhance the vehicle's ability to accurately identify its surrounding environment.

[0069] The technical solution of this application is described in detail below with reference to the accompanying drawings.

[0070] Figure 1 is a schematic diagram of the first flow of the annotation processing method shown in this application.

[0071] Referring to Figure 1, the method includes:

[0072] S101, extract information from the image using at least two preset recognition models to obtain the first descriptive information of the image.

[0073] This step may include:

[0074] The image is segmented using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0075] The instance detection model is used to detect instances in the image to obtain information about objects with instance concepts in the image;

[0076] The information of objects in the image that do not have the concept of instance is combined with the information of objects in the image that have the concept of instance to obtain the first descriptive information of the image.

[0077] S102, by performing a first processing on the image, a first processed image is obtained.

[0078] This step may include: overlaying the 2D bounding box of the object with instance concept onto the image to obtain the first processed image.

[0079] S103, input the first description information of the image and the first processed image into the preset visual language model to obtain the first annotation information of the image.

[0080] Visual language models typically consist of image recognition and natural language processing components. Utilizing deep learning techniques, they combine image and text information to construct a model capable of understanding and generating the relationships between images and text. In this step, the large visual language model acquires the input information, analyzes and processes it, and outputs the initial annotation information for the image.

[0081] As can be seen from this embodiment, this application integrates multiple traditional deep visual neural networks into the annotation processing flow. It utilizes at least two preset recognition models to extract fine-grained information from images, and then analyzes and processes this extracted information through a visual language model. By combining image and text information, the visual language model generates relevant image annotation information by analyzing image content and context, thereby improving the accuracy and comprehensiveness of annotation descriptions and enhancing the vehicle's ability to accurately identify its surrounding environment.

[0082] Figure 2 is a schematic diagram of the second process of the annotation processing method shown in this application.

[0083] Referring to Figure 2, the method includes:

[0084] S201, extract information from the image using at least two preset recognition models to obtain the first descriptive information of the image.

[0085] This step may include:

[0086] The image is segmented using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0087] The instance detection model is used to detect instances in the image to obtain information about objects with instance concepts in the image;

[0088] Traffic sign detection is performed on the image using a road surface sign detection model to obtain road surface traffic sign information in the image;

[0089] The first descriptive information of the image is obtained by combining information about objects in the image that do not have the concept of instance, information about objects in the image that have the concept of instance, and road surface traffic sign information in the image.

[0090] The instance detection model is used to detect instances in the image, obtaining information about objects with instance concepts in the image, which may include:

[0091] Traffic light devices in an image are detected using an instance detection model, and traffic light information within the traffic light devices is obtained using a traffic light detection model; and / or

[0092] The instance detection model detects traffic scene objects containing text in the image, and the natural scene text recognition model is used to identify the text information in the traffic scene objects.

[0093] S202, by performing a first processing on the image, a first processed image is obtained.

[0094] This step may include: overlaying the 2D bounding box of the object with instance concept onto the image to obtain the first processed image.

[0095] S203, input the first description information of the image and the first processed image into the preset visual language model to obtain the first annotation information of the image.

[0096] Large visual language models typically consist of image recognition and natural language processing components. They utilize deep learning techniques to combine image and text information, creating a model capable of understanding and generating the relationships between images and text. In this step, after acquiring the input information, the large visual language model analyzes and processes it, outputting the initial annotation information for the image.

[0097] S204, input the first description information and the first annotation information of the image into the preset language model to obtain the second annotation information of the image.

[0098] This step may also include: inputting the first description information of the image, the first annotation information of the image, and the obtained description requirement information into a preset language model to obtain the second annotation information of the image.

[0099] Large language models can, to some extent, simulate the human language cognition and generation process, generating natural language text or understanding the meaning of language text. They learn and simulate the complex rules of human language, achieving near-human-level text generation capabilities. In this step, after acquiring the input information, the large visual language model analyzes and processes it, outputting the second annotation information of the image.

[0100] As can be seen from this embodiment, after obtaining the first annotation information of the image, this application can further input the first description information and the first annotation information of the image into a preset language model to obtain the second annotation information of the image. In other words, this application further utilizes the high-level understanding and generalization capabilities of a large language model to simulate the human language cognition and generation process, ultimately achieving pixel-level description annotation of autonomous driving scene images; thereby further improving the accuracy and comprehensiveness of the annotation description, and further improving the vehicle's accurate recognition capability of the surrounding environment.

[0101] Figure 3 is a schematic diagram of the third process of the annotation processing method shown in this application. Figures 4 and 5 can also be seen, where Figure 4 is a schematic diagram of the application framework corresponding to Figure 3 of this application; and Figure 5 is a schematic diagram of the application process of the annotation processing method shown in this application.

[0102] This application integrates various traditional deep visual neural networks into the annotation process, utilizing different types of related models, such as semantic segmentation models, instance detection models (object detection models), road marking detection models, and natural scene text recognition models, to extract fine-grained information from images. This extracted information is then input into a large visual language model through visual and textual cues, and further utilized to leverage the high-level understanding and generalization capabilities of the large language model, ultimately achieving pixel-level descriptive annotation of autonomous driving scene images.

[0103] Referring to Figure 3, the method includes:

[0104] S301, the image is segmented using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0105] After acquiring images from vehicle sensors, this application can perform semantic segmentation on the images using a semantic segmentation model. Vehicle sensors may include, but are not limited to, cameras, LiDAR, etc.

[0106] Image semantic segmentation is an important technique in computer vision. Its goal is to segment an image into regions with different semantic information and label each region with a corresponding semantic label. Semantic segmentation can assign a specific category to each pixel in an image. For example, it can analyze objects in an image or video stream and label their category pixel by pixel. Semantic segmentation is widely used in fields such as autonomous driving.

[0107] Semantic segmentation models can be pre-trained using relevant deep neural network training methods, and this application does not limit this. These deep neural network training methods may include: methods for training a single network using fully labeled data, methods for training multiple networks based on the data labeling method, and methods for continuously adjusting the training categories using incremental learning, etc.

[0108] In this step, a semantic segmentation model can select objects in the image that lack instance concepts, such as the sky, grass, and trees. 2D bounding boxes for these objects can be calculated using a segmentation mask. These objects, lacking instance concepts, are also considered uncountable objects.

[0109] Segmentation masking is a technique in computer vision used to precisely separate objects from the background in an image. It achieves fine-grained segmentation of image regions by classifying and labeling each pixel. Each pixel is assigned a label indicating whether it belongs to the foreground, background, or a different object category. This labeling information forms a two-dimensional matrix, the segmentation mask. The mask can accurately describe the location and boundaries of different objects in an image.

[0110] S302, perform instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image; obtain the 2D bounding boxes of objects with instance concepts.

[0111] In this step, an instance detection model can be used to perform instance detection on the image, filtering out objects that do not fit the driving scenario, and obtaining objects with instance concepts (instance objects) in the image. The 2D bounding boxes of these instance objects are then obtained. These instance objects are all countable objects.

[0112] Instance detection is a computer vision task that classifies objects in an image into different categories. The goal is to identify the location and class of objects in a given image. The main task of instance detection is to identify objects in an image and classify them into different categories. Instance detection algorithms can include feature-based methods and deep learning-based methods, among others.

[0113] The instance detection model used in this step can detect the bounding boxes of various countable objects (instance objects), such as cars on the road, which can detect all vehicles and obtain their 2D bounding boxes.

[0114] In this step, the detection of objects with the concept of instance (instance objects) mainly involves detecting common objects in traffic scenes. These objects with the concept of instance (instance objects) include, for example, traffic lights, road signs, traffic signs, and license plate locations.

[0115] S303, detects traffic light devices in the image through the instance detection model, and obtains traffic light information in the traffic light devices using the traffic light detection model.

[0116] This step utilizes the traffic light locations detected by the instance detection model to extract the image content at the corresponding locations, and then uses the traffic light detection model to accurately identify the colors of the traffic lights.

[0117] Traffic light semantic information has a significant impact on vehicle movement. Therefore, traffic light detection is a crucial functional module in autonomous driving systems, and the detection results are vital for vehicle motion planning. By using a traffic light detection model to detect traffic light images, the corresponding traffic light category and color can be obtained. This model can consist of a variable convolutional neural network and an attention-based detection framework, but is not limited to these. The variable convolutional neural network can be used for feature extraction from traffic light images; the attention-based detection framework can be used to perceive the extracted features, obtaining the traffic light classification information and color.

[0118] S304, the traffic scene objects containing text in the image are detected by the instance detection model, and the text information in the traffic scene objects is identified by the natural scene text recognition model.

[0119] This step utilizes the locations of road signs, traffic signs, and license plates detected by the instance detection model to extract the corresponding image portions containing text information. Then, a natural scene text recognition model is used to perform text recognition and extraction on these image blocks, accurately obtaining the specific text information in the image.

[0120] Text detection is a hot research topic in computer vision. It aims to detect the location of text in natural scene images for further recognition, thereby converting the image into realistic text information that can be processed by a computer. Text recognition algorithms can include CRNN (Convolutional Recurrent Neural Network) algorithms or natural scene text recognition methods based on Transformer models. CRNN is a neural network model that combines CNN (Convolutional Neural Networks) and RNN (Recurrent Neural Networks). CRNN is suitable for tasks such as image processing. For example, the CRNN algorithm uses a CNN network to extract features from an image, then slices the features into feature sequences, inputs them into a bidirectional LSTM (Long Short-Term Memory) network for recognition, and finally uses the CTC (Connectionist Temporal Classification) algorithm to align the recognition results, obtaining the final recognition result. CTC is a commonly used algorithm in speech recognition, text recognition, and other fields to solve the problem of inconsistent input and output sequence lengths and lack of alignment.

[0121] S305 uses a road marking detection model to detect traffic signs in the image, obtaining road traffic sign information in the image.

[0122] This step utilizes a road marking detection model to identify the location and information of traffic markings on the road surface, such as left-turn arrows, right-turn arrows, and left-turn / U-turn signs, to obtain the road surface traffic marking information in the image.

[0123] Road marking detection is a fundamental and core problem that needs to be solved for autonomous driving. Road markings include lines, arrows, and text on the road surface that convey traffic information such as guidance, restrictions, and warnings to road users, playing a vital role in regulating and guiding traffic.

[0124] The road sign detection model can employ traditional classification algorithms or convolutional neural networks to segment road sign images, thereby determining the sign type based on the segmented images. Furthermore, the road sign detection model can utilize a fusion of graph networks and attention mechanisms to achieve road sign detection.

[0125] S306, combine the information of objects in the image that do not have the concept of instance, the information of objects in the image that have the concept of instance, and the road surface traffic sign information in the image to obtain the first descriptive information of the image.

[0126] This step can integrate the information obtained through the relevant visual network in the above steps, and combine the information of objects in the image that do not have the concept of instance, the information of objects in the image that have the concept of instance, and the road surface traffic sign information in the image to obtain the first descriptive information of the image. The first descriptive information can be a text prompt.

[0127] The text prompts may include, but are not limited to, the following information:

[0128] 1) Category names of each object;

[0129] 2) Traffic light location and color information;

[0130] 3) The location of the object containing text and the specific text content within it;

[0131] 4) The location and information of road traffic signs;

[0132] 5) Location information of other objects.

[0133] The object position information in this step can be expressed in the coordinates of a normalized 2D bounding box, such as [top left x-coordinate, top left y-coordinate, bottom right x-coordinate, bottom right y-coordinate].

[0134] S307, the 2D bounding box of the object with instance concept is superimposed on the image to obtain the first processed image.

[0135] In this step, the 2D bounding box of the object detected by S302 is superimposed onto the original image to obtain the first processed image after superposition.

[0136] S308, input the first description information of the image and the first processed image into the preset visual language model to obtain the first annotation information of the image.

[0137] This step inputs the initial descriptive information (textual cues) of the image and the initial processed image into a large visual language model, requiring it to provide as much detail as possible about the weather, key objects, traffic signs, the location, state, and intentions of other road users, the vehicle's intentions, and possible actions that might be taken. The large visual language model can accept text and images as input and output text.

[0138] Visual Language Model (VLM) is a technique that combines image processing and natural language processing. Its main purpose is to understand and interpret the relationship between images and text, and to generate accurate and vivid natural language descriptions based on images. VLM generates relevant textual descriptions by analyzing image content and context, exhibiting a visual understanding ability closer to that of humans.

[0139] Visual language models typically consist of image recognition and natural language processing components. They utilize deep learning techniques to combine image and text information, thereby constructing a model capable of understanding and generating the relationships between images and text.

[0140] In this step, after the large visual language model obtains the input information, it analyzes and processes the information to output the first annotation information of the image.

[0141] S309, input the first description information of the image, the first annotation information of the image, and the obtained description requirement information into the preset language model to obtain the second annotation information of the image.

[0142] After obtaining the first annotation information of the image, this application can further input the first descriptive information (textual prompts), the first annotation information of the image, and the obtained description requirements into a preset language model. The language model is required to summarize a description of the image as detailed as possible, and remove grammatical errors, duplicate sentences, and some meaningless information. After obtaining this information, the language model analyzes and processes it, and outputs the second annotation information of the image.

[0143] Large Language Models (LLMs) are deep learning models trained on massive amounts of text data, enabling them to generate natural language text or understand the meaning of language text. By training on huge datasets, LLMs can provide in-depth knowledge and language production on various topics. Their core idea is to learn the patterns and structures of natural language through large-scale unsupervised training, simulating human language cognition and generation processes to a certain extent. This allows them to generate natural language text or understand the meaning of language text, learning and simulating the complex rules of human language to achieve near-human-level text generation capabilities.

[0144] Large-scale language models can generally only process text, so both input and output are text content. In this step, after the large-scale visual language model obtains the input information, it analyzes and processes the information to output the second annotation information of the image.

[0145] It should be noted that the bounding boxes obtained in this application are represented by pixel coordinates. Therefore, the description of the final annotation information can be accurate to what objects are in certain pixel areas, thus ultimately achieving pixel-level description annotation of autonomous driving scene images.

[0146] It should also be noted that after performing step S308, this application can also directly use the first annotation information of the image obtained by the visual language model as the final annotation information. The application framework can be seen in Figure 6, which is a schematic diagram of another application framework shown in this application. In Figure 6, only the large visual language model is used, and the large language model is not used.

[0147] As this example illustrates, this application combines relevant visual models with a large-scale model to achieve text annotation processing for images of driving scenarios. This application extracts various information from images using semantic segmentation models, instance detection models (object detection models), road marking detection models, and natural scene text recognition models. This information is then input into a large-scale visual language model to output preliminary annotation information. Further, the preliminary annotation information, the image's initial descriptive information (textual prompts), and descriptive requirements are input into the large-scale language model. Leveraging the high-level understanding and generalization capabilities of the large-scale language model, pixel-level descriptive annotation of autonomous driving scenario images is ultimately achieved. Compared to related technologies, this improves the accuracy and comprehensiveness of the annotation description, thereby enhancing the vehicle's ability to accurately identify its surrounding environment.

[0148] Corresponding to the aforementioned application function implementation method embodiments, this application also provides a labeling processing device, a server, and corresponding embodiments.

[0149] Figure 7 is a first structural schematic diagram of the annotation processing device shown in this application.

[0150] Referring to Figure 7, this application provides a labeling processing device 70, including: a first information determination module 71, a first image processing module 72, and a first labeling module 73.

[0151] The first information determination module 71 is used to extract information from the image using at least two preset recognition models to obtain the first descriptive information of the image. The first information determination module 71 can segment the image using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts; perform instance detection on the image using an instance detection model to obtain information about objects in the image that have instance concepts; and combine the information about objects in the image that do not have instance concepts with the information about objects in the image that have instance concepts to obtain the first descriptive information of the image.

[0152] The first image processing module 72 is used to obtain a first processed image by performing a first processing on the image. The first image processing module 72 superimposes the 2D bounding boxes of objects with instance concepts onto the image to obtain the first processed image.

[0153] The first annotation module 73 is used to input the first descriptive information of the image and the first processed image into a preset visual language model to obtain the first annotation information of the image. After the large visual language model obtains the input information, it analyzes and processes the information and outputs the first annotation information of the image.

[0154] As can be seen from this embodiment, the device provided in this application integrates multiple traditional deep visual neural networks into the annotation processing flow. It uses at least two preset recognition models to extract fine-grained information from images, and then analyzes and processes this extracted information through a visual language model. It combines image and text information, and the visual language model generates relevant image annotation information by analyzing image content and context. This can improve the accuracy and comprehensiveness of annotation descriptions and enhance the vehicle's ability to accurately identify its surrounding environment.

[0155] Figure 8 is a second structural schematic diagram of the annotation processing device shown in this application.

[0156] Referring to Figure 8, an annotation processing device 70 of this application includes: a first information determination module 71, a first image processing module 72, a first annotation module 73, and a second annotation module 74.

[0157] The first information determination module 71, the first image processing module 72, and the first annotation module 73 can be referred to in the description in Figure 7.

[0158] The second annotation module 74 is used to input the first description information and the first annotation information of the image into a preset language model to obtain the second annotation information of the image.

[0159] The second annotation module 74 can also input the first description information of the image, the first annotation information of the image, and the obtained description requirement information into a preset language model to obtain the second annotation information of the image.

[0160] The first information determination module 71 includes: a semantic segmentation submodule 711, an instance detection submodule 712, and a combination processing submodule 713.

[0161] The semantic segmentation submodule 711 is used to segment images using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts.

[0162] The instance detection submodule 712 is used to perform instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image;

[0163] The combination processing submodule 713 is used to combine the information of objects in the image that do not have the concept of instance with the information of objects in the image that have the concept of instance to obtain the first descriptive information of the image.

[0164] The first information determination module may further include: road marking detection submodule 714.

[0165] The road marking detection submodule 714 is used to detect traffic signs in the image using the road marking detection model, and obtain the road traffic sign information in the image.

[0166] The combined processing submodule 713 combines the information of objects in the image that do not have the concept of instance, the information of objects in the image that have the concept of instance, and the road surface traffic sign information in the image to obtain the first descriptive information of the image.

[0167] Specifically, the instance detection submodule 712 can detect traffic light devices in an image using an instance detection model, and obtain traffic light information from the traffic light devices using the traffic light detection model; and / or,

[0168] The instance detection model detects traffic scene objects containing text in the image, and the natural scene text recognition model is used to identify the text information in the traffic scene objects.

[0169] In summary, the device provided in this application extracts various information from images through semantic segmentation models, instance detection models (object detection models), road marking detection models, and natural scene text recognition models. This information is then input into a large visual language model to output preliminary annotation information. Furthermore, the preliminary annotation information, the image's first descriptive information (textual prompts), and descriptive requirements are input into the large language model. Leveraging the high-level understanding and summarization capabilities of the large language model, pixel-level descriptive annotation of autonomous driving scene images is ultimately achieved. Compared with related technologies, this improves the accuracy and comprehensiveness of the annotation description, thereby enhancing the vehicle's ability to accurately identify its surrounding environment.

[0170] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated further here.

[0171] Figure 9 is a schematic diagram of the server structure shown in this application.

[0172] Referring to Figure 9, server 1000 includes memory 1010 and processor 1020.

[0173] The processor 1020 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor.

[0174] Memory 1010 may include various types of storage units, such as system memory, read-only memory (ROM), and permanent storage devices. ROM may store static data or instructions required by processor 1020 or other modules of the computer. Permanent storage devices may be read-write storage devices. Permanent storage devices may be non-volatile storage devices that retain stored instructions and data even when the computer is powered off. In some embodiments, permanent storage devices use mass storage devices (e.g., magnetic or optical disks, flash memory) as permanent storage devices. In other embodiments, permanent storage devices may be removable storage devices (e.g., floppy disks, optical drives). System memory may be a read-write storage device or a volatile read-write storage device, such as dynamic random access memory. System memory may store some or all of the instructions and data required by the processor during operation. Furthermore, memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and disks and / or optical disks may also be used. In some embodiments, the memory 1010 may include a removable storage device that is readable and / or writable, such as a laser disc (CD), a read-only digital multifunction optical disc (e.g., DVD-ROM, dual-layer DVD-ROM), a read-only Blu-ray disc, a high-density optical disc, a flash memory card (e.g., SD card, mini SD card, Micro-SD card, etc.), a magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves or transient electronic signals transmitted wirelessly or via wired connections.

[0175] The memory 1010 stores executable code, which, when processed by the processor 1020, can cause the processor 1020 to execute part or all of the methods described above.

[0176] Furthermore, the method according to this application can also be implemented as a computer program or computer program product, which includes computer program code instructions for performing some or all of the steps in the method described above.

[0177] Alternatively, this application may be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium) storing executable code (or computer program or computer instruction code) thereon, which, when executed by a processor of an electronic device (or server, etc.), causes the processor to perform part or all of the steps of the methods described above according to this application.

[0178] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A labeling processing method, characterized in that, include: Information is extracted from the image using at least two preset recognition models to obtain the first descriptive information of the image; By performing a first process on the image, a first processed image is obtained; The first description information of the image and the first processed image are input into a preset visual language model to obtain the first annotation information of the image.

2. The method according to claim 1, characterized in that, The method further includes: The first description information and the first annotation information of the image are input into a preset language model to obtain the second annotation information of the image.

3. The method according to claim 1, characterized in that, The step of extracting information from the image using at least two preset recognition models to obtain the first descriptive information of the image includes: The image is segmented using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts. The instance detection model is used to detect instances in the image to obtain information about objects with instance concepts in the image; The information of objects in the image that do not have the concept of instance is combined with the information of objects in the image that have the concept of instance to obtain the first descriptive information of the image.

4. The method according to claim 1, characterized in that, The step of extracting information from the image using at least two preset recognition models to obtain the first descriptive information of the image includes: The image is segmented using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts. The instance detection model is used to detect instances in the image to obtain information about objects with instance concepts in the image; Traffic sign detection is performed on the image using a road surface sign detection model to obtain road surface traffic sign information in the image; The information of objects in the image that do not have the concept of instance, the information of objects in the image that have the concept of instance, and the road surface traffic sign information in the image are combined and processed to obtain the first descriptive information of the image.

5. The method according to claim 3, characterized in that, The step of performing instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image includes: Traffic light devices in an image are detected using an instance detection model, and traffic light information within the traffic light devices is obtained using a traffic light detection model; and / or The instance detection model detects traffic scene objects containing text in the image, and the natural scene text recognition model is used to identify the text information in the traffic scene objects.

6. The method according to any one of claims 3 to 5, characterized in that, By performing a first process on the image, a first processed image is obtained; The 2D bounding boxes of objects with instance concepts are superimposed onto the image to obtain the first processed image.

7. The method according to any one of claims 2 to 5, characterized in that, The step of inputting the first description information and the first annotation information of the image into a preset language model to obtain the second annotation information of the image includes: The first description information of the image, the first annotation information of the image, and the obtained description requirement information are input into a preset language model to obtain the second annotation information of the image.

8. A labeling processing device, characterized in that, include: The first information determination module is used to extract information from the image using at least two preset recognition models to obtain the first descriptive information of the image; A first image processing module is used to obtain a first processed image by performing a first processing on the image; The first annotation module is used to input the first description information of the image and the first processed image into a preset visual language model to obtain the first annotation information of the image.

9. The apparatus according to claim 8, characterized in that, The device further includes: The second annotation module is used to input the first description information and the first annotation information of the image into a preset language model to obtain the second annotation information of the image.

10. The apparatus according to claim 8, characterized in that, The first information determination module includes: The semantic segmentation submodule is used to segment images using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts. The instance detection submodule is used to perform instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image; The combination processing submodule is used to combine information about objects in the image that do not have an instance concept with information about objects in the image that do have an instance concept to obtain the first descriptive information of the image.

11. The apparatus according to claim 8, characterized in that, The first information determination module includes: The semantic segmentation submodule is used to segment images using a semantic segmentation model to obtain information about objects in the image that do not have instance concepts. The instance detection submodule is used to perform instance detection on the image using an instance detection model to obtain information about objects with instance concepts in the image; The road marking detection submodule is used to detect traffic signs in the image using the road marking detection model, and obtain the road traffic sign information in the image. The combined processing submodule is used to combine information about objects in the image that do not have an instance concept, information about objects in the image that have an instance concept, and road surface traffic sign information in the image to obtain the first descriptive information of the image.

12. The apparatus according to claim 10 or 11, characterized in that: The instance detection submodule detects traffic light devices in the image through the instance detection model, and uses the traffic light detection model to detect traffic light information in the traffic light devices. And / or, The instance detection model detects traffic scene objects containing text in the image, and the natural scene text recognition model is used to identify the text information in the traffic scene objects.

13. A server, characterized in that, include: processor; as well as A memory having executable code stored thereon, which, when executed by the processor, causes the processor to perform the method as described in any one of claims 1-7.

14. A computer-readable storage medium having executable code stored thereon, which, when executed by a processor of an electronic device, causes the processor to perform the method as described in any one of claims 1-7.