An entity segmentation method, device, equipment and medium based on dynamic programming

By combining the model capabilities of CLIP and SAM, and using dynamic programming algorithms to generate target semantic mask information, the problems of high annotation costs and complex training processes in existing technologies are solved, achieving efficient and accurate image segmentation and recognition.

CN120451561BActive Publication Date: 2026-06-30HANGZHOU SHIQU INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU SHIQU INFORMATION TECH CO LTD
Filing Date
2025-05-21
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing CLIP and SAM models have limitations in image segmentation and recognition, with high annotation costs, complex training processes, and poor performance in vertical fields.

Method used

By combining the zero-shot recognition capability of CLIP with the segmentation capability of SAM using a dynamic programming algorithm, a target model is trained using a pre-set dataset to generate a mask set containing semantic information. The target semantic mask information is then determined using a dynamic programming algorithm for image segmentation.

Benefits of technology

It reduces annotation costs, simplifies the training process, and improves the accuracy and efficiency of image segmentation, especially in vertical fields where it can more accurately identify and segment specific objects, thus improving the quality of segmentation results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120451561B_ABST
    Figure CN120451561B_ABST
Patent Text Reader

Abstract

This application discloses a dynamic programming-based entity segmentation method, apparatus, device, and medium, relating to the field of computer vision. The method includes: inputting text information from an acquired preset dataset into a preset image recognition model to obtain corresponding vector information; training the preset entity segmentation model based on the vector information to obtain a target model; performing entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set; acquiring a target query request; generating a second mask set corresponding to the target query request based on the target model; using a preset dynamic programming algorithm to determine target mask information corresponding to the second mask set from the first mask set; determining target semantic mask information based on the target mask information and the second semantic mask information; segmenting target entity images from the target images using the target semantic mask information; and returning the target entity images to the user terminal to respond to the target query request.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision, and in particular to an entity segmentation method, apparatus, device, and medium based on dynamic programming. Background Technology

[0002] With the rapid development of artificial intelligence technology, interactive applications such as augmented reality, virtual reality, and image and video editing are experiencing explosive growth. In these applications, accurate segmentation and recognition of specific objects in images has become a critical requirement. In the field of computer vision, CLIP (Contrastive Language-Image Pre-Training, a multimodal pre-trained model based on contrastive learning) and SAM (Segment Anything Model, a pre-trained model in image segmentation) are two highly influential models. CLIP can recognize objects in images without encountering samples of a specific category; while SAM can perform interactive segmentation in multiple domains. However, they each have limitations. CLIP is relatively weak in segmentation capabilities, while SAM, although possessing strong segmentation capabilities, is insufficient in identifying the specific category of the segmented object.

[0003] Several related research results have been achieved. Examples include the Semantic-SAM general image segmentation model, the Open-Vocabulary SAM model based on the integration of SAM and CLIP, and the TAP (Tokenize Anything via Prompting, a unified prompting visual foundation model) model based on the SAM architecture. However, these research results still face many challenges. Labeling costs remain high, training processes are complex and cumbersome, and performance in vertical domains is unsatisfactory. For instance, in image segmentation tasks within specific industries, the need for large amounts of accurately labeled data and complex training processes leads to high application costs and unsatisfactory results. Summary of the Invention

[0004] In view of this, the purpose of this application is to provide a dynamic programming-based entity segmentation method, apparatus, device, and medium that organically combines the zero-shot recognition capability of CLIP with the segmentation capability of SAM to achieve more efficient and accurate interactive segmentation and recognition of open vocabulary. The specific scheme is as follows:

[0005] Firstly, this application provides an entity segmentation method based on dynamic programming, including:

[0006] A preset dataset is obtained, and the text information in the preset dataset is input into a preset image recognition model to obtain vector information corresponding to the text information. The preset entity segmentation model is trained based on the vector information to obtain a target model. The preset dataset includes target images, text information corresponding to each target image, and first semantic mask information containing semantic information of a first target object.

[0007] Based on the preset entity segmentation model, entity segmentation is performed on each target image in the preset dataset to obtain a corresponding first mask set; the first mask set includes mask information that does not contain semantic information obtained after performing entity segmentation on the target image;

[0008] The system obtains a target query request input from the user terminal, and generates a second mask set corresponding to the second target object in the target image that corresponds to the target query request based on the target model; the second mask set includes second semantic mask information containing the semantic information of the second target object;

[0009] Using a preset dynamic programming algorithm, target mask information corresponding to the second semantic mask information in the second mask set is determined from the first mask set. Target semantic mask information is then determined based on the target mask information and the second semantic mask information. Target entity images are then segmented from the target image using the target semantic mask information. The target entity images are then returned to the user terminal to respond to the target query request.

[0010] Optionally, training a preset entity segmentation model based on the vector information to obtain a target model includes:

[0011] The vector information is fused with the encoder in the preset entity segmentation model to obtain a text encoder, so that the preset entity segmentation model can recognize the acquired text information based on the fused text encoder.

[0012] The target model is obtained by training a preset entity segmentation model including the text encoder based on the target images in the preset dataset, the text information corresponding to each target image, and the first semantic mask information containing the semantic information of the first target object; wherein the first target object is an object with semantic information in the preset dataset.

[0013] Optionally, the step of performing entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set includes:

[0014] Based on the preset entity segmentation model, entity segmentation is performed on each target image in the preset dataset to obtain a third target object, and mask information without semantic information corresponding to each third target object is determined to obtain a first mask set; wherein, the third target object includes objects with semantic information and objects without semantic information.

[0015] Optionally, generating a second mask set corresponding to the second target object in the target image corresponding to the target query request based on the target model includes:

[0016] The target model obtains the target query request, so that the target model determines the second target object and the corresponding target region corresponding to each target query request from the target image, and generates second semantic mask information containing the semantic information of the second target object, so as to obtain the second mask set corresponding to each target query request.

[0017] Optionally, determining the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set using a preset dynamic programming algorithm includes:

[0018] Determine the second semantic mask information corresponding to the target query request, and determine the first object region corresponding to the second semantic mask information;

[0019] Select a mask from the mask information that has not been selected in the current first mask set as the current mask information;

[0020] The second object region corresponding to the current mask information is determined, and the second object region corresponding to the current mask information is merged with all merged object regions corresponding to existing mask information in the target mask information combination to obtain the current merged object region; wherein, the initial target mask information combination is empty;

[0021] Calculate the current intersection-union ratio between the first object region and the currently merged object region. If the current intersection-union ratio is not zero and the current intersection-union ratio is greater than the target intersection-union ratio between the first object region and the merged object regions corresponding to all existing mask information in the target mask information combination, then add the current mask information to the target mask information combination.

[0022] Jump to the step of selecting a mask from the mask information that has not been selected in the current first mask set as the current mask information, until the first mask set has been traversed, and use the mask information in the target mask information combination as the target mask information corresponding to the second semantic mask information in the second mask set corresponding to the target query request.

[0023] Optionally, determining the target semantic mask information based on the target mask information and the second semantic mask information includes:

[0024] Using the semantic information corresponding to the second semantic mask information, the mask information in the target mask information that does not include the semantic information is removed to obtain new target mask information. The semantic information corresponding to the second semantic mask information is then fused with the new target mask information to obtain target semantic mask information that includes the semantic information.

[0025] Optionally, before determining the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set using a preset dynamic programming algorithm, the method further includes:

[0026] The mask information in the first mask set and the second mask set is downsampled based on a preset step size to obtain the preprocessed first mask set and the second mask set; the preset step size includes a preset horizontal step size and a preset vertical step size.

[0027] Secondly, this application provides an entity segmentation device based on dynamic programming, comprising:

[0028] The model determination module is used to acquire a preset dataset, input the text information in the preset dataset into a preset image recognition model to obtain vector information corresponding to the text information, and train a preset entity segmentation model based on the vector information to obtain a target model; the preset dataset includes target images, text information corresponding to each target image, and first semantic mask information containing semantic information of a first target object.

[0029] The first mask set acquisition module is used to perform entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set; the first mask set includes mask information that does not contain semantic information obtained after performing entity segmentation on the target image;

[0030] The second mask set acquisition module is used to acquire the target query request input by the user terminal, and generate a second mask set corresponding to the second target object in the target image corresponding to the target query request based on the target model; the second mask set includes second semantic mask information containing the semantic information of the second target object;

[0031] The image segmentation module is used to determine target mask information corresponding to the second semantic mask information in the second mask set from the first mask set using a preset dynamic programming algorithm, and to determine target semantic mask information based on the target mask information and the second semantic mask information, so as to segment the target entity image from the target image using the target semantic mask information, and return the target entity image to the user terminal to respond to the target query request.

[0032] Thirdly, this application provides an electronic device, comprising:

[0033] Memory, used to store computer programs;

[0034] A processor is used to execute the computer program to implement the aforementioned dynamic programming-based entity segmentation method.

[0035] Fourthly, this application provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the aforementioned entity segmentation method based on dynamic programming.

[0036] In this application, a preset dataset is obtained, and the text information in the preset dataset is input into a preset image recognition model to obtain vector information corresponding to the text information. A preset entity segmentation model is trained based on the vector information to obtain a target model. The preset dataset includes target images, text information corresponding to each target image, and first semantic mask information containing semantic information of a first target object. Entity segmentation is performed on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set. The first mask set includes mask information that does not contain semantic information obtained after entity segmentation of the target image. User input is obtained. The target query request is processed by generating a second mask set corresponding to a second target object in the target image based on the target model. The second mask set includes second semantic mask information containing semantic information of the second target object. A preset dynamic programming algorithm is used to determine target mask information corresponding to the second semantic mask information in the second mask set from the first mask set. Target semantic mask information is then determined based on the target mask information and the second semantic mask information. The target semantic mask information is used to segment the target entity image from the target image, and the target entity image is returned to the user terminal to respond to the target query request. As can be seen from the above, this application selects a preset dataset, obtains vector information corresponding to text information in the preset dataset based on a preset image recognition model, and uses the vector information to train a preset entity segmentation model to obtain the target model. This requires only a small amount of preset data, reducing dependence on large-scale labeled data and lowering labeling costs. Furthermore, the training process is simple, requiring no complex schemes or the introduction of large-scale new datasets, thus reducing training costs and improving training efficiency. On the other hand, a first mask set without semantic information corresponding to the target image in the pre-defined dataset is obtained using a pre-defined entity segmentation model. A second mask set containing semantic information corresponding to the user's input target query request in the target image is obtained using the target model. A pre-defined dynamic programming algorithm is then used to determine the target mask information corresponding to the second mask set from the first mask set. Finally, based on the target mask information and the second semantic mask information in the second mask set, the target semantic mask information is determined, which is used to segment the target entity image from the target image. In this process, the target model obtains the image semantic results and provides semantic information, while the pre-defined entity segmentation model provides fine-grained mask information. Dynamic programming bridges the results of the target model and the pre-defined entity segmentation model. The combination of the two results ensures that the final segmentation result, i.e., the target semantic mask information, has precise boundaries and semantic labels. In vertical domains, it can more accurately identify and segment specific objects, meeting the needs of scenarios with high requirements for segmentation results and improving vertical performance. Attached Figure Description

[0037] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0038] Figure 1 This is a flowchart of an entity segmentation method based on dynamic programming disclosed in this application;

[0039] Figure 2 This is a schematic diagram of the contents of a specific publicly available LIP dataset;

[0040] Figure 3 This is a schematic diagram of an image obtained after entity segmentation using the SAM model disclosed in this application.

[0041] Figure 4 This is a schematic diagram of a specific dynamic programming algorithm disclosed in this application;

[0042] Figure 5 This is a schematic diagram of a specific entity segmentation method based on dynamic programming disclosed in this application;

[0043] Figure 6 This is a schematic diagram of a dynamic programming-based entity segmentation device disclosed in this application.

[0044] Figure 7 This is a schematic diagram of the structure of an electronic device disclosed in this application. Detailed Implementation

[0045] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0046] Current general image segmentation models such as Semantic-SAM, Open-Vocabulary SAM models based on the integration of SAM and CLIP, and TAP models based on the SAM architecture still have many problems, such as high annotation costs, complex and cumbersome training processes, and unsatisfactory performance in vertical domains. To address these issues, this application proposes a dynamic programming-based entity segmentation method that organically combines the zero-shot recognition capability of CLIP with the segmentation capability of SAM, achieving more efficient and accurate interactive segmentation and recognition of open vocabulary.

[0047] See Figure 1 As shown in the figure, this application discloses an entity segmentation method based on dynamic programming, including:

[0048] Step S11: Obtain a preset dataset and input the text information in the preset dataset into a preset image recognition model to obtain vector information corresponding to the text information. Train the preset entity segmentation model based on the vector information to obtain the target model. The preset dataset includes target images, text information corresponding to each target image, and first semantic mask information containing semantic information of the first target object.

[0049] In this embodiment, a small number of target images (e.g., several thousand) can be selected to form a preset dataset. For example, target images can be selected from the LIP (Look in Person, a large-scale dataset) dataset. The LIP dataset includes images of various categories, and also includes text information corresponding to the images and mask information containing semantic information of the target objects. Figure 2 As shown, mask information is used to represent the region of a target object in an image. The categories of mask information include, but are not limited to, hat, hair, gloves, sunglasses, upper-clothes, dress, coat, socks, pants, jumpsuits, scarf, skirt, face, left-arm, right-arm, left-leg, right-leg, left-shoe, and right-shoe.

[0050] Then, the text information from the preset dataset is input into a preset image recognition model to obtain vector information corresponding to the text information. Based on the vector information, the preset entity segmentation model is trained to obtain the target model. Specifically, firstly, the vector information can be fused with the encoder in the preset entity segmentation model to obtain a text encoder, so that the preset entity segmentation model can recognize the acquired text information based on its fused text encoder. Then, based on the target images in the preset dataset, the text information corresponding to each target image, and the first semantic mask information containing the semantic information of the first target object, the preset entity segmentation model including the text encoder is trained to obtain the target model. Here, the first target object is an object with semantic information in the preset dataset.

[0051] For example, the preset image recognition model can be the CLIP model, and the preset entity segmentation model can be the SAM model. The CLIP model processes the text information in the preset dataset to convert it into vector form. Because the CLIP model has strong cross-modal understanding capabilities, it can map text into a feature space, representing the semantic information of the text as vectors. Then, the obtained text vector information can be fused with the encoder of the SAM model to obtain a text encoder. This fusion method is similar to the process of fusing the vectors obtained by the SAM point encoder and bounding box encoder. Fusing the text vectors with the SAM encoder aims to enable the SAM model to segment images by incorporating text semantic information. Furthermore, based on the target images in the preset dataset, the text information corresponding to each target image, and the first semantic mask information containing the semantic information of the first target object, the SAM model including the text encoder is trained to obtain the target model, namely the text-guide SAM model.

[0052] Step S12: Perform entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set; the first mask set includes mask information that does not contain semantic information obtained after performing entity segmentation on the target image.

[0053] In this embodiment, performing entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set may include: performing entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a third target object, and determining mask information corresponding to each third target object that does not contain semantic information to obtain a first mask set; wherein, the third target object includes objects with semantic information and objects without semantic information.

[0054] For example, the SAM model can be used to segment entities in each target image in a preset dataset to obtain the corresponding third target object. The third target object is no less than the first target object in step S11. Furthermore, since the amount of text information in the preset dataset is limited, the target images in the dataset contain entity objects with and without semantic information. The SAM model is used to segment each entity object in the target image to distinguish each object as an independent entity, determining the specific location and contour of each object in the image. The SAM model provides relatively accurate entity segmentation, but the mask information corresponding to the segmented third target objects does not contain semantic information, such as clothing categories. For example... Figure 3 The image shown is the result of entity segmentation using the SAM model. Different colored masks represent different entity semantic content, but lack specific natural language semantics.

[0055] Step S13: Obtain the target query request input by the user terminal, and generate a second mask set corresponding to the second target object in the target image corresponding to the target query request based on the target model; the second mask set includes second semantic mask information containing the semantic information of the second target object.

[0056] In this embodiment, the target query request input by the user (i.e., the Text information in the preset dataset) is obtained. Then, based on the target model, a second mask set corresponding to the second target object in the target image corresponding to the target query request is generated. Specifically, the target query request can be obtained through the target model so that the target model can determine the second target object and the corresponding target region corresponding to each target query request from the target image, and generate second semantic mask information containing the semantic information of the second target object to obtain the second mask set corresponding to each target query request.

[0057] For example, a text-guided SAM model can be used to generate a second mask set corresponding to the second target object in the target image that matches the target query request. It's important to note that the text-guided SAM model allows users to input text information from a pre-set dataset, and the model then outputs corresponding mask information to achieve image segmentation based on text prompts. However, due to the limited amount of text information in the pre-set dataset, the features and patterns learned by the SAM model during training are not rich enough. This results in coarse mask information output by the trained text-guided SAM model, with imprecise segmentation boundaries and poor detail representation, failing to meet the needs of applications requiring high-precision segmentation results.

[0058] Step S14: Using a preset dynamic programming algorithm, determine the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set, and determine the target semantic mask information based on the target mask information and the second semantic mask information, so as to segment the target entity image from the target image using the target semantic mask information, and return the target entity image to the user terminal to respond to the target query request.

[0059] In this embodiment, using a preset dynamic programming algorithm, determining the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set can include: first, determining the second semantic mask information corresponding to the target query request, and determining the first object region corresponding to the second semantic mask information; then, selecting a mask information from the unselected mask information in the current first mask set as the current mask information; determining the second object region corresponding to the current mask information, and merging the second object region corresponding to the current mask information with all merged object regions corresponding to existing mask information in the target mask information combination to obtain the current merged object region; wherein... The initial target mask information combination is empty; calculate the current intersection-union ratio (IUU) between the first object region and the currently merged object region. If the current IUU is not zero and is greater than the target IUU between the first object region and the merged object regions corresponding to all existing mask information in the target mask information combination, then add the current mask information to the target mask information combination; jump to the step of selecting a mask information from the mask information that has not been selected in the current first mask set as the current mask information, until the first mask set has been traversed, and use the mask information in the target mask information combination as the target mask information corresponding to the second semantic mask information in the second mask set corresponding to the target query request.

[0060] Before determining the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set using a preset dynamic programming algorithm, the process may further include: downsampling the mask information in the first and second mask sets based on a preset step size to obtain preprocessed first and second mask sets; the preset step size includes a preset horizontal step size and a preset vertical step size. This downsampling reduces the amount of mask information, lowers the computational complexity of subsequent processing, and to some extent avoids overfitting, thus improving processing efficiency.

[0061] It should be noted that different dynamic programming algorithms can be selected based on the actual situation to determine the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set. For example... Figure 4 The dynamic programming algorithm shown:

[0062] 1. Define the function for calculating IoU: The function CALCULATEIOU is used to calculate the IoU (Intersection over Union) between mask_a (i.e., the first mask set) and mask_b (i.e., the second mask set). It calculates the intersection and union of mask_a and mask_b, and then divides the number of pixels in the intersection by the number of pixels in the union to obtain the IoU value.

[0063] 2. Initialization and Preprocessing: Obtain the height h and width w of the reference mask mask_b, and calculate the step size step_h and step_w for sliding on the mask accordingly. The step size is calculated for subsequent mask sampling operations; sample the input mask list masks_a, and generate a new mask list adapted to the size of the reference mask mask_b through the operation [tmp[::step_h, ::step_w] for tmp in masks_a], which is convenient for subsequent calculations; create a zero matrix dp of size (n + 1)*(m + 1) to store intermediate results in the dynamic programming process, where n is the number of masks in masks_a and m is the number of masks to be selected; create a zero matrix mask_all with the same shape as mask_b to temporarily store mask information, and repeat it through the repeat(mask_all, m + 1) operation to adapt to subsequent calculations.

[0064] 3. Dynamic Programming for IoU Calculation: The algorithm iterates through masks in `masks_a` using two nested loops, considering the number of masks to be selected (`m`) and the position `j` (from 1 to `m`) of each selected mask. Specifically, for each current mask `current_mask` (taken from `masks_a`), the IoU value `iou_cur` between the current mask and the reference mask is calculated; the IoU value `iou_top` between the current mask and the selected mask combination (here, the selected mask combination refers to the set of masks to be selected determined before the current loop) is calculated; the current mask is merged with the previous mask in the selected mask combination, and the IoU value `iou_left` between the merged mask and the reference mask is calculated. `iou_cur`, `iou_top`, and `iou_left` are compared, and the maximum value is taken. Based on the maximum value, the dp matrix and the mask information in `mask_all.repeat` are updated. If the maximum value is iou_cur, then store the current mask in mask_all.repeat[i][j]; if the maximum value is iou_top, then keep mask_all.repeat[i][j] as the previous value; if the maximum value is iou_left, then store the merged mask in mask_all.repeat[i][j].

[0065] 4. Selecting the mask index: After the loop ends, by iterating backwards from m to 1, check whether adjacent elements in the dp matrix are different. If they are different, add the corresponding index i - 1 to the selected_indices list. This process determines the index of the final selected mask in the original masks_a based on the result of dynamic programming.

[0066] 5. Output: Based on the index in the selected_indices list, extract the corresponding mask from the original mask_a to form the selected_masks list. At the same time, record the maximum IoU value max_iou throughout the process. Finally, return selected_masks and max_iou as the output of the algorithm.

[0067] Furthermore, the semantic information corresponding to the second semantic mask information can be used to remove the mask information in the target mask information that does not contain semantic information to obtain new target mask information. The semantic information corresponding to the second semantic mask information can then be fused with the new target mask information to obtain target semantic mask information that contains semantic information.

[0068] Understandably, the first mask set is obtained by performing entity segmentation on the target image using a pre-defined entity segmentation model. The target mask information in this set is typically quite fine at the segmentation boundaries, accurately delineating the object's outline. The second semantic mask information, generated by the target model based on the user's query request, while carrying clear semantics, may lack sufficient accuracy in segmentation boundaries. Combining the two allows the new target mask information to segment the target object more accurately, reducing segmentation errors and improving the quality of the segmentation results. Simultaneously, fusing the semantic information corresponding to the second semantic mask information with the new target mask information yields a target semantic mask information. This target semantic mask information not only accurately delineates the target region but also carries clear semantic labels, enabling the segmentation results to better meet the user's semantic-based query needs. This facilitates subsequent understanding, analysis, and application of the segmented entity image, such as in image retrieval and intelligent monitoring scenarios.

[0069] For example Figure 5 As shown, the user's target query request is to obtain specific clothing items worn by a person (such as a blue shirt, blue backpack, orange shorts, and black shoes). The text-guided SAM model is used to generate a second mask set in the target image corresponding to the second target object in the target query request, as shown below. Figure 5 The yellow area in the image, along with the entity segmentation results from the second mask set and the SAM model, enters the "dynamic programming" module. Based on the dynamic programming algorithm, this information is analyzed and integrated. According to certain rules and objectives, the optimal segmentation scheme is selected and combined from numerous entity segmentation results. Then, based on the optimal segmentation scheme, the target entity image is segmented from the target image and returned to the user to respond to the target query request. For example... Figure 5 The purple area in the image represents the target entity image returned to the user.

[0070] As can be seen from the above, this embodiment selects a preset dataset, obtains the vector information corresponding to the text information in the preset dataset based on a preset image recognition model, and uses the vector information to train a preset entity segmentation model to obtain the target model. Only a small amount of preset data is required, reducing the dependence on large-scale labeled data and lowering the labeling cost. Moreover, the training process is simple, without the need for complex schemes or the introduction of large-scale new datasets, reducing training costs and improving training efficiency. On the other hand, the preset entity segmentation model is used to obtain a first mask set that does not contain semantic information corresponding to the target image in the preset dataset, and the target model is used to obtain a second mask set that contains semantic information corresponding to the target query request input by the user in the target image. A preset dynamic programming algorithm is used to determine the target mask information corresponding to the second mask set from the first mask set. Then, based on the target mask information and the second semantic mask information in the second mask set, the target semantic mask information is determined, so as to segment the target entity image from the target image using the target semantic mask information. The target model acquires image semantic results, providing semantic information, while the preset entity segmentation model provides fine-grained mask information. Dynamic programming bridges the results of the target model and the preset entity segmentation model. The combination of these two approaches results in a final segmentation result—the target semantic mask information—with precise boundaries and semantic labels. This enables more accurate identification and segmentation of specific objects in vertical domains, meeting the demands of scenarios with high segmentation performance requirements and improving vertical performance. Simultaneously, it effectively integrates the zero-shot recognition capability of the preset image recognition model and the segmentation capability of the preset entity segmentation model, avoiding the drawbacks of simple fusion and improving the accuracy and efficiency of open-vocabulary interactive segmentation and recognition.

[0071] See Figure 6 As shown in the figure, this application also discloses an entity segmentation device based on dynamic programming, including:

[0072] The model determination module 11 is used to acquire a preset dataset, input the text information in the preset dataset into a preset image recognition model to obtain vector information corresponding to the text information, and train a preset entity segmentation model based on the vector information to obtain a target model; the preset dataset includes target images, text information corresponding to each target image, and first semantic mask information containing semantic information of a first target object.

[0073] The first mask set acquisition module 12 is used to perform entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set; the first mask set includes mask information that does not contain semantic information obtained after performing entity segmentation on the target image;

[0074] The second mask set acquisition module 13 is used to acquire the target query request input by the user terminal, and generate a second mask set corresponding to the second target object in the target image corresponding to the target query request based on the target model; the second mask set includes second semantic mask information containing the semantic information of the second target object;

[0075] The image segmentation module 14 is used to determine target mask information corresponding to the second semantic mask information in the second mask set from the first mask set using a preset dynamic programming algorithm, and to determine target semantic mask information based on the target mask information and the second semantic mask information, so as to segment the target entity image from the target image using the target semantic mask information, and return the target entity image to the user terminal to respond to the target query request.

[0076] As can be seen from the above, this application selects a preset dataset, obtains the vector information corresponding to the text information in the preset dataset based on a preset image recognition model, and uses the vector information to train a preset entity segmentation model to obtain the target model. Only a small amount of preset data is required, reducing the dependence on large-scale labeled data and lowering the labeling cost. Moreover, the training process is simple, without the need for complex schemes or the introduction of large-scale new datasets, reducing training costs and improving training efficiency. On the other hand, the preset entity segmentation model is used to obtain a first mask set that does not contain semantic information corresponding to the target image in the preset dataset, and the target model is used to obtain a second mask set that contains semantic information corresponding to the target query request input by the user in the target image. A preset dynamic programming algorithm is used to determine the target mask information corresponding to the second mask set from the first mask set. Then, based on the target mask information and the second semantic mask information in the second mask set, the target semantic mask information is determined, so as to segment the target entity image from the target image using the target semantic mask information. The target model acquires the semantic results of the image and provides semantic information, while the preset entity segmentation model provides fine mask information. The results of the target model and the preset entity segmentation model are bridged by dynamic programming. The combination of the two results makes the final segmentation result, namely the target semantic mask information, accurate in boundary and with semantic labels. In the vertical domain, it can more accurately identify and segment specific objects, meet the needs of scenarios with high requirements for segmentation results, and improve vertical performance.

[0077] In some specific embodiments, the model determination module 11 includes:

[0078] An information recognition unit is used to fuse the vector information with the encoder in a preset entity segmentation model to obtain a text encoder, so that the preset entity segmentation model can recognize the acquired text information based on the fused text encoder.

[0079] The model training unit is used to train a preset entity segmentation model including the text encoder based on the target images in the preset dataset, the text information corresponding to each target image, and the first semantic mask information containing the semantic information of the first target object to obtain a target model; wherein, the first target object is an object with semantic information in the preset dataset.

[0080] In some specific embodiments, the first mask set acquisition module 12 includes:

[0081] The first mask set determination unit is used to perform entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a third target object, and to determine the mask information corresponding to each third target object that does not contain semantic information, so as to obtain a first mask set; wherein, the third target object includes objects with semantic information and objects without semantic information.

[0082] In some specific embodiments, the second mask set acquisition module 13 includes:

[0083] The second mask set determination unit is used to obtain the target query request through the target model, so that the target model can determine the second target object and the corresponding target region corresponding to each target query request from the target image, and generate second semantic mask information containing the semantic information of the second target object, so as to obtain the second mask set corresponding to each target query request.

[0084] In some specific embodiments, the image segmentation module 14 includes:

[0085] The first region determination unit is used to determine the second semantic mask information corresponding to the target query request, and to determine the first object region corresponding to the second semantic mask information;

[0086] The mask information determination unit is used to select a mask information from the mask information that has not been selected in the current first mask set as the current mask information;

[0087] The second region determination unit is used to determine the second object region corresponding to the current mask information, and merge the second object region corresponding to the current mask information with all merged object regions corresponding to the existing mask information in the target mask information combination to obtain the current merged object region; wherein, the initial target mask information combination is empty;

[0088] The mask information update unit is used to calculate the current intersection-union ratio between the first object region and the currently merged object region. If the current intersection-union ratio is not zero and the current intersection-union ratio is greater than the target intersection-union ratio between the first object region and the merged object regions corresponding to all existing mask information in the target mask information combination, then the current mask information is added to the target mask information combination.

[0089] The set traversal unit is used to jump to the step of selecting a mask information from the mask information that has not been selected in the current first mask set as the current mask information, until the first mask set is traversed, and the mask information in the target mask information combination is used as the target mask information corresponding to the second semantic mask information in the second mask set corresponding to the target query request.

[0090] In some specific embodiments, the image segmentation module 14 includes:

[0091] The mask information determination unit is used to use the semantic information corresponding to the second semantic mask information to remove the mask information in the target mask information that does not include the semantic information to obtain a new target mask information, and to fuse the semantic information corresponding to the second semantic mask information with the new target mask information to obtain target semantic mask information containing semantic information.

[0092] In some specific embodiments, the image segmentation module 14 further includes:

[0093] The sampling unit is used to downsample the mask information in the first mask set and the second mask set based on a preset step size to obtain the preprocessed first mask set and second mask set; the preset step size includes a preset horizontal step size and a preset vertical step size.

[0094] Furthermore, embodiments of this application also disclose an electronic device, Figure 7 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application.

[0095] Figure 7 This is a schematic diagram of the structure of an electronic device 20 provided in an embodiment of this application. Specifically, the electronic device 20 may include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input / output interface 25, and a communication bus 26. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the entity segmentation method based on dynamic programming disclosed in any of the foregoing embodiments. Alternatively, the electronic device 20 in this embodiment may specifically be an electronic computer.

[0096] In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 25 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.

[0097] In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk or optical disk, etc. The resources stored thereon can include operating system 221, computer program 222, etc., and the storage method can be temporary storage or permanent storage.

[0098] The operating system 221 is used to manage and control the various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc. In addition to including a computer program capable of performing the dynamic programming-based entity segmentation method executed by the electronic device 20 as disclosed in any of the foregoing embodiments, the computer program 222 may further include computer programs capable of performing other specific tasks.

[0099] Furthermore, this application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned dynamic programming-based entity segmentation method. Specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.

[0100] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.

[0101] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0102] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0103] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0104] The technical solutions provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A dynamic programming-based entity segmentation method, characterized in that, include: A preset dataset is obtained, and the text information in the preset dataset is input into a preset image recognition model to obtain vector information corresponding to the text information. Based on the vector information, a preset entity segmentation model is trained to obtain a target model. The preset dataset includes target images, text information corresponding to each target image, and first semantic mask information containing semantic information of the first target object; Based on the preset entity segmentation model, entity segmentation is performed on each target image in the preset dataset to obtain a corresponding first mask set; the first mask set includes mask information that does not contain semantic information obtained after performing entity segmentation on the target image; The system obtains a target query request input from the user terminal, and generates a second mask set corresponding to the second target object in the target image that corresponds to the target query request based on the target model; the second mask set includes second semantic mask information containing the semantic information of the second target object; Using a preset dynamic programming algorithm, target mask information corresponding to the second semantic mask information in the second mask set is determined from the first mask set. Target semantic mask information is then determined based on the target mask information and the second semantic mask information. Target entity images are then segmented from the target image using the target semantic mask information. The target entity images are then returned to the user terminal to respond to the target query request. The step of using a preset dynamic programming algorithm to determine the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set includes: Determine the second semantic mask information corresponding to the target query request, and determine the first object region corresponding to the second semantic mask information; Select a mask from the mask information that has not been selected in the current first mask set as the current mask information; The second object region corresponding to the current mask information is determined, and the second object region corresponding to the current mask information is merged with all merged object regions corresponding to existing mask information in the target mask information combination to obtain the current merged object region; wherein, the initial target mask information combination is empty; Calculate the current intersection-union ratio between the first object region and the currently merged object region. If the current intersection-union ratio is not zero and the current intersection-union ratio is greater than the target intersection-union ratio between the first object region and the merged object regions corresponding to all existing mask information in the target mask information combination, then add the current mask information to the target mask information combination. Jump to the step of selecting a mask from the mask information that has not been selected in the current first mask set as the current mask information, until the first mask set has been traversed, and use the mask information in the target mask information combination as the target mask information corresponding to the second semantic mask information in the second mask set corresponding to the target query request.

2. The entity segmentation method based on dynamic programming according to claim 1, characterized in that, The step of training a preset entity segmentation model based on the vector information to obtain a target model includes: The vector information is fused with the encoder in the preset entity segmentation model to obtain a text encoder, so that the preset entity segmentation model can recognize the acquired text information based on the fused text encoder. The target model is obtained by training a preset entity segmentation model including the text encoder based on the target images in the preset dataset, the text information corresponding to each target image, and the first semantic mask information containing the semantic information of the first target object; wherein the first target object is an object with semantic information in the preset dataset.

3. The entity segmentation method based on dynamic programming according to claim 1, characterized in that, The step of performing entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain the corresponding first mask set includes: Based on the preset entity segmentation model, entity segmentation is performed on each target image in the preset dataset to obtain a third target object, and mask information without semantic information corresponding to each third target object is determined to obtain a first mask set; wherein, the third target object includes objects with semantic information and objects without semantic information.

4. The entity segmentation method based on dynamic programming according to claim 1, characterized in that, The step of generating a second mask set corresponding to the second target object in the target image that corresponds to the target query request based on the target model includes: The target model obtains the target query request, so that the target model determines the second target object and the corresponding target region corresponding to each target query request from the target image, and generates second semantic mask information containing the semantic information of the second target object, so as to obtain the second mask set corresponding to each target query request.

5. The entity segmentation method based on dynamic programming according to claim 1, characterized in that, Determining the target semantic mask information based on the target mask information and the second semantic mask information includes: Using the semantic information corresponding to the second semantic mask information, the mask information in the target mask information that does not include the semantic information is removed to obtain new target mask information. The semantic information corresponding to the second semantic mask information is then fused with the new target mask information to obtain target semantic mask information that includes the semantic information.

6. The entity segmentation method based on dynamic programming according to claim 1, characterized in that, Before determining the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set using a preset dynamic programming algorithm, the method further includes: The mask information in the first mask set and the second mask set is downsampled based on a preset step size to obtain the preprocessed first mask set and the second mask set; the preset step size includes a preset horizontal step size and a preset vertical step size.

7. A physical segmentation device based on dynamic programming, characterized in that, include: The model determination module is used to acquire a preset dataset, input the text information in the preset dataset into a preset image recognition model to obtain vector information corresponding to the text information, and train a preset entity segmentation model based on the vector information to obtain a target model; the preset dataset includes target images, text information corresponding to each target image, and first semantic mask information containing semantic information of a first target object. The first mask set acquisition module is used to perform entity segmentation on each target image in the preset dataset based on the preset entity segmentation model to obtain a corresponding first mask set; the first mask set includes mask information that does not contain semantic information obtained after performing entity segmentation on the target image; The second mask set acquisition module is used to acquire the target query request input by the user terminal, and generate a second mask set corresponding to the second target object in the target image corresponding to the target query request based on the target model; the second mask set includes second semantic mask information containing the semantic information of the second target object; The image segmentation module is used to determine the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set using a preset dynamic programming algorithm, and to determine the target semantic mask information based on the target mask information and the second semantic mask information, so as to segment the target entity image from the target image using the target semantic mask information, and return the target entity image to the user terminal to respond to the target query request; The step of using a preset dynamic programming algorithm to determine the target mask information corresponding to the second semantic mask information in the second mask set from the first mask set includes: Determine the second semantic mask information corresponding to the target query request, and determine the first object region corresponding to the second semantic mask information; Select a mask from the mask information that has not been selected in the current first mask set as the current mask information; The second object region corresponding to the current mask information is determined, and the second object region corresponding to the current mask information is merged with all merged object regions corresponding to existing mask information in the target mask information combination to obtain the current merged object region; wherein, the initial target mask information combination is empty; Calculate the current intersection-union ratio between the first object region and the currently merged object region. If the current intersection-union ratio is not zero and the current intersection-union ratio is greater than the target intersection-union ratio between the first object region and the merged object regions corresponding to all existing mask information in the target mask information combination, then add the current mask information to the target mask information combination. Jump to the step of selecting a mask from the mask information that has not been selected in the current first mask set as the current mask information, until the first mask set has been traversed, and use the mask information in the target mask information combination as the target mask information corresponding to the second semantic mask information in the second mask set corresponding to the target query request.

8. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the entity segmentation method based on dynamic programming as described in any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that, Used to store a computer program, which, when executed by a processor, implements the entity segmentation method based on dynamic programming as described in any one of claims 1 to 6.