Image infringement retrieval method and device based on image semantic contrast pre-training model

By using the CLIP model and R101-FPN model combined with the SuperGlue algorithm, high-value elements in local images are extracted for feature matching and weighted summation, solving the problem of low efficiency in image infringement retrieval under massive data and achieving fast and accurate infringement detection.

CN118132792BActive Publication Date: 2026-06-12HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2024-03-27
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing image infringement retrieval methods are inefficient with massive datasets, cannot effectively distinguish between image retrieval and infringement retrieval, and cannot accurately identify infringement caused by partial image modifications.

Method used

Image feature extraction is performed using a pre-trained CLIP model. Combined with the R101-FPN model and the SuperGlue algorithm, a comprehensive infringement similarity score is calculated by feature matching and weighted summation of local high-value elements to determine whether an image constitutes infringement.

🎯Benefits of technology

Fast and accurate image infringement retrieval was achieved in massive datasets, improving retrieval efficiency and accuracy while effectively saving resource consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118132792B_ABST
    Figure CN118132792B_ABST
Patent Text Reader

Abstract

The application relates to an image infringement retrieval method and device based on image semantic comparison pre-training model. The method comprises the following steps: acquiring an image database and a to-be-detected image, performing feature extraction on the to-be-detected image by using a pre-trained CLIP model to obtain coarse screening features of each database image and the to-be-detected image; selecting images with a similarity greater than a preset value from the image database according to the similarity of the coarse screening features to form a coarse screening image set; extracting local high-value elements of the to-be-detected image by using a feature extraction network; performing feature matching on the local high-value elements and the images in the coarse screening image set to obtain feature matching scores of the local high-value elements; weighting and summing the feature matching scores according to corresponding area proportions, and judging whether the to-be-detected image constitutes infringement according to a comprehensive infringement similarity score obtained. The method solves the problem of fast and accurate retrieval in a large amount of data set, effectively saves retrieval resource consumption, and improves the retrieval speed.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of copyright protection technology, and in particular to an image infringement retrieval method and apparatus based on an image semantic comparison pre-trained model. Background Technology

[0002] Existing image infringement retrieval methods mainly fall into two categories: one is based on watermark addition and retrieval, primarily aimed at improving the robustness of watermarks; the second method involves image feature extraction and similarity comparison. Basic methods are similar to extracting image features; some methods also combine blockchain technology and infringement detection technology.

[0003] Image infringement detection presents two core challenges: one is the sheer volume of image databases, requiring rapid feature extraction and computation; the other is accurately matching the needs of image infringement detection with those of image similarity detection. To address the first challenge, patents CN116415210A and CN114495139A employ a segmented approach: first extracting image features, then classifying them, and finally searching within the same category (CN116415210A focuses on methodology, while CN114495139A emphasizes system construction). Patent CN112579986A, on an image infringement detection method, device, and system, does not use a classification method but instead performs a two-stage search: coarse screening and fine screening. None of these patents address the second challenge: failing to distinguish between image retrieval and image infringement retrieval.

[0004] Facing the second challenge, the patent CN115017350A, which describes a poster image deduplication retrieval method, device, and electronic device based on deep learning, addresses this issue. This patent proposes a two-stage retrieval method: first, images are classified, and then similarity searches are performed within the same category. Its key feature is that it retrieves local features rather than overall image characteristics. However, this method does not address the first challenge: how to perform retrieval on massive datasets. Because the scheme uses object recognition technology, the identifiable categories are inherently limited, making massive-scale retrieval impossible.

[0005] Traditional image search methods often extract global features of the image as feature values ​​for retrieval. Although the above methods can retrieve identical or similar images, image infringement retrieval and image retrieval are two similar but different tasks. The main differences are: (1) Common infringement behaviors include overall modification of the image and local modification of the image, and existing technologies can only solve the former type of infringement behavior; (2) Image infringement usually requires infringement retrieval in a massive database, and how to improve the efficiency of infringement retrieval is also a core problem to be solved. Summary of the Invention

[0006] Therefore, it is necessary to provide an image infringement retrieval method and apparatus based on an image semantic comparison pre-trained model to address the aforementioned technical problems.

[0007] An image infringement retrieval method based on an image semantic contrast pre-trained model, the method comprising:

[0008] An image database is acquired, and a pre-trained CLIP model is used to extract features from each image in the database to obtain coarse-screen features for each image.

[0009] The image to be detected is acquired, and the CLIP model is used to extract features from the image to obtain coarse screening features.

[0010] Calculate the cosine similarity between the coarse screening features of the image to be detected and the coarse screening features of each image in the image database, and select images in the image database whose cosine similarity is greater than a preset value to form a coarse screening image set.

[0011] A feature extraction network is used to extract local high-value elements from the image to be detected.

[0012] Feature matching is performed on each image in the coarse-screened image set and the local high-value elements to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

[0013] The feature matching scores are weighted and summed according to the area ratio to obtain the comprehensive infringement similarity score of each image in the image to be detected and the coarse screening image set.

[0014] Based on the comprehensive infringement similarity score and the preset threshold, it is determined whether the image to be detected constitutes infringement.

[0015] In one embodiment, the feature extraction network is an R101-FPN model that combines a ResNet-101 backbone network with a feature pyramid network.

[0016] In one embodiment, a feature extraction network is used to extract local high-value elements of the image to be detected, including:

[0017] The image to be detected is input into the R101-FPN model to obtain multiple bounding boxes, which form a bounding box set.

[0018] Each bounding box is treated as a detected element.

[0019] Based on the bounding box set, remove low-value elements and retain locally high-value elements.

[0020] In one embodiment, based on the bounding box set, low-value elements are removed while locally high-value elements are retained, including:

[0021] Calculate the ratio of the area of ​​each bounding box in the bounding box set to the area of ​​the image to be detected, and obtain the proportion of the area occupied by each bounding box.

[0022] Remove bounding boxes whose area proportions are outside the preset range from the bounding box set to obtain the first bounding box set. Define an area proportion set P, which is used to record the area proportion of each bounding box in the first bounding box set.

[0023] Calculate the aspect ratio of each bounding box in the first bounding box set, and delete the bounding boxes with an aspect ratio of 0 from the first bounding box set to obtain the second bounding box set.

[0024] The CLIP model is used to extract the image features corresponding to each bounding box in the second bounding box set. The cosine similarity of the image features corresponding to any two bounding boxes in the second bounding box set is calculated. When the obtained cosine similarity is higher than the first preset threshold, the bounding box with the smaller area is deleted, and the area ratio set P is updated.

[0025] The area of ​​each bounding box in the area ratio set P is normalized proportionally, and the normalized area ratio set P' is defined to obtain the filtered bounding box set. The filtered bounding box set includes the high-value element set and its corresponding normalized area ratio set P'. The normalized area ratio set P' is used to record the normalized weight of each local high-value element.

[0026] In one embodiment, feature matching is performed on the local high-value elements and each image in the coarse-screened image set to obtain a feature matching score between the local high-value elements and each image in the coarse-screened image set, including:

[0027] SuperGlue is used to extract and match features of local high-value elements in the image to be detected and each image in the coarse-screened image set, so as to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

[0028] In one embodiment, the feature matching scores are weighted and summed according to the area ratio to obtain a comprehensive infringement similarity score for each image in the image to be detected and the coarse-screened image set, including:

[0029] Each high-value element is assigned a weight based on its area ratio, and all weights are normalized to obtain the normalized weight of each high-value element.

[0030] Based on the normalized weights and feature matching scores of locally high-value elements, the comprehensive infringement similarity score for each image in the image to be detected and the coarse-screened image set is obtained as follows:

[0031]

[0032] Where, S(Q,I) i ) is the image to be detected Q and the i-th image I in the set of coarse-screened images. i The overall infringement similarity score between them, p b ' is the normalized weight of the locally high-value element b, score(b, I i ) is a locally high-value element b and image I i Feature matching scores between them.

[0033] In one embodiment, the normalized weights of all locally high-value elements constitute the normalized set of area ratios:

[0034]

[0035] Where P′ is the normalized set of area proportions; p′ k It is the normalized weight of the k-th locally high-value element; p k is the original area ratio of the k-th segmented object in the area ratio set P, and J is the set of indices of all bounding boxes. It is the sum of the area proportions of all bounding boxes in the area proportion set P.

[0036] An image infringement retrieval device based on an image semantic contrast pre-trained model, the device comprising:

[0037] The first coarse screening module is used to acquire the image database. It uses a pre-trained CLIP model to extract features from each image in the image database to obtain coarse screening features for each image. Then, it acquires the image to be detected and uses the CLIP model to extract features from the image to be detected to obtain coarse screening features for the image to be detected. Finally, it calculates the cosine similarity between the coarse screening features of the image to be detected and the coarse screening features of each image in the image database, and selects images in the image database with a cosine similarity greater than a preset value to form a coarse screening image set.

[0038] The local high-value element extraction and matching module is used to extract local high-value elements from the image to be detected using a feature extraction network; feature matching is performed on the local high-value elements and each image in the coarse-screened image set to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

[0039] The infringement judgment module is used to perform a weighted summation of feature matching scores based on area ratios to obtain a comprehensive infringement similarity score for each image in the image to be detected and the coarse-screened image set; based on the comprehensive infringement similarity score and a preset threshold, it determines whether the image to be detected constitutes infringement.

[0040] The aforementioned image infringement retrieval method and apparatus based on an image semantic comparison pre-trained model includes the following steps: acquiring an image database and images to be detected; performing feature extraction using a pre-trained CLIP model to obtain coarse-screen features for each database image and the image to be detected; selecting images from the image database with a cosine similarity greater than a preset value based on the similarity of the coarse-screen features to form a coarse-screen image set; extracting local high-value elements from the images to be detected using a feature extraction network; performing feature matching on the local high-value elements and each image in the coarse-screen image set to obtain feature matching scores for the local high-value elements; weighting and summing the feature matching scores according to the corresponding area ratios; and determining whether the image to be detected constitutes infringement based on the obtained comprehensive infringement similarity score. This method solves the problem of fast and accurate retrieval in massive datasets, effectively saving retrieval resources while improving retrieval speed. Attached Figure Description

[0041] Figure 1 This is a flowchart illustrating an image infringement retrieval method based on an image semantic comparison pre-trained model in one embodiment.

[0042] Figure 2 This is a structural block diagram of an image infringement retrieval device based on an image semantic comparison pre-trained model in one embodiment;

[0043] Figure 3 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0044] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0045] In one embodiment, such as Figure 1 As shown, an image infringement retrieval method based on an image semantic contrast pre-trained model is provided. The method includes the following steps:

[0046] Step 100: Obtain the image database, and use the pre-trained CLIP model to extract features from each image in the image database to obtain the coarse-screen features of each image.

[0047] Specifically, the image database can be of any type and can contain different kinds of image data. A subset of images, which are those that infringe on the image to be detected, can be pre-stored in the database. Alternatively, images can be imported in batches from other poster databases, or images can be added manually to the database.

[0048] Once the image database is built, a neural network is used to extract coarse-grained features from the images in the database. Specifically, a pre-trained CLIP model is used to extract image features and save the feature vectors. The CLIP model is a neural network model pre-trained on a large number of image-text pairs, achieving excellent classification capabilities. Using the CLIP model, not only can image features be extracted quickly, but the semantics of the images can also be understood, helping to filter images that are conceptually more similar. The CLIP model employs a multimodal learning strategy, but its focus is on the semantic alignment between images and text. The CLIP model contains two main components: an image encoder and a text encoder. The image encoder is typically based on a convolutional neural network (CNN) architecture, which is responsible for processing image data and extracting visual features. In the feature extraction stage, the image encoder transforms the image into high-dimensional feature vectors, which capture the visual information of the image. The extracted image feature vectors are then saved; these vectors represent a high-level abstract representation of the image. The CLIP model learns the semantic relationship between image content and corresponding text through its training process. Therefore, CLIP can not only extract image features but also understand the association between these features and specific text descriptions. The output of the CLIP model includes not only feature vectors, but also matching scores between the image and a series of text descriptions. CLIP can filter out the image features that best match the text descriptions.

[0049] Step 102: Obtain the image to be detected, and use the CLIP model to extract features from the image to obtain the coarse screening features of the image to be detected.

[0050] Specifically, a pre-trained CLIP model is used to extract coarse features from the image to be detected and the feature vector is saved.

[0051] Step 104: Calculate the cosine similarity between the coarse screening features of the image to be detected and the coarse screening features of each image in the image database, and select images in the image database with a cosine similarity greater than a preset value to form a coarse screening image set.

[0052] Specifically, cosine similarity is calculated between the feature vector of the image to be detected and the feature vector of each image in the image database to obtain a similarity score between the image to be detected and each image in the image database, which is usually within the range of 0-1. Then, all the score results are sorted from high to low, and as the preferred result, the images corresponding to the top 200 results are selected to form a coarse-screen image subset that is conceptually related to the image to be detected.

[0053] Specifically, the formula for calculating the cosine similarity (Scos) between the coarse-screened features of image l in the image database and the coarse-screened features of the image to be detected is as follows:

[0054]

[0055] Among them, S cos (v l ,v q V represents the cosine similarity between the coarse-screened features of image l in the image database and the coarse-screened features of the image to be detected. l V is the coarse screening feature of image l in the image database. q represents the coarse screening features of the image to be detected, · represents the product, and ||·|| is the Euclidean norm.

[0056] Step 106: Use a feature extraction network to extract local high-value elements from the image to be detected.

[0057] Specifically, to improve the efficiency of detecting infringing samples in image infringement retrieval tasks, the characteristics of image infringement behavior were analyzed in the embodiments. It was found that some elements in the images have a high potential probability of being infringed, that is, they are more likely to be infringed. Therefore, these high-value elements and the core objectives of this stage were extracted.

[0058] The local high-value elements of images are designed primarily for image infringement retrieval. They can effectively filter out image elements with a high probability of being copied, thereby improving the efficiency of infringement case detection.

[0059] Step 108: Perform feature matching on the local high-value elements and each image in the coarse-screened image set to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

[0060] Specifically, feature matching for local high-value elements and each image in the coarsely screened image set can be achieved using SuperGlue, or a feature extraction and feature matching model similar to SuperGlue can be used.

[0061] Step 110: The feature matching scores are weighted and summed according to the area ratio to obtain the comprehensive infringement similarity score of each image in the image to be detected and the coarse screening image set.

[0062] Specifically, by using a weighted summation method, the weight of each local high-value element b based on its area ratio is combined with the feature matching score provided in step 108 to calculate the overall infringement similarity. This approach not only considers the visual importance of elements but also utilizes the degree of matching between features extracted by the deep learning model, thus obtaining a comprehensive infringement similarity score.

[0063] The comprehensive infringement similarity score calculation method is designed primarily for image infringement retrieval problems. By assigning weights to different high-value elements, it focuses on image elements with a higher probability of being copied (elements with larger areas), thereby improving the efficiency of infringement case detection.

[0064] Step 112: Determine whether the image to be detected constitutes infringement based on the comprehensive infringement similarity score and the preset threshold.

[0065] Specifically, the image to be detected is compared with the set of images in the preliminary screening to obtain infringement similarity data, with similarity values ​​ranging from [0,1]. If the similarity is 1, the image to be detected and an image in the image dataset are considered to constitute 100% infringement. Preferably, the threshold for determining infringement is set to 0.7, that is, if the infringement similarity is greater than or equal to 0.7, the image to be detected and an image in the image database are considered to have an infringement relationship. This threshold can be modified according to different infringement retrieval scenarios.

[0066] By extracting high-value elements with a high probability of infringement and combining them with a two-stage screening mechanism, this method can effectively improve the efficiency and accuracy of image infringement retrieval by searching the image database for images that may infringe on the image to be detected.

[0067] The aforementioned image infringement retrieval method includes: acquiring an image database and images to be detected; using a pre-trained CLIP model for feature extraction to obtain coarse-screen features for each database image and the image to be detected; selecting images from the image database with a cosine similarity greater than a preset value based on the similarity of the coarse-screen features to form a coarse-screen image set; using a feature extraction network to extract local high-value elements from the images to be detected; performing feature matching on the local high-value elements and each image in the coarse-screen image set to obtain feature matching scores for the local high-value elements; weighting and summing the feature matching scores according to the corresponding area ratios; and determining whether the image to be detected constitutes infringement based on the obtained comprehensive infringement similarity score. This method solves the problem of fast and accurate retrieval in massive datasets, effectively saving retrieval resources while improving retrieval speed.

[0068] In one embodiment, the feature extraction network in step 106 is the R101-FPN model, which combines the ResNet-101 backbone network and the feature pyramid network.

[0069] Specifically, to extract the target completely, this embodiment uses the R101-FPN model and the pre-trained Mask R-CNN framework as the feature extraction network. This model combines the ResNet-101 backbone network with the Feature Pyramid Network (FPN) to optimize target detection and segmentation.

[0070] ResNet+FPN takes the feature maps of Conv2 (Layer 1), Conv3 (Layer 2), Conv4 (Layer 3), and Conv5 (Layer 4) from ResNet and puts them into FPN for processing.

[0071] In one embodiment, step 106 includes: inputting the image to be detected into the R101-FPN model to obtain multiple bounding boxes, forming a bounding box set; treating each bounding box as a detected element; and based on the bounding box set, deleting low-value elements and retaining local high-value elements.

[0072] Specifically, after inputting the image Q to be detected into the R101-FPN model, the model outputs multiple bounding boxes (bboxes), forming a bounding box set M. Each bbox represents a detected object, and the bbox is represented by its coordinates (x, y, y). left ,y top ,x right ,y buttom Subsequently, low-value elements are filtered out through multiple steps, while high-value elements are retained.

[0073] In one embodiment, based on the bounding box set, low-value elements are removed and locally high-value elements are retained. This includes: calculating the ratio of the area of ​​each bounding box in the bounding box set to the area of ​​the image to be detected, obtaining the proportion of each bounding box's area; removing bounding boxes whose area proportions are not within a preset range from the bounding box set, obtaining a first bounding box set; defining an area proportion set P, which records the proportion of each bounding box's area in the first bounding box set; calculating the aspect ratio of each bounding box in the first bounding box set; removing bounding boxes with an aspect ratio of 0 from the first bounding box set, obtaining a second bounding box set; and using the CLIP model. The image features corresponding to each bounding box in the second bounding box set are extracted. The cosine similarity between the image features corresponding to any two bounding boxes in the second bounding box set is calculated. When the obtained cosine similarity is higher than a first preset threshold, the bounding box with the smaller area is deleted, and the area ratio set P is updated. The area of ​​each bounding box in the area ratio set P is normalized, and the normalized area ratio set P' is defined to obtain the filtered bounding box set. The filtered bounding box set includes the high-value element set and its corresponding normalized area ratio set P'. The normalized area ratio set P' is used to record the normalized weight of each local high-value element.

[0074] Specifically, calculate the area occupied by each bounding box. For each bounding box bbox∈M, calculate its area as follows:

[0075] A bbox =(x left-x right )·(y top -y buttom (2)

[0076] By calculating the area A of the bounding box bbox Let area(Q) be the area of ​​the image Q to be detected. Calculate the proportion p of this area. bbox for:

[0077]

[0078] Where, p bbox The threshold range is 5-85%. Remove bounding boxes outside this range from the bounding box set M. Simultaneously define an area ratio set P to record the proportion p occupied by each bounding box (bbox). bbox .

[0079] Calculate the aspect ratio. For each bounding box (bbox), let its width be (x...). right -x left ), height is (y buttom -y top ), respectively from its coordinate points (x left ,y top ,x right ,y buttom The aspect ratio C of the bounding box (bbox) is calculated. b for:

[0080]

[0081] Calculate C b If the value is non-zero, remove C from the bounding box set M. b The bounding box (bbox) with a value of 0.

[0082] Deduplication is performed. The CLIP model is used to extract image features corresponding to bounding boxes, and the image similarity between any two bounding boxes l and q in the boundary set M is calculated to improve retrieval speed. The image features v corresponding to the images of bounding boxes l and q are calculated. l and v q And calculate the cosine similarity between the two:

[0083]

[0084] Among them, S cos (v l ,v q ) represents the image features v of the images corresponding to the bounding boxes l and q. l and v q The cosine similarity between them, S cos (v l ,v qThe value of ) ranges from 0 to 1, with higher scores indicating greater similarity. When the threshold is higher than 0.7, the area A of bounding boxes l and q is compared. l A q Delete images with smaller areas and update the area ratio set P to remove redundant objects.

[0085] Scale normalization. The final step involves normalizing the area of ​​the cropped image, which standardizes the output set of segmented objects based on their relative proportions in the original image. This normalization is crucial for a consistent representation of object size in visualization analysis and retrieval. The goal of this embodiment is to normalize these scale values ​​so that the sum of the area proportions of all segmented objects equals 1, thus obtaining the final scale set.

[0086] Finally, we obtain a set of filtered bounding boxes M', which is the set of high-value elements and its corresponding normalized area ratio set P'.

[0087] In one embodiment, step 108 includes: using SuperGlue to extract and match features of local high-value elements of the image to be detected and each image in the coarse-screened image set, to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

[0088] Specifically, this embodiment uses SuperGlue to extract image features from the high-value element set M of the image to be detected and the images in the coarse-screened image set, and performs feature matching to obtain a feature matching score. SuperGlue is an end-to-end image deep feature extraction network used to match key points in images. It learns how to extract and compare image features through end-to-end training. SuperGlue first performs key point detection on each image in the image to be detected and the coarse-screened image set. These key points are salient feature points in the image, usually corners, edges, or texture regions, which can be used for correspondence and matching between images. After key point detection, SuperGlue generates a feature descriptor for each key point. These descriptors are high-dimensional vectors that encode local image information around the key point, allowing key points to be compared between different images. Finally, during feature matching, the key point feature descriptors of the image to be detected are compared with the feature descriptors of the images in the coarse-screened image set to find matching pairs.

[0089] In one embodiment, step 110 includes: assigning weights to each high-value element based on its area ratio, normalizing all weights to obtain a normalized weight for each local high-value element; and obtaining a comprehensive infringement similarity score for each image in the image to be detected and the coarse-screened image set based on the normalized weights of the local high-value elements and the feature matching score.

[0090]

[0091] Where, S(Q,I) i ) is the image to be detected Q and the i-th image I in the set of coarse-screened images. i The overall infringement similarity score between them, p b ' is the normalized weight of the locally high-value element b, score(b, I i ) is a locally high-value element b and image I i Feature matching scores between them.

[0092] Specifically, for the image Q to be detected, a set of high-value elements D' has already been extracted in the previous step. This is determined by the area proportion p of each b∈D'. b To measure the importance of different high-value elements b in D', each local high-value element b is assigned a weight, and the sum of the weight values ​​of all elements in the set is 1, thus obtaining the normalized area ratio set P'. The calculation expression for the comprehensive infringement similarity score between the image to be detected Q and the i-th image in the coarse screening image set is shown in formula (6).

[0093] By using a weighted summation method, each high-value element b is determined based on its area ratio P. b weight w b The overall infringement similarity score is calculated by combining the feature matching score provided by the SuperGlue model. This approach considers not only the visual importance of elements but also the degree of matching between features extracted by the deep learning model, resulting in a comprehensive infringement similarity score.

[0094] In one embodiment, the normalized weights of all locally high-value elements constitute the normalized set of area ratios:

[0095]

[0096] Where P′ is the normalized set of area proportions; p′ k It is the normalized weight of the k-th locally high-value element; p k is the original area ratio of the k-th segmented object in the area ratio set P, and J is the set of indices of all bounding boxes. It is the sum of the area proportions of all bounding boxes in the area proportion set P.

[0097] Specifically, this method ensures that all normalized area proportions in P′ add up to 1, which is crucial for ensuring a consistent representation of object size in visual parsing and retrieval.

[0098] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

[0099] In one embodiment, an image infringement retrieval method based on an image semantic contrast pre-trained model is provided, which specifically includes the following steps:

[0100] Step S1: Obtain the image database; extract features from each image in the image database using a neural network to obtain the coarse screening features of each image.

[0101] Step S2: Obtain the image to be detected, and extract features from the image to be detected through a neural network to obtain the coarse screening features of the image to be detected.

[0102] Step S3: Compare the coarse screening features of the image to be detected with the coarse screening features of the images in the image database one by one by calculating cosine similarity, and form a coarse screening image set by the top 200 images with the highest scores.

[0103] Step S4: Use the R101-FPN model to extract high-value elements from the image to be detected, calculate the feature similarity between the high-value elements and the images in the coarse-screened image set, and perform a weighted summation of the obtained similarities according to the area ratio to obtain the comprehensive infringement similarity score between the image to be detected and the images in the coarse-screened image set.

[0104] Step S5: Determine whether the image to be detected constitutes infringement by using a pre-set infringement similarity score threshold.

[0105] Specifically, the image infringement retrieval system contains an image database composed of multiple images. Each image in the database undergoes feature extraction using the CLIP model, and a feature index is constructed. The image to be detected is acquired, and its features are extracted using the same method as in step S1, comparing them with the feature index extracted from the image database. The top 200 images with the highest similarity are used as a coarse-screening subset. Subsequently, the image to be detected is input into step S2, where a feature extraction network based on the R101-FPN model selects locally high-value elements as the core infringement retrieval targets. In step S4, Superglue is used to extract each local feature of the image to be detected and the overall features of each image in the suspected set, performing a similarity comparison. The set of locally high-value elements is weighted according to its area ratio, and then a comprehensive image feature of the image to be detected is generated. The image feature similarity is compared among the 200 suspected images, returning a list of infringement retrieval results.

[0106] Among common methods of image infringement, besides manipulating the image itself, some techniques involve reusing parts of the image, such as cropping and using a portion of the image instead of the entire image. Therefore, the method in this application, considering common characteristics of image infringement, extracts key elements for local retrieval to improve search efficiency. To further enhance search speed, a large model is used to classify image concepts, performing a coarse screening based on these concepts to help to more quickly and accurately define the initial search scope.

[0107] In a verification embodiment, the image infringement retrieval method based on the image semantic contrast pre-trained model of this application, compared with commonly used image retrieval methods, achieves a probability of finding infringing samples of 0.920 in the first 20 image results, which is significantly higher than other methods. Compared with methods without two-stage screening, the time taken by the method without two-stage screening to retrieve the same number of images to be detected is approximately 21 times that of the method of this application. The results demonstrate that this method can effectively improve the accuracy and efficiency of image infringement retrieval.

[0108] In one embodiment, such as Figure 2 As shown, an image infringement retrieval device based on an image semantic comparison pre-trained model is provided, including: a first coarse screening module, a local high-value element extraction and matching module, and an infringement judgment module, wherein:

[0109] The first coarse screening module is used to acquire the image database. It uses a pre-trained CLIP model to extract features from each image in the image database to obtain coarse screening features for each image. Then, it acquires the image to be detected and uses the CLIP model to extract features from the image to be detected to obtain coarse screening features for the image to be detected. Finally, it calculates the cosine similarity between the coarse screening features of the image to be detected and the coarse screening features of each image in the image database, and selects images in the image database with a cosine similarity greater than a preset value to form a coarse screening image set.

[0110] The local high-value element extraction and matching module is used to extract local high-value elements from the image to be detected using a feature extraction network; feature matching is performed on the local high-value elements and each image in the coarse-screened image set to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

[0111] The infringement judgment module is used to perform a weighted summation of feature matching scores based on area ratios to obtain a comprehensive infringement similarity score for each image in the image to be detected and the coarse-screened image set; based on the comprehensive infringement similarity score and a preset threshold, it determines whether the image to be detected constitutes infringement.

[0112] In one embodiment, the feature extraction network in the local high-value element extraction and matching module is the R101-FPN model, which combines the ResNet-101 backbone network and the feature pyramid network.

[0113] In one embodiment, the local high-value element extraction and matching module is further configured to input the image to be detected into the R101-FPN model to obtain multiple bounding boxes, forming a bounding box set; treat each bounding box as a detected element; and delete low-value elements and retain local high-value elements based on the bounding box set.

[0114] In one embodiment, the local high-value element extraction and matching module is further configured to calculate the ratio of the area of ​​each bounding box in the bounding box set to the area of ​​the image to be detected, thereby obtaining the proportion of the area occupied by each bounding box; delete bounding boxes whose area proportions are not within a preset range from the bounding box set to obtain a first bounding box set; define an area proportion set P, which is used to record the proportion of the area occupied by each bounding box in the first bounding box set; calculate the aspect ratio of each bounding box in the first bounding box set; delete bounding boxes with aspect ratios of 0 from the first bounding box set to obtain a second bounding box set; and use the CLIP model to extract the second bounding box set. For each bounding box in the bounding box set, calculate the cosine similarity of the image features corresponding to any two bounding boxes in the second bounding box set. If the obtained cosine similarity is higher than a first preset threshold, delete the bounding box with the smaller area and update the area ratio set P. Normalize the area of ​​each bounding box in the area ratio set P, and define the normalized area ratio set P' to obtain the filtered bounding box set. The filtered bounding box set includes the high-value element set and its corresponding normalized area ratio set P'. The normalized area ratio set P' is used to record the normalized weight of each local high-value element.

[0115] In one embodiment, the local high-value element extraction and matching module is further used to use SuperGlue to extract and match features of local high-value elements of the image to be detected and each image in the coarse-screened image set, so as to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

[0116] In one embodiment, the infringement determination module is further configured to assign a weight to each high-value element based on the area ratio of each local high-value element, normalize all weights to obtain the normalized weight of each local high-value element, and obtain the comprehensive infringement similarity score of each image in the image to be detected and the coarse-screened image set based on the normalized weight of the local high-value element and the feature matching score:

[0117]

[0118] Where, S(Q,I) i ) is the image to be detected Q and the i-th image I in the set of coarse-screened images. i The overall infringement similarity score between them, p b ' is the normalized weight of the locally high-value element b, score(b, I i ) is a locally high-value element b and image I i Feature matching scores between them.

[0119] In one embodiment, the normalized weight of locally high-value elements in the infringement determination module is:

[0120]

[0121] Where P′ is the normalized set of area proportions; p′ k It is the normalized weight of the k-th locally high-value element; p k is the original area ratio of the k-th segmented object in the area ratio set P, and J is the set of indices of all bounding boxes. It is the sum of the area proportions of all bounding boxes in the area proportion set P.

[0122] Specific limitations regarding the image infringement retrieval device based on an image semantic comparison pre-trained model can be found in the limitations of the image infringement retrieval method based on an image semantic comparison pre-trained model mentioned above, and will not be repeated here. Each module in the aforementioned image infringement retrieval device based on an image semantic comparison pre-trained model can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the corresponding operations of each module.

[0123] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0124] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. An image infringement retrieval method based on an image semantic comparison pre-trained model, characterized in that, The method includes: An image database is acquired, and a pre-trained CLIP model is used to extract features from each image in the image database to obtain coarse-screen features for each image. The image to be detected is acquired, and the CLIP model is used to extract features from the image to be detected to obtain coarse screening features of the image to be detected. Calculate the cosine similarity between the coarse screening features of the image to be detected and the coarse screening features of each image in the image database, and select images in the image database whose cosine similarity is greater than a preset value to form a coarse screening image set; A feature extraction network is used to extract local high-value elements from the image to be detected; the feature extraction network is an R101-FPN model that combines a ResNet-101 backbone network and a feature pyramid network. Feature matching is performed between the local high-value elements and each image in the coarse-screened image set to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set; The feature matching scores are weighted and summed according to the area ratio to obtain the comprehensive infringement similarity score of the image to be detected and each image in the coarse screening image set; Based on the comprehensive infringement similarity score and the preset threshold, it is determined whether the image to be detected constitutes infringement; The process involves using a feature extraction network to extract local high-value elements from the image to be detected, including: The image to be detected is input into the R101-FPN model to obtain multiple bounding boxes, which constitute a bounding box set; Each of the bounding boxes is treated as a detected element; Based on the bounding box set, low-value elements are deleted, and locally high-value elements are retained. Specifically, this includes: calculating the ratio of the area of ​​each bounding box in the bounding box set to the area of ​​the image to be detected, obtaining the proportion of each bounding box's area; deleting bounding boxes whose area proportions are not within a preset range from the bounding box set, obtaining a first bounding box set; defining an area proportion set P, which records the proportion of each bounding box's area in the first bounding box set; calculating the aspect ratio of each bounding box in the first bounding box set, deleting bounding boxes with aspect ratios of 0 from the first bounding box set, obtaining a second bounding box set; using the CLIP model to extract image features corresponding to each bounding box in the second bounding box set; calculating the cosine similarity of image features corresponding to any two bounding boxes in the second bounding box set; when the obtained cosine similarity is higher than a first preset threshold, deleting the bounding box with the smaller area and updating the area proportion set P; normalizing the area of ​​each bounding box in the area proportion set P, defining a normalized area proportion set. The filtered bounding box set is obtained. ; the filtered bounding box set Includes the set of high-value elements and their corresponding normalized set of area ratios. The normalized set of area ratios Used to record the normalized weights of each locally high-value element.

2. The method according to claim 1, characterized in that, Feature matching is performed on local high-value elements and each image in the coarse-screened image set to obtain a feature matching score between the local high-value elements and each image in the coarse-screened image set, including: SuperGlue is used to extract and match features of local high-value elements in the image to be detected and each image in the coarse-screened image set, so as to obtain the feature matching score between the local high-value elements and each image in the coarse-screened image set.

3. The method according to claim 1, characterized in that, The feature matching scores are weighted and summed according to the area ratio to obtain the comprehensive infringement similarity score of the image to be detected and each image in the coarse-screened image set, including: Each high-value element is assigned a weight based on its area ratio, and all weights are normalized to obtain the normalized weight of each high-value element. Based on the normalized weights of locally high-value elements and the feature matching scores, the comprehensive infringement similarity score of the image to be detected and each image in the coarse-screened image set is obtained as follows: in, The image to be detected With the coarse screening image set, the first One image The overall infringement similarity score between them It is a local high-value element Normalized weights, It is a local high-value element With images Feature matching scores between them.

4. The method according to claim 3, characterized in that, The normalized weights of all locally high-value elements, after normalization, form the set of area ratios: in, It is a set of standardized area proportions; It is the first k Normalized weights of local high-value elements; It is the first in the set of area proportions P k The original area ratio of each segmented object J It is the set of indices of all bounding boxes. It is the sum of the area proportions of all bounding boxes in the area proportion set P.

5. An image infringement retrieval device based on an image semantic comparison pre-trained model, characterized in that, The device includes: The first coarse screening module is used to acquire an image database. A pre-trained CLIP model is used to extract features from each image in the image database to obtain coarse screening features for each image. The module then acquires an image to be detected and uses the CLIP model to extract features from the image to be detected to obtain coarse screening features for the image to be detected. The module calculates the cosine similarity between the coarse screening features of the image to be detected and the coarse screening features of each image in the image database, and selects images in the image database whose cosine similarity is greater than a preset value to form a coarse screening image set. A local high-value element extraction and matching module is used to extract local high-value elements of the image to be detected using a feature extraction network. The feature extraction network is an R101-FPN model combining a ResNet-101 backbone network and a feature pyramid network. Specifically, it includes: inputting the image to be detected into the R101-FPN model to obtain multiple bounding boxes, forming a bounding box set; treating each bounding box as a detected element; and deleting low-value elements and retaining local high-value elements based on the bounding box set. Specifically, it includes: calculating the ratio of the area of ​​each bounding box in the bounding box set to the area of ​​the image to be detected, obtaining the proportion of each bounding box's area; and deleting bounding boxes whose area proportion is not within a preset range from the bounding box set. A first set of bounding boxes is obtained, and an area ratio set P is defined to record the proportion of the area occupied by each bounding box in the first set. The aspect ratio of each bounding box in the first set is calculated, and bounding boxes with an aspect ratio of 0 are deleted from the first set to obtain a second set of bounding boxes. The CLIP model is used to extract the image features corresponding to each bounding box in the second set, and the cosine similarity between the image features corresponding to any two bounding boxes in the second set is calculated. When the obtained cosine similarity is higher than a first preset threshold, the bounding box with the smaller area is deleted, and the area ratio set P is updated. The area of ​​each bounding box in the area ratio set P is normalized to define a normalized area ratio set. This yields a set of filtered bounding boxes; the set of filtered bounding boxes includes a set of high-value elements and their corresponding normalized area ratio sets. The normalized set of area ratios Used to record the normalized weights of each locally high-value element; The local high-value element extraction and matching module is also used to perform feature matching between the local high-value element and each image in the coarse-screened image set to obtain the feature matching score between the local high-value element and each image in the coarse-screened image set. The infringement determination module is used to perform a weighted summation of the feature matching scores according to the area ratio to obtain a comprehensive infringement similarity score between the image to be detected and each image in the coarse-screened image set; and to determine whether the image to be detected constitutes infringement based on the comprehensive infringement similarity score and a preset threshold.