[0035] The drawings constituting a part of the present invention are used to provide a further understanding of the present invention, and the schematic embodiments and descriptions of the present invention are used to explain the present invention, and do not constitute an improper limitation of the present invention.
[0036] like figure 1 As shown, this embodiment provides a near-duplicate image retrieval method based on consistent region deep learning features: in the offline stage, extract SIFT features for all images in the image library, and then use the K-Means clustering method to cluster each SIFT Features are quantized into visual words and stored into the constructed inverted index file. In the online stage, use the same feature extraction and quantization method for the input query image, calculate the similarity between the quantized SIFT features and the features in the index file, sort the obtained similarity results, and output the images related to the query image in turn That is, candidate images. The above process is to use the Bag-of-visual-words (BOW) model for image retrieval. In addition, in order to reduce the computational complexity of detecting the target area in the image and feature extraction, the present invention uses the existing EdgeBox algorithm for area detection in the offline stage to extract the target area for all the images in the above image library, and for all the target areas in the candidate image Extract CNN features. In order to ensure that visually consistent region pairs are detected between near-duplicate images, in the online stage, the characteristics of SIFT features are fully utilized to locate near-duplicate regions consistent with the target region in the query image, form near-duplicate region pairs, and Near-repeated region pairs extract compact CNN features, which improves the accuracy and efficiency of near-repeated image retrieval. The specific steps are as follows:
[0037] Step 1: Extract 128-dimensional SIFT features for all images in the image library.
[0038] Step 2: Perform BOW quantization on the extracted SIFT features: perform K-means clustering on all the extracted SIFT features, divide all the extracted SIFT features into E categories, each category is represented by a visual word, and SIFT features quantized to the same visual word are grouped into one category. The collection of all visual word labels constitutes a visual dictionary. Therefore, each image can be described by several visual words.
[0039] Step 3: In order to improve the efficiency of image retrieval, an inverted index is established for all SIFT features. The indexed feature not only records the ID of the image it belongs to, but also its orientation, scale and coordinates and other related information. This information will be further used to generate potential near-duplicate region pairs. The inverted index as described in figure 2 shown.
[0040] Step 4: By using an inverted index structure, any two SIFT features quantified to the same visual word from different images are considered to be matched, and the number of SIFT features shared between the two images is measured by counting the number of SIFT features between images. similarity. Images in the image library are considered candidate images when they share 5 or more SIFT feature pairs with the input query image. Therefore, a large number of irrelevant images can be filtered out, and the time complexity of image detection region and feature extraction can be reduced.
[0041] Step 5: Since the EdgeBox algorithm can achieve high recall by computing informative edge maps, it can detect meaningful object regions from images that are most likely to be replicated and propagated among near-duplicate images. In addition, the algorithm's edge computation is efficient and the computed edge map is sparse with low computational complexity. Most importantly, the algorithm can detect the target area directly from the edge information of the image without the learning process based on the deep learning network. Therefore, the algorithm has strong flexibility. Specific steps are as follows:
[0042] Step 5-1: Using the EdgeBox algorithm, a set of object regions is detected for each candidate image.
[0043] Step 5-2: In order to avoid negative impact of small areas on image retrieval, this embodiment deletes areas with an area smaller than M/5×N/5, where M and N are the width and height of the image, respectively.
[0044] Step 5-3: In theory, for the detected target area, the number of SIFT features can reflect its texture complexity to a certain extent, because the number of SIFT features extracted from a well-textured area is much larger than that extracted from a flat area SIFT features. Therefore, in order to save computing resources, all object regions detected in candidate images are sorted in descending order according to the number of SIFT features contained in each region, and the top k object regions (detected regions) are kept; other object regions are deleted.
[0045] Step 6: According to the SIFT feature pair between the query image and any target area in any candidate image, find the near-duplicate area in the query image that is approximately repeated with the target area; combine the near-duplicate area with the target area Form a set of near-duplicate region pairs. Specifically as follows:
[0046] Step 6-1: Use the inverted index file to find n pairs of SIFT feature pairs that exist between the query image and a certain target area in a candidate image. The number of SIFT feature pairs n may be as high as hundreds. All SIFT features are matched to locate the corresponding potential near-repeated region pairs in the query image. Although many correct near-repeated region pairs can be located, the calculation consumption is very large. In practice, we only need to ensure that the located near-duplicate region pairs contain at least one pair of true matching SIFT feature pairs to ensure the correctness of near-duplicate image detection; the true matching SIFT feature pairs are from different graphics, And the description of the image content is composed of two SIFT features; therefore, in order to reduce the amount of calculation, it is assumed that the probability of true matching is p T , Y is the feature pair that has Y pair of feature pairs in the n pairs of feature pairs that are true matches. When n is randomly selected S When matching SIFT features, the probability of including at least one true matching SIFT feature pair is approximately:
[0047]
[0048] So, pick n S Using the SIFT feature matching pairs to locate near-duplicate region pairs can ensure that at least one pair of SIFT feature matching pairs is a real match, so that at least one pair of correct near-duplicate region pairs can be located.
[0049] Step 6-2: The detection of SIFT features is based on the content of the image, so the scale, main direction, and coordinates of feature points of local features change together with scaling, rotation, and translation transformations, respectively. Therefore, the parameters of the transformation can be estimated from the attribute changes between two matched local features.
[0050] Suppose two SIFT features f Q and f C The characteristics of [σ Q ,θ Q ,(x Q ,y Q ) T ] and [σ C ,θ C ,(x C ,y C ) T ]; where f Q is the SIFT feature in the query image, σ Q , θ Q , (x Q ,y Q ) represent the scale, main direction and coordinates of the SIFT feature respectively; f C is the SIFT feature in the target area, σ C , θ C , (x C ,y C ) represent the scale, main direction and coordinates of the SIFT feature, respectively. Use the following formula to determine a near-duplicate region (localized region), that is, there is n between the query image and the target region s a pair of near-duplicate regions;
[0051]
[0052] Among them, (u Q ,v Q ) T 、w Q and h Q Respectively, the near-repetitive region R in the query image C The center coordinates, width and height of ;
[0053] Intuitively, if these two features are truly matched, then R C and R Q Very likely to be the correct near-duplicate region pair.
[0054] Step 7: After detecting potential near-repeated region pairs, extract compact CNN features for these near-repeated region pairs, the steps are as follows:
[0055]Step 7-1: Take any target area/nearly repeated area as the input image of the AlexNet model, then the model outputs 256 feature maps with a size of W×H, and a feature vector with a dimension of W×H×256 can be obtained.
[0056] Step 7-2: Enter the first sum-pooling stage, for an input region of any size, apply spatial sum-pooling of size m×m to the activation of the region to obtain an m×m×256-dimensional feature map.
[0057] Step 7-3: Enter the second sum-pooling stage to compress the features by summarizing the activation values of the m×m×256-dimensional feature map and concatenating the pooled results to generate an m×m×d-dimensional feature vector. where 256 is a multiple of d. Finally, the generated m×m×d-dimensional feature vector is normalized by L2, and the normalized m×m×d-dimensional features are used as CNN features.
[0058] Step 8: In the online retrieval stage, by comparing the CNN features between the near-duplicate region and the target region of the candidate image, the similarity between the two images is measured to achieve the purpose of retrieving the near-duplicate image version. For a given into-repeat region pair R Q (near repeat region) and R C (target area), and their corresponding CNN features are C(R Q ) and C(R C ), which computes the cosine similarity:
[0059]
[0060] Step 9: Between the query image and a candidate image, select the scores of a group of near-repeated region pairs with the highest cosine similarity score as the similarity score between the query image and the candidate image.
[0061] In addition, it should be noted that the various specific technical features described in the above specific implementation manners may be combined in any suitable manner if there is no contradiction. In order to avoid unnecessary repetition, various possible combinations are not further described in the present invention.