Image deduplication method and system based on multi-layer feature fusion and improved similarity clustering
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- INFORMATION CENT OF YUNNAN POWER GRID CO LTD
- Filing Date
- 2025-09-18
- Publication Date
- 2026-06-18
AI Technical Summary
Existing image deduplication methods are inefficient and lack feature extraction accuracy when dealing with multiple duplicate images. Furthermore, similarity calculation and clustering analysis are inefficient and cannot effectively handle the correlation between images.
We employ a multi-layer feature fusion and improved similarity clustering method. By extracting the main and detailed features of the image through three-layer wavelet packet decomposition, we form a three-dimensional vector. Combined with improved DBScan clustering analysis, we optimize the similarity calculation and clustering process.
It improves the comprehensiveness and reliability of image deduplication, reduces noise interference, and enhances computational efficiency and accuracy, effectively handling multiple duplicate images.
Smart Images

Figure CN2025122351_18062026_PF_FP_ABST
Abstract
Description
A method and system for image deduplication based on multi-layer feature fusion and improved similarity clustering Technical Field
[0001] This invention relates to the field of computer image processing technology, specifically to an image deduplication method based on multi-layer feature fusion and improved similarity clustering. Background Technology
[0002] With the development of science and technology and the continuous updating and iteration of image data acquisition methods, the amount of image data is gradually increasing. Inevitably, one or more duplicate images will be mixed in the image set. Using manual methods to classify or remove duplicate images is not only time-consuming and labor-intensive, but also difficult to ensure that no omissions are made. Therefore, it is necessary to develop an efficient, fast and accurate similar image redundancy elimination algorithm.
[0003] Duplicate images typically refer to images that are completely identical in content, color, size, orientation, etc. These images are completely consistent in characteristics, so deduplication of such images is relatively easy and can be achieved using general methods. However, in reality, in most cases, duplicate images are not completely identical. They may be images with similar content, especially images taken continuously by a camera. The content may not appear very different to the human eye, but it has already undergone significant changes to the computer. Or, images that have been processed by the computer, such as flipping, panning, or changing brightness, may also have significant changes to the computer, but in terms of image content, they are still considered duplicate images.
[0004] Currently, deduplication algorithms for duplicate images can be broadly divided into two categories: one is for deduplication algorithms for completely identical images, and the other is for deduplication algorithms for similar images through feature extraction.
[0005] For identical images, common deduplication algorithms include histogram comparison, MD5 comparison, cosine similarity comparison, and hash comparison. Histogram comparison-based algorithms represent the pixel intensity distribution of an image as a histogram and compare the differences in these histograms to determine if images are duplicates. Therefore, this algorithm is heavily influenced by the number of pixels in the image and does not perform well when the image's brightness varies. MD5-based algorithms require highly consistent data for duplicate images; any differences will result in different MD5 values. Therefore, this method can only deduplicate completely identical images, and its practical application performance is poor. Cosine similarity-based image deduplication algorithms require converting the image information into a set of vectors and then calculating the cosine of the angle between the vectors' inner product space, resulting in complex calculations and a large computational load. Hash-based deduplication algorithms require converting the image to grayscale and compressing the image resolution to ensure calculation accuracy. Although this method is fast, it loses a lot of image information and details during processing. Therefore, the similarity calculated based on hash values is only a rough estimate, resulting in poor image deduplication performance.
[0006] Algorithms for deduplicating similar images based on feature extraction generally include ORG-based deduplication algorithms, SIFT (Scale Invariant Feature Transform)-based deduplication algorithms, and machine learning-based deduplication algorithms. Algorithms based on ORG to extract local binary feature points are fast, but are significantly affected by differences in image size and deformation. SIFT-based deduplication algorithms, while achieving good results, are computationally slow due to the complexity of feature extraction and matching calculations, and perform poorly on images after flipping or similar operations. Machine learning-based deduplication algorithms require large amounts of data and time for training, and their deduplication effectiveness depends heavily on the training results, leading to poor practical application performance.
[0007] Most existing image deduplication algorithms only work when there is only one duplicate image. They cannot handle multiple duplicate images well. Even if they do, they only compare the subsequent duplicate images with the first one, ignoring the correlation between the multiple duplicate images. As a result, the search for similar images may be missed or inaccurate. Summary of the Invention
[0008] In view of the above-mentioned problems, the present invention is proposed.
[0009] Therefore, the technical problem solved by this invention is that existing image deduplication methods are insufficient for processing multiple duplicate images, have limited feature extraction efficiency and accuracy, and have low efficiency in similarity calculation and cluster analysis. The invention also addresses the optimization problem of how to achieve efficient image feature extraction and similarity calculation.
[0010] To address the aforementioned technical problems, this invention provides the following technical solution: a deduplication method for images based on multi-layer feature fusion and improved similarity clustering, comprising: extracting a first feature from the input image to generate an index; obtaining image similarity based on the feature vector; and reading the index to perform a first analysis on the image.
[0011] As a preferred embodiment of the image deduplication method based on multi-layer feature fusion and improved similarity clustering described in this invention, the first feature extraction of the input image includes feature extraction of the input image to obtain a feature vector.
[0012] As a preferred embodiment of the image deduplication method based on multi-layer feature fusion and improved similarity clustering described in this invention, the generated index includes an index of generated filenames plus features.
[0013] As a preferred embodiment of the image deduplication method based on multi-layer feature fusion and improved similarity clustering described in this invention, the step of reading in the feature vector includes reading in the feature vector.
[0014] As a preferred embodiment of the image deduplication method based on multi-layer feature fusion and improved similarity clustering described in this invention, the method for obtaining image similarity includes integrating the similarity of each component.
[0015] As a preferred embodiment of the image deduplication method based on multi-layer feature fusion and improved similarity clustering described in this invention, the input index includes an index of the input file name plus features.
[0016] As a preferred embodiment of the image deduplication method based on multi-layer feature fusion and improved similarity clustering described in this invention, the first analysis of the image includes obtaining image clustering results.
[0017] Another objective of this invention is to provide an image deduplication system based on multi-layer feature fusion and improved similarity clustering. This system can calculate the similarity between images using a hierarchical comparison method based on feature vectors through a similarity calculation module, thus solving the problems of high computational cost, low efficiency, and susceptibility to noise or subtle changes in current similarity calculation methods.
[0018] As a preferred embodiment of the image deduplication system based on multi-layer feature fusion and improved similarity clustering described in this invention, it includes a feature extraction module, a similarity calculation module, and a cluster analysis module; the feature extraction module is used to extract first features from the input image and generate an index; the similarity calculation module is used to obtain image similarity based on feature vectors; and the cluster analysis module is used to read in the index and perform a first analysis on the image.
[0019] A computer device includes a memory and a processor, the memory storing a computer program, characterized in that the processor executes the computer program to implement a method for image deduplication based on multi-layer feature fusion and improved similarity clustering.
[0020] A computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of a multi-layer feature fusion and improved similarity clustering image deduplication method.
[0021] The beneficial effects of this invention are as follows: The image deduplication method based on multi-layer feature fusion and improved similarity clustering provided by this invention extracts the main features and detail features of the image by using three-layer wavelet packet decomposition and two-layer wavelet packet decomposition respectively, and combines the features to form a three-dimensional vector, which improves the feature representation ability and anti-interference ability. By prioritizing the matching of main features, the computational efficiency is improved and the risk of misjudgment caused by noise or small changes is reduced. By abstracting the similarity results into the distance between samples and performing unsupervised clustering analysis, the comprehensiveness and reliability of duplicate image removal are improved. This invention achieves better results in terms of anti-interference, computational efficiency and reliability. Attached Figure Description
[0022] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort. Wherein:
[0023] Figure 1 is an overall flowchart of the image deduplication method based on multi-layer feature fusion and improved similarity clustering provided in the first and second embodiments of the present invention.
[0024] Figure 2 is a schematic diagram of image feature extraction based on multi-layer feature fusion and improved similarity clustering image deduplication method provided in the first and second embodiments of the present invention.
[0025] Figure 3 is a schematic diagram of similarity calculation for an image deduplication method based on multi-layer feature fusion and improved similarity clustering provided in the second embodiment of the present invention.
[0026] Figure 4 is a schematic diagram of DBScan clustering analysis based on similarity improvement, which is based on multi-layer feature fusion and improved similarity clustering image deduplication method provided in the second embodiment of the present invention.
[0027] Figure 5 is an overall module diagram of the image deduplication system based on multi-layer feature fusion and improved similarity clustering provided in the fourth embodiment of the present invention. Detailed Implementation
[0028] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.
[0029] Example 1
[0030] Referring to Figures 1-2, an embodiment of the present invention provides an image deduplication method based on multi-layer feature fusion and improved similarity clustering, including:
[0031] S1: Perform the first feature extraction on the input image to generate an index.
[0032] Furthermore, the first feature extraction of the input image includes extracting features from the input image to obtain a feature vector.
[0033] In this embodiment of the application, the first feature extraction is image feature extraction, and the input is a similar redundant image dataset containing at least two categories and at least one or no redundant image;
[0034] Determine whether the input single image is in HIS space. If not, convert it from RGB space to HIS space. Decompose the converted image according to HIS space to obtain three components: H, I, and S. Convert each component from a two-dimensional matrix to a one-dimensional matrix.
[0035] The three components are decomposed into three-level wavelet packets, and each component is obtained into eight nodes, i.e., eight frequency bands. The energy proportions of the eight frequency bands are arranged from high to low frequency and used as a set of one-dimensional feature vectors A1, which are denoted as the main features.
[0036] The two energy values with the largest proportions are further decomposed into two layers of wavelet packets to obtain two groups of four nodes and four frequency bands. The energy proportions of the two groups of four frequency bands are sorted according to the frequency size. The two sets of results are combined to form a one-dimensional feature vector A2, which is denoted as the detail feature.
[0037] Two one-dimensional vectors are combined to obtain a set of two-dimensional vectors. The feature vectors of the three components of HIS are merged to obtain the overall three-dimensional feature vector of the image.
[0038] In an optional embodiment, the first feature extraction can also be implemented in other ways, such as feature extraction based on deep learning. Before feature extraction, the input image dataset is standardized, including resizing, pixel value normalization and data augmentation. All images are uniformly resized to a fixed size (e.g., 224×224 pixels) to ensure the consistency of network input. Pixel values are normalized to the [0,1] range to reduce the complexity of numerical computation and improve the stability of the model.
[0039] Using a pre-trained deep convolutional neural network model as the basic framework for feature extraction, the network is adjusted into a feature extraction module by removing the fully connected layers in the pre-trained model and retaining only the convolutional and pooling layers. The first few layers of the convolutional network are used to extract low-level features of the image, such as edges, textures, and color distribution. These features are mainly used to characterize the basic structural information of the image. High-level semantic features, such as the shape of objects, local structure, and overall layout in the image, are extracted through deep convolutional layers. These features have higher expressive power and can effectively distinguish the categories and similarities of images.
[0040] Low-level features and high-level semantic features are integrated through feature fusion. Features extracted from different convolutional layers are concatenated according to channel dimension to form a multi-dimensional feature matrix. Different weights are assigned to features at different levels to highlight features that are more discriminative for image deduplication. Principal component analysis or dimensionality reduction layers (such as global average pooling layers) are used to reduce the dimensionality of the fused high-dimensional features to obtain a fixed-length feature vector.
[0041] Furthermore, index generation includes generating an index with filenames and characteristics.
[0042] It should be noted that the file name and the extracted feature vector are added together as a key-value pair to the index dictionary. If the number of images is large, the index dictionary can be written to a JSON file for later use.
[0043] Example 2
[0044] Referring to Figures 1-4, an embodiment of the present invention provides an image deduplication method based on multi-layer feature fusion and improved similarity clustering, including:
[0045] S2: Obtain image similarity based on feature vectors.
[0046] Furthermore, based on the feature vector, the feature vectors of the two images to be compared are read in.
[0047] Furthermore, obtaining image similarity involves combining the similarity of each component.
[0048] It should be noted that, based on the feature vectors, the system compares whether the feature vectors with the largest energy value proportions in the three HIS components A1 belong to the same frequency band. If the output is True, the system continues to compare the next feature vector in order of energy proportion. If the output is False, the system skips the comparison and provides a similarity score. This process is repeated until all values in the A1 feature vectors are matched. Then, the system compares whether the feature vectors with the largest energy value proportions in the three HIS components A2 belong to the same frequency band. The two groups in A2 are compared sequentially. If the output is True, the system compares the next feature vector in order of energy proportion. If the output is False, the system skips the comparison and provides a similarity score. This process is repeated until all values in the feature vectors are compared. Finally, the similarity scores of the three HIS components are combined to provide the final similarity result.
[0049] It should also be noted that primary features are compared first, followed by detailed features. Features with higher energy percentages are compared first, followed by those with lower energy percentages. Once the priority features match, the others are compared. If the primary features do not match, the detailed features do not need to be compared. The lower the similarity, the faster the calculation speed; the higher the similarity, the longer the calculation time.
[0050] S3: Read in the index and perform the first analysis on the image.
[0051] Furthermore, the read index includes an index of read filenames plus characteristics, in which objects P are selected sequentially.
[0052] Furthermore, the first analysis of the image includes obtaining image clustering results.
[0053] In this embodiment of the application, the first analysis is DBScan clustering analysis based on similarity improvement, to determine whether P belongs to a certain cluster. If it belongs to a certain cluster, the object is reselected.
[0054] Determine if P is a core point. If it is a core point, mark P as processed. If it is not a core point, determine if P is a noise point. If it is a noise point, group P into a separate cluster and mark it as processed. If it is not a noise point, reselect the object.
[0055] Traverse all unlabeled data within the neighborhood of the core point P, group all unlabeled points with the core point P into one cluster. The distance parameter used when calculating the neighborhood range is the abstracted similarity, i.e., d = 1 - [S(P1, P2)]. Mark all unlabeled points as processed, and determine whether they are core points in turn. If they are core points, repeat the traversal. If all core points and all unlabeled points in their neighborhoods have been processed, then select objects again.
[0056] Repeat the loop until all items in the index are marked as processed, then output all clusters obtained from the clustering, with each cluster representing a class of similar images.
[0057] In an optional embodiment, the first analysis can also be implemented in other ways, such as hierarchical clustering analysis. Before the clustering analysis, it is necessary to calculate the similarity distance matrix between all images based on the feature vectors generated by the feature extraction module. The distance can be calculated using Euclidean distance, cosine distance, or based on the similarity formula generated by the similarity calculation module. The feature vector of each image is read from the index, and the distance between the feature vectors of each pair of images is calculated to form a symmetric distance matrix.
[0058] Each image is considered as an independent cluster, meaning there are n clusters in the initial state, and each cluster contains only one image;
[0059] Based on the values of the distance matrix, select the two clusters with the smallest distance to merge, and update the distance matrix. The distance update can be done using the single-chain method. The distance between the new cluster and other clusters is the distance between the nearest points in the two clusters.
[0060] The full-chain method takes the distance between a new cluster and other clusters as the distance between the farthest points in the two clusters.
[0061] The average chain method takes the average of the point-to-point distances between all points in the two clusters as the distance between the new cluster and other clusters.
[0062] The centroid method states that the distance between a new cluster and other clusters is the distance between their centers.
[0063] The clustering process continues until any of the following conditions are met: the minimum distance between all clusters is greater than a preset threshold, the total number of clusters reaches the specified target number of clusters, or no further merging is possible.
[0064] The final clustering result includes all clusters and the images they contain. Each cluster represents a group of similar images, while images that are not clustered are considered noise points.
[0065] Example 3
[0066] One embodiment of the present invention provides an image deduplication method based on multi-layer feature fusion and improved similarity clustering. To verify the beneficial effects of the present invention, scientific demonstration is carried out through economic benefit calculation and simulation experiments.
[0067] A dataset containing multiple types of similar redundant images was selected, comprising 5000 images. The dataset consists of various types of images taken in different scenes, including images that have undergone various preprocessing steps such as flipping, brightness adjustment, and scaling, as well as several consecutively captured images with high similarity.
[0068] Each image is input for feature extraction. Principal features are extracted using three-layer wavelet packet decomposition, and detail features are extracted using two-layer wavelet packet decomposition. The principal and detail features are then fused to form a three-dimensional feature vector. During feature extraction, an index file is used to associate image names with feature vectors for easier subsequent processing. For any two images, the matching rate of principal features is first compared. If the matching rate is below a threshold, they are directly considered dissimilar. If the matching rate meets the requirements, detail features are further compared. An improved DBScan algorithm is used, with the image similarity results used as a distance parameter. Density clustering groups similar images together. The algorithm does not require pre-setting the number of image categories and can dynamically identify similar categories in the image set. The clustering results are compared with manually labeled similar groups to calculate the accuracy and error rate of clustering. The processing time for each group of images is recorded to evaluate efficiency.
[0069] As shown in Table 1, the matching rate of the main features is above 88%, and the matching rate of the detailed features is above 85%, indicating that the feature extraction has high expressive power in extracting global and local features of images, and can effectively distinguish similar and dissimilar images. The processing time is generally between 0.45 and 0.60 seconds, which is a significant improvement compared to the processing time of more than 1 second of traditional algorithms. The hierarchical strategy of prioritizing the matching of main features reduces the frequency of complex calculations, and the clustering error rate is generally below 1.2%. The error rate of most samples is controlled within the range of 0.5%-0.7%, which shows that the improved DBScan algorithm can accurately process large-scale image datasets and reduce the omission of similar images or classification errors.
[0070] Table 1 Experimental Data
[0071] Example 4
[0072] Referring to Figure 5, an embodiment of the present invention provides an image deduplication system based on multi-layer feature fusion and improved similarity clustering, including: a feature extraction module, a similarity calculation module, and a cluster analysis module.
[0073] The feature extraction module is used to extract the first feature from the input image and generate an index; the similarity calculation module is used to obtain the image similarity based on the feature vector; and the clustering analysis module is used to read in the index and perform the first analysis on the image.
[0074] If a function is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0075] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device.
[0076] More specific examples (a non-exhaustive list) of computer-readable media include: electrical connections (electronic devices) having one or more wires, portable computer disk drives (magnetic devices), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Furthermore, computer-readable media can even be paper or other suitable media on which programs can be printed, because programs can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in computer memory.
[0077] It should be understood that various parts of the present invention can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc. It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.
[0078] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.
Claims
1. A deduplication method for images based on multi-layer feature fusion and improved similarity clustering, characterized in that, include: Perform first feature extraction on the input image to generate an index; Image similarity is obtained based on feature vectors; Read in the index and perform the first analysis on the image.
2. The image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in claim 1, characterized in that: The first feature extraction of the input image includes extracting features from the input image to obtain a feature vector.
3. The image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in claim 2, characterized in that: The generated index includes generating an index with filenames and features.
4. The image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in claim 3, characterized in that: The step of reading in the feature vector includes reading in the feature vector.
5. The image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in claim 4, characterized in that: The obtained image similarity includes the sum of the similarity of each component.
6. The image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in claim 5, characterized in that: The read-in index includes an index of the read-in filename plus its characteristics.
7. The image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in claim 6, characterized in that: The first analysis of the image includes obtaining image clustering results.
8. A system employing the image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in any one of claims 1 to 7, characterized in that: It includes a feature extraction module, a similarity calculation module, and a cluster analysis module; The feature extraction module is used to extract the first feature from the input image and generate an index; The similarity calculation module is used to obtain image similarity based on feature vectors; The clustering analysis module is used to read in the index and perform the first analysis on the image.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the image deduplication method based on multi-layer feature fusion and improved similarity clustering as described in any one of claims 1 to 7.