Archive cleaning method and related apparatus, electronic device, and storage medium
By filtering and clustering the local feature similarity of object images, the problem of insufficient accuracy in archive cleaning is solved, achieving more efficient and accurate archive cleaning.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG DAHUA TECH CO LTD
- Filing Date
- 2023-05-10
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, image clustering lacks accuracy in the document cleaning process, leading to erroneous cleaning.
By filtering out suspected impure files from the files to be processed, clustering is performed using the local feature similarity of the object images to obtain several first-class clusters, and then further filtering out second-class clusters, finally obtaining purified files.
It improved the accuracy and efficiency of document cleaning and enhanced the applicability of document cleaning methods.
Smart Images

Figure CN116704225B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image clustering technology, and in particular to a method for cleaning archives and related apparatus, electronic devices and storage media. Background Technology
[0002] With the development of science and technology, image clustering has been widely applied. For example, in identity verification, it is necessary to search relevant databases based on images of unidentified objects. Therefore, it is necessary to establish a database with one file per object, where images of the same object belong to the same file.
[0003] Currently, image comparison of different objects is generally used to group images of the same object into the same archive. However, due to the limitations of image comparison, errors often occur during the clustering and generation of new archives. Therefore, improving the accuracy of archive cleaning has become an urgent problem to be solved. Summary of the Invention
[0004] The main technical problem addressed by this application is to provide a method for cleaning archives, as well as related apparatus, electronic devices, and storage media, which can improve the accuracy of archive cleaning.
[0005] To address the aforementioned technical problems, the first aspect of this application provides a file cleaning method, comprising: selecting suspected impure files from a number of files to be processed, wherein the object images in the suspected impure files are suspected to belong to different objects; clustering the object images in the suspected impure files to obtain several first clusters; further selecting the object images in the first clusters based on the similarity between local features of the same parts in different object images in the first clusters to obtain second clusters; and obtaining purified files based on the second clusters.
[0006] To address the aforementioned technical problems, a second aspect of this application provides an archive cleaning apparatus, comprising a first screening module, an image clustering module, a second screening module, and an archive purification module. The first screening module is used to screen suspected impure archives from a plurality of archives to be processed; wherein the object images in the suspected impure archives are suspected to belong to different objects. The image clustering module is used to cluster the object images in the suspected impure archives to obtain several first clusters. The second screening module is used to screen the object images in the first clusters based on the similarity between local features of the same parts in different object images within the first clusters to obtain second clusters. The archive purification module is used to obtain purified archives based on the second clusters.
[0007] To address the aforementioned technical problems, a third aspect of this application provides an electronic device including a memory and a processor coupled to each other. The memory stores program instructions, and the processor executes the program instructions to implement the file cleaning method described in the first aspect.
[0008] To address the aforementioned technical problems, a fourth aspect of this application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being used to implement the file cleaning method described in the first aspect.
[0009] The above scheme identifies potentially impure files from a pool of pending files, where the object images appear to belong to different objects. These images are then clustered into several first-class clusters. Next, based on the similarity of local features in common areas of different object images within these first-class clusters, a second-class cluster is obtained. Based on this second-class cluster, purified files are obtained. This process improves the accuracy of clustering objects into first-class clusters, thus increasing the efficiency of file cleaning. Furthermore, the second-class cluster selection enhances the applicability of the file cleaning method. Therefore, it improves the accuracy of file cleaning.
[0010] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this application. Attached Figure Description
[0011] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the specification, serve to explain the technical solutions of this application.
[0012] Figure 1 This is a flowchart illustrating an embodiment of the document cleaning method of this application;
[0013] Figure 2 yes Figure 1 A flowchart illustrating an embodiment of step S12;
[0014] Figure 3 yes Figure 1 A flowchart illustrating another embodiment of step S12;
[0015] Figure 4 yes Figure 1 A flowchart illustrating another embodiment of step S12;
[0016] Figure 5 yes Figure 1 A flowchart illustrating an embodiment of step S13;
[0017] Figure 6 yes Figure 1 A flowchart illustrating another embodiment of step S13;
[0018] Figure 7 This is a schematic diagram of the framework of an embodiment of the document cleaning device of this application;
[0019] Figure 8 This is a schematic diagram of the framework of an embodiment of the electronic device of this application;
[0020] Figure 9 This is a schematic diagram of a framework of an embodiment of the computer-readable storage medium of this application. Detailed Implementation
[0021] The embodiments of this application will now be described in detail with reference to the accompanying drawings.
[0022] In the following description, specific details such as particular system architectures, interfaces, and technologies are presented for illustrative purposes rather than for limiting purposes, in order to provide a thorough understanding of this application.
[0023] In this document, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. Additionally, the character " / " generally indicates that the preceding and following related objects are in an "or" relationship. Furthermore, "many" in this document means two or more. Moreover, the term "at least one" in this document means any combination of at least two of any one or more of a plurality of objects. For example, including at least one of A, B, and C can mean including any one or more elements selected from the set consisting of A, B, and C. "Several" means at least one. The terms "first," "second," etc., in the specification, claims, and accompanying drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0024] Please see Figure 1 , Figure 1 This is a flowchart illustrating an embodiment of the document cleaning method of this application.
[0025] Specifically, this may include the following steps:
[0026] Step S11: Select suspected impure files from a number of files to be processed.
[0027] In this embodiment of the disclosure, the object images in the suspected impure file are suspected to belong to different objects. Of course, the suspected impure file may also include object images whose belonging to which cannot be determined, images without a belonging to which there is no object, etc. The object images in the suspected impure file can be determined according to the actual situation, and no specific limitation is made here.
[0028] In a specific implementation scenario, the object image may include, but is not limited to, images of people, animals, vehicles, etc. Of course, the object image can be determined according to the actual situation, and no specific limitation is made here.
[0029] In one implementation scenario, a number of files to be processed may not contain any suspected impure files. When these files do not contain any suspected impure files, the files to be processed are screened to directly obtain purified files. It can be understood that the purified files are the files to be processed. Alternatively, the files to be processed may contain some suspected impure files. After cleaning the suspected impure files, the cleaned purified files and the purified files contained in the files to be processed are obtained. Furthermore, the files to be processed may only contain suspected impure files. After cleaning the suspected impure files, the cleaned purified files are obtained.
[0030] In one implementation scenario, to filter out suspected impure files, feature extraction can be performed on the object images in the files to be processed to obtain image features. The image similarity between the image features of different object images can be calculated, and then the mean of the image similarity can be calculated. The mean of the image similarity of the files to be processed is used as the third outlier. The third outlier is then compared with a preset threshold. Files to be processed with a third outlier not less than the preset threshold are considered suspected impure files, while files with a third outlier less than the preset threshold are considered purified files.
[0031] In another implementation scenario, unlike the aforementioned implementation, to further improve the accuracy of identifying suspected impure files, the first outlier value of the abnormal factors in the files to be processed can be obtained first. The abnormal factors represent the abnormal dimensions of the files to be processed. The abnormal factors may include the similarity of object images among different object images in the files to be processed, the similarity of target parts among different object images in the files to be processed, the change in the number of object images in the files to be processed, the update time of object images in the files to be processed, etc. Then, the first outlier value corresponding to the abnormal factors is obtained. The first outlier value represents the degree of abnormality of the files to be processed in the corresponding abnormal factor dimension. For example, firstly, object features and target part features are extracted from the object images in the file to be processed. Object features are obtained by directly extracting features from the object images, while target part images are obtained by cropping the object images and extracting features from the target part images. Then, the similarity between object features of different object images and the similarity between different target part features are calculated. Further, the mean similarity between object features and the mean similarity between target part features in the file to be processed are calculated, and the mean similarity between object features and the mean similarity between target part features are respectively used as the first outliers. Of course, the minimum value of the similarity between different object features and the minimum value of the similarity between different target part features in the file to be processed can also be selected as the first outlier. Furthermore, the first outlier can be determined based on whether there is a sudden change in the number of object images in the file to be processed. For example, when the number of object images changes abruptly, the corresponding first outlier is smaller, and when the number of object images changes slightly, the corresponding first outlier is larger. Alternatively, the first outlier can be determined based on the update time of the object images in the file to be processed. For example, when the update time interval of the object images in the file to be processed is long, the corresponding first outlier is smaller, and when the update time interval of the object images is short, the corresponding first outlier is larger. The first outliers are then further fused to obtain the second outlier. Specifically, the second outlier can be obtained by weighting several first outliers, or by directly summing several first outliers. The second outlier can be determined according to the actual situation and is not specifically limited here. It is understandable that when there is only one first outlier, the first outlier can be directly used as the second outlier. After obtaining the second outlier, the file to be processed is determined to be a suspected impure file based on whether the second outlier meets the third condition. It can be understood that the third condition is that the second outlier is not greater than a preset threshold, which could be 0.5, 0.6, 0.7, etc. When the second outlier meets the third condition, the corresponding file to be processed is considered a suspected impure file. The third condition can be set according to the method of determining the outlier factor and the second outlier; no specific limitations are made here.The above method obtains the first outlier of the abnormal factors in the files to be processed, and merges the first outliers to obtain the second outlier. By determining the first outliers of multiple abnormal factors in the files to be processed, the diversity of the first outliers is improved. Furthermore, by merging the first outliers to determine the second outlier, the accuracy of the second outlier is improved. Then, based on whether the second outlier meets the third condition, it is determined whether the files to be processed are suspected impure files. This helps to improve the accuracy of screening the files to be processed, thereby improving the accuracy of obtaining suspected impure files and further improving the accuracy of file cleaning.
[0032] In a specific implementation scenario, the most feature-rich parts of an object are located in at least one of the pre-defined positions. The target part can be determined based on the object category. For example, for most living organisms, the face is usually the most feature-rich part, so the target part can be set as the face; while for non-living objects like "vehicles," the front and rear of the vehicle are usually highly recognizable, so the target part can be set as the front and rear of the vehicle. It is understood that the above-described setting method is only one possible setting method in actual application and does not limit the setting method of the target part in actual application. The specific setting method of the target part can be determined according to the actual situation and is not limited here.
[0033] Step S12: Cluster the object images in the suspected impure files to obtain several first-class clusters.
[0034] In one implementation scenario, to obtain the first cluster, features can be extracted from the object images in the suspected impure files to obtain the first features. Then, based on the similarity between the first features of different object images in the suspected impure files, the object images in the suspected impure files are clustered. The clustering method can be, but is not limited to, K-means, hierarchical clustering, and density-based clustering, etc., thereby obtaining several first clusters. Unlike the aforementioned implementation, after extracting the first features, the object images in the suspected impure files can be clustered based on the first features to obtain several third clusters. Then, based on the detection results of whether the object images in the third clusters contain the target part, the object images in the third clusters are filtered to obtain fourth clusters. On this basis, different fourth clusters that meet the first condition are merged to obtain several first clusters. The above method clusters object images in suspected impure archives based on a first feature to obtain several third clusters. Then, based on the detection results of whether the object images in the third clusters contain the target parts, the object images in the third clusters are filtered to obtain fourth clusters, which improves the accuracy of the fourth cluster filtering. Then, different fourth clusters that meet the first condition are merged to obtain several first clusters, which helps to improve the clustering effect of the first clusters and minimize the differences between object images in the first clusters, thereby further improving the accuracy of archive cleaning.
[0035] Please see Figure 2 , Figure 2 yes Figure 1 A flowchart illustrating an embodiment of step S12. Specifically, it may include the following steps:
[0036] Step S21: Assign the images of each object in the suspected impure file as the fifth cluster.
[0037] Understandably, in order to improve the clustering of object images in suspected impure archives, each object image in the suspected impure archive can be treated as a fifth cluster, meaning the number of fifth clusters is equal to the number of object images in the suspected impure archive.
[0038] Step S22: Based on the first feature, calculate the first distance between different fifth-class clusters.
[0039] In one implementation scenario, the first distance between different fifth-class clusters can be obtained by calculating the cosine distance between the first features, or by calculating the Euclidean distance between the first features. The calculation method for the first distance between different fifth-class clusters can be determined according to the actual situation, and no specific limitation is made here.
[0040] Step S23: In response to the first distance satisfying the second condition, merge the corresponding fifth cluster to obtain the first merged cluster.
[0041] In one implementation scenario, the second condition could be that the first distance is not greater than a preset threshold, which could be set to 0.3, 0.4, etc. The preset threshold can be determined based on the actual situation and is not specifically limited here.
[0042] In one implementation scenario, when the first distance does not meet the second condition, it indicates that the corresponding fifth cluster is suspected of not belonging to the same object. The process continues by calculating the first distance between different fifth clusters and determining whether the first distance meets the second condition. When the first distance meets the second condition, the corresponding fifth clusters are merged to obtain a first merged cluster. The process then continues by calculating the first distance between different clusters and determining whether the first distance meets the second condition. It can be understood that at this point, the different clusters include both fifth clusters and the first merged cluster. Therefore, when the first distance between any two clusters meets the second condition, the two corresponding clusters are merged to obtain a new first merged cluster, until the first distance between all different clusters no longer meets the second condition.
[0043] Step S24: Combine the first merged cluster with the unmerged fifth cluster in the suspected impure file as the third cluster.
[0044] Understandably, treating the first merged cluster and the unmerged fifth clusters in the suspected impure files as the third cluster—that is, merging the corresponding fifth clusters when the first distance between different fifth clusters satisfies the second condition—results in the third cluster. This method, by calculating the first distance between different fifth clusters and merging the corresponding clusters when the first distance satisfies the second condition, helps improve the efficiency of merging fifth clusters. Furthermore, treating the first merged cluster and the unmerged fifth clusters in the suspected impure files as the third cluster helps improve the accuracy of the third cluster.
[0045] Please see Figure 3 , Figure 3 yes Figure 1 A flowchart illustrating another embodiment of step S12. Specifically, it may include the following steps:
[0046] Step S31: Determine whether the object image in the third cluster contains the target part; if not, proceed to step S32; otherwise, proceed to step S33.
[0047] In one implementation scenario, object images in the third cluster may all contain the target region, partially contain the target region, or none of them may contain the target region. It is understood that there is a strong correlation between object images and target regions; if an object image can be associated with a target region, it indicates that the object image contains the target region; if an object image is not associated with a target region, it indicates that the object image does not contain the target region. Unlike the aforementioned implementation methods, the presence or absence of target regions in object images can be detected using a network model. This network model can be, but is not limited to, CNN (convolutional neural network), RNN (recurrent neural network), etc.
[0048] Step S32: Remove the third type of cluster.
[0049] In one implementation scenario, if the third cluster does not contain the target part (meaning none of the object images in the third cluster contain the target part), then the third cluster is removed. It's understandable that when all object images in the third cluster uniformly lack the target part, the object to which the image belongs is uncertain. To improve the accuracy of file cleaning, the third cluster that does not contain the target part is removed.
[0050] Step S33: Treat the third type of cluster as the fourth type of cluster.
[0051] In one implementation scenario, if the third cluster contains the target part (meaning all object images in the third cluster contain the target part, or some object images in the third cluster contain the target part), then the third cluster is directly classified as the fourth cluster. It's understandable that when the third cluster contains the target part, objects potentially belonging to the third cluster can be identified, allowing for further evaluation of the corresponding object images. This improves both the accuracy and efficiency of document cleaning. The above method, by determining whether object images contain the target part and removing the third cluster corresponding to object images that do not contain the target part, helps improve the efficiency of document cleaning.
[0052] Please see Figure 4 , Figure 4 yes Figure 1 A flowchart illustrating another embodiment of step S12. Specifically, it may include the following steps:
[0053] Step S41: Extract features from the image of the target area to obtain the second feature.
[0054] In this disclosed implementation scenario, the target part image is obtained by cropping an object image from a suspected impure file, and feature extraction is performed on the cropped target part image to obtain a second feature. Furthermore, the fourth cluster may include object images that do not contain a target part image. The second feature of an object image that does not contain a target part image can be selected from any second feature within the same cluster that contains a target part image, and used as the second feature of the object image that does not contain a target part image.
[0055] Step S42: Based on the second feature, calculate the second distance between different fourth clusters.
[0056] In one implementation scenario, the second distance between different fourth-class clusters can be calculated using the second feature. This second distance could be the minimum distance, maximum distance, or average distance between the two clusters. For example, the second distance between different fourth-class clusters is the maximum distance between the two clusters, i.e., the second distance between the two furthest second features between the different fourth-class clusters. It is understandable that when the second distance between different fourth-class clusters is the maximum distance between the two clusters, it can effectively separate noisy datasets between clusters.
[0057] Step S43: In response to the second distance satisfying the first condition, merge the corresponding fourth cluster to obtain the second merged cluster.
[0058] In one implementation scenario, the first condition can be that the second distance is not greater than a preset threshold, which could be 0.3, 0.4, etc.; the first condition can also be that the second distance is not greater than a preset distance, which could be the average distance between two clusters belonging to the same object. The first condition can be determined according to the actual situation and is not specifically limited here.
[0059] In one implementation scenario, when the second distance does not meet the first condition, it indicates that the corresponding fourth cluster is suspected of not belonging to the same object. The process continues by calculating the second distance between different fourth clusters and determining whether the second distance meets the first condition. When the second distance meets the first condition, the corresponding fourth clusters are merged to obtain a second merged cluster. The process then continues by calculating the second distance between different clusters and determining whether the second distance meets the first condition. It can be understood that at this point, the different clusters include the fourth cluster and the second merged cluster. Therefore, when the second distance between any two clusters meets the first condition, the corresponding two clusters are merged to obtain a new second merged cluster, until the second distance between all different clusters no longer meets the first condition.
[0060] Step S44: Combine the second merged cluster with the unmerged fourth cluster as the first cluster.
[0061] It is understandable that the second merged cluster and the unmerged fourth cluster are used as the first cluster. That is, when the second distance between different fourth clusters meets the first condition, the corresponding fourth clusters are merged to obtain the first cluster. This method, by calculating the second distance between different fourth clusters based on the second feature, and merging the corresponding fourth clusters when the second distance meets the first condition to obtain the second merged cluster, helps to improve the further judgment on whether object images in the merged cluster belong to the same object. Furthermore, using the second merged cluster and the unmerged fourth cluster as the first cluster helps to improve the accuracy of cluster merging, further improving the accuracy and stability of file cleaning.
[0062] Step S13: Based on the similarity between local features of the same part in different object images in the first cluster, filter the object images in the first cluster to obtain the second cluster.
[0063] In one implementation scenario, images of different objects within the first cluster can be segmented into multiple local images. Feature extraction is then performed on each local image to obtain local features. The segmentation method can be determined based on the different first clusters, or the same segmentation method can be used for all first clusters. Specifically, the segmentation method could be to divide a human image into several equal parts from top to bottom to obtain local images; the segmentation method can be determined by a model, i.e., locating the upper body, lower body, shoes, left hand, etc., of the human image to obtain local images.
[0064] In one implementation scenario, the mean similarity between local features in two different object images can be calculated. Then, it is determined whether the mean similarity between local features in two different object images is greater than a preset threshold. When the mean similarity between local features in two different object images is greater than the preset threshold, the corresponding first cluster is designated as the second cluster; when the mean similarity between local features in two different object images is not greater than the preset threshold, the corresponding first cluster is removed. Unlike the aforementioned implementation method, the number of contradictory points between any two object images can be calculated first based on the similarity between local features of the same parts in different object images within the first cluster. It can be understood that the number of contradictory points represents the number of parts in the two object images that are suspected to belong to different objects. Then, based on the number of contradictory points, the object images in the first cluster are filtered to obtain the second cluster. The above method calculates the number of contradictions between any two object images by using the similarity between local features of the same parts in different object images in the first cluster. Then, based on the number of contradictions, the object images in the first cluster are filtered to obtain the second cluster. This helps to improve the accuracy of the filtering results of object images in the first cluster, thereby improving the accuracy of file cleaning.
[0065] Please see Figure 5 , Figure 5 yes Figure 1 A flowchart illustrating an embodiment of step S13. Specifically, it may include the following steps:
[0066] Step S51: Obtain the similarity between local features of the same part in two object images as the local similarity.
[0067] It is understandable that the methods for obtaining local features can refer to the previously disclosed methods for extracting local features, and will not be repeated here. Furthermore, the similarity between local features can be determined by obtaining the cosine similarity between features, or by obtaining the Euclidean distance between features. The similarity between local features can be determined according to the actual situation, and no specific limitations are made here.
[0068] Step S52: Determine whether the local similarity is less than the first threshold; if yes, proceed to step S53; otherwise, proceed to step S54.
[0069] In one implementation scenario, the first threshold can be set to 0.8, 0.9, etc. The first threshold can be determined according to the actual situation, and no specific limitation is made here.
[0070] Step S53: The location corresponding to the local feature is the contradiction point.
[0071] In one implementation scenario, when the local similarity is less than the first threshold, the similarity between images representing the same part of different objects is low, meaning that the part corresponding to the local feature is a contradictory point.
[0072] Step S54: The location corresponding to the local feature is not a point of contradiction.
[0073] In one implementation scenario, when the local similarity is not less than the first threshold, the similarity between images representing the same part of different objects is high, meaning that the part corresponding to the local feature is not a contradictory point.
[0074] Understandably, after determining whether the similarity between local features of different object images in the first cluster is less than a first threshold, and finally determining whether the parts corresponding to all local features are contradictory points, the number of contradictory points between any two object images is obtained by summing the contradictory points between them. This method improves the accuracy of identifying contradictory points by determining whether the parts corresponding to local features are contradictory points. Furthermore, obtaining the number of contradictory points between any two object images based on the sum of their contradictory points helps improve the accuracy of determining the number of contradictory points between any two object images, thereby improving the accuracy of document cleaning.
[0075] Please see Figure 6 , Figure 6 yes Figure 1 A flowchart illustrating another embodiment of step S13. Specifically, it may include the following steps:
[0076] Step S61: Obtain a first image that does not contain the target part in the first cluster, and obtain a second image that contains the target part in the first cluster.
[0077] It is understood that the method for determining whether a target part is contained in an object image can be referred to in the aforementioned disclosed embodiments, and will not be repeated here.
[0078] Step S62: Based on the number of contradictions between the first image and each of the second images, obtain the first weight of the first image and each of the second images belonging to different objects.
[0079] In one implementation scenario, the number of inconsistencies between the first image and each of the second images can be used as the primary weight for the corresponding first image and each of the second images; alternatively, the ratio of the number of inconsistencies between the first image and each of the second images to the number of local images in the object image can be used as the primary weight for the corresponding first image and each of the second images. The primary weight can be determined based on the actual situation and is not specifically limited here.
[0080] Step S63: Merge the first weights to obtain the second weights.
[0081] In one implementation scenario, the second weight can be the sum of the first weights for the first image and each of the second images in the first cluster to belong to different objects, or it can be the average of the first weights for the first image and each of the second images in the first cluster to belong to different objects. The second weight can be determined according to the actual situation, and no specific limitation is made here.
[0082] Step S64: Determine whether the second weight is less than the second threshold; if not, proceed to step S65; otherwise, proceed to step S66.
[0083] In one implementation scenario, the second threshold can be determined based on the calculation method of the first weight and the second weight. For example, the first weight is the ratio of the number of inconsistencies between the first image and each of the second images to the number of local images in the object image, and the second weight is the average of the first weights for the first image and each of the second images in the first cluster to belong to different objects. Therefore, the second threshold can be 0.6, 0.7, etc. The second threshold can be determined according to the actual situation and is not specifically limited here.
[0084] Step S65: Remove the corresponding first image.
[0085] In one implementation scenario, when the second weight is not less than the second threshold, it indicates that the first image and the second image are suspected of not belonging to the same object. The corresponding first image is then removed to reduce noise in the file cleaning process and thus improve the accuracy of file cleaning.
[0086] Step S66: Assign the corresponding first image to the second cluster.
[0087] In this embodiment, the second cluster at least includes a second image containing the target region from the first cluster. It is understood that the second cluster includes the second image containing the target region from the first cluster, and further, it determines whether the first image and the second image belong to the same object. When the second weight is less than a second threshold, it indicates that the first image and the second image belong to the same object, and the first image is directly assigned to the second cluster. This method, by determining whether the second weight is less than a second threshold, assigns the corresponding first image to the second cluster when the second weight is less than the second threshold, and removes the corresponding first image when the second weight is not less than the second threshold, which helps reduce noise in the document cleaning process and thus improves the accuracy of document cleaning.
[0088] Step S14: Based on the second type of cluster, obtain the purified archive.
[0089] In one implementation scenario, as a possible approach, by cleaning suspected impure files to obtain at least one second cluster, the cluster with the most object images can be selected as the purified file. This differs from the previous implementation method, where after cleaning the suspected impure files and obtaining the second cluster, each second cluster is used as a purified file. This method improves the cleaning efficiency of purified files by selecting second clusters; furthermore, selectively determining purified files further enhances the applicability of the cleaned files.
[0090] The above scheme identifies potentially impure files from a pool of pending files, where the object images appear to belong to different objects. These images are then clustered into several first-class clusters. Next, based on the similarity of local features in common areas of different object images within these first-class clusters, a second-class cluster is obtained. Based on this second-class cluster, purified files are obtained. This process improves the accuracy of clustering objects into first-class clusters, thus increasing the efficiency of file cleaning. Furthermore, the second-class cluster selection enhances the applicability of the file cleaning method. Therefore, it improves the accuracy of file cleaning.
[0091] Those skilled in the art will understand that, in the above-described method of the specific implementation, the order in which the steps are written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of each step should be determined by its function and possible internal logic.
[0092] Please see Figure 7 , Figure 7This is a schematic diagram of an embodiment of the document cleaning apparatus of this application. The document cleaning apparatus 70 includes a first screening module 71, an image clustering module 72, a second screening module 73, and a document purification module 74. The first screening module 71 is used to screen suspected impure documents from a plurality of documents to be processed; wherein the object images in the suspected impure documents are suspected to belong to different objects; the image clustering module 72 is used to cluster the object images in the suspected impure documents to obtain a plurality of first clusters; the second screening module 73 is used to screen the object images in the first clusters based on the similarity between local features of the same parts in different object images in the first clusters to obtain second clusters; the document purification module 74 is used to obtain purified documents based on the second clusters.
[0093] The above scheme, on the one hand, filters suspected impure files from a number of files to be processed, and clusters the object images in these suspected impure files to obtain several first-class clusters. This helps improve the accuracy of obtaining the first-class clusters from the object images, thereby improving the efficiency of file cleaning. On the other hand, based on the similarity between local features of the same parts in different object images within the first-class clusters, the object images in the first-class clusters are filtered to obtain second-class clusters. These second-class clusters are then further filtered based on local features, further improving the accuracy of file cleaning. Furthermore, based on the second-class clusters, purified files are selected, which helps improve the applicability of the file cleaning method. Therefore, the accuracy of file cleaning can be improved.
[0094] In some disclosed embodiments, the second filtering module 73 includes a calculation submodule and a filtering submodule. The calculation submodule is used to calculate the number of contradictory points between any two object images based on the similarity between local features of the same parts in different object images within the first cluster, where the number of contradictory points represents the number of parts in the corresponding two object images that are suspected to belong to different objects. The filtering submodule is used to filter the object images in the first cluster based on the number of contradictory points to obtain a second cluster.
[0095] Therefore, by calculating the number of contradictions between any two object images based on the similarity between local features of the same parts in different object images in the first cluster, and then filtering the object images in the first cluster based on the number of contradictions, a second cluster is obtained. This helps to improve the accuracy of the filtering results of object images in the first cluster, thereby improving the accuracy of archive cleaning.
[0096] In some disclosed embodiments, the calculation submodule includes a judgment unit, a determination unit, and a calculation unit. The judgment unit is used to determine whether the similarity between local features of the same part in any two object images is less than a first threshold; the determination unit is used to determine the part corresponding to the local feature as a conflict point in response to the similarity between the local features being less than the first threshold; and the calculation unit is used to obtain the number of conflict points between any two object images based on the sum of the conflict points between them.
[0097] Therefore, by determining whether the parts corresponding to local features are contradictory points, the accuracy of identifying contradictory points is improved. Furthermore, based on the sum of contradictory points between any two object images, the number of contradictory points between the corresponding two object images is obtained, which helps to improve the accuracy of identifying the number of contradictory points between any two object images, thereby improving the accuracy of file cleaning.
[0098] In some disclosed embodiments, the filtering submodule includes an acquisition unit, a determination unit, and a filtering unit. The acquisition unit acquires a first image from a first cluster that does not contain the target region, and acquires a second image from the first cluster that contains the target region. The determination unit determines a first weight based on the number of inconsistencies between the first image and each second image, indicating that the first image and each second image belong to different objects. The filtering unit filters the object images in the first cluster based on the first weight to obtain a second cluster.
[0099] In some disclosed embodiments, the filtering unit includes a fusion subunit, a judgment subunit, a first response subunit, and a second response subunit. The fusion subunit is used to fuse the first weights to obtain a second weight; the judgment subunit is used to determine whether the second weight is less than a second threshold; the first response subunit is used to assign the corresponding first image to a second cluster in response to the second weight being less than the second threshold; wherein the second cluster at least includes second images containing the target region from the first cluster; and the second response subunit is used to remove the corresponding first image in response to the second weight being not less than the second threshold.
[0100] Therefore, by determining whether the second weight is less than the second threshold, when the second weight is less than the second threshold, the corresponding first image is assigned to the second cluster, and when the second weight is not less than the second threshold, the corresponding first image is removed. This helps to reduce noise in the document cleaning process and thus improve the accuracy of document cleaning.
[0101] In some disclosed embodiments, the image clustering module 72 includes an extraction submodule, a clustering submodule, a filtering submodule, and a merging submodule. The extraction submodule extracts features from object images in suspected impure archives to obtain a first feature; the clustering submodule clusters the object images in suspected impure archives based on the first feature to obtain several third clusters; the filtering submodule filters the object images in the third clusters based on the detection results of whether the object images in the third clusters contain target parts to obtain fourth clusters; and the merging submodule merges different fourth clusters that meet a first condition to obtain several first clusters.
[0102] Therefore, by clustering object images in suspected impure archives based on the first feature, several third-class clusters are obtained. Based on the detection results of whether the object images in the third-class clusters contain target parts, the object images in the third-class clusters are filtered to obtain fourth-class clusters, thereby improving the accuracy of the fourth-class cluster filtering. Then, different fourth-class clusters that meet the first condition are merged to obtain several first-class clusters, which helps to improve the clustering effect of the first-class clusters and minimize the differences between object images in the first-class clusters, thereby further improving the accuracy of archive cleaning.
[0103] In some disclosed embodiments, the clustering submodule includes a first determining unit, a distance calculation unit, a cluster merging unit, and a second determining unit. The first determining unit is used to classify each object image in the suspected impure file as a fifth cluster; the distance calculation unit is used to calculate a first distance between different fifth clusters based on a first feature; the cluster merging unit is used to merge the corresponding fifth clusters in response to the first distance satisfying a second condition, obtaining a first merged cluster; and the second determining unit is used to classify the first merged cluster and the unmerged fifth clusters in the suspected impure file as a third cluster.
[0104] Therefore, by calculating the first distance between different fifth-category clusters and merging the corresponding clusters when the first distance meets the second condition, it helps to improve the efficiency of merging fifth-category clusters. Then, taking the first merged cluster and the unmerged fifth-category clusters in the suspected impure files as third-category clusters helps to improve the accuracy of third-category clusters.
[0105] In some disclosed embodiments, the filtering submodule includes a first response unit and a second response unit. The first response unit is configured to classify the third cluster as a fourth cluster in response to the object image in the third cluster containing the target part; the second response unit is configured to discard the third cluster in response to the object image in the third cluster not containing the target part.
[0106] Therefore, by determining whether an object image contains the target part and removing the third cluster corresponding to object images that do not contain the target part, the efficiency of document cleaning can be improved.
[0107] In some disclosed embodiments, the merging submodule includes an extraction unit, a calculation unit, a merging unit, and a determination unit. The extraction unit extracts features from the target region image to obtain a second feature, and the target region image is captured from an object image in a suspected impure file. The calculation unit calculates a second distance between different fourth clusters based on the second feature. The merging unit merges the corresponding fourth clusters in response to the second distance satisfying a first condition to obtain a second merged cluster. The determination unit combines the second merged cluster with the unmerged fourth clusters as a first cluster.
[0108] Therefore, by calculating the second distance between different fourth-category clusters based on the second feature, when the second distance meets the first condition, the corresponding fourth-category clusters are merged to obtain the second merged cluster. This helps to improve the judgment of whether the object images in the merged clusters belong to the same object. Then, the second merged cluster and the unmerged fourth-category clusters are used as the first cluster, which helps to improve the accuracy of cluster merging and further improve the accuracy and stability of archive cleaning.
[0109] In some disclosed embodiments, the first screening module 71 includes an acquisition submodule, a fusion submodule, and a determination submodule. The acquisition submodule acquires a first outlier value of an anomaly factor in the file to be processed, where the anomaly factor represents the anomaly dimension of the file to be processed; the first outlier value represents the degree of anomaly in the corresponding anomaly factor dimension of the file to be processed; the fusion submodule fuses the first outlier value to obtain a second outlier value; and the determination submodule determines whether the file to be processed is a suspected impure file based on whether the second outlier value meets a third condition.
[0110] Therefore, by obtaining the first outlier of the abnormal factors in the files to be processed and fusing the first outlier to obtain the second outlier, and by determining the first outlier of multiple abnormal factors in the files to be processed, the diversity of the first outlier can be improved. Furthermore, by fusing the first outlier to determine the second outlier, the accuracy of the second outlier can be improved. Then, based on whether the second outlier meets the third condition, it can be determined whether the files to be processed are suspected impure files. This helps to improve the accuracy of screening the files to be processed, thereby improving the accuracy of obtaining suspected impure files and further improving the accuracy of file cleaning.
[0111] In some disclosed embodiments, the archive purification module 74 includes a selection submodule and a determination submodule. The selection submodule selects the cluster with the most object images from the second cluster as the purified archive; the determination submodule selects each second cluster as a purified archive.
[0112] Therefore, by selecting the second type of cluster, a purified archive is obtained, which improves the cleaning efficiency of the purified archive; in addition, selectively determining the purified archive further improves the applicability of the cleaned archive.
[0113] Please see Figure 8 , Figure 8 This is a schematic diagram of a framework of an embodiment of the electronic device of this application. The electronic device 80 includes a memory 81 and a processor 82 coupled to each other. The memory 81 stores program instructions, and the processor 82 is used to execute the program instructions to implement the steps in any of the above-described file cleaning method embodiments. Specifically, the electronic device 80 may include, but is not limited to, desktop computers, laptops, servers, mobile phones, tablet computers, etc., and is not limited thereto.
[0114] Specifically, processor 82 controls itself and memory 81 to implement the steps in any of the above-described file cleaning method embodiments. Processor 82 can also be referred to as a CPU (Central Processing Unit). Processor 82 may be an integrated circuit chip with signal processing capabilities. Processor 82 can also be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor. Furthermore, processor 82 can be implemented using integrated circuit chips.
[0115] In the above scheme, the electronic device 80 can be used to implement the steps in any of the above-described document cleaning method embodiments. On the one hand, by screening suspected impure documents from a number of documents to be processed, and clustering the object images in the suspected impure documents to obtain several first clusters, it helps to improve the accuracy of obtaining the first clusters by clustering the object images, thereby improving the efficiency of document cleaning. On the other hand, based on the similarity between local features of the same parts in different object images in the first clusters, the object images in the first clusters are screened to obtain second clusters. Then, the object images in the first clusters are screened again based on local features, further improving the accuracy of document cleaning. In addition, based on the second clusters, purified documents are selected, which helps to improve the applicability of the document cleaning method. Therefore, the accuracy of document cleaning can be improved.
[0116] Please see Figure 9 , Figure 9This is a schematic diagram of a framework of an embodiment of the computer-readable storage medium of this application. The computer-readable storage medium 90 stores program instructions 91 that can be executed by a processor. The program instructions 91 are used to implement the steps in any of the above-described embodiments of the file cleaning method.
[0117] The above scheme, where the computer-readable storage medium 90 can be used to implement the steps in any of the above-described document cleaning method embodiments, involves two aspects. Firstly, by filtering suspected impure documents from a number of documents to be processed and clustering the object images within these suspected impure documents to obtain several first clusters, the accuracy of clustering the object images into first clusters is improved, thereby increasing the efficiency of document cleaning. Secondly, based on the similarity between local features of the same parts in different object images within the first clusters, object images within the first clusters are filtered to obtain second clusters. Furthermore, object images within the first clusters are filtered again based on local features, further improving the accuracy of document cleaning. Additionally, based on the second clusters, purified documents are selected, which helps improve the applicability of the document cleaning method. Therefore, the accuracy of document cleaning is improved.
[0118] In some embodiments, the functions or modules of the apparatus provided in this disclosure can be used to perform the methods described in the above method embodiments. The specific implementation can be referred to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.
[0119] The description of the various embodiments above tends to emphasize the differences between the various embodiments. The similarities or similarities between them can be referred to, and for the sake of brevity, they will not be repeated here.
[0120] In the several embodiments provided in this application, it should be understood that the disclosed methods and apparatus can be implemented in other ways. For example, the apparatus implementations described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0121] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.
[0122] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0123] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods of various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0124] If the technical solution of this application involves personal information, the product using this technical solution has clearly informed the user of the personal information processing rules and obtained the user's voluntary consent before processing the personal information. If the technical solution of this application involves sensitive personal information, the product using this technical solution has obtained the user's separate consent before processing the sensitive personal information, and also meets the requirement of "express consent". For example, at personal information collection devices such as cameras, clear and prominent signs are set up to inform users that they have entered the scope of personal information collection and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed that they have agreed to the collection of their personal information; or on the personal information processing device, with clear signs / information informing users of the personal information processing rules, authorization is obtained from the individual through pop-up information or by asking the individual to upload their personal information; wherein, the personal information processing rules may include information such as the personal information processor, the purpose of personal information processing, the processing method, and the types of personal information processed.
Claims
1. An archival purging method, characterized by, include: Suspected impure files were obtained from a number of files to be processed; among them, the object images in the suspected impure files were suspected to belong to different objects. Clustering of the object images in the suspected impure files yields several first-class clusters; Based on the similarity between local features of the same part in different object images in the first cluster, the object images in the first cluster are filtered to obtain a second cluster; Based on the second type of cluster, a purified archive is obtained; The second cluster is obtained by filtering the object images in the first cluster based on the similarity between local features of the same parts in different object images in the first cluster, including: Based on the similarity between local features of the same parts in different object images in the first cluster, the number of contradictory points between any two object images is calculated; wherein, the number of contradictory points represents the number of parts in the corresponding two object images that are suspected to belong to different objects; Based on the number of contradictions, the object images in the first cluster are filtered to obtain the second cluster; The clustering of object images in the suspected impure files yields several first-class clusters, including: Feature extraction is performed on the object images in the suspected impure files to obtain the first feature; Based on the first feature, the object images in the suspected impure archives are clustered to obtain several third-class clusters; Based on the detection results of whether the object images in the third cluster contain the target part, the object images in the third cluster are filtered to obtain the fourth cluster; Different fourth-category clusters that meet the first condition are merged to obtain several first-category clusters.
2. The method of claim 1, wherein, The calculation of the number of contradictory points between any two object images based on the similarity between local features of the same parts in different object images in the first cluster includes: Determine whether the similarity between local features of the same part in any two images of the object is less than a first threshold; If the similarity between the local features is less than the first threshold, the location corresponding to the local feature is determined to be a point of contradiction. The number of conflicting points between any two object images is obtained by summing the conflicting points between them.
3. The method of claim 1, wherein, The second cluster is obtained by filtering the object images in the first cluster based on the number of contradictions, including: Obtain a first image that does not contain the target part in the first cluster, and obtain a second image that contains the target part in the first cluster; Based on the number of contradictions between the first image and each of the second images, a first weight is obtained to determine whether the first image and each of the second images belong to different objects. Based on the first weight, the object images in the first cluster are filtered to obtain the second cluster.
4. The method of claim 3, wherein, The step of filtering the object images in the first cluster based on the first weight to obtain the second cluster includes: The first weight is fused to obtain the second weight; Determine whether the second weight is less than the second threshold; In response to the second weight being less than the second threshold, the corresponding first image is assigned to the second cluster; wherein, the second cluster at least includes the second image containing the target part in the first cluster; In response to the second weight being not less than the second threshold, the corresponding first image is removed.
5. The method according to claim 1, characterized in that, Based on the first feature, the object images in the suspected impure archives are clustered to obtain several third-class clusters, including: The images of each object in the suspected impure file are classified as the fifth cluster; Based on the first feature, the first distance between different fifth clusters is calculated; In response to the first distance satisfying the second condition, the corresponding fifth cluster is merged to obtain the first merged cluster; The first merged cluster and the fifth cluster that was not merged in the suspected impure file are taken as the third cluster.
6. The method according to claim 1, characterized in that, Based on the detection results of whether the object images in the third cluster contain the target part, the object images in the third cluster are filtered to obtain a fourth cluster, including: In response to the fact that the object image in the third cluster contains a target part, the third cluster is designated as the fourth cluster; If the object image in the third cluster does not contain the target part, the third cluster is removed.
7. The method according to claim 1, characterized in that, The process of merging different fourth-category clusters that satisfy the first condition to obtain several first-category clusters includes: Feature extraction is performed on the target area image to obtain a second feature; wherein, the target area image is obtained by cropping an object image from the suspected impure file; Based on the second feature, the second distance between different fourth clusters is calculated; In response to the second distance satisfying the first condition, the corresponding fourth cluster is merged to obtain a second merged cluster; The second merged cluster and the unmerged fourth cluster are taken as the first cluster.
8. The method according to claim 1, characterized in that, The process of filtering out suspected impure files from a number of files to be processed includes: Obtain the first outlier value of the outlier factor in the file to be processed; wherein the outlier factor represents the outlier dimension of the file to be processed; and the first outlier value represents the degree of outlier of the file to be processed in the corresponding outlier factor dimension. The first outlier is fused to obtain the second outlier; Based on whether the second outlier meets the third condition, it is determined whether the file to be processed is a suspected impure file.
9. The method according to claim 1, characterized in that, The purified archive obtained based on the second type of cluster includes: Select the one with the most object images in the second cluster as the purified archive; Alternatively, each of the second type of clusters may be used as the purification file.
10. An archive cleaning device, characterized in that, include: First filtering module; This is used to filter out suspected impure files from a number of files to be processed; wherein the object images in the suspected impure files are suspected to belong to different objects; The image clustering module is used to cluster the object images in the suspected impure files to obtain several first-class clusters; The second filtering module is used to filter the object images in the first cluster based on the similarity between local features of the same part in different object images in the first cluster, so as to obtain the second cluster. The archive purification module is used to obtain purified archives based on the second type of cluster; The second filtering module is used to calculate the number of contradiction points between any two object images based on the similarity between local features of the same parts in different object images in the first cluster; wherein, the number of contradiction points represents the number of parts in the two object images that are suspected to belong to different objects; and the object images in the first cluster are filtered based on the number of contradiction points to obtain the second cluster. The image clustering module is used to extract features from the object images in the suspected impure archives to obtain a first feature; based on the first feature, the object images in the suspected impure archives are clustered to obtain several third clusters; based on the detection results of whether the object images in the third clusters contain target parts, the object images in the third clusters are filtered to obtain a fourth cluster; different fourth clusters that meet the first condition are merged to obtain several first clusters.
11. An electronic device, characterized in that, The method includes a memory and a processor coupled to each other, the memory storing program instructions, and the processor executing the program instructions to implement the file cleaning method according to any one of claims 1 to 9.
12. A computer-readable storage medium, characterized in that, The system stores program instructions that can be executed by a processor, the program instructions being used to implement the file cleaning method according to any one of claims 1 to 9.
Citation Information
Patent Citations
Picture clustering method and device
CN103390165A