Data deduplication via associative similarity search

By using the Locality Sensitive Hash algorithm and a multi-level fingerprint database, the problem of low data deduplication efficiency in existing technologies is solved, achieving efficient data storage management and cost optimization.

CN122285641APending Publication Date: 2026-06-26GSI TECHNOLOGY INC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GSI TECHNOLOGY INC
Filing Date
2020-08-11
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing data deduplication techniques are inefficient and costly when processing slightly different data blocks, especially in email systems where there is excessive redundant data storage.

Method used

Slightly different fingerprints are created using the Local Sensitive Hash (LSH) algorithm, and only the difference blocks between similar blocks are stored through similarity search and difference calculation in the associated memory device. The fingerprint database with a multi-level structure is used for efficient search and storage management.

Benefits of technology

It improves the efficiency of data deduplication and storage utilization, reduces redundant storage requirements, lowers storage costs, and increases processing throughput.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122285641A_ABST
    Figure CN122285641A_ABST
Patent Text Reader

Abstract

A deduplication system includes: a similarity searcher, a difference calculator, and a storage manager. The similarity searcher searches a database storing multiple locally sensitive fingerprints for similar fingerprints that resemble a new fingerprint of a new block. The difference calculator calculates the difference blocks between the input block and the similar blocks associated with the found similar fingerprints; and the storage manager updates the database with the new fingerprints, and if the difference blocks are not empty, stores the difference blocks in a storage repository. A method for deduplication includes: searching a database storing multiple locally sensitive fingerprints for similar fingerprints that resemble a new fingerprint of a new block; if a similar fingerprint is found, calculating the difference blocks between the input block and the similar blocks associated with the similar fingerprint; updating the database with the new fingerprints; and if the difference blocks are not empty, storing the difference blocks in a storage unit.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application is a divisional application of Chinese patent application filed on August 11, 2020, with application number 202010800152.5 and invention title "Data Deduplication via Association Similarity Search". Cross-references to related applications

[0002] This application claims priority to U.S. Provisional Patent Application No. 62 / 888,580, filed August 19, 2019, and U.S. Provisional Patent Application No. 62 / 978,336, filed February 19, 2020, both of which are incorporated herein by reference. Technical Field

[0003] This invention generally relates to techniques for data deduplication, and more specifically to improved deduplication techniques. Background Technology

[0004] In computing, data deduplication is a technique used to eliminate duplicate copies of data. This technique is used to improve storage utilization and is particularly important when a large portion of stored data is redundant. For example, when an employee sends an email with attachments to a large email list of employees, the same data can be stored redundantly. In this case, the email system typically stores all copies of the attachments.

[0005] During deduplication, unique data blocks are identified and stored. New blocks are compared to the stored blocks, and whenever a match is found, the redundant block is replaced by a small reference pointing to the stored block. This reduces the amount of data that must be stored, assuming the same block might appear dozens, hundreds, or even thousands of times.

[0006] The purpose of deduplication is to examine large amounts of data and identify identical large sections (e.g., the entire file or large portions of a file) and replace these sections with shared copies. For example, a typical existing email system might contain 100 instances of identical 1 MB (megabyte) file attachments. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is actually stored, and subsequent instances are referenced back to the saved copy, resulting in a deduplication ratio of approximately 100 to 1.

[0007] One form of data deduplication is performed by comparing complete blocks of data to detect duplicates. The block size can be defined by physical layer constraints (e.g., 4 KB block size) and can be the same size as the complete file.

[0008] Other forms of deduplication are performed by first mapping each block to a much shorter bit string (called its fingerprint) and comparing that fingerprint, which uniquely identifies the original data for all practical purposes. It is understood that the fingerprint is capable of capturing the identity of a block; that is, the probability of two distinct blocks producing the same fingerprint is negligible.

[0009] Fingerprint identification can be performed using any known algorithm, including Secure Hash Algorithms (SHA) (e.g., SHA-1). The SHA-1 function operates on inputs of any size and creates a 20-bit output. The SHA-1 function operates such that small changes between various input blocks result in random changes in the 20-bit output; therefore, slight changes in the input blocks can produce significant differences in the fingerprint.

[0010] Existing systems that provide inline data deduplication (performing deduplication as data enters the system rather than at a later stage) divide the data into blocks, create a fingerprint for each block, and compare a newly entered fingerprint with fingerprints stored in a fingerprint table. If the comparison indicates that an identical fingerprint exists in the database, it means that the same block as the new block has already been stored, and therefore the entry in the database for the found fingerprint is updated to include the new block information. If the comparison fails (i.e., there is no entry in the fingerprint database that matches the new fingerprint), the new fingerprint is stored in the database, and the block is stored in a data repository.

[0011] Existing technology systems that provide offline deduplication processes post-process the stored data to find and release duplicate blocks that have already been stored without affecting the throughput of the online process. Offline deduplication systems typically perform the same comparison operations, but release duplicate blocks instead of a priori preventing the storage of these duplicate blocks.

[0012] Inline deduplication requires less storage and network traffic than offline deduplication because duplicate data is never stored or moved; however, hashing can be computationally expensive and may impact processing throughput. Summary of the Invention

[0013] According to a preferred embodiment of the present invention, a deduplication system is provided, comprising a similarity searcher, a difference calculator, and a storage manager. The similarity searcher searches a fingerprint database storing multiple locally sensitive fingerprints for similar fingerprints to a new fingerprint similar to an input block. If a similar fingerprint is found, the difference calculator calculates the difference blocks between the input block and similar blocks associated with the similar fingerprint; and the storage manager updates the fingerprint database with the new fingerprint, and if the difference blocks are not empty, stores the difference blocks in a storage unit.

[0014] Furthermore, according to a preferred embodiment of the invention, if no similar fingerprint is found, the storage manager stores the input block.

[0015] Furthermore, according to a preferred embodiment of the invention, the storage manager is used to store fingerprints in columns of the associated storage device, and the similarity searcher performs a search within the associated storage device.

[0016] Furthermore, according to a preferred embodiment of the invention, the system also includes a fingerprint creator for creating new fingerprints using the Locality Sensitive Hash (LSH) algorithm to create slightly different new fingerprints for slightly different input blocks.

[0017] Additionally, according to a preferred embodiment of the invention, the fingerprint database is arranged in a multi-level structure, wherein higher levels include the centroids of clusters in lower levels, and the lowest level includes the fingerprints of blocks, the centroids of which are calculated from the fingerprints.

[0018] Furthermore, according to a preferred embodiment of the invention, the storage manager stores the highest-level fingerprint in a column of the associated memory device, wherein the search at the highest level is performed within the associated memory device, and the search at the lower levels is performed in the CPU.

[0019] Furthermore, according to a preferred embodiment of the present invention, the system further includes: a block splitter for splitting an input block into smaller sub-blocks; an anti-collision fingerprint creator for creating an anti-collision fingerprint for each of the sub-blocks; and an exact searcher for searching a fingerprint database for matching fingerprints that are also anti-collision fingerprints. A storage manager updates the fingerprint database with the anti-collision fingerprints and stores sub-blocks for which no matching fingerprints were found.

[0020] According to a preferred embodiment of the present invention, a method for deduplication is provided. The method includes: searching a fingerprint database storing multiple locally sensitive fingerprints for similar fingerprints to a new fingerprint similar to an input block; if a similar fingerprint is found, calculating a difference block between the input block and a similar block associated with the similar fingerprint; updating the fingerprint database with the new fingerprint; and if the difference block is not empty, storing the difference block in a storage unit.

[0021] Furthermore, according to a preferred embodiment of the present invention, the storage step further includes: if no similar fingerprint is found, storing the input block.

[0022] Furthermore, according to a preferred embodiment of the invention, the method further includes loading the fingerprint into a column of the associated memory device, and the search step is performed within the associated memory device.

[0023] Additionally, according to a preferred embodiment of the invention, the method includes creating a new fingerprint using a Local Sensitive Hash (LSH) algorithm to create a slightly different new fingerprint for a slightly different input block.

[0024] Furthermore, according to a preferred embodiment of the present invention, the fingerprint database is a hierarchical database arranged in a multi-level structure, wherein higher levels include the centroids of clusters in lower levels, which are calculated based on fingerprints, and the lowest level includes the fingerprints of blocks.

[0025] Additionally, according to a preferred embodiment of the invention, loading includes loading the highest-level fingerprint into a column of the associated memory, and searching includes performing a search by the associated memory device at the highest level and by the CPU at a lower level.

[0026] Furthermore, according to a preferred embodiment of the present invention, the method further includes: splitting the input block into smaller sub-blocks; creating an anti-collision fingerprint for each of the sub-blocks; performing an exact search to find an exact match in the fingerprint database for each of the anti-collision fingerprints. The method also includes: updating the fingerprint database with the anti-collision fingerprints; and storing sub-blocks for which no matching fingerprints were found.

[0027] Furthermore, according to a preferred embodiment of the present invention, the method further includes: performing an exact search between the fingerprint of the input / output block and a database of fingerprints created using an anti-collision algorithm; and performing a search if no exact match is found. Attached Figure Description

[0028] The subject matter considered to be the invention is specifically pointed out and explicitly claimed in the concluding section of the specification. However, the invention, its objects, features, and advantages, relating to both the organization and the method of operation, can be best understood by referring to the following detailed description, in conjunction with the accompanying drawings.

[0029] Figure 1 This is a schematic diagram of the process implemented by a deduplication system constructed and operated according to an embodiment of the present invention;

[0030] Figure 2A and Figure 2B This is an implementation of the construction and operation according to embodiments of the present invention. Figure 1 A schematic diagram of the deduplication system's process;

[0031] Figure 3 This is a schematic diagram of an alternative embodiment of a fingerprint database organized in a hierarchical structure, constructed and operated according to embodiments of the present invention;

[0032] Figure 4A , Figure 4B and Figure 4C It is a utilization constructed and implemented according to embodiments of the present invention.Figure 3 A schematic diagram of a hierarchical database deduplication system;

[0033] Figure 5 It is constructed and operated according to embodiments of the present invention. Figure 4A , Figure 4B and Figure 4C A schematic diagram of the system implementation process;

[0034] Figure 6 The management of the construction and operation according to embodiments of the present invention Figure 3 A diagram illustrating the hierarchical storage manager of a hierarchical database; and

[0035] Figure 7A , Figure 7B and Figure 7C This is a schematic diagram of a deduplication system processing sub-block constructed and operated according to an embodiment of the present invention.

[0036] It will be recognized that, for the sake of simplicity and clarity, the elements shown in the accompanying drawings are not necessarily drawn to scale. For example, for clarity, the dimensions of some elements may be enlarged relative to others. Furthermore, where deemed appropriate, reference numerals may be repeated between figures to indicate corresponding or similar elements. Detailed Implementation

[0037] Numerous specific details are set forth in the following detailed description in order to provide a thorough understanding of the invention. However, those skilled in the art will understand that the invention can be practiced without these specific details. In other instances, well-known methods, processes, and components have not been described in detail so as not to obscure the invention.

[0038] The applicant has recognized that the performance and throughput of online deduplication operations can be enhanced by improving both the fingerprint creation and fingerprint comparison steps. The applicant has also recognized that storage efficiency can be improved by storing only the relative changes between similar blocks instead of storing entirely modified blocks (e.g., versions of documents, source code, or images / videos with only minor changes).

[0039] The applicant has further recognized that fingerprinting methods (e.g., Locality Sensitive Hash (LSH) algorithms) that provide slightly different outputs when operating on slightly different blocks can be used to provide slightly different fingerprints for slightly different blocks, and thus provide a way to distinguish between identical, similar, and dissimilar blocks (rather than distinguishing only between identical and dissimilar blocks as existing deduplication methods do). The applicant has further recognized that methods for searching similar fingerprints (e.g., Hamming distance or Tanimoto distance) can be used to compare fingerprints and find the most similar blocks.

[0040] Embodiments of the present invention can improve the performance and storage efficiency of deduplication operations by performing the following operations: using the LSH algorithm to create fingerprints similar to similar blocks, storing the fingerprints in a memory device capable of performing association processing and efficient similarity searches between the new fingerprint and the stored fingerprint to find similar fingerprints (if such fingerprints exist). The new fingerprint can ultimately be stored, and the changes between the entire dissimilar block or the new block and the already stored similar blocks (indicated by detected similar fingerprints) can be stored in a data repository. Further improvements can be achieved by storing fingerprints hierarchically in columns of the associated memory device, as described below.

[0041] It is noted that embodiments of the present invention may use memory devices, such as those described in U.S. Patent 9,558,812, which has been assigned to co-assignees of the present invention and is incorporated herein by reference. It is also noted that similarity searches may be similar to those described in U.S. Patent 2018 / 0341642, which has been assigned to co-assignees of the present invention and is incorporated herein by reference.

[0042] Now for reference Figure 1 This is a schematic diagram of the process 100 implemented by a system constructed according to an embodiment of the present invention.

[0043] In step 110, the system can receive a new block 101 and create a new fingerprint for the new block using the LSH algorithm. In step 120, the system can perform a similarity search between the new fingerprint and fingerprints stored in the Locality Sensitive Fingerprint Database. In step 130, the system can check if a similar fingerprint has been found, where a similar fingerprint can be a fingerprint whose distance from the created fingerprint is less than a predefined threshold. If a similar fingerprint is found, in step 140, the system can calculate the difference between the new block 101 and the stored block associated with the located similar fingerprint, update the fingerprint information in the fingerprint database, and store the calculated difference in a difference data repository. If no similar fingerprint is found, in step 150, the system can store the new block in the data repository and add the new fingerprint to the fingerprint database.

[0044] Now for reference Figure 2A This is a schematic diagram of a deduplication system 200 constructed and operated according to an embodiment of the present invention. The deduplication system 200 includes a fingerprint database 205 for storing fingerprints of blocks, an associated storage device 210, a block data repository 270 for storing entire data blocks, and a difference data repository 275 for storing changes between the stored blocks and slightly different blocks.

[0045] The associated memory device 210 includes a memory array 220, a local sensitive fingerprint creator 230, a similarity searcher 240, a difference calculator 250, and a storage manager 260.

[0046] The memory array 220 includes multiple columns 221 for storing records, each record representing a block. Record 221 includes a fingerprint 222 (a key to a specific block) and a value 223. The fingerprint 222 can capture the identity of the block, and the value 223 may contain information about the location of a block in the block data repository 270 that is similar to or the same as the represented block, and possibly information about the location of a difference block in the difference data repository 275 when the specific block is slightly different from the stored block (which includes changes between existing blocks in the block data repository 270 and the block represented by record 221), as well as other data.

[0047] The locality-sensitive fingerprint creator 230 can receive a new block 101 and can create a new LSH fingerprint 202 using the LSH algorithm. As mentioned above, the LSH algorithm can create slightly different outputs for slightly different inputs, so the locality-sensitive fingerprint creator 230 can create slightly different fingerprints for slightly different blocks.

[0048] The similarity searcher 240 can compare the new LSH fingerprint 202 with the fingerprint 222 (of record 221) stored in the memory array 220. If the similarity searcher 240 finds a similar fingerprint 222x (fingerprint 222 of record 221x), then the similarity searcher 240 acquires the similar block 203 associated with fingerprint 222x. If the similarity searcher 240 finds more than one similar fingerprint 222x, then the similarity searcher 240 selects the fingerprint that is most similar to the new block 101. If the similarity searcher 240 does not find any similar fingerprints, then the similarity searcher 240 indicates that no similar fingerprints were found.

[0049] The similarity searcher 240 can use any known distance calculation algorithm (e.g., the Hamming distance or Tanimoto distance algorithm mentioned above) and select fingerprint 222x (which has the smallest distance value) as the most similar fingerprint. When the calculated distance between two fingerprints is less than a predetermined value (threshold) and a similar fingerprint 222x may not have been found in other ways, fingerprint 222x can be considered similar to a new LSH fingerprint 202.

[0050] If the similarity searcher 240 locates a similar fingerprint 222x, a similar block 203 can be retrieved from the block data repository 270. The difference calculator 250 can calculate the differences between the new block 101 and the similar block 203, and can create a difference block 204 containing information about the differences. The new block 101 can be recovered from the similar block 203 and the difference block 204. It can be recognized that the difference block 204 is typically smaller than the new block 201, and therefore less space can be used than storing the entire block. When the data consists of documents, the difference calculator 250 can use an edit search method to create the difference block 204, which includes the edit instructions that need to be executed on the similar block 203 to generate the new block 101.

[0051] Storage manager 260 can process new block 101, difference block 204, and new LSH fingerprint 202. Storage manager 260 can store new block 101 in block data repository 270, store difference block 204 in difference data repository 275, create new record 221 based on the new LSH fingerprint 202 and information about the location of related blocks (similar blocks in block data repository 270 and difference block 204 in difference data repository 275) associated with new block 101, and insert new record 221 into fingerprint database 205.

[0052] Figure 2B An alternative view of system 200 is provided, in which the same reference numerals indicate matching elements.

[0053] The applicant has further recognized that the similarity searcher 240 and storage manager 260 can provide enhanced performance when records 221 are arranged hierarchically. When records 221 are organized in a flat layout, the similarity searcher 240 may need to compare the LSH fingerprint 202 of a new block 101 with all previously created fingerprints 222 stored in records 221 to find the most similar fingerprint, which is an expensive operation when the number of fingerprints is large (petabytes). However, if storage manager 260 manages records 221 in a cluster arranged in a hierarchical structure, the similarity searcher 240 can search only the relevant cluster instead of the entire database, and thus improve search efficiency. The number of levels in the hierarchy can be determined by the expected size of the database or any other characteristic that may affect storage size and / or search efficiency. It is understood that multi-level hierarchical structures of keys can also be used by other applications that require efficient searches.

[0054] Now for reference Figure 3This is a schematic diagram of an alternative embodiment of a fingerprint database 305 organized in a three-level hierarchical structure. In this example, 1-bit fingerprint records 221 can be partitioned such that a first highest level 310 can store 256 KB fingerprint records 221, each fingerprint record being the centroid of a cluster of 1 KB fingerprint records 221 stored in a second intermediate level 320, and each fingerprint record 221 in the second level 320 being the centroid of a cluster of 1 KB fingerprints in a third lowest level 330.

[0055] Level 310, which stores the first-level centroids, is referred to herein as the highest level of the hierarchical database 305, and it stores the maximum number of fingerprints. Level 330, which stores the actual fingerprints, is referred to herein as the lowest level. Intermediate levels 320, located between the highest level 310 and the lowest level 330, may store the centroids of intermediate levels. It is understood that the hierarchical database 305 may include any number of intermediate levels 320, and is not limited to a single intermediate level.

[0056] It can be recognized that the centroid in each level comprises a fingerprint, which represents a group of fingerprints in the next level, where all members of the group are similar to each other and different from fingerprints in other groups. The centroid can be calculated as the center of the cluster it represents.

[0057] It can be recognized that the number of fingerprint records 221 stored in this hierarchy is 256 K × 1 K × 1K, which represents 1 PB of data stored in data repositories 270 and 275. As mentioned above, loading fingerprint records 221 into the associated storage device 210 enables efficient similarity searches to be performed.

[0058] Now for reference Figure 4A This is a schematic diagram of a deduplication system 400 constructed and implemented according to an embodiment of the present invention. The deduplication system 400 is an alternative embodiment of the deduplication system 200, which includes a hierarchical fingerprint database 305 and an associated storage device 410 for performing similarity searches on the hierarchical fingerprint database 305. The deduplication system 400 also includes a block data repository 270 and a difference data repository 275.

[0059] The associated storage device 410 includes a hierarchical storage manager 460 for processing the fingerprint database 305, a memory array 220, a local sensitive fingerprint creator 230, a similarity searcher 240, and a difference calculator 250.

[0060] The hierarchical storage manager 460 can perform multi-level similarity searches on the hierarchical fingerprint database 305, which has any number of levels. In the example of Figure 4, three levels are shown. Each level can be stored in a storage unit within the fingerprint database 305 and can be downloaded to the memory array 220 when the similarity searcher 240 needs to perform a search on that level. It can be appreciated that when the lower levels (intermediate and lowest levels) of the hierarchical fingerprint database 305 are small and contain approximately 1K fingerprints, the similarity searcher 240 can use the CPU to perform similarity searches on these levels instead of downloading them to the memory array 220 and performing similarity searches on the associated memory.

[0061] As mentioned above, each fingerprint record 221 includes a key and a value, and the similarity searcher 240 can use the key portion to search for similar keys on the memory array 220. The fingerprint values ​​at all levels except the last include information about the next level of the record; if a search is successful at the current level, the search can continue at that next level, indicating that similar blocks may have been stored in the data repository 270. It can be understood that the keys at all levels have the same structure. At higher levels (the highest and intermediate levels), the key is computed as the centroid of the key in the next level, and at the lowest level, the key is the actual fingerprint computed for a specific block.

[0062] The similarity searcher 240 can complete its search when a similar fingerprint 222x is found at the last level or when no similar fingerprint is detected (at any level). If the distance between the fingerprint 222x in memory array 220 and the new LSH fingerprint 202 is less than a threshold, the similarity searcher 240 can determine that the fingerprint 222x in memory array 220 is similar to the new LSH fingerprint 202. Different thresholds can be defined for each level. If the similarity searcher 240 does not find any similar fingerprint 222x, it indicates that no similar fingerprint was found. If the similarity searcher 240 finds a similar fingerprint 222x at a level that is not the last level, the hierarchical storage manager 460 can download the fingerprints of the next level to memory array 220 and allow the similarity searcher 240 to perform a search in the newly downloaded fingerprints.

[0063] If the similarity searcher 240 finds a similar fingerprint 222x in the last level, the difference calculator 250 can calculate the difference block 204, the hierarchical storage manager 460 can store the difference block 204 in the difference data repository 275, and update the information associated with the new LSH fingerprint 202 to include the location of the similar block 203 associated with fingerprint 222x, as well as the location of the difference block 204 containing the difference.

[0064] Figure 4B and Figure 4C An alternative view of system 400 is provided together, in which the same reference numerals indicate matching elements.

[0065] Now for reference Figure 5 This is a schematic diagram of a process 500 implemented by system 400, constructed and operated according to an embodiment of the present invention. In process 500, record 221 can be stored in three levels of hierarchical fingerprint database 305, wherein records in each level except the last level in the hierarchy can store the centroid of the next level, and the last level can store fingerprints pointing to the actual relevant blocks in block data repository 270 and difference data repository 275.

[0066] In step 510, the associative memory device 210 can receive the new block 101, set the search level to the first level of the hierarchical fingerprint database 305, and the local sensitive fingerprint creator 230 can create a new LSH fingerprint 202 for the new block 101. In step 520, the hierarchical storage manager 460 can load the record 221 from the search level of the hierarchical fingerprint database 305 into the memory array 220, and the similarity searcher 240 can perform a similarity search between the new LSH fingerprint 202 and the record 221 stored in the memory array 220. In step 520, the similarity searcher 240 can determine whether a fingerprint 222x similar to the new LSH fingerprint 202 has been found. If no similar fingerprint has been found, the associative memory device 210 can proceed to step 570; however, if fingerprint 222x has been found, the search level is checked in step 540. If a similar fingerprint 222x has been found at the last level of the fingerprint database 305, the associated storage device 210 can proceed to step 550; otherwise, step 560 can be executed.

[0067] If a similar fingerprint 222x is found at the final level, in step 550, the difference calculator 250 can retrieve the similar block 203 (associated with the similar fingerprint 222x), calculate the difference between the new block 101 and the similar block 203, and create a difference block 204. The hierarchical storage manager 460 can store the difference block 204 in the difference data repository 275 and store a new record 221 (which may include the new LSH fingerprint 202, the block position of the similar block 204 in the block data repository 270, and the position of the difference block 204 in the difference data repository 275) in the current search level of the hierarchical fingerprint database 305. If the cluster becomes unbalanced, the hierarchical storage manager 460 can optionally rebalance the fingerprint database 305.

[0068] If a similar block 203 is found in an intermediate level, that level is updated in step 560, and the associated storage device 210 can return to step 520 to process the new level in the fingerprint database 305. When a similar fingerprint 222x has been found, step 570 is reached, where the hierarchical storage manager 460 can store a new block 101 in the block data repository 270 and update the fingerprint database 305 with a new record 221 including the LSH fingerprint 202. If the fingerprint database 305 becomes unbalanced, the hierarchical storage manager 460 can optionally rebalance the fingerprint database 305.

[0069] As mentioned above, the fingerprint database 305 can be arranged in levels, and different portions of the database can be loaded into the memory array 220 at different steps. Each record can be stored as a vector in a column of the memory array 220. The vector includes a key and a value, where the key is the fingerprint identifying a specific block. In the last level, the value includes the location of similar or identical blocks and, optionally, the location of differing blocks. In all other levels (except the last one), the key of the record is the centroid representing a group of similar fingerprints, and the value includes the address of the cluster of vectors represented by the centroid.

[0070] Records at all levels can have the same key format, and the similarity searcher 240 can perform the same search at each level. Typically, the first level can contain the largest set of records and can be stored in the memory array 220, while other levels can contain a smaller number of records that can be loaded into the memory array 220 as needed, or processed by the CPU as mentioned above. Figure 3 In the example, the number of records in the first level is 256 K, and the number of records in each of the other levels is 1 K.

[0071] Now for reference Figure 6 This is a schematic diagram of a hierarchical storage manager 460 constructed and operated according to an embodiment of the present invention. The hierarchical storage manager 460 can be responsible for all aspects of the storage of the associated storage device 210 and the database used by it. The hierarchical storage manager 460 includes a centroid determiner 610 and an updater 620.

[0072] After the similarity searcher 240 fails to find a similar fingerprint at any level except the last level, the centroid determiner 610 can compute a new representative fingerprint 601 (centroid). The new representative fingerprint 601 can be calculated as the average of all fingerprints in the cluster it represents.

[0073] In the update process 500 ( Figure 5When updating to any level other than the last level, the updater 620 can use the new representative fingerprint 601, but when updating to the last level, the updater 620 can use the new LSH fingerprint 202.

[0074] The overhead of writing a new block 101 using the deduplication system 400 can include a search in the fingerprint database 305. If the search is successful, the write operation can include updating information associated with similar fingerprints 222x and storing the difference block 204 in the difference data repository 275. If the search fails, the write operation can include a write operation to write the new block 101 to the block data repository 270 and an update to the fingerprint database 305.

[0075] The overhead of obtaining block Y from deduplication system 400 may include: a read operation from block data repository 270 to obtain block 221y; and an optional combination of a read operation from difference data repository 275 and an operation to reconstruct the original block by applying differences to block 221y.

[0076] The size and performance of the associated memory device 210 can depend on the size of the data the system should process. In addition to the size of the data to be processed, the size of the memory array 220 can depend on the number of levels in the hierarchical database 305.

[0077] For a system designed to process 1 PB of data consisting of 4K blocks of 256 GB, each block represented by a 256-bit fingerprint, the size of the levels determines the size of memory array 220. Assuming the first level is 256 K in size, and each subsequent level is 1 K in size, the required size is 256 K + 1 K + 1 K, or 258 K. The first maximum level (256 K) can be loaded offline into memory array 220 and can remain in memory array 220 throughout the operation of associated memory device 210. The next much smaller levels (the second and third levels of the cluster, each 1 K in size) can be loaded into memory array 220 at runtime, or alternatively, processed in the CPU. It can be noted that the fingerprint database 305 used to store 1 PB of data in 4K blocks can be approximately 8 TB (256 bits / 8 256 K × 1 K × 1 K).

[0078] It can be recognized that the deduplication system 400 can be cheaper than existing deduplication systems. Fast similarity search is cheaper and requires less storage for the same number of blocks because it only stores changes between similar blocks (rather than storing entirely different blocks). Reducing the overall size of the storage device also reduces the number of components required to process the storage, such as fewer storage drivers, fewer storage servers, etc. The hierarchical fingerprint database 305 can be built online and rebalanced offline (to maintain its representativeness of the entire block storage and prevent skew) to maintain its performance. This may be an ideal solution for solid-state drive (SSD) storage, to which write frequency should be minimal.

[0079] It is understood that additional embodiments of the deduplication system may include a module that first performs an exact search between the fingerprint of the new block and a fingerprint database created using a collision-resistant algorithm such as SHA-1, as performed in existing systems, and, if no exact match is found, uses the associative memory device 410 described above to store only the differences.

[0080] Some users accustomed to existing deduplication systems (which store entire blocks) may be reluctant to store the differences that allow them to recreate the original block, and may feel more secure accessing the original content without any re-creation process. A system that provides improved storage efficiency (i.e., minimizes duplication) while preserving the original content compared to current deduplication systems could perform similarity searches on new blocks, as described above, but instead divide each new block into sub-blocks. Fingerprints could be created for each sub-block using collision-resistant hash functions (e.g., Message-Digest Algorithm MD5 or Cyclic Redundancy Check (CRC)) that create smaller fingerprints, and only the sub-blocks that differ between two similar blocks could be stored. The original block could then be easily composed of the stored sub-blocks.

[0081] Now for reference Figure 7A This is a schematic diagram of a deduplication system 700 constructed and operated according to an embodiment of the present invention, processing sub-blocks. The deduplication system 700 includes an associative memory device 710, a multi-key hierarchical fingerprint database 705, and a block storage 770. The block storage 770 can store sub-blocks 101-j of each new input block 101. The associative memory device 710 includes a locality-sensitive fingerprint creator 230, a block splitter 720, a collision-resistant fingerprint creator 725, a searcher 740 (which includes an exact searcher 745 in addition to the similarity searcher 240), and a multi-key storage manager 760.

[0082] The multi-key hierarchical fingerprint database 705 can store different types of keys at different levels of the hierarchy. At higher levels (all levels except the lowest), the fingerprint database 705 can store records where fingerprint 222 is an LSH key (such as the new LSH fingerprint 202), and it can also store records where the fingerprint is an anti-collision key 227 (such as the new anti-collision fingerprint 702). The key of a record in a higher level of the multi-key hierarchical fingerprint database 705 can be a centroid fingerprint, and the key of a record in the lowest level can be an anti-collision fingerprint 702 that provides access to sub-block 101-j of block 101.

[0083] The Locality Sensitive Fingerprint Creator 230 can create a new LSH fingerprint 202. The Block Splitter 720 can split block 101 into smaller sub-blocks 101-j. For example, if block 101 is 4 KB in size, the Block Splitter 720 can split it into 16 sub-blocks 101-j, each 256 bytes. The Anti-collision Fingerprint Creator 725 can create a new anti-collision fingerprint 702 for each sub-block 101-j. The Similarity Searcher 240 can perform similarity searches at higher levels of the multi-key hierarchical fingerprint database 705, store centroids, and find records with fingerprints similar to the new LSH fingerprint 202.

[0084] The precise searcher 745 can perform a precise search between the new anti-collision fingerprint 702 of each sub-block 101-j and the records in the lowest level of the fingerprint database 705 to find the same fingerprint.

[0085] If the similarity searcher 240 does not find a similar fingerprint 222x (similar to the new LSH fingerprint 202) when searching at a higher level, i.e., no exact match is found, the multikey storage manager 760 can insert all sub-blocks 101-j of the new block 101 into the database 705, update the higher levels of the multikey hierarchical fingerprint database 705 with the relevant centroids created using the LSH fingerprint 202, and update the lowest level of the multikey hierarchical fingerprint database 705 with the anti-collision fingerprint 702 of the sub-blocks 101-j.

[0086] If the similarity searcher 240 finds a similar fingerprint 222x in a higher level, the exact searcher 745 can perform an exact search to locate each new anti-collision fingerprint 702. If the same anti-collision fingerprint has been found (an exact match has been found), the multi-key storage manager 760 can update the anti-collision fingerprint 702 to include the location of the same sub-block. If the same fingerprint has not yet been found in the lowest level of the multi-key hierarchical fingerprint database 705, the multi-key storage manager 760 can use the LSH fingerprint 202 to update the centroid in a higher level, insert the calculated centroid into the multi-key hierarchical fingerprint database 705, and insert the new sub-block 101-j into the database 705.

[0087] Figure 7B and Figure 7C An alternative view of system 700 is provided together, in which the same reference numerals indicate matching elements.

[0088] Embodiments of the present invention can be configured to locate blocks similar to the input block and retrieve a set of similar documents before or instead of storing new documents.

[0089] It can be recognized that the deduplication system 700 can require less data storage compared to a standard deduplication system because two blocks that share most of the same content in their data (e.g., if 15 sub-blocks are the same and only one is different) can consume less than twice the block size, which would be the storage consumption in a standard deduplication system (in this example, the storage consumption of the deduplication system 700 could be 1.06 times the block size).

[0090] Those skilled in the art will recognize that the steps shown in the different processes described herein are not intended to be limiting, and that these processes can be practiced with more or fewer steps or different orders of steps or any combination thereof.

[0091] Those skilled in the art will also recognize that the different components of the system shown in the different figures and described herein are not intended to be limiting, and that the system may be implemented with more or fewer components, or have different arrangements of components, or have one or more processors that perform the activities of the whole system, or any combination thereof.

[0092] Unless otherwise specifically stated, as is obvious from the foregoing discussion, it should be understood that throughout this specification, discussions using terms such as “processing,” “computing,” “operation,” “determining,” etc., refer to the actions and / or processes of any type of general-purpose computer (e.g., client / server systems, mobile computing devices, smart home appliances, cloud computing units, or similar electronic computing devices that manipulate data in the registers and / or memory of a computer system and / or convert it into other data in the memory, registers, or other such information storage, transmission, or display devices of a computing system).

[0093] Embodiments of the present invention may include means for performing the operations described herein. The means may be specifically configured for the desired purpose, or the means may include a computing device or system typically having at least one processor and at least one memory, which is selectively activated or reconfigured by a computer program stored in a computer. The resulting means, when instructed by software, can transform a general-purpose computer into an inventive element as discussed herein. The instructions may define the inventive means operating with the desired computer platform. Such a computer program may be stored in a computer-readable storage medium, such as, but not limited to, any type of disk, including optical disks, magneto-optical disks, read-only memory (ROM), volatile and non-volatile memory, random access memory (RAM), electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical cards, flash memory, disk-on-key flash drives, or any other type of medium suitable for storing electronic instructions and capable of being coupled to a computer system bus. Computer-readable storage media may also be implemented in cloud storage.

[0094] Some general-purpose computers may include at least one communication element to enable communication with data networks and / or mobile communication networks.

[0095] The processes and displays presented herein are not inherently related to any particular computer or other device. Various general-purpose systems may be used with the programs taught herein, or it may prove convenient to construct more specialized devices to perform the desired methods. The desired structures for various such systems will emerge from the above description. Furthermore, embodiments of the invention are described without reference to any particular programming language. It will be appreciated that the teachings of the invention as described herein can be implemented using various programming languages.

[0096] Although certain features of the invention have been shown and described herein, many modifications, substitutions, alterations, and equivalents will now occur to those skilled in the art. Therefore, it should be understood that the appended claims are intended to cover all such modifications and alterations falling within the true spirit of the invention.

Claims

1. A system for a storage unit, the system comprising: An associative memory device for performing associative processing includes a memory array having columns divided into portions, wherein fingerprint portions store multiple fingerprints associated with data blocks, each fingerprint being stored in a separate column of the fingerprint portions; The associated memory device also includes: A similarity searcher, operating on the columns, is configured to receive an input fingerprint of an input block and perform a search within the columns of the fingerprint portion for similar fingerprints whose distance to the input fingerprint is less than a predetermined threshold; and A difference calculator, operating on the column, calculates a difference block indicating the relative change between the input block and similar blocks associated with the similar fingerprint, if the similar fingerprint is found; and A difference block storage manager is configured to, if the difference block is a non-empty difference block, associate the input fingerprint with the similar block and with the difference block, store the input fingerprint in a column of the fingerprint portion, and store the non-empty difference block in the storage unit. The fingerprint portion is arranged in a multi-level structure, wherein higher levels include the centroids of clusters in lower levels, and the lowest level includes the fingerprints of blocks, the centroids being calculated based on the fingerprints; and The similarity searcher alternatively performs a search within the centroids for a set of similar centroids whose distance from the input fingerprint is less than a predetermined threshold; and A retrieval device for retrieving a set of data blocks associated with the set of similar fingerprints before or in lieu of storing the input block.

2. The system according to claim 1, further comprising: A fingerprint creator that uses the Locality Sensitive Hash (LSH) algorithm to create the new fingerprint.

3. The system according to claim 1, wherein, The storage manager is used to store the highest-level fingerprints in the column, and the similarity searcher performs the search at the highest level, while the search at lower levels is performed in the CPU.

4. A storage method, comprising: Multiple fingerprints associated with a data block are stored in the fingerprint portion of an associative memory device, the associative memory device including a memory array having columns divided into portions, each fingerprint being stored in a separate column within the columns, and the data block being stored in a storage cell; Receive the input fingerprint of the input block; Search within the columns of the fingerprint section for similar fingerprints that are less than a predetermined threshold distance from the input fingerprint; If the similar fingerprint is found, then within the column of the fingerprint portion, calculate the difference block of relative change between the input block and the similar block associated with the similar fingerprint; If the difference block is a non-empty difference block, then: The input fingerprint is associated with the similar blocks and with the different blocks. Add the input fingerprint to a column of the fingerprint portion, and The non-empty difference block is stored in the storage unit. The fingerprint portion is arranged in a multi-level structure, wherein higher levels include the centroids of clusters in lower levels, and the lowest level includes the fingerprints of blocks, the centroids being calculated based on the fingerprints; and Alternatively, within the column of the fingerprint portion, a group of similar fingerprints whose distance from the input fingerprint is less than a predetermined threshold is searched; and Before storing the input block or instead of storing the input block, retrieve a set of data blocks associated with the set of similar fingerprints.

5. The method according to claim 4, further comprising: The new fingerprint was created using the Locality Sensitive Hash (LSH) algorithm.

6. The method according to claim 4, further comprising: The highest-level fingerprint is loaded into the column, and the search includes: a search at the highest level performed by the associated memory device, and a search at a lower level performed by the CPU.

7. A deduplication system for storage cells, the deduplication system comprising: An associative memory device for performing associative processing includes a memory array having columns divided into portions, wherein fingerprint portions store multiple fingerprints associated with data blocks, and wherein the fingerprint portions are arranged in a multi-level structure including a highest level, at least one intermediate level, and a lowest level, wherein the highest level and the at least one intermediate level store Locality Sensitive Hash (LSH) fingerprints as centroids representing clusters of fingerprints in the lower levels; and the lowest level stores collision-resistant fingerprints associated with individual data sub-blocks; The associated memory device also includes: A local sensitive fingerprint creator for creating an input LSH fingerprint for the input block; Anti-collision fingerprint creator, which is used to create a set of anti-collision fingerprints for a set of sub-blocks of the input block; A similarity searcher, which operates on the columns, is used to perform a search for similar fingerprints whose distance from the input LSH fingerprint is less than a predetermined threshold within the columns of the highest level and the at least one intermediate level. A storage manager is configured to store the set of sub-blocks in a storage unit if no similar fingerprint is found, store the set of anti-collision fingerprints in the lowest level, and update the centroid using the input LSH fingerprint; A precise searcher, which, if a similar fingerprint is found, searches in the lowest level for an identical fingerprint matching each of the anti-collision fingerprints; and For each identical fingerprint, the storage manager is configured to associate the anti-collision fingerprint with the associated sub-block of each identical fingerprint, and for each dissimilar fingerprint, the storage manager is configured to add the associated dissimilar sub-block to the storage cell, and update the centroid using the LSH fingerprint of the dissimilar fingerprint.

8. The system according to claim 7, wherein, The storage manager is used to store the highest-level fingerprints in the column, and the similarity searcher performs the search at the highest level, while the search at lower levels is performed in the CPU.

9. A method for deduplication in storage cells, the method comprising: A fingerprint portion of an associative memory device stores multiple fingerprints associated with data blocks. The associative memory device includes a memory array having columns divided into sections. Each fingerprint is stored in a separate column within the columns, and the data blocks are stored in storage cells. The fingerprint portion is arranged in a multi-level structure, including a highest level, at least one intermediate level, and a lowest level. The highest level and the at least one intermediate level store Local Sensitive Hash (LSH) fingerprints that serve as centroids representing clusters of fingerprints at lower levels. The lowest level stores collision-resistant fingerprints associated with individual data sub-blocks. Create an input LSH fingerprint for the input block; Create a set of anti-collision fingerprints for a set of sub-blocks of the input block; Perform a search for similar fingerprints whose distance from the input LSH fingerprint is less than a predetermined threshold within the columns of the highest level and at least one intermediate level; If no similar fingerprint is found, the set of sub-blocks is stored in the storage unit, the set of anti-collision fingerprints is stored in the lowest level, and the centroid is updated using the input LSH fingerprint; If a similar fingerprint is found, then search in the lowest level for an identical fingerprint that matches each of the anti-collision fingerprints; and For each identical fingerprint, associate the anti-collision fingerprint with the associated sub-block of each identical fingerprint; and For each distinct fingerprint, the associated distinct sub-block is added to the storage unit; and the centroid is updated using the LSH fingerprint of the distinct fingerprint.

10. The method of claim 9, further comprising: The highest-level fingerprint is loaded into the column, and the search includes: a search at the highest level performed by the associated memory device, and a search at a lower level performed by the CPU.