Cuckoo filter-based duplicate removal method and system for large-data-volume key

A large data volume, cuckoo technology, which is applied to encryption devices with shift registers/memory and key distribution, can solve the problems of high-efficiency deduplication and deduplication methods for large data volume keys, and achieve efficient and accurate deduplication. Achieve deduplication query efficiency, improve quality and usability, and efficiently query the effect

Active Publication Date: 2022-08-02
ZHEJIANG QUANTUM TECH CO LTD
13 Cites 0 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to provide a method and system for deduplication of keys with a large amount of data based on a cuckoo filter, so as to solve the problem that the field of quantum informatio...
View more

Method used

Accurate deduplication module 207: be used for traversing the hashMap collection that may repeat data statistical result, realize the accurate deduplication of repeated key, if the element quantity of the Value list of certain element in the collection is greater than 1, then show to find Key key Indicates the duplicate key of the key. According to the storage location information of the duplicate key saved in the Value list, the duplicate key is eliminated, and only the first group of keys in the duplicate data is kept, and the corresponding The deduplication detection identification field value is set to 0; if the Value list element of an element is 1, it indicates that the Key key indicates that the key is unique, and the deduplication detection identification field value corresponding to this group of keys is set to 0.
[0105] Based on the structure, method and embodiments of the present invention, it can be known that the present invention adopts a weighted load balancing scheduling method so that keys are uniformly guided and stored in different storage units, which ensures that repeated keys must be stored in the same data set, It also reduces ...
View more

Abstract

The invention discloses a duplicate removal method for a large-data-volume key based on a cuckoo filter. Comprising the steps of initialization of a duplicate removal system, acquisition of key data to be subjected to duplicate removal, divided storage of the key data, cuckoo filtering and duplicate removal of the key data, deletion of the key data, traversal statistics of positive data and overflow data sets, accurate duplicate removal of the key data, and completion of accurate duplicate removal of the key data of a large data volume. The invention further provides a duplicate removal system for the large-data-volume key based on the cuckoo filter. Compared with the prior art, dynamic adjustment of the subsequent storage unit is facilitated, and migration of a large amount of key data is reduced; the deduplication efficiency of the whole system on the large-data-volume key data is improved; compared with a Bloom filter, the cuckoo filter has the function of supporting dynamic element adding and deleting, provides searching performance higher than that of a traditional Bloom filter, and occupies smaller space than that of the Bloom filter under the condition of the same low expected misjudgment rate; and meanwhile, the quality and availability of the key are improved.

Application Domain

Key distribution for secure communicationEncryption apparatus with shift registers/memories +1

Technology Topic

Storage cellPositive data +6

Image

  • Cuckoo filter-based duplicate removal method and system for large-data-volume key
  • Cuckoo filter-based duplicate removal method and system for large-data-volume key
  • Cuckoo filter-based duplicate removal method and system for large-data-volume key

Examples

  • Experimental program(3)
  • Effect test(1)

Example Embodiment

[0071] Embodiment 2:
[0072] Embodiment 2: as figure 2 , image 3 , Figure 4 , Figure 5 , Image 6 As shown, the present invention also provides a large data volume key deduplication system based on a cuckoo filter, including the following components:
[0073] Deduplication system initialization module 201: used to create storage units and cuckoo filters according to the input parameters, such as image 3 As shown, the deduplication system initialization module 201 includes the following sub-modules:
[0074] (1) Create a storage unit submodule 2011: used to obtain the corresponding storage weight wt according to different hardware parameters according to the total amount of target keys S and the number of storage units N preset by the system, and the weights of each storage unit and is 1, and the expected storage capacity of a single persistent storage unit n = S * wt. Immediately create N database tables or N files;
[0075] (2) Create a cuckoo filter sub-module 2012: according to the expected storage capacity n of a single storage unit and the preset expected false positive rate fpp, calculate the number b of elements that can be placed in each subscript position of the cuckoo filter, and hash The number of functions k, the maximum number of runs MaxNumKicks, create the corresponding cuckoo filter, and create the corresponding number of cuckoo filters according to the number of storage units.
[0076] Data acquisition module 201 : used for acquiring large amount of key data to be stored and to be deduplicated and detected from key distribution systems such as a quantum key distribution system and a quantum key relay network system.
[0077] Divide and conquer storage module 203: used to take the low kbit bits (generally 32 bits) of the key according to the input key X as the key interception value K, and construct M=2^k virtual storage units. Compare the relationship between the key intercept value and M. If the key interception value K∈[M *∑(wt1 + …+wti-1), M * ∑(wt1 + …+wti)], the serial number of the branch storage unit obtained is i, which means that the group of keys should enter the serial number is the processing unit of i. Call the corresponding storage interface to guide and store the set of input keys to the storage unit indicated by the serial number i, that is, a database table or a file, to ensure that duplicate keys are stored in the same storage unit, and the same cuckoo filter performs key duplication. Inquire.
[0078] Cuckoo deduplication module 204: when the key is stored, the existence of the key is calculated by the Cuckoo Filter algorithm, if it is determined that the key data fingerprint does not exist, the set of keys is inserted into the cuckoo filter instance, and the storage unit is stored. The deduplication detection identification field value of the key is set to 0, and the identification key is unique; if the key data fingerprint already exists, it is judged that it has been stored in the cuckoo filter, and the deduplication of the key in the storage unit is determined. The value of the detection identification field is +1, and the set of keys is added as a Value value element to the database that saves the positive data through the HashSet.add method. A collection of hashSets in element form.
[0079] If the key data fingerprint information does not exist in the cuckoo filter, and the cuckoo filter cannot be stored in the set of key fingerprints after the maximum number of runs MaxNumKicks wheel unit shift, the corresponding unit of the cuckoo filter has been fully loaded, If there is an overflow in the key data, the set of keys is overflow data, and the set of keys is added as a Value value element to the storage box that saves the overflow data through the HashSet.add method. The hashSet collection in element form, and the deduplication detection identification field value of the key in the storage unit is +1.
[0080] Data deletion module 205: when the key data is deleted, calculate the existence of the key through the Cuckoo Filter algorithm, if it is determined that the fingerprint of the key data exists, delete the group of keys from the cuckoo filter instance, and at the same time in the positive data. The set queries whether the key data exists, deletes it if it exists, and deletes the key as a Value value element through the HashSet.Remove method. If it is determined that the key data fingerprint does not exist in the cuckoo filter instance, then the overflow data The set queries whether the key data exists, deletes it if it exists, and deletes the key as a Value value element through the HashSet.Remove method. At the same time, the deduplication detection identification field value of the key in the storage unit is set to -1.
[0081] Positive data and overflow data traversal statistics module 206: For a certain key deduplication process in the parallel processing process of the system, this module 206 is used to traverse the specified storage unit, and each time a group of keys is taken out, it is judged and counted. If the key is duplicated, the duplicate screening results containing the storage location information of the same set of duplicate keys are saved to the hashMap collection in the form of an ArrayList list, such as Figure 4 As shown, the positive data and overflow data traversal statistics module 206 includes the following sub-modules:
[0082] (1) Positive data traversal sub-module 2061: each time a set of keys is taken out, it is judged whether it already exists in the hashSet set of positive data. If the set of keys does not exist in the hashSet set, it indicates that the set of keys is unique, and the processing is skipped; if the set of keys exists in the hashSet set, it means that the set of keys may be duplicated, and the duplicate result statistics sub-module 2063 is forwarded for processing. :
[0083] (2) The overflow data traversal sub-module 2062: each time a set of keys is taken out, it is judged whether it already exists in the hashSet set of the overflow data. If the set of keys does not exist in the hashSet set, it indicates that the set of keys is unique, and the processing is skipped; if the set of keys exists in the hashSet set, it means that the set of keys may be duplicated, and the duplicate result statistics sub-module 2063 is forwarded for processing. :
[0084] (3) Duplicate result statistics sub-module 2063: The sub-module 2061 or the sub-module 2062 determines that the key may be duplicated, then this sub-module 2063 determines whether the hashMap set in the form of elements that store the statistical results of possibly duplicated data already exists in this group The Key key corresponding to the key. If it does not exist, create a key-value pair. The Key key is the key of the group, and the Value value is initialized as an ArrayList list, which is used to save the key storage location information, such as the database primary key, File displacement, actual storage address, etc., use the ArrayList.add method to add the key storage location information of the group to the ArrayList list. If it already exists, it indicates that the duplicate key corresponding to the Key key of the group and its storage location information have been found. Then use the set of keys as the Key key to obtain the corresponding key-value pair element, add the storage location information of the set of keys to the Value value ArrayList list, and then use the HashMap.add method to write the updated Key and Value key-value pairs back to the hashMap collection, Complete the update of the corresponding key-value pair element.
[0085] Precise deduplication module 207: used to traverse the hashMap collection that stores the statistical results of possibly duplicated data, to achieve accurate deduplication of duplicate keys. If the number of elements in the Value list of an element in the collection is greater than 1, it indicates that the Key key is found to indicate the key According to the storage location information of the duplicate key saved in the Value list, the duplicate key is eliminated, and only the first group of keys in the duplicate data is retained, and the corresponding deduplication detection of this group of keys is performed. The value of the ID field is set to 0; if the Value list element of an element is 1, it indicates that the Key key indicates that the key is unique, and the value of the deduplication detection ID field corresponding to this group of keys is set to 0.

Example Embodiment

[0086] Embodiment three:
[0087] Example three: as Figure 7 shown, on the basis of Embodiment 1, combined with Figure 7 Detailing the process of step S6 positive data traversal statistics, including S601, S602, S603, S604, S605, S606 and other sub-steps, as follows:
[0088] S601: Traverse and retrieve a set of keys X in the specified storage unit;
[0089] S602: Determine whether the key X already exists in the HashSet set of positive data output in step S4 or the HashSet set of overflow data. If it does not exist, it indicates that the key X is unique and does not need to be processed, jump to step S601 to start the next round of traversal statistics , if it exists, it will be processed by S603;
[0090] S603: The key X exists in the HashSet set of positive data or the HashSet set of overflow data, indicating that the key X may be repeated, and obtain the actual storage location information of the key X, that is, the file displacement or database of the key X in the storage unit A primary key, etc., can represent information about where the key is actually stored.
[0091] S604: Determine whether the element whose Key key is the key X in the HashMap collection storing the repeated statistical result of positive data exists, and if it does not exist, it is processed in S605, and if it exists, it is processed in S606.
[0092] S605: Create a key-value pair element, wherein the Key key is the key X, the Value value is initialized as an ArrayList list, and the actual storage location information of the key X is added, and the key-value pair is added to the HashMap collection that stores the positive data repetition statistics result.
[0093] S606: Extract the element whose Key key is the key X from the HashMap collection storing the repeated statistical result of the positive data, and add the actual storage location information of the key X to the ArrayList list of the Value value of the collection element.

Example Embodiment

[0094] Embodiment 4:
[0095] Example four: as Figure 8 shown in figure 2 Based on the structure of the key deduplication system provided by the described invention, the present invention also provides a parallel processing framework of the key deduplication system, as follows:
[0096] Deduplication system instance Inst: The deduplication system instance Inst includes N deduplication process instances, that is, N process instances such as the following deduplication process instances Inst1, InstX, and InstN, where N is the number of storage units, that is, The number of cuckoo filters required for the deduplication system.
[0097] Deduplication process instance Inst1: the key deduplication system process instance Inst1 is the first process instance in the parallel process of the key deduplication system, including a storage unit 601, a cuckoo filter 602, a HashSet collection 603 and a HashMap collection 604.
[0098] Deduplication process instance InstX: the key deduplication system process instance InstX is the Xth process instance in the parallel process of the key deduplication system, including storage unit 601, cuckoo filter 602, HashSet collection 603 and HashMap collection 604.
[0099] Deduplication process instance InstN: the key deduplication system process instance Inst1 is the Nth process instance in the parallel process of the key deduplication system, including a storage unit 601, a cuckoo filter 602, a HashSet collection 603 and a HashMap collection 604.
[0100] Storage unit 601: The storage unit 601 is one of the N storage units created by the deduplication system initialization module 202 in the second embodiment, and is used to store the specified key by calculating the hash of the key and taking the remainder according to the input key. The specific storage unit serial number.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Image reading apparatus

InactiveUS20050238205A1improve efficiency
Owner:FUJIFILM BUSINESS INNOVATION CORP

Stem cell microparticles and miRNA

ActiveUS10406182B2reduce migrationinhibit migration
Owner:RENEURON LTD

Cured In Place Pipe Liner With Styrene Barrier

InactiveUS20100243154A1reduce migrationhigh hardness
Owner:LUBRIZOL ADVANCED MATERIALS INC

Compression support sleeve

InactiveUS7025738B2reduce migrationlight weight
Owner:L&R USA INC

Method and device for updating data in distributed storage system

ActiveCN103294675AImprove efficiencySolve data consistency
Owner:SHANDA INTERACTIVE ENTERTAINMENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products