[0071] Embodiment 2:
[0072] Embodiment 2: as figure 2 , image 3 , Figure 4 , Figure 5 , Image 6 As shown, the present invention also provides a large data volume key deduplication system based on a cuckoo filter, including the following components:
[0073] Deduplication system initialization module 201: used to create storage units and cuckoo filters according to the input parameters, such as image 3 As shown, the deduplication system initialization module 201 includes the following sub-modules:
[0074] (1) Create a storage unit submodule 2011: used to obtain the corresponding storage weight wt according to different hardware parameters according to the total amount of target keys S and the number of storage units N preset by the system, and the weights of each storage unit and is 1, and the expected storage capacity of a single persistent storage unit n = S * wt. Immediately create N database tables or N files;
[0075] (2) Create a cuckoo filter sub-module 2012: according to the expected storage capacity n of a single storage unit and the preset expected false positive rate fpp, calculate the number b of elements that can be placed in each subscript position of the cuckoo filter, and hash The number of functions k, the maximum number of runs MaxNumKicks, create the corresponding cuckoo filter, and create the corresponding number of cuckoo filters according to the number of storage units.
[0076] Data acquisition module 201 : used for acquiring large amount of key data to be stored and to be deduplicated and detected from key distribution systems such as a quantum key distribution system and a quantum key relay network system.
[0077] Divide and conquer storage module 203: used to take the low kbit bits (generally 32 bits) of the key according to the input key X as the key interception value K, and construct M=2^k virtual storage units. Compare the relationship between the key intercept value and M. If the key interception value K∈[M *∑(wt1 + …+wti-1), M * ∑(wt1 + …+wti)], the serial number of the branch storage unit obtained is i, which means that the group of keys should enter the serial number is the processing unit of i. Call the corresponding storage interface to guide and store the set of input keys to the storage unit indicated by the serial number i, that is, a database table or a file, to ensure that duplicate keys are stored in the same storage unit, and the same cuckoo filter performs key duplication. Inquire.
[0078] Cuckoo deduplication module 204: when the key is stored, the existence of the key is calculated by the Cuckoo Filter algorithm, if it is determined that the key data fingerprint does not exist, the set of keys is inserted into the cuckoo filter instance, and the storage unit is stored. The deduplication detection identification field value of the key is set to 0, and the identification key is unique; if the key data fingerprint already exists, it is judged that it has been stored in the cuckoo filter, and the deduplication of the key in the storage unit is determined. The value of the detection identification field is +1, and the set of keys is added as a Value value element to the database that saves the positive data through the HashSet.add method. A collection of hashSets in element form.
[0079] If the key data fingerprint information does not exist in the cuckoo filter, and the cuckoo filter cannot be stored in the set of key fingerprints after the maximum number of runs MaxNumKicks wheel unit shift, the corresponding unit of the cuckoo filter has been fully loaded, If there is an overflow in the key data, the set of keys is overflow data, and the set of keys is added as a Value value element to the storage box that saves the overflow data through the HashSet.add method. The hashSet collection in element form, and the deduplication detection identification field value of the key in the storage unit is +1.
[0080] Data deletion module 205: when the key data is deleted, calculate the existence of the key through the Cuckoo Filter algorithm, if it is determined that the fingerprint of the key data exists, delete the group of keys from the cuckoo filter instance, and at the same time in the positive data. The set queries whether the key data exists, deletes it if it exists, and deletes the key as a Value value element through the HashSet.Remove method. If it is determined that the key data fingerprint does not exist in the cuckoo filter instance, then the overflow data The set queries whether the key data exists, deletes it if it exists, and deletes the key as a Value value element through the HashSet.Remove method. At the same time, the deduplication detection identification field value of the key in the storage unit is set to -1.
[0081] Positive data and overflow data traversal statistics module 206: For a certain key deduplication process in the parallel processing process of the system, this module 206 is used to traverse the specified storage unit, and each time a group of keys is taken out, it is judged and counted. If the key is duplicated, the duplicate screening results containing the storage location information of the same set of duplicate keys are saved to the hashMap collection in the form of an ArrayList list, such as Figure 4 As shown, the positive data and overflow data traversal statistics module 206 includes the following sub-modules:
[0082] (1) Positive data traversal sub-module 2061: each time a set of keys is taken out, it is judged whether it already exists in the hashSet set of positive data. If the set of keys does not exist in the hashSet set, it indicates that the set of keys is unique, and the processing is skipped; if the set of keys exists in the hashSet set, it means that the set of keys may be duplicated, and the duplicate result statistics sub-module 2063 is forwarded for processing. :
[0083] (2) The overflow data traversal sub-module 2062: each time a set of keys is taken out, it is judged whether it already exists in the hashSet set of the overflow data. If the set of keys does not exist in the hashSet set, it indicates that the set of keys is unique, and the processing is skipped; if the set of keys exists in the hashSet set, it means that the set of keys may be duplicated, and the duplicate result statistics sub-module 2063 is forwarded for processing. :
[0084] (3) Duplicate result statistics sub-module 2063: The sub-module 2061 or the sub-module 2062 determines that the key may be duplicated, then this sub-module 2063 determines whether the hashMap set in the form of elements that store the statistical results of possibly duplicated data already exists in this group The Key key corresponding to the key. If it does not exist, create a key-value pair. The Key key is the key of the group, and the Value value is initialized as an ArrayList list, which is used to save the key storage location information, such as the database primary key, File displacement, actual storage address, etc., use the ArrayList.add method to add the key storage location information of the group to the ArrayList list. If it already exists, it indicates that the duplicate key corresponding to the Key key of the group and its storage location information have been found. Then use the set of keys as the Key key to obtain the corresponding key-value pair element, add the storage location information of the set of keys to the Value value ArrayList list, and then use the HashMap.add method to write the updated Key and Value key-value pairs back to the hashMap collection, Complete the update of the corresponding key-value pair element.
[0085] Precise deduplication module 207: used to traverse the hashMap collection that stores the statistical results of possibly duplicated data, to achieve accurate deduplication of duplicate keys. If the number of elements in the Value list of an element in the collection is greater than 1, it indicates that the Key key is found to indicate the key According to the storage location information of the duplicate key saved in the Value list, the duplicate key is eliminated, and only the first group of keys in the duplicate data is retained, and the corresponding deduplication detection of this group of keys is performed. The value of the ID field is set to 0; if the Value list element of an element is 1, it indicates that the Key key indicates that the key is unique, and the value of the deduplication detection ID field corresponding to this group of keys is set to 0.