A method and system for deduplicating keys with a large amount of data based on Bloom filter

A Bloom filter and large data volume technology, which is applied in digital transmission systems, transmission systems, database indexes, etc., can solve the problem that it is difficult to achieve accurate deduplication in the deduplication method of large data volume data processing, and achieve deduplication Effects of query efficiency, space improvement, quality improvement and safety improvement

Active Publication Date: 2021-12-31
ZHEJIANG QUANTUM TECH CO LTD
View PDF8 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to provide a method and system for deduplication of large data volume keys based on Bloom filter, so as to solve the technical defect that it is difficult to achieve accurate deduplication in the prior art for large data volume data processing deduplication methods

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for deduplicating keys with a large amount of data based on Bloom filter
  • A method and system for deduplicating keys with a large amount of data based on Bloom filter
  • A method and system for deduplicating keys with a large amount of data based on Bloom filter

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0068] Embodiment one: refer to figure 1 As shown, the present invention provides a method for deduplication of a large data volume key based on a Bloom filter, comprising the following steps:

[0069] S1: Obtain the key data to be deduplicated, obtain the key K from the secure key distribution system such as the quantum key distribution system and the quantum key relay network system, and wait for deduplication detection.

[0070] S2: Deduplication system initialization, according to the total amount of target keys S designed by the system, and the expected storage capacity n of a single persistent storage unit, determine the number N of storage units, and then create N database tables or N files; Unit expected storage capacity n, preset expected false positive rate fpp, calculate the size m of a single Bloom filter Bitmap array, and the number k of hash functions, create a Bloom filter BF, and create a corresponding one according to the number of storage units Number...

Embodiment 2

[0079] Embodiment two: refer to figure 2 , image 3 , Figure 4 As shown, the present invention also provides a large data volume key deduplication system based on Bloom filter, including the following components:

[0080] Key acquisition module 201: used to acquire a key with a large amount of data to be stored and deduplicated from key distribution systems such as quantum key distribution system and quantum key relay network system.

[0081] Deduplication system initialization module 202: used to create storage units and bloom filters according to input parameters, such as image 3 As shown, the deduplication system initialization module 202 includes the following submodules:

[0082] (1) Create storage unit submodule 2021: used to determine the number N of storage units according to the expected total key S input by the system and the expected storage capacity n of a single persistent storage unit, and then create N database tables or N files ;

[0083] ...

Embodiment 3

[0090] Embodiment three: refer to Figure 5 Shown, on the basis of embodiment one, combined Figure 5 The process of step S5 positive data traversal statistics is described in detail, including sub-steps such as S501, S502, S503, S504, S505, S506, as follows:

[0091] S501: traverse and take out a group of keys K in the specified storage unit;

[0092] S502: Determine whether the key K already exists in the HashSet set of positive data output in step S4. If it does not exist, it means that the key K is unique and does not need to be processed. Jump to step S501 to start the next round of traversal statistics. If it exists, go to S503 deal with;

[0093] S503: The key K exists in the HashSet set of positive data, indicating that the key K may be repeated, and the actual storage location information of the key K is obtained, that is, the file displacement of the key K in the storage unit or the database master key can represent the key. information on the actual sto...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for deduplication of large data volume keys based on Bloom filter, comprising the following steps: acquisition of data to be deduplicated; initialization of deduplication system; partition storage of data; Bloom deduplication of data; traversal of positive data Statistics; accurate deduplication of data; complete deduplication of large data volume key data, and the present invention also provides a large data volume key deduplication system based on Bloom filter. Compared with the prior art, the present invention proposes a divide-and-conquer storage method and an accurate deduplication method based on positive data for deduplication of large data volume keys, and evenly guides and stores large data volume keys to different storage units according to hash remainder , which not only ensures that the duplicate key is in the same data set, but also reduces the BitSet space occupation and deduplication operation consumption required by a single Bloom filter, that is, improves the space and time efficiency of the Bloom filter during deduplication operations, and based on Positive data HashSet collection traversal statistics to achieve accurate deduplication of key data, improve deduplication accuracy and key quality.

Description

technical field [0001] The invention relates to the technical field of electrical digital data processing, in particular to a Bloom filter-based method and system for deduplicating keys with a large amount of data. Background technique [0002] With the continuous development of quantum key distribution technology and quantum key relay technology, there has been a situation where the server stores a large amount of data keys in practical applications. As the amount of key data continues to grow, it has become an urgent need to remove duplicate keys for large data volume keys. Deduplication of keys can more effectively ensure key security and improve key quality. Currently, Bloom filter algorithm is often used to deduplicate such a large amount of data, based on multiple hash functions and Bitmap binary vector storage to achieve the purpose of data deduplication, and the efficiency of time and space is relatively high, but simply Using this Bloom filter scheme has a misjudgm...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/215G06F16/22H04L9/08
CPCH04L9/0894G06F16/215G06F16/2237
Inventor 丁胜建封连重
Owner ZHEJIANG QUANTUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products