Stream type repetitive data detection method

A technology of repeated data and detection methods, applied in data transformation, electrical digital data processing, instruments, etc., can solve problems such as unacceptable overhead, and achieve the effect of reducing memory overhead

Active Publication Date: 2011-11-23
HUAZHONG UNIV OF SCI & TECH
View PDF2 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The main disadvantage of this method is that the overhead of detection is unacceptable when the value of N is too large

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Stream type repetitive data detection method
  • Stream type repetitive data detection method
  • Stream type repetitive data detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0018] The invention uses the Bloom filter detection technology to detect repeated data for the data stream. Before describing the inventive solution, briefly introduce the working principle of the bloom filter.

[0019] Bloom Filter is a random data structure with high space efficiency. It uses bit array to represent a collection concisely, and can determine whether an element belongs to this collection. The efficiency of Bloom Filter comes at a certain price: when judging whether an element belongs to a certain set, it is possible to mistake elements that do not belong to this set as belonging to this set (false positive). Therefore, Bloom Filter is not suitable for those "zero error" applications. In applications where low error rates can be tolerated, Bloom Filter saves a lot of storage space with very few errors.

[0020] Let's take a look at how Bloom Filter uses bit arrays to represent collections. In the initial state, the Bloom Filter is a bit array containing m bi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a stream type repetitive data detection method. According to the method, a TBFA (Timing Bloom Filter Array) is constructed for flexibly and efficiently detecting repetitive data in a sliding window model, wherein the TBFA consists of a plurality of TBFs (Timing Bloom Filters) with the same structure, each TBF comprises a bloom filter and a separated timer array used for storing timestamps, the whole TBFA works in a looped first-in first-out mode and gets rid of old elements removed from a data stream monitoring window while recording new elements. The stream type repetitive data detection method is implemented under the sliding widow model, element monitoring is correct to one element, therefore the statistic result based on the stream type repetitive data detectionmethod has good stability; in addition, a part of the timer arrays in the TBFA can be unloaded into a disc, therefore the overhead of an internal memory can be reduced. Theoretical analysis and experimental data show that more than 95% of query efficiency can be maintained when DCBA (Detached Counting Bloom filters Array loads less than 10% of data contents to the internal memory, therefore the method provided by the invention is superior to the traditional technical scheme in space efficiency and expandability.

Description

technical field [0001] The invention belongs to computer data transmission and storage systems, in particular to a method for deleting duplicate data in data streams. Background technique [0002] The expansion of the Internet has led to an explosive growth of data information in a geometric progression. Turing Award winner Jim Gray (Jim Gray) pointed out that the amount of new data added every 18 months in the network environment is equal to the sum of the amount of data in history. With the continuous development of applications such as digital libraries, e-commerce, medical imaging, bioengineering, scientific computing, virtual reality, digital earth, and website multimedia, there is a demand for the establishment of high-performance, high-reliability mass information storage systems. Future storage systems Its scale will reach PB level or even EB level. The transmission and storage of massive data puts forward very high requirements on network systems, storage devices a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F5/06
Inventor 周可魏建生张攀峰李春花王桦
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products