Method for estimating top-n cardinal number data in high-speed data flow

A high-speed data flow, top-n technology, applied in data classification, processing input data, electrical digital data processing, etc., to achieve stable time efficiency, space efficiency optimization, and simple methods

Inactive Publication Date: 2017-03-15
成都知道创宇信息技术有限公司
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] When seeking top-n, the only concern is to sort the first n data, but all the data has to be saved in the hash table. In the context of today's increasingly complex data, this will become a huge storage overhead

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for estimating top-n cardinal number data in high-speed data flow
  • Method for estimating top-n cardinal number data in high-speed data flow

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. In the data stream, neither the data type of the non-top-n data nor the actual cardinality of the non-top-n data type is concerned, and the cardinality of the non-top-n data type is relatively small compared to the cardinality of the top-n data type Many, even if they are added together by mistake, the cardinality precision of the top-n data type will not be damaged much.

[0028] In the present invention, a data structure used is called "HyperLogLog Sketch matrix", which is set as S, with a width of m and a height of n, and each element is an HLL counter. Correspondingly, there are n hash functions that are independent of each other and have a hash value of 1~m, set f 1 , f 2 ,...,f n . Such as figure 2 As shown, when new data D appears, follow the steps below:

[0029] 1. Classify by business and set it as type X;

[0030] 2. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for estimating top-n cardinal number data in high-speed data flow. The method comprises the following steps that a HyperLogLog Sketch matrix data structure is defined and set as S, the width is m, the height is n, each element is an HLL counter, and correspondingly, n mutually independent Harsh functions with the Harsh value being 1-m are set as f2, f2, ..., fn; when new data D appears, classification is carried out according to services, and the data D is set as the type X; xi is calculated to be equal to fi(X), wherein I is equal to 1, 2, ..., n; D is reckoned in the HLL counter in S(1, x1), S(2, x2), ..., S(n, xn), updated cardinal numbers, namely, Y1, Y2, ..., Yn are obtained, and an estimated cardinal number Y is obtained; the data type X and the estimated cardinal number Y are updated into top-n. The method is simple and convenient to use, is achieved parallel through hardware, can be used for calculating the cardinal number of the data type without storing the data type and has good safety.

Description

technical field [0001] The invention relates to the field of high-speed data flow calculation, in particular to a method for estimating top-n cardinality data in high-speed data flow. Background technique [0002] With the development of modern Internet technology and sensor technology, the scale of data is increasing day by day, and the data in many scenarios in production presents the characteristics of rapid generation and complex content, which far exceeds the growth of hardware processing performance. way to process these data. In a small-scale data environment, the method of using a hash table to accurately solve the cardinality and top-n can no longer adapt to this high-speed and large-scale data flow, so the data flow algorithm came into being. The HyperLogLog Counting algorithm, which can estimate the cardinality with high precision by only using a very small storage space, is one of the widely used algorithms. [0003] The method often adopted in the prior art is...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F7/24
CPCG06F7/24G06F2207/228
Inventor 罗意王小虎石涵王春鹏赵晨晖
Owner 成都知道创宇信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products