Method and system for eliminating duplication during data count as well as server and storage medium

A data counting and data storage technology, applied in the field of big data, can solve the problems of low accuracy rate, achieve the effect of improving accuracy, improving the efficiency of duplication check, and reducing the probability of data manslaughter

Active Publication Date: 2018-11-13
WUHAN DOUYU NETWORK TECH CO LTD
View PDF5 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In view of this, the embodiment of the present invention provides a method, system, server, and storage medium for deduplication of data counting, and the existing deduplication method has a low accuracy rate.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for eliminating duplication during data count as well as server and storage medium
  • Method and system for eliminating duplication during data count as well as server and storage medium
  • Method and system for eliminating duplication during data count as well as server and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0028] see figure 1 , a schematic flowchart of a data counting and deduplication method provided by an embodiment of the present invention, including the following steps:

[0029] S101. After receiving the deduplication call request from the client, use the dubbo component to perform load balancing, so as to assign servers to perform deduplication processing.

[0030] The client can provide a local service for the user, and can request a deduplication service from the server. The client may refer to a deduplication request program on the client computer, capable of invoking a deduplication component on the server side. After receiving the request, the server will verify the legitimacy of the request, and then distribute the server through load balancing of dubbo components. The dubbo component is a distributed service framework, which can provide transparent RPC (Remote Procedure Call) remote service invocation, and has a soft load balancing and fault tolerance mechanism. S...

Embodiment 2

[0043] exist figure 1 on the basis of combining figure 2 Step S102 is described in detail, that is, to create a deduplication service data storage unit, as follows:

[0044] figure 2 The flowchart of step S102 provided for the embodiment of the present invention includes steps S1021, S1022, S1023, and S1024, and the above steps do not imply the sequence of execution.

[0045] In step S1021, by parsing the request parameters, the database name, partition data, and deduplication level can be obtained.

[0046] Before redis storage, you need to query the storage component redis to determine whether it has been stored, so as to avoid repeated data storage and occupy memory. Specifically, by obtaining the data name and partition data content in the request parameter, and then comparing it with the data traversal in the redis storage component, the interference can be eliminated through step S1022.

[0047] When there is no corresponding database name and partition data, creat...

Embodiment 3

[0052] exist figure 1 on the basis of combining image 3 The process of creating the deduplication calculation unit in step S103 is described in detail as follows:

[0053] After parsing the application request parameters, it is necessary to obtain the set deduplication level parameters in step S103. The specific implementation process is performed in S301 and S302 through the Bloom Filter algorithm for deduplication counting. For example, when the deduplication level is level 1, calculate the hash value of a group of deduplication data, and find the corresponding redis storage unit according to the hash value result. Bitmap, and query in the bitmap, if it does not exist, then set 1 with the value of 0 in the corresponding bitmap bit, add the data to the storage unit of the deduplication result, and return the deduplication result. Each time the query result is returned according to the query process, if any bit returns a value of 0, it indicates that the query data does not...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and system for eliminating duplication during data count as well as a server and a storage medium, applicable to data duplication elimination in big data. The method provided by the invention comprises the following steps: receiving a call request, and performing load balancing by utilizing a dubbo component; analyzing the request, and according to a preset duplication elimination rank parameter in the request, creating a corresponding quantity of redis data storage bitmaps on the server; and acquiring a duplication elimination content parameter and the duplication elimination rank parameter in the request, calculating by virtue of a Bloom Filter algorithm to obtain a duplication elimination result, when the duplication elimination rank is higher than grade1 and a duplication elimination result return value is 0, calculating one group of hash functions again, and then performing duplication elimination again by virtue of the Bloom Filter algorithm. Inthe method disclosed by the invention, the load balancing is performed by virtue of the dubbo component, and according to the preset duplication elimination grade, count duplication elimination at a corresponding grade is performed by virtue of the Bloom Filter algorithm, so that data can be efficiently and rapidly processed, the probability that the data is eliminated mistakenly can be greatly reduced, and duplication elimination accuracy is improved.

Description

technical field [0001] The invention relates to the field of big data, in particular to a data counting and deduplication method, system, server and storage medium. Background technique [0002] With the popularity of the Internet, network data has shown exponential growth, and the huge amount of data is a major test for deduplication technology. For the counting of data such as user visits, user comments, and user speeches, the traditional simple group counting is obviously difficult to apply to tens of millions or hundreds of millions of data. [0003] At present, the Bloom Filter algorithm is often used for counting and deduplication of such huge data, using multiple hash functions and bitmap storage to achieve the purpose of data deduplication, but this method has data miskilling, resulting in a low deduplication accuracy rate, which is difficult to guarantee The results are reliable. Contents of the invention [0004] In view of this, embodiments of the present inve...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/54G06F9/50G06F17/30
CPCG06F9/5083G06F9/547
Inventor 王毅张文明陈少杰
Owner WUHAN DOUYU NETWORK TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products