A method and device for deduplication of big data

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A big data and data technology, applied in the field of big data deduplication methods and devices, can solve problems such as poor accuracy, information pollution, data redundancy, etc., and achieve the effect of high deduplication accuracy, good versatility, and high precision

Active Publication Date: 2021-02-02

BEIJING SUREKAM CORP

View PDF9 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] However, simple keyword deduplication in related technologies cannot eliminate approximate data. The accuracy of deduplication is poor. value information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0049] An embodiment of the present invention provides a large data deduplication method. see figure 1, the network architecture based on the method includes data acquisition equipment, server clusters and Redis server pairs. Wherein, the data collection device is used to collect data to be deduplicated, and upload the data to be deduplicated to the server cluster. The server cluster includes multiple servers, and the execution subject of the embodiment of the present invention is the server, and the deduplication work is distributed to different nodes in the cluster environment as much as possible through the server cluster, so as to obtain the maximum calculation amount. Multiple groups of Redis server pairs are set in the embodiment of the present invention, and each Redis server pair includes a Redis master server and a Redis backup server, stores the intermediate data in the large data deduplication process by the Redis server pair, and passes through the master server p...

Embodiment 2

[0098] see Figure 5 , the embodiment of the present invention provides a big data deduplication device, which is used to implement the big data deduplication method provided in the above-mentioned embodiment 1, and the device includes:

[0099] The receiving module 20 is used to receive data to be deduplicated, and the data to be deduplicated includes occurrence time and data string;

[0100] Generating module 21 is used for generating the Redis key-value pair corresponding to the data to be deduplicated according to the time of occurrence and the data string;

[0101] The determination module 22 is configured to insert the Redis key-value pair into the Redis server pair, and determine whether the data to be deduplicated is duplicate data according to the return result of the Redis server pair.

[0102] Above-mentioned generation module 21 comprises:

[0103] The generation unit is used to generate the Redis key corresponding to the data to be deduplicated according to the ...

Embodiment 3

[0111] An embodiment of the present invention provides a large data deduplication device, the device includes one or more processors, and one or more storage devices, one or more programs are stored in the one or more storage devices, the When the one or more programs are loaded and executed by the one or more processors, the method for deduplication of big data provided in Embodiment 1 above is implemented.

[0112] In the embodiment of the present invention, deduplication of big data is performed through server clusters, and data operations are distributed to different nodes in the cluster environment as much as possible. In addition, Redis, a key-value pair database with high concurrent access, is used for deduplication, which ensures that the deduplication operation occupies the minimum system resources from the perspective of space and time. By extending the occurrence time of the data to be deduplicated to multiple adjacent times, it can effectively filter out approximat...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and device for deduplication of large data. The method includes: receiving data to be deduplicated, the data to be deduplicated includes occurrence time and data strings; The Redis key-value pair; insert the Redis key-value pair into the Redis server pair, and determine whether the data to be deduplicated is duplicate data according to the return result of the Redis server pair. The present invention deduplicates large data through server clusters, and disperses data operations to different nodes in the cluster environment as much as possible. In addition, Redis, a key-value pair database with high concurrent access, is used for deduplication, which ensures that the deduplication operation occupies the minimum system resources from the perspective of space and time. By extending the occurrence time of the data to be deduplicated to multiple adjacent times, it can effectively filter out approximate data with close time, high deduplication accuracy, high precision, and good versatility, and can be applied to various data with time continuity characteristics of big data application scenarios.

Description

technical field [0001] The invention belongs to the technical field of data processing, and in particular relates to a large data deduplication method and device. Background technique [0002] At present, big data technology has been widely used in various fields. In some big data application scenarios, the data has a certain time continuity. For example, in the traffic big data, when the vehicle passes by the reader at the bayonet, the reader uploads the vehicle’s passing record to the big data platform, and the vehicle’s passing record has a certain If the vehicle slows down or is still at the checkpoint, the reader will repeatedly upload the vehicle's passing records in a short period of time, causing the big data platform to store a lot of repeated or approximate data. Therefore, the big data platform needs to deduplicate the received data. [0003] At present, a data deduplication method is provided in the related art, that is, each time a data is received in a dedupl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F16/215G06F16/2455

Inventor郭冰程广艺罗天成夏曙东

OwnerBEIJING SUREKAM CORP

A method and device for deduplication of big data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology