Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method and device for deduplication of big data

A big data and data technology, applied in the field of big data deduplication methods and devices, can solve problems such as poor accuracy, information pollution, data redundancy, etc., and achieve the effect of high deduplication accuracy, good versatility, and high precision

Active Publication Date: 2021-02-02
BEIJING SUREKAM CORP
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, simple keyword deduplication in related technologies cannot eliminate approximate data. The accuracy of deduplication is poor. value information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for deduplication of big data
  • A method and device for deduplication of big data
  • A method and device for deduplication of big data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0049] An embodiment of the present invention provides a large data deduplication method. see figure 1, the network architecture based on the method includes data acquisition equipment, server clusters and Redis server pairs. Wherein, the data collection device is used to collect data to be deduplicated, and upload the data to be deduplicated to the server cluster. The server cluster includes multiple servers, and the execution subject of the embodiment of the present invention is the server, and the deduplication work is distributed to different nodes in the cluster environment as much as possible through the server cluster, so as to obtain the maximum calculation amount. Multiple groups of Redis server pairs are set in the embodiment of the present invention, and each Redis server pair includes a Redis master server and a Redis backup server, stores the intermediate data in the large data deduplication process by the Redis server pair, and passes through the master server p...

Embodiment 2

[0098] see Figure 5 , the embodiment of the present invention provides a big data deduplication device, which is used to implement the big data deduplication method provided in the above-mentioned embodiment 1, and the device includes:

[0099] The receiving module 20 is used to receive data to be deduplicated, and the data to be deduplicated includes occurrence time and data string;

[0100] Generating module 21 is used for generating the Redis key-value pair corresponding to the data to be deduplicated according to the time of occurrence and the data string;

[0101] The determination module 22 is configured to insert the Redis key-value pair into the Redis server pair, and determine whether the data to be deduplicated is duplicate data according to the return result of the Redis server pair.

[0102] Above-mentioned generation module 21 comprises:

[0103] The generation unit is used to generate the Redis key corresponding to the data to be deduplicated according to the ...

Embodiment 3

[0111] An embodiment of the present invention provides a large data deduplication device, the device includes one or more processors, and one or more storage devices, one or more programs are stored in the one or more storage devices, the When the one or more programs are loaded and executed by the one or more processors, the method for deduplication of big data provided in Embodiment 1 above is implemented.

[0112] In the embodiment of the present invention, deduplication of big data is performed through server clusters, and data operations are distributed to different nodes in the cluster environment as much as possible. In addition, Redis, a key-value pair database with high concurrent access, is used for deduplication, which ensures that the deduplication operation occupies the minimum system resources from the perspective of space and time. By extending the occurrence time of the data to be deduplicated to multiple adjacent times, it can effectively filter out approximat...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and device for deduplication of large data. The method includes: receiving data to be deduplicated, the data to be deduplicated includes occurrence time and data strings; The Redis key-value pair; insert the Redis key-value pair into the Redis server pair, and determine whether the data to be deduplicated is duplicate data according to the return result of the Redis server pair. The present invention deduplicates large data through server clusters, and disperses data operations to different nodes in the cluster environment as much as possible. In addition, Redis, a key-value pair database with high concurrent access, is used for deduplication, which ensures that the deduplication operation occupies the minimum system resources from the perspective of space and time. By extending the occurrence time of the data to be deduplicated to multiple adjacent times, it can effectively filter out approximate data with close time, high deduplication accuracy, high precision, and good versatility, and can be applied to various data with time continuity characteristics of big data application scenarios.

Description

technical field [0001] The invention belongs to the technical field of data processing, and in particular relates to a large data deduplication method and device. Background technique [0002] At present, big data technology has been widely used in various fields. In some big data application scenarios, the data has a certain time continuity. For example, in the traffic big data, when the vehicle passes by the reader at the bayonet, the reader uploads the vehicle’s passing record to the big data platform, and the vehicle’s passing record has a certain If the vehicle slows down or is still at the checkpoint, the reader will repeatedly upload the vehicle's passing records in a short period of time, causing the big data platform to store a lot of repeated or approximate data. Therefore, the big data platform needs to deduplicate the received data. [0003] At present, a data deduplication method is provided in the related art, that is, each time a data is received in a dedupl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/215G06F16/2455
Inventor 郭冰程广艺罗天成夏曙东
Owner BEIJING SUREKAM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products