A method and system for deduplication of data

A data management system and data technology, applied in the field of big data, can solve problems such as huge loss of computing resources, large computing resources, and incompatibility, and achieve the effects of high-efficiency data deduplication, small computing resources, and unified algorithms

Active Publication Date: 2022-07-05
南京苏宁云财信息技术有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, in a real environment with such a large amount of data, the computing resource consumption caused by data deduplication is very huge. How to remove duplicate data efficiently and accurately is currently difficult.
In the existing technology, there are usually two operations in the process of processing data deduplication. One is that the component layer of the industry's internal data performs related algorithms, such as using the HyperLogLog algorithm that uses inaccurate deduplication in the Druid database, or through Spark. technology to deduplicate, etc., but the deduplication results are inaccurate and consume huge computing resources; the other is to use data dictionaries to deduplicate data, for example, in PostGreSql, ClickHouse and Druid data management systems are all in their respective component layers Create a dictionary table. Although the above three databases have achieved accurate deduplication, they are relatively scattered, not universal, and have not formed a unified document. Sometimes it is necessary to calculate multiple times for the same dictionary table. The efficiency of data deduplication The same is not high; and the above two data deduplication operations still have the problem of occupying a large amount of computing resources and storage resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for deduplication of data
  • A method and system for deduplication of data
  • A method and system for deduplication of data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0030] like figure 1 As shown, an embodiment of the present invention discloses a method for data deduplication, and the method includes the following steps:

[0031] S1: Design a deduplication dictionary table array in the database, add a column of data acceleration layers to the deduplication dictionary table array, and perform dimension correlation matching between the deduplication dictionary table array and the data acceleration layer;

[0032] S2: Map the data to be deduplicated into the deduplication dictionary table array, and then import the deduplication dictionary table array into the data management system of the data acceleration layer, and use the data management system to deduplicate the deduplication The data to be deduplicated in the dictionary table array is converted into a bit format and stored in the Bitmap set, so that the data to be deduplicated is converted into a new column in the data acceleration layer;

[0033] S3: In the Bitmap set, use a deduplic...

Embodiment 2

[0039] like image 3 As shown, the embodiment of the present invention also discloses a system for data deduplication, the system comprising:

[0040] Dictionary table array module 1, the dictionary table array module 1 is used to design a deduplication dictionary table array in the database, add a column of data acceleration layer to the deduplication dictionary table array, and combine the deduplication dictionary table array with all the dictionary table arrays. The data acceleration layer performs dimension association matching;

[0041] The data processing module 2 is used to map the data to be deduplicated into the deduplication dictionary table array, and then import the deduplication dictionary table array into the data management system of the data acceleration layer, through the data management system Convert the data to be deduplicated in the deduplication dictionary table array into a bit format and store it in the Bitmap set, so that the data to be deduplicated i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data deduplication method and system. The method includes the following steps: designing a deduplication dictionary table array in a database, adding a column of data acceleration layers to the deduplication dictionary table array, and adding the deduplication dictionary The table array and the data acceleration layer perform dimension association matching; map the data to be deduplicated into the deduplication dictionary table array, and then import the deduplication dictionary table array into the data management system of the data acceleration layer. The data management system converts the data to be deduplicated in the deduplication dictionary table array into a bit format and stores it in the Bitmap set, so that the data to be deduplicated is converted into a new column in the data acceleration layer; In the above Bitmap set, the deduplication function is used to uniformly deduplicate the deduplicated data and filter out the duplicate data. The embodiments of the present invention can efficiently and accurately filter out the duplicate data in the big data, improve the application accuracy of the big data, and reduce the cost of using the big data.

Description

technical field [0001] The invention relates to the field of big data, in particular to a method and system for deduplication of data. Background technique [0002] At present, big data is widely used in many fields. The application of big data is not only to master huge data information, but also to professionally process these meaningful data. In the process of collecting a large amount of data, a lot of data needs to be further screened to obtain the part of the data that customers need most. Filtering to remove duplicate data is an operation performed by many customers who apply big data. [0003] However, in a real environment with such a large amount of data, the loss of computing resources caused by data deduplication is very huge, and it is currently difficult to remove duplicate data efficiently and accurately. In the existing technology, there are usually two operations in the process of data deduplication. One is that the component layer of the internal data of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/215G06F16/22
CPCG06F16/215G06F16/2237
Inventor 范东孙迁汪金忠
Owner 南京苏宁云财信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products