Method and device for deleting duplicated data

A data and database technology, applied in the field of data processing, can solve problems such as heavy system load, data storage and query performance degradation, and achieve the effect of reducing usage and cost

Inactive Publication Date: 2017-05-10
RUN TECH CO LTD BEIJING
View PDF3 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The amount of data to be processed by enterprises has increased sharply. While big data brings convenience, it also puts some burden on technicians. Among the massive data, there are a lot of duplicate data, which causes the system to load more and more. Data storage and The query performance decreases accordingly. How to delete a large amount of duplicate junk data and reduce the usage of the hard disk has become an urgent problem to be solved in the era of big data.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for deleting duplicated data
  • Method and device for deleting duplicated data
  • Method and device for deleting duplicated data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] figure 1 It is a flow chart of a method for deduplication of data provided by Embodiment 1 of the present invention. This embodiment is applicable to the case of effectively deduplication of massive data, and the method can be performed by a data deduplication device. The method is specific Including the following steps:

[0047] S110. Obtain the MD5 value of the data to be processed and the corresponding data identifier.

[0048] Among them, MD5 (Message-Digest Algorithm 5, Information-Digest Algorithm 5) is used to ensure the integrity and consistency of information transmission. It is one of the hash algorithms widely used by computers. It has compressibility, easy calculation, anti-modification and strong anti-collision, etc. features. The type of data to be processed can be a text type, and the data can be read by row or by column and the corresponding MD5 value of the data can be calculated. The data identifier can be used as a mark for each piece of data to dis...

Embodiment 2

[0066] figure 2 It is a flowchart of a data deduplication method provided by Embodiment 2 of the present invention. This embodiment is further optimized on the basis of the above embodiments, and "acquires the MD5 value of the data to be processed and the corresponding data identifier" It is further refined as "reading the data to be processed by row; calculating the MD5 value of the data to be processed; generating the data identifier of the data to be processed according to the reading time and / or the thread number when reading the data to be processed. "This method specifically comprises the following steps:

[0067] S210. Read the data to be processed row by row.

[0068] Among them, the data can be a data access link before the preprocessing, and the data can be moved to the preprocessing server through a moving tool, and wait for the data to be processed. In the preprocessing server, the program reads data row by row. The commonly used moving programs are all transmi...

Embodiment 3

[0078] Figure 4 It is a flow chart of a data deduplication method provided by Embodiment 3 of the present invention. This embodiment further optimizes the above-mentioned embodiment on the basis of further refinement of "calculating the MD5 value of the data to be processed" "If the data to be processed contains preset ignored data, remove the preset ignored data from the data to be processed; calculate the MD5 value of the data to be processed after removing the preset ignored data, as the The MD5 value of the data to be processed." The method specifically includes the following steps:

[0079] S410. Read the data to be processed row by row.

[0080] S420. If the data to be processed includes preset ignored data, remove the preset ignored data from the data to be processed.

[0081] Among them, before reading the data to be processed, some data content can be set as preset ignore data according to actual needs, for example, it can be the port number of the data or some unn...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a method and device for deleting duplicated data. The method comprises the following steps: acquiring an MD5 value of to-be-processed data and a corresponding data mark; forming a key value pair of the to-be-processed data by the MD5 value and the data mark; comparing the MD5 value in the key value pair of the to-be-processed data with the MD5 value in the key value pair of the existing data; and if the MD5 value in the key value pair of the to-be-processed data is as same as the MD5 value in the key value pair of the existing data, deleting the to-be-processed data and confirming the data mark of the existing data which is as same as the to-be-processed data. According to the embodiment of the invention, the key value pair of the to-be-processed data is formed by the MD5 value and the data mark, the MD5 value of the to-be-processed data is compared with the MD5 value of the existing data, and the to-be-processed data, of which the MD5 value is as same as the MD5 value of the existing data, is deleted, so that the problem of the duplicated data existing in mass data can be solved, the effect of deleting the duplicated data before storage can be achieved, the usage rate of the hard disk is reduced and the cost is lowered.

Description

technical field [0001] Embodiments of the present invention relate to data processing technologies, and in particular, to a method and device for deduplication of data. Background technique [0002] In today's big data era, with the development of informatization, speaking with data is the philosophy of many business operators. The amount of data to be processed by enterprises has increased sharply. While big data brings convenience, it also adds some burdens to technicians. In the massive data, there are a lot of duplicate data, which causes the system to load more and more. Data storage and The query performance will decrease accordingly. How to realize the deletion of a large amount of duplicate junk data and reduce the utilization rate of the hard disk has become an urgent problem to be solved in the era of big data. Contents of the invention [0003] The invention provides a method and device for deduplication of data, so as to realize deduplication of large-scale da...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 孙健
Owner RUN TECH CO LTD BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products