Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Data deduplication method, device, equipment and storage medium

A data and database technology, applied in the database field, can solve the problems of high maintenance cost and low data deduplication efficiency, and achieve the effect of improving efficiency and shortening time.

Pending Publication Date: 2020-06-09
软通智慧信息技术有限公司
View PDF7 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Based on the above-mentioned existing technical solutions, the method of the temporary library will repeatedly deduplicate the file data that has been deduplicated, and the maintenance cost is relatively large
However, in the case of a large amount of database data, the method of full table comparison is not efficient in data deduplication.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data deduplication method, device, equipment and storage medium
  • Data deduplication method, device, equipment and storage medium
  • Data deduplication method, device, equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0029] figure 1 It is a flow chart of a data deduplication method provided by Embodiment 1 of the present invention. This embodiment is applicable to the case of data deduplication during file collection. The method can be executed by a data deduplication device, which can use software and / or hardware implementation. Specifically include the following steps:

[0030] S110. Obtain the data to be processed in the file to be processed, and calculate a first hash value and a first MD5 value of the data to be processed.

[0031] Wherein, exemplary, the file type of the file to be processed includes excel (electronic form), txt (text document) or csv (Comma Separate Values, comma separated value) and the like. The data format of the data to be processed is related to the file type of the file to be processed. Exemplarily, the data to be processed may be numbers, characters, symbols, etc., and the files to be processed and the data to be processed are not limited here.

[0032] ...

Embodiment 2

[0048] figure 2 It is a flow chart of a data deduplication method provided in Embodiment 2 of the present invention, and the technical solution of this embodiment is further refined on the basis of the foregoing embodiments. Optionally, the retrieved data also includes primary key data corresponding to each stored data, correspondingly, after obtaining the data to be processed in the file to be processed, it also includes: determining whether there is a pre-processed data in the data to be processed Main key data is set; if it exists, repeatability judgment is performed on the data to be processed based on the preset main key data.

[0049] S210. Obtain the data to be processed in the file to be processed.

[0050] S220. Determine that there is preset primary key data in the data to be processed; if yes, execute S230; if not, execute S240.

[0051] Wherein, the primary key data refers to at least one field data in the data to be processed, and the primary key data can uniqu...

Embodiment 3

[0068] Figure 4 It is a schematic diagram of a data deduplication device provided in Embodiment 3 of the present invention. This embodiment is applicable to the case of deduplication of data when collecting files, and the device can be realized by software and / or hardware. The data deduplication device includes a data to be processed acquisition module 310 , a target hash partition determination module 320 , a first MD5 value determination module 330 and a duplicate data determination module 340 .

[0069] Wherein, the data to be processed acquisition module 310 is used to obtain the data to be processed in the file to be processed, and calculate the first hash value and the first MD5 value of the data to be processed;

[0070] The target hash partition determining module 320 is configured to determine the target hash partition for data comparison in the stored retrieval data according to the first hash value, wherein the retrieval data includes at least one hash partition, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the invention discloses a data deduplication method, a device, equipment and a storage medium. The method comprises the steps of obtaining to-be-processed data in a to-be-processed file, and calculating a first hash value and a first MD5 value of the to-be-processed data; according to the first hash value, a target hash partition for data comparison in stored retrieval data is determined, the retrieval data comprises at least one hash partition, and each hash partition comprises at least one MD5 value; determining whether the first MD5 value exists in at least one MD5 value of the target hash partition; and if the to-be-processed data exists, determining that the to-be-processed data is repeated data, and updating storage data corresponding to the first MD5 value in a document database based on the to-be-processed data. According to the method of the invention, the hash value and the MD5 value are calculated, so that the data deduplication problem under the conditionof no main keyword is solved, the data deduplication time is shortened, and the data acquisition efficiency in the file is further improved.

Description

technical field [0001] Embodiments of the present invention relate to the technical field of databases, and in particular, to a data deduplication method, device, device, and storage medium. Background technique [0002] In the process of data summarization, data files are usually deduplicated by means of primary keywords to avoid a large amount of duplicate data in the database, resulting in waste of storage resources. Specifically, the primary key can be used to uniquely identify the data records in the table. The primary key is also called the primary key and can consist of one field or multiple fields. [0003] However, the data files provided by the data file provider usually do not have a primary key, and the data files are often provided repeatedly. It has also been suggested that data deduplication can be performed through temporary database or full table comparison. Among them, the temporary database method is to collect file data statistics into a temporary datab...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/172G06F16/174
CPCG06F16/172G06F16/1748
Inventor 李猛
Owner 软通智慧信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products