A method for data duplicate checking

A data and database technology, which is applied in the field of big data duplication check, can solve the problems of limited storage system scalability, inefficient consumption, complex metadata management and storage, etc., to solve high memory usage, slow solution speed, reduce The effect of connection time

Active Publication Date: 2019-01-25
GLOBAL TONE COMM TECH
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If only the block-based deduplication scheme is used, it always mechanically executes the existing deduplication process without considering whether there are similar duplicate files. This method is very inefficient and costly. A large amount of metadata will be generated, and the management and storage of these metadata are also very complicated, which will greatly limit the scalability of existing storage systems, especially in the case of cloud storage technology

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for data duplicate checking
  • A method for data duplicate checking

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0038] Such as figure 1 As shown, in the first aspect, the present invention provides a data duplication check processing method, comprising the following steps:

[0039] Step S100: use the DBCursor cursor to read data for massive resource data, use .next() to read data in the database in order, and close the connection with close() after reading;

[0040] Step S102: traverse the resource data list in batches, use the simHash algorithm to calculate the text in the resource database as a simHash fingerprint signature, store it as a String type, and divide the fingerprint signature into n pieces, where n is a natural number;

[0041] Step S104: Use the above obtained simHash signature as the key, and the corresponding sinHash+heap number as the value;

[0042] S106: Use the key value of the hashMap to perform a duplicate check of the target text, and determine whether the Hamming distance of the n segments is less than the set threshold, if it is less than the threshold, it is ...

Embodiment 2

[0053] Such as figure 2 As shown, in the second aspect, the present invention provides a logical flow chart of data duplication check, including:

[0054] Data reading, use the DBCursor cursor to read the data in the resource database;

[0055] Data query, query whether there is current resource information in the simHash table through the target resource data id; if yes, end the current data duplication check, if not, further calculate the simHash fingerprint signature operation on the current data, and segment the fingerprint signature;

[0056] Traverse each segmented simHash segment, and match the segmented simHash fingerprint of the target resource data in the resource data signature list: If there is no match, then further query the heap number information table, and store the heap number information + 1 into the cache ; If there is a matching item, take out all matching resource simHash lists, calculate the Hamming distance with the current resource data one by one, a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for duplicate checking of massive data by using DBCursor cursor and hashMap. The method adopts the DBCursor cursor to read data of massive resource data, uses .next ()to read data in database in sequence, and closes connection with close () after reading. The invention also discloses a method for checking duplicate of massive resource data by using DBCursor cursor.Traversing the resource data list in batches, using simHash algorithm to calculate the text in the resource database as simHash fingerprint signature, storing it as String type, and dividing the fingerprint signature into n fragments, where n is a natural number; the fragment divided into simHash signatures obtained above is used as key and the corresponding sinHash+ heap number is used as value;duplicate checking of the target text is performed by using the key assignments of hashMap. The method overcomes the problems of low efficiency and high memory occupancy of a traditional method, andgives consideration to the balance between efficiency and accuracy.

Description

technical field [0001] The invention relates to the field of large data plagiarism check, in particular to a method for massive data plagiarism check by using a DBCursor cursor and a hashMap. Background technique [0002] With the development of big data technology and the promotion of the Internet, data is growing rapidly, so how to effectively use the limited storage space to store these data has become an urgent problem to be solved. As we all know, a large amount of data is the same or similar in massive data, and a large amount of duplicate data will occupy data storage space. Nowadays, various data deduplication technologies are developing rapidly in the process of solving duplicate data in the data storage process. However, these data deduplication technologies will more or less face the problems of efficiency and accuracy. Traditional data deduplication technologies usually include three parts: block, hash calculation, and deduplication. If only the block-based ded...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/22G06F16/31
CPCG06F40/194
Inventor 鄢亚东程国艮
Owner GLOBAL TONE COMM TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products