A method for data duplicate checking

A data and database technology, which is applied in the field of big data duplication check, can solve the problems of limited storage system scalability, inefficient consumption, complex metadata management and storage, etc., to solve high memory usage, slow solution speed, reduce The effect of connection time

Active Publication Date: 2019-01-25
GLOBAL TONE COMM TECH
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If only the block-based deduplication scheme is used, it always mechanically executes the existing deduplication process without considering whether there are similar duplicate files. This method is very inefficient and costly. A larg

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for data duplicate checking
  • A method for data duplicate checking

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0037] Example 1

[0038] Such as figure 1 As shown, in the first aspect, the present invention provides a data duplication processing method, including the following steps:

[0039] Step S100: Use the DBCursor cursor to read the data for the massive resource data, use .next() to read the data in the database in order, and close the connection with close() after reading;

[0040] Step S102: Traverse the resource data list in batches, use the simHash algorithm to calculate the text in the resource database as a simHash fingerprint signature, store it as a String type, and divide the fingerprint signature into n segments, where n is a natural number;

[0041] Step S104: Use the fragments of the simHash signature obtained above as the key, and the corresponding sinHash+heap number as the value;

[0042] S106: Perform a duplicate check of the target text using the key value of the hashMap, and determine whether the Hamming distance of the n segments is less than a set threshold; if it is le...

Example Embodiment

[0052] Example 2

[0053] Such as figure 2 As shown, in the second aspect, the present invention provides a logic flow chart for data duplicate checking, including:

[0054] Data reading, using DBCursor cursor to read the data in the resource database;

[0055] Data query, query whether the current resource information exists in the simHash table through the target resource data id; if it is, the current data check is ended, if not, the current data is further calculated simHash fingerprint signature operation, and the fingerprint signature is segmented;

[0056] Traverse each segmented simHash fragment to the segmented simHash fingerprint matching the target resource data in the resource data signature list: if there is no match, then further query the heap number information table, and store the heap number information+1 in the cache ; If there is a match, take out the simHash list of all matching resources, calculate the Hamming distance from the current resource data one by one,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for duplicate checking of massive data by using DBCursor cursor and hashMap. The method adopts the DBCursor cursor to read data of massive resource data, uses .next ()to read data in database in sequence, and closes connection with close () after reading. The invention also discloses a method for checking duplicate of massive resource data by using DBCursor cursor.Traversing the resource data list in batches, using simHash algorithm to calculate the text in the resource database as simHash fingerprint signature, storing it as String type, and dividing the fingerprint signature into n fragments, where n is a natural number; the fragment divided into simHash signatures obtained above is used as key and the corresponding sinHash+ heap number is used as value;duplicate checking of the target text is performed by using the key assignments of hashMap. The method overcomes the problems of low efficiency and high memory occupancy of a traditional method, andgives consideration to the balance between efficiency and accuracy.

Description

technical field [0001] The invention relates to the field of large data plagiarism check, in particular to a method for massive data plagiarism check by using a DBCursor cursor and a hashMap. Background technique [0002] With the development of big data technology and the promotion of the Internet, data is growing rapidly, so how to effectively use the limited storage space to store these data has become an urgent problem to be solved. As we all know, a large amount of data is the same or similar in massive data, and a large amount of duplicate data will occupy data storage space. Nowadays, various data deduplication technologies are developing rapidly in the process of solving duplicate data in the data storage process. However, these data deduplication technologies will more or less face the problems of efficiency and accuracy. Traditional data deduplication technologies usually include three parts: block, hash calculation, and deduplication. If only the block-based ded...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/22G06F16/31
CPCG06F40/194
Inventor 鄢亚东程国艮
Owner GLOBAL TONE COMM TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products