A method for data duplicate checking

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A data and database technology, which is applied in the field of big data duplication check, can solve the problems of limited storage system scalability, inefficient consumption, complex metadata management and storage, etc., to solve high memory usage, slow solution speed, reduce The effect of connection time

Active Publication Date: 2019-01-25

GLOBAL TONE COMM TECH

View PDF8 Cites 3 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

If only the block-based deduplication scheme is used, it always mechanically executes the existing deduplication process without considering whether there are similar duplicate files. This method is very inefficient and costly. A large amount of metadata will be generated, and the management and storage of these metadata are also very complicated, which will greatly limit the scalability of existing storage systems, especially in the case of cloud storage technology

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0038] Such as figure 1 As shown, in the first aspect, the present invention provides a data duplication check processing method, comprising the following steps:

[0039] Step S100: use the DBCursor cursor to read data for massive resource data, use .next() to read data in the database in order, and close the connection with close() after reading;

[0040] Step S102: traverse the resource data list in batches, use the simHash algorithm to calculate the text in the resource database as a simHash fingerprint signature, store it as a String type, and divide the fingerprint signature into n pieces, where n is a natural number;

[0041] Step S104: Use the above obtained simHash signature as the key, and the corresponding sinHash+heap number as the value;

[0042] S106: Use the key value of the hashMap to perform a duplicate check of the target text, and determine whether the Hamming distance of the n segments is less than the set threshold, if it is less than the threshold, it is ...

Embodiment 2

[0053] Such as figure 2 As shown, in the second aspect, the present invention provides a logical flow chart of data duplication check, including:

[0054] Data reading, use the DBCursor cursor to read the data in the resource database;

[0055] Data query, query whether there is current resource information in the simHash table through the target resource data id; if yes, end the current data duplication check, if not, further calculate the simHash fingerprint signature operation on the current data, and segment the fingerprint signature;

[0056] Traverse each segmented simHash segment, and match the segmented simHash fingerprint of the target resource data in the resource data signature list: If there is no match, then further query the heap number information table, and store the heap number information + 1 into the cache ; If there is a matching item, take out all matching resource simHash lists, calculate the Hamming distance with the current resource data one by one, a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a method for duplicate checking of massive data by using DBCursor cursor and hashMap. The method adopts the DBCursor cursor to read data of massive resource data, uses .next ()to read data in database in sequence, and closes connection with close () after reading. The invention also discloses a method for checking duplicate of massive resource data by using DBCursor cursor.Traversing the resource data list in batches, using simHash algorithm to calculate the text in the resource database as simHash fingerprint signature, storing it as String type, and dividing the fingerprint signature into n fragments, where n is a natural number; the fragment divided into simHash signatures obtained above is used as key and the corresponding sinHash+ heap number is used as value;duplicate checking of the target text is performed by using the key assignments of hashMap. The method overcomes the problems of low efficiency and high memory occupancy of a traditional method, andgives consideration to the balance between efficiency and accuracy.

Description

technical field [0001] The invention relates to the field of large data plagiarism check, in particular to a method for massive data plagiarism check by using a DBCursor cursor and a hashMap. Background technique [0002] With the development of big data technology and the promotion of the Internet, data is growing rapidly, so how to effectively use the limited storage space to store these data has become an urgent problem to be solved. As we all know, a large amount of data is the same or similar in massive data, and a large amount of duplicate data will occupy data storage space. Nowadays, various data deduplication technologies are developing rapidly in the process of solving duplicate data in the data storage process. However, these data deduplication technologies will more or less face the problems of efficiency and accuracy. Traditional data deduplication technologies usually include three parts: block, hash calculation, and deduplication. If only the block-based ded...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/22G06F16/31

CPCG06F40/194

Inventor 鄢亚东程国艮

Owner GLOBAL TONE COMM TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

A method for data duplicate checking

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology