Dynamic detection method for multi-data concentrated and repeated records

A technology of repeated recording and dynamic detection, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., and can solve problems such as not suitable for multiple data sets, low efficiency, and inability to incrementally detect changes in repeated records

Inactive Publication Date: 2011-08-31
JINAN UNIVERSITY
View PDF5 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] (1) The accuracy of detection depends heavily on the sorting keywords. If the keywords are not selected properly, the physical locations of many potential duplicate records may be far apart, and they will not fall within the same sliding window, resulting in omissions The discovery of some duplicate records has a low accuracy rate;
[0008] (2) The size of the sliding window is difficult to determine. If the sliding window is too small, it will affect the accuracy of detection; on the contrary, if the sliding window is too large, it will reduce the efficiency of detection;
[0009] (3) The algorithm is only suitable for a single data set, not for multiple data sets;
[0010] (4) If the data set is too large, the time cost for sorting will be relatively high;
[0011] (5) Cannot adapt to the requirements of dynamic and real-time data processing
Although there are some conflict-free Hash function research results, the calculation method is complicated, and the Hash code is too complex and has randomness
[0014] (2) Incremental detection of duplicate records in multiple datasets: the existing duplicate record detection methods cannot adapt to the dynamic and incremental detection requirements of multiple datasets
Although the current method can compare the Hash code of the newly added record with the Hash code of the previous record and determine whether it is duplicated, it cannot incrementally detect the duplicate record changes caused by the modification of the data set and the deletion of records unless the new data set is re-detected. all records
[0015] (3) Sharing and management of record hash buckets of multi-source data sets: In the existing method, the records of each data source are hashed into their own bucket sets, and the hashed records in the buckets of different data sources are then compared , this way is not efficient
[0016] (4) The traditional Hash partition method requires that the memory can at least accommodate all the records in one Hash partition. If the amount of data in the information source is large enough that the memory cannot accommodate all the records in the next partition, the algorithm cannot be executed.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dynamic detection method for multi-data concentrated and repeated records
  • Dynamic detection method for multi-data concentrated and repeated records
  • Dynamic detection method for multi-data concentrated and repeated records

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0082] As shown in Figure 1, a hash-based method for dynamically checking duplicate records in multiple datasets includes the following steps:

[0083] A method for dynamic detection of duplicate records in multiple datasets, comprising the following steps:

[0084] (1) Read a record from the initial data set, suppose the record is composed of N intrinsic fields, and the i-th intrinsic field is f i , where 1≤i≤N;

[0085] (2) Calculate the recorded Hash code, the calculation method of the recorded Hash code is:

[0086] The Hash function is as follows:

[0087] H i = hashCode ( f 1 ) i = 1 H i ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for dynamically and concurrently detecting record groups with completely same contents from the data of a plurality of information sources in a concentrated way. In the method, each original record or alternation record is read from the data of the plurality of information sources in a concentrated way; the Hash codes and inspection codes of each record are calculated by the hash function and the inspection function code function built by the invention according to the inherent fields of the records; moreover, a group of buckets shared by the data sets of all the information sources and the relevant information of the buckets are dynamically established and modified; and repeated record groups distributed in each information source are quickly detected. The method has the advantages of high efficiency, high accuracy, and high utilization rate of memory resources; moreover, the method can also dynamically implement increment detection.

Description

technical field [0001] The invention relates to the field of computer data processing, in particular to a dynamic detection method for repeated records in multiple data sets. Background technique [0002] The data growth rate is increasing year by year with the popularization of computer applications. At the same time, the data redundancy rate of many application enterprises is also increasing, that is, a large amount of redundant data is distributed in LAN, WAN and SAN (StorageAreaNetwork). This not only leads to an increase in the purchase of storage equipment, storage-related operating costs, and management costs, but also seriously hinders the construction of information integration platforms and data centers, and generates erroneous statistics and integrated data. Therefore, duplicate data detection and deletion technology is considered to be one of the hottest technologies in the information field. The core of data deduplication technology is the detection method of d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 刘波潘久辉张武
Owner JINAN UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products