Improved method for sorting data set through sorting keywords

A keyword and data set technology, applied in the field of big data, can solve the problems of cumbersome data cleaning steps, low cleaning efficiency of duplicate records, etc., to increase the probability of being identified as duplicate records, and increase the initial clustering to adjacent locations. The opportunity of space, the effect of improving cleaning efficiency

Inactive Publication Date: 2018-05-04
ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD
View PDF4 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The technical problem to be solved by the present invention is that the existing data cleaning steps are cumbersome, and the cleaning efficiency of repeated records is low. The purpose is to provide an improved method for sorting data sets by sorting keywords, simplify the data cleaning steps, and improve efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0029] Improved methods for sorting datasets by sort keys, including,

[0030] Step 1, preprocessing;

[0031] Step 2, duplicate record detection, realize duplicate record detection through field matching and record matching;

[0032] Step 3, clustering of duplicate records at the database level, the algorithm for detecting duplicate records at the database level clusters the duplicate records in the entire data set;

[0033] Step 4, use external source files to correct the errors in the sorting keywords and unify the data format;

[0034] Step 5, sorting the words in the sorting keywords;

[0035] Step 6, conflict handling, merge or delete the detected duplicate records in the same duplicate record cluster according to the rules, and only keep the correct record.

[0036] The preprocessing of step 1 includes,

[0037] Step 11, attribute selection, select an attribute for record matching;

[0038] Step 12, preliminary clustering, sorting the records in the database;

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an improved method for sorting a data set through sorting keywords. The method comprises the steps of 1, preprocessing; 2, duplicate record detection, wherein duplicate recorddetection is achieved through field matching and record matching; 3, database-level duplicate record clustering, wherein duplicate records in the whole data set are clustered through a database-levelmedical duplicate record detection algorithm; 4, correction of errors and data format unifying, wherein the errors in the sorting keywords are corrected by using an external source file; 5, sorting ofwords in the sorting keywords; 6, conflict resolution, wherein according to a rule, the duplicate records which are detected in the same duplicate record cluster are combined or deleted, and only thecorrect record in the cluster is reserved. By means of the method, the probability that the potential and possible duplicate records are preliminarily clustered to the adjacent positional space can be increased, and the probability that the records in the data set are identified as the duplicate records is increased.

Description

technical field [0001] The invention relates to the field of big data, in particular to an improved sorting keyword sorting method for data sets. Background technique [0002] Similar terms have appeared in the history of data development, including ultra-large-scale data and massive data. "Super large-scale" generally refers to data corresponding to GB (1GB=1024MB), "massive" generally refers to data at the level of TB (1TB=1024GB), and the current "big data" refers to PB (1PB=1024TB), EB (1EB=1024PB), or even data above the ZB (1ZB=1024EB) level. In 2013, Gartner predicted that the data stored in the world will reach 1.2ZB. If these data are burned to CD-R read-only discs and piled up, the height will be five times the distance from the earth to the moon. Behind the different scales are different technical problems or challenging research problems. [0003] Big data refers to a collection of data that cannot be captured, managed and processed by conventional software to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/215
Inventor 石文威
Owner ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products