Improved method for sorting data set through sorting keywords

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A keyword and data set technology, applied in the field of big data, can solve the problems of cumbersome data cleaning steps, low cleaning efficiency of duplicate records, etc., to increase the probability of being identified as duplicate records, and increase the initial clustering to adjacent locations. The opportunity of space, the effect of improving cleaning efficiency

Inactive Publication Date: 2018-05-04

ANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD

View PDF4 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0010] The technical problem to be solved by the present invention is that the existing data cleaning steps are cumbersome, and the cleaning efficiency of repeated records is low. The purpose is to provide an improved method for sorting data sets by sorting keywords, simplify the data cleaning steps, and improve efficiency.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0029] Improved methods for sorting datasets by sort keys, including,

[0030] Step 1, preprocessing;

[0031] Step 2, duplicate record detection, realize duplicate record detection through field matching and record matching;

[0032] Step 3, clustering of duplicate records at the database level, the algorithm for detecting duplicate records at the database level clusters the duplicate records in the entire data set;

[0033] Step 4, use external source files to correct the errors in the sorting keywords and unify the data format;

[0034] Step 5, sorting the words in the sorting keywords;

[0035] Step 6, conflict handling, merge or delete the detected duplicate records in the same duplicate record cluster according to the rules, and only keep the correct record.

[0036] The preprocessing of step 1 includes,

[0037] Step 11, attribute selection, select an attribute for record matching;

[0038] Step 12, preliminary clustering, sorting the records in the database;

[0...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an improved method for sorting a data set through sorting keywords. The method comprises the steps of 1, preprocessing; 2, duplicate record detection, wherein duplicate recorddetection is achieved through field matching and record matching; 3, database-level duplicate record clustering, wherein duplicate records in the whole data set are clustered through a database-levelmedical duplicate record detection algorithm; 4, correction of errors and data format unifying, wherein the errors in the sorting keywords are corrected by using an external source file; 5, sorting ofwords in the sorting keywords; 6, conflict resolution, wherein according to a rule, the duplicate records which are detected in the same duplicate record cluster are combined or deleted, and only thecorrect record in the cluster is reserved. By means of the method, the probability that the potential and possible duplicate records are preliminarily clustered to the adjacent positional space can be increased, and the probability that the records in the data set are identified as the duplicate records is increased.

Description

technical field [0001] The invention relates to the field of big data, in particular to an improved sorting keyword sorting method for data sets. Background technique [0002] Similar terms have appeared in the history of data development, including ultra-large-scale data and massive data. "Super large-scale" generally refers to data corresponding to GB (1GB=1024MB), "massive" generally refers to data at the level of TB (1TB=1024GB), and the current "big data" refers to PB (1PB=1024TB), EB (1EB=1024PB), or even data above the ZB (1ZB=1024EB) level. In 2013, Gartner predicted that the data stored in the world will reach 1.2ZB. If these data are burned to CD-R read-only discs and piled up, the height will be five times the distance from the earth to the moon. Behind the different scales are different technical problems or challenging research problems. [0003] Big data refers to a collection of data that cannot be captured, managed and processed by conventional software to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

CPCG06F16/215

Inventor石文威

OwnerANHUI KECHUANG INTELLIGENT INTPROP SERVICE CO LTD

Improved method for sorting data set through sorting keywords

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology