Method for removing repeated object based on metadata

A technology for repeating objects and metadata, applied in the field of data cleaning, can solve the problems that the accurate weight judgment scheme cannot be suitable for partial data errors in metadata, and cannot fully meet the requirements of metadata weight judgment, so as to narrow the scope of comparison and reduce the workload , the effect of improving work efficiency

Active Publication Date: 2008-10-15
利德科技发展有限公司 +1
View PDF0 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, the existing weight judgment scheme for unstructured data cannot fully meet the requirements of metadata weight judgment
In addition,

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for removing repeated object based on metadata
  • Method for removing repeated object based on metadata
  • Method for removing repeated object based on metadata

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] Aiming at the problem of heavy workload for removing dirty data in the existing metadata cleaning field, the present invention provides a method for removing duplicate objects based on metadata, refer to the attached figure 1 , which includes the following steps:

[0043] 1) Standardize the metadata currently to be entered, and judge whether it is metadata to be entered with good quality;

[0044] 2) Compare the better-quality metadata to be entered with each record in the data set, and determine whether there is a duplicate record in the data set with the metadata to be entered;

[0045] 3) If there are duplicate records, select a record with good quality among the two as the data set.

[0046]The information of a book on the Internet includes a large amount of metadata, most of which are dirty data, that is, data with poor quality. For example: Title: Romance of the Three Kingdoms; International Standard Book Number: ISBN7-305-01568-7; Publisher Number: 305; Publish...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for removing repeated objects on the basis of meta data, relating to meta data removal field, solving the problem that the working quantity of the repeated data removal is large. The method firstly carries out standardized processing to the meta data to be recorded. During the comparison, by reducing the comparison range, the working quantity is reduced and the working efficiency is improved. In the date set record, records which have the same press fields with the meta data to be recorded are selected; in the selected record, isbn, book name, author, press, publishing time and price character are selected as the comparison range. A similarity comparison function with weighing values is used to calculate the similarity value between the meta data to be recorded and the attribute value of the corresponding fields in the data set; the weighing value is multiplied by the similarity value of each field and then added by the gained complex similarity value; the complex similarity value is compared with the prearranged threshold value; if the complex similarity value is not less than the threshold value, the current record in the data set and the meta data to be recorded are repeated data.

Description

technical field [0001] The invention relates to a method for cleaning data, in particular to a method for removing duplicate objects in a data collection. Background technique [0002] In the information society, information can be divided into two categories. One type of information can be represented by data or a unified structure, which we call structured data, such as numbers and symbols; while another type of information cannot be represented by numbers or a unified structure, such as text, images, sounds, web pages, etc. We call this unstructured data. Structured data belongs to unstructured data and is a special case of unstructured data. [0003] A structured data type is a user-defined data type that contains some non-atomic elements, more precisely, these data types are divisible, and they can be used either individually or as an independent unit use. [0004] In the library and information field, metadata is defined as: providing a structured data about inform...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F19/00
Inventor 高飞
Owner 利德科技发展有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products