Similar data de-duplication method

A data and database technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as no effective implementation method and data are not identical, and achieve fast sorting and processing, improve accuracy, and improve speed. Effect

Active Publication Date: 2015-02-18
北京世纪读秀技术有限公司
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, most of the current deduplication methods can realize the rapid deduplication processing of exactly the same data, and there is still no effective implementation method for how to solve the deduplication of different data and reflect the same information, that is, the deduplication of similar data. Removal of weight has become a new direction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similar data de-duplication method
  • Similar data de-duplication method
  • Similar data de-duplication method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention.

[0029] The duplication of similar data in the present invention refers to the duplication of those data that are different but reflect the same information; the similar data in the present invention may be a single similar data or multiple similar data.

[0030] see figure 1 , is the first embodiment of the present invention, that is, when the obtained similar data is a single similar data, a similar data sorting method provided by the present invention includes the following steps:

[0031] Step 1: Input the obtained similar data into the server;

[0032] Step 2: extracting the eigenvectors of the similar data, preprocessing each information of the eigenvectors, an...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a similar data de-duplication method, which includes the following steps of inputting the acquired similar data to a server, extracting the characteristic vectors of the similar data, preprocessing information of the characteristic vectors to acquire the character type index data of the information; performing code conversion to the index data to generate the numerical hashed data of the information; judging whether the hashed data of the characteristic vectors are the same with the standard data stored in a database server or not one by one according to the weight of the information, and then feeding the results back to users. The similar data with different response information can be de-duplicated quickly by the method. Besides, the similar data de-duplication method is high in accuracy and fine in stability.

Description

technical field [0001] The invention relates to the technical field of data information processing, in particular to a method capable of deduplicating large-scale similar data information. Background technique [0002] With the continuous development of information technology, a large number of various types of information are emerging. In practical applications, there are more and more requirements for deduplication of large amounts of data. For example: in a search engine system, it is necessary to determine which data information has been collected in the system. Due to the large amount of data information on the Internet, it is necessary to have a special method to judge the newly discovered data information and check whether it has been included in the information database. If the data information already exists, it is only necessary to update the information source attribute; if the data information If it does not exist, it is necessary to collect data information and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 杨健
Owner 北京世纪读秀技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products