Method and device for removing duplication from data mining

A technology of repetitive patterns and data resources, applied in the field of data processing, can solve problems such as scattered work centers, inability to enumerate repeated classifications, difficult research and development, etc., and achieve the effect of solving repetitive problems

Active Publication Date: 2010-05-05
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF0 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The main drawbacks of the method of manual classification to mine repetition rules are as follows: firstly, the repetition classification cannot be enumerated. One iterative taxonomy is mined

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for removing duplication from data mining
  • Method and device for removing duplication from data mining
  • Method and device for removing duplication from data mining

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0066] Extract the pattern and feature fields of the following URLs:

[0067] www.gouwo.com / service / View.aspx? SubjectID=8040&page=3

[0068] Through the processing process described above, the process in steps 1-5 can be executed, and each component in the URL is scanned from left to right and processed.

[0069] First, divide the URL into parts such as site name, path name, and file name using delimiter splitting, and add the site name (ie, www.gouwo.com) to the pattern.

[0070] Then, process the directory path and file name, that is, add each part of " / service / View.aspx?SubjectID" to the pattern one by one; for the purely numeric field "8040", replace it with the replacement character "*" ; For other unprocessed parts in the URL, the above rules are also used to extract. The resulting schema is:

[0071] www.gouwo.com / service / View.aspx? SubjectID=*&page=*

[0072] It should be noted that in the above extraction process, since the purely numeric fields "8040" and "3" ...

example 2

[0075] Extract the pattern and feature fields of the following URLs:

[0076] istock.jrj.com.cn / forum456 / mtopic789.html

[0077] Through the above extraction process, it can be concluded that:

[0078] Mode: istock.jrj.com.cn / forum# / mtopic#.html

[0079] There are two characteristic fields, the value of characteristic field 1 is 456, and the value of characteristic field 2 is 789.

[0080] According to the embodiment of the present invention, the principle of mining duplication based on eigenvalues ​​is to first extract patterns and feature fields from the identification information associated with data resources, and then perform processing according to certain rules for mining duplication in order to obtain eigenvalues, which Finally, by comparing the eigenvalues, it is judged whether the eigenvalues ​​correspond to duplicate data resources, so as to achieve the purpose of mining duplication.

[0081] Therefore, how to extract the feature value from the identification inf...

example 3

[0088] Extract the schema and feature fields from the following URLs, and calculate the common feature fields:

[0089] URL1: www.shufa.com / product / view.asp? id=112404

[0090] URL2: www.shufa.com / product / view.asp? id=112404&p=112404

[0091] URL3: www.shufa.com / product / view.asp? id=112404&n=112404

[0092]First, extracting the patterns and feature fields of the above URLs respectively, it can be obtained that the patterns P1, P2, and P3 corresponding to URL1, URL2, and URL3 are respectively:

[0093] P1: www.shufa.com / product / view.asp? id=*

[0094] P2: www.shufa.com / product / view.asp? id=*&p=*

[0095] P3: www.shufa.com / product / view.asp? id=*&n=*

[0096] Through analysis, it can be seen that the public feature field is "112404". Next, record the position of the public feature field in the above URL, and add this position to the corresponding pattern to obtain 3 pattern feature position strings:

[0097] PS1: www.shufa.com / product / view.asp? id=*1

[0098] PS2: w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method, a device and a system for removing duplication from data mining. The method comprises the following steps: receiving two or more characteristic values of identification information; confirming that the two or more identification information are duplicated when the characteristics values thereof are identical; and selecting one piece of duplicated identification information to represent the duplicated identification information.

Description

technical field [0001] The present invention relates to the technical field of data processing, and more particularly, the present invention relates to a method and device for removing duplication in data mining. Background technique [0002] With the continuous development of computer technology, data processing technology has increasingly penetrated into all aspects of people's work and life. In particular, with the rapid development of Internet technology, everyone needs to face various data resources in their daily life. When faced with numerous and complex data resources, how to identify the duplication among them has become an urgent problem to be solved. [0003] For various data resources existing on the Internet, the problem of duplication of data resources is particularly serious. Duplicate webpage content often exists on the Internet, that is, two or more URLs (Uniform Resource Locators) point to webpages with exactly the same content. It is a very common pheno...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 张岩马飞
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products