Method and device for removing duplication from data mining
A technology of repetitive patterns and data resources, applied in the field of data processing, can solve problems such as scattered work centers, inability to enumerate repeated classifications, difficult research and development, etc., and achieve the effect of solving repetitive problems
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
example 1
[0066] Extract the pattern and feature fields of the following URLs:
[0067] www.gouwo.com / service / View.aspx? SubjectID=8040&page=3
[0068] Through the processing process described above, the process in steps 1-5 can be executed, and each component in the URL is scanned from left to right and processed.
[0069] First, divide the URL into parts such as site name, path name, and file name using delimiter splitting, and add the site name (ie, www.gouwo.com) to the pattern.
[0070] Then, process the directory path and file name, that is, add each part of " / service / View.aspx?SubjectID" to the pattern one by one; for the purely numeric field "8040", replace it with the replacement character "*" ; For other unprocessed parts in the URL, the above rules are also used to extract. The resulting schema is:
[0071] www.gouwo.com / service / View.aspx? SubjectID=*&page=*
[0072] It should be noted that in the above extraction process, since the purely numeric fields "8040" and "3" ...
example 2
[0075] Extract the pattern and feature fields of the following URLs:
[0076] istock.jrj.com.cn / forum456 / mtopic789.html
[0077] Through the above extraction process, it can be concluded that:
[0078] Mode: istock.jrj.com.cn / forum# / mtopic#.html
[0079] There are two characteristic fields, the value of characteristic field 1 is 456, and the value of characteristic field 2 is 789.
[0080] According to the embodiment of the present invention, the principle of mining duplication based on eigenvalues is to first extract patterns and feature fields from the identification information associated with data resources, and then perform processing according to certain rules for mining duplication in order to obtain eigenvalues, which Finally, by comparing the eigenvalues, it is judged whether the eigenvalues correspond to duplicate data resources, so as to achieve the purpose of mining duplication.
[0081] Therefore, how to extract the feature value from the identification inf...
example 3
[0088] Extract the schema and feature fields from the following URLs, and calculate the common feature fields:
[0089] URL1: www.shufa.com / product / view.asp? id=112404
[0090] URL2: www.shufa.com / product / view.asp? id=112404&p=112404
[0091] URL3: www.shufa.com / product / view.asp? id=112404&n=112404
[0092]First, extracting the patterns and feature fields of the above URLs respectively, it can be obtained that the patterns P1, P2, and P3 corresponding to URL1, URL2, and URL3 are respectively:
[0093] P1: www.shufa.com / product / view.asp? id=*
[0094] P2: www.shufa.com / product / view.asp? id=*&p=*
[0095] P3: www.shufa.com / product / view.asp? id=*&n=*
[0096] Through analysis, it can be seen that the public feature field is "112404". Next, record the position of the public feature field in the above URL, and add this position to the corresponding pattern to obtain 3 pattern feature position strings:
[0097] PS1: www.shufa.com / product / view.asp? id=*1
[0098] PS2: w...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com