Method and device for detecting repeated URL (Uniform Resource Locator)

An address and address collection technology, applied in the field of network applications, can solve problems such as limitations, infinite accumulation of storage unit space, unbearable time-consuming search, etc., achieve high generalization, simple implementation, and avoid repeated downloading effects

Active Publication Date: 2015-07-15
CHINA UNIONPAY
View PDF4 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

After the downloaded URLs become larger and larger, the space of the storage unit cannot be accumulated infinitely, and the time-consuming search in the huge amount of URL address data will also become unbearable
[0006] Therefore, the above methods for detecting duplicate URLs have been subject to many limitations, and researchers expect to obtain a more efficient and reliable method for detecting duplicate URLs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for detecting repeated URL (Uniform Resource Locator)
  • Method and device for detecting repeated URL (Uniform Resource Locator)
  • Method and device for detecting repeated URL (Uniform Resource Locator)

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] It should be noted that, in each embodiment of the present invention, the first URL address set includes a plurality of URL addresses, and after grouping them, each group includes at least one URL address, and the elements in the second and third URL address sets In one-to-one correspondence with the above groupings, each element in the second and third URL address sets is a next-level URL address set, including one or more URL addresses, and each element in the fourth URL address set is only Include a URL address.

[0026] In various embodiments of the present invention, the following definitions can be made:

[0027] Definition 1.1 (URL repetition): Given two URL addresses u 1 and u 2 , if the corresponding webpage content doc(u 1 ) and doc(u 2 ) are the same or nearly the same, it is called u 1 with u 2 repeat.

[0028] Definition 1.2 (URL patterns): A URL pattern is a generalization of a specific class of URLs. If a URL instance u 1 Match URL pattern r 1 ,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method and a device for detecting a repeated URL (Uniform Resource Locator). The method comprises the following steps: grouping all URL addresses in a first URL address set; aiming at each group to independently carry out generalization expression on the first characteristic part of each URL address to form a second URL address set; aiming at the second URL address set to independently carry out the generalization expression on the second characteristic part of each URL address contained in each element of the second URL address set, and forming a third URL address set; aiming at each element of the third URL address set to independently extract the similarity part of all URL addresses contained in each element, and forming a fourth URL address set; and if a URL address to be downloaded is matched with any element in the fourth URL address set, judging that a webpage corresponding to the URL address to be downloaded is downloaded, wherein the URL address to be downloaded is obtained by a web crawler. The webpage can be prevented from being repeatedly downloaded, and the work efficiency of the web crawler is improved.

Description

technical field [0001] The invention relates to the technical field of network applications, and more specifically, to a method and device for detecting repeated URLs. Background technique [0002] In recent years, e-commerce websites have flourished and become the main entrance for people to shop online. These website pages contain a large amount of product-related introduction information and user comment information. Collecting these data is the basis for e-commerce applications such as personalized recommendation, commodity marketing analysis, and sentiment analysis. [0003] A web crawler is a program that automatically extracts web pages. It downloads web resources through traversal, and is also a common means of collecting and formulating web pages. Its working principle is: the web crawler starts from one or more URLs initially set, and obtains its corresponding web pages. The topic of interest is related, filter out irrelevant URLs and put relevant URLs into the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 冯亮尹亚伟费志军
Owner CHINA UNIONPAY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products