A Method of Webpage Incremental Crawling for Effective Link Acquisition

A web page and incremental technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems affecting efficiency, etc., to achieve the effect of avoiding repeated crawling, avoiding crawling, and increasing the frequency of crawling

Active Publication Date: 2018-04-03
NANJING UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

As the number of crawls increases, judging whether the URL is in the crawled URL collection will seriously affect the efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Webpage Incremental Crawling for Effective Link Acquisition
  • A Method of Webpage Incremental Crawling for Effective Link Acquisition
  • A Method of Webpage Incremental Crawling for Effective Link Acquisition

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] In order to better understand the technical content of the present invention, the present invention will be described in detail below in conjunction with the accompanying drawings.

[0047] figure 1 It is a flow chart of an incremental web crawling method for effective link acquisition according to an embodiment of the present invention, which includes two stages: an effective link acquisition stage and an incremental crawling stage.

[0048] Step 0 is the initial state of the present invention;

[0049] In the effective link acquisition stage (steps 1-4), step 1 is to initialize the capture of the entry URL link, and the capture program will capture layer by layer from then on;

[0050] Step 2 judges whether there is pagination in each URL link text in the entry URL webpage by matching whether there are paging signs such as "next page" or "next page";

[0051] Step 3 finds out its public link by comparing the link in the entry URL web page with the link in its paging...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The web page incremental crawling method for effective link acquisition includes the following steps: 1) the effective link acquisition stage: a initializes the web page link to be captured, and specifies the URL for grabbing the entrance; b judges whether the entrance web page link has paging; c calculates the entrance web page and its paging d Obtain valid links through public links; e end; 2) Incremental crawling stage: a constructs a Bloom filter, and judges whether the valid webpage link in step 1)-d has been crawled through the Bloom filter ; b grabs the ungrabbed webpage by HTTP request; c ends; the present invention adopts filtering invalid links to obtain valid webpage links, and builds a Bloom filter to maintain the captured link collection, and judges whether the webpage is grabbed by random hash Implement incremental fetching. Avoid crawling of invalid web pages by filtering invalid links.

Description

Technical field: [0001] The invention relates to a valid webpage link acquisition method based on invalid link filtering and a technique for incrementally capturing the acquired valid webpage links. Background technique: [0002] The rapid development of Internet technology and the rapid popularization of smart mobile terminals have led to an explosive growth of information, and also brought new challenges to quickly and efficiently grabbing the required information from the Internet. [0003] Traditional webpage information crawling usually adopts the crawling method of depth or breadth traversal, starting from the specified webpage link to crawl information layer by layer, and extracting the URL link in each layer as the starting link of the next layer of crawling. Since the links in the webpage do not all point to useful information, such as menus, advertisements, footers, etc. in the webpage, if invalid links cannot be effectively filtered, a large amount of invalid info...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 张雷刘有力资帅韩军华冯瀚洋谢俊元
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products