A Method of Webpage Incremental Crawling for Effective Link Acquisition

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A web page and incremental technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems affecting efficiency, etc., to achieve the effect of avoiding repeated crawling, avoiding crawling, and increasing the frequency of crawling

Active Publication Date: 2018-04-03

NANJING UNIV

View PDF3 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

As the number of crawls increases, judging whether the URL is in the crawled URL collection will seriously affect the efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0046] In order to better understand the technical content of the present invention, the present invention will be described in detail below in conjunction with the accompanying drawings.

[0047] figure 1 It is a flow chart of an incremental web crawling method for effective link acquisition according to an embodiment of the present invention, which includes two stages: an effective link acquisition stage and an incremental crawling stage.

[0048] Step 0 is the initial state of the present invention;

[0049] In the effective link acquisition stage (steps 1-4), step 1 is to initialize the capture of the entry URL link, and the capture program will capture layer by layer from then on;

[0050] Step 2 judges whether there is pagination in each URL link text in the entry URL webpage by matching whether there are paging signs such as "next page" or "next page";

[0051] Step 3 finds out its public link by comparing the link in the entry URL web page with the link in its paging...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The web page incremental crawling method for effective link acquisition includes the following steps: 1) the effective link acquisition stage: a initializes the web page link to be captured, and specifies the URL for grabbing the entrance; b judges whether the entrance web page link has paging; c calculates the entrance web page and its paging d Obtain valid links through public links; e end; 2) Incremental crawling stage: a constructs a Bloom filter, and judges whether the valid webpage link in step 1)-d has been crawled through the Bloom filter ; b grabs the ungrabbed webpage by HTTP request; c ends; the present invention adopts filtering invalid links to obtain valid webpage links, and builds a Bloom filter to maintain the captured link collection, and judges whether the webpage is grabbed by random hash Implement incremental fetching. Avoid crawling of invalid web pages by filtering invalid links.

Description

Technical field: [0001] The invention relates to a valid webpage link acquisition method based on invalid link filtering and a technique for incrementally capturing the acquired valid webpage links. Background technique: [0002] The rapid development of Internet technology and the rapid popularization of smart mobile terminals have led to an explosive growth of information, and also brought new challenges to quickly and efficiently grabbing the required information from the Internet. [0003] Traditional webpage information crawling usually adopts the crawling method of depth or breadth traversal, starting from the specified webpage link to crawl information layer by layer, and extracting the URL link in each layer as the starting link of the next layer of crawling. Since the links in the webpage do not all point to useful information, such as menus, advertisements, footers, etc. in the webpage, if invalid links cannot be effectively filtered, a large amount of invalid info...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F17/30

Inventor张雷刘有力资帅韩军华冯瀚洋谢俊元

OwnerNANJING UNIV

A Method of Webpage Incremental Crawling for Effective Link Acquisition

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology