A web crawler-based cascade crawling method and device for multi-level pages

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A web crawler and page technology, applied in the field of data crawling, can solve problems such as difficult data access, task identification does not reflect hierarchical relationship, and complex association logic, etc., to achieve the effect of ensuring data integrity and accuracy

Active Publication Date: 2022-04-01

厦门商集网络科技有限责任公司

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Because the crawler task ID can only play the role of one-to-one correspondence with the crawler task, the task ID does not reflect the hierarchical relationship, therefore, the original data hierarchy cannot be restored through the crawler task ID

If there is an association between multi-level pages, when the existing crawler technology captures multi-level hierarchical data, it is difficult to verify the integrity and accuracy of the data due to the complexity of the association logic between the levels.

At the same time, due to the greater difficulty of data access, multi-level web data usage rules are more cumbersome

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0030] Such as figure 1 As shown, a web crawler-based multi-level page cascading crawling method includes the following steps: grabbing the upper-level page data, storing the captured data in the upper-level page data analysis table, and analyzing the upper-level page data In the table, the primary key value is set for the object that needs to continue to grab the lower-level page. The primary key value has uniqueness, and the corresponding primary key values of each described object are all different; the superior page where the object is identified by the primary key value and Associate the lower-level page through the primary key value; click the URL link of the upper-level page, access the lower-level page through crawler simulation, capture the data of the lower-level page and store the captured data in the lower-level page data analysis table, and compare the data of the lower-level page The analysis table is set to associate the foreign key value of the upper-level pa...

Embodiment 2

[0042] A cascade crawling device based on a web crawler for multi-level pages, the device includes a microprocessor and a memory, a program is stored on the memory, the microprocessor runs the program and performs the following steps: grabbing the upper level Page data, and the captured data is stored in the upper-level page data analysis table, and the primary key value is set for the object that needs to continue to grab the lower-level page in the upper-level page data analysis table. The primary key value is unique. Through The primary key value identifies the upper-level page where the object is located and associates the lower-level page through the primary key value; click the URL link of the upper-level page, access the lower-level page through crawler simulation, grab the data of the lower-level page and store the captured data in the lower-level page data analysis table, and set the foreign key value used to associate the upper-level page with the lower-level page dat...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a web crawler-based cascading crawling method for multi-level pages, comprising the following steps: grabbing the upper-level pages and storing the captured data in the upper-level page data analysis table, and storing the captured data in the upper-level page data analysis table Set the primary key value for the objects that need to continue to grab the lower-level pages, and the corresponding primary key values of each object are different; grab the lower-level pages and store the captured data in the lower-level page data parsing table, and The lower-level page data analysis table sets the foreign key value, obtains the primary key value of the object corresponding to the lower-level page from the upper-level page data analysis table, and then uses it as the foreign key value of the lower-level page data analysis table, thereby realizing data capture After the landing, the associated query between the upper-level webpage and the lower-level webpage. The present invention is a data collection mode capable of restoring the front and rear logic of webpages, ensures the integrity of webpage capture and stores data in the order of the original webpage hierarchy, and can conveniently acquire associated multi-level page data.

Description

technical field [0001] The invention relates to a web crawler-based cascade crawling method and equipment for multi-level pages, belonging to the field of data crawling. Background technique [0002] The existing method of crawling upper and lower level pages is: first grab the upper level pages, then store the URL addresses in the upper level pages, and repeatedly grab the lower level pages according to these URL addresses, and finally identify and match the landing data through the crawler task. The crawler task ID is in one-to-one correspondence with the crawler and the data landing file captured by the crawler; when the crawler task ends and needs to match data, use the crawler task ID to parse the crawled data file into a structured file according to the logic of the original web page data. Because the crawler task ID can only play a role of one-to-one correspondence with the crawler task, the task ID does not reflect the hierarchical relationship, therefore, the origi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F16/951G06F16/953G06F16/955

CPCG06F16/951G06F16/953G06F16/955

Inventor 邱涛丘水文陈昊陈耀才

Owner 厦门商集网络科技有限责任公司

A web crawler-based cascade crawling method and device for multi-level pages

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology