Unlock instant, AI-driven research and patent intelligence for your innovation.

A web crawler-based cascade crawling method and device for multi-level pages

A web crawler and page technology, applied in the field of data crawling, can solve problems such as difficult data access, task identification does not reflect hierarchical relationship, and complex association logic, etc., to achieve the effect of ensuring data integrity and accuracy

Active Publication Date: 2022-04-01
厦门商集网络科技有限责任公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Because the crawler task ID can only play the role of one-to-one correspondence with the crawler task, the task ID does not reflect the hierarchical relationship, therefore, the original data hierarchy cannot be restored through the crawler task ID
If there is an association between multi-level pages, when the existing crawler technology captures multi-level hierarchical data, it is difficult to verify the integrity and accuracy of the data due to the complexity of the association logic between the levels.
At the same time, due to the greater difficulty of data access, multi-level web data usage rules are more cumbersome

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A web crawler-based cascade crawling method and device for multi-level pages
  • A web crawler-based cascade crawling method and device for multi-level pages
  • A web crawler-based cascade crawling method and device for multi-level pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0030] Such as figure 1 As shown, a web crawler-based multi-level page cascading crawling method includes the following steps: grabbing the upper-level page data, storing the captured data in the upper-level page data analysis table, and analyzing the upper-level page data In the table, the primary key value is set for the object that needs to continue to grab the lower-level page. The primary key value has uniqueness, and the corresponding primary key values ​​of each described object are all different; the superior page where the object is identified by the primary key value and Associate the lower-level page through the primary key value; click the URL link of the upper-level page, access the lower-level page through crawler simulation, capture the data of the lower-level page and store the captured data in the lower-level page data analysis table, and compare the data of the lower-level page The analysis table is set to associate the foreign key value of the upper-level pa...

Embodiment 2

[0042] A cascade crawling device based on a web crawler for multi-level pages, the device includes a microprocessor and a memory, a program is stored on the memory, the microprocessor runs the program and performs the following steps: grabbing the upper level Page data, and the captured data is stored in the upper-level page data analysis table, and the primary key value is set for the object that needs to continue to grab the lower-level page in the upper-level page data analysis table. The primary key value is unique. Through The primary key value identifies the upper-level page where the object is located and associates the lower-level page through the primary key value; click the URL link of the upper-level page, access the lower-level page through crawler simulation, grab the data of the lower-level page and store the captured data in the lower-level page data analysis table, and set the foreign key value used to associate the upper-level page with the lower-level page dat...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a web crawler-based cascading crawling method for multi-level pages, comprising the following steps: grabbing the upper-level pages and storing the captured data in the upper-level page data analysis table, and storing the captured data in the upper-level page data analysis table Set the primary key value for the objects that need to continue to grab the lower-level pages, and the corresponding primary key values ​​of each object are different; grab the lower-level pages and store the captured data in the lower-level page data parsing table, and The lower-level page data analysis table sets the foreign key value, obtains the primary key value of the object corresponding to the lower-level page from the upper-level page data analysis table, and then uses it as the foreign key value of the lower-level page data analysis table, thereby realizing data capture After the landing, the associated query between the upper-level webpage and the lower-level webpage. The present invention is a data collection mode capable of restoring the front and rear logic of webpages, ensures the integrity of webpage capture and stores data in the order of the original webpage hierarchy, and can conveniently acquire associated multi-level page data.

Description

technical field [0001] The invention relates to a web crawler-based cascade crawling method and equipment for multi-level pages, belonging to the field of data crawling. Background technique [0002] The existing method of crawling upper and lower level pages is: first grab the upper level pages, then store the URL addresses in the upper level pages, and repeatedly grab the lower level pages according to these URL addresses, and finally identify and match the landing data through the crawler task. The crawler task ID is in one-to-one correspondence with the crawler and the data landing file captured by the crawler; when the crawler task ends and needs to match data, use the crawler task ID to parse the crawled data file into a structured file according to the logic of the original web page data. Because the crawler task ID can only play a role of one-to-one correspondence with the crawler task, the task ID does not reflect the hierarchical relationship, therefore, the origi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/953G06F16/955
CPCG06F16/951G06F16/953G06F16/955
Inventor 邱涛丘水文陈昊陈耀才
Owner 厦门商集网络科技有限责任公司