Cascade crawling method and device for multi-level pages based on web crawlers

A web crawler and page technology, applied in the field of data crawling, can solve the problems of difficult data access, failure to restore the original data level, task identification does not reflect the hierarchical relationship, etc., to achieve the effect of ensuring data integrity and accuracy

Active Publication Date: 2019-11-19
厦门商集网络科技有限责任公司
View PDF13 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Because the crawler task ID can only play the role of one-to-one correspondence with the crawler task, the task ID does not reflect the hierarchical relationship, therefore, the original data hierarchy cannot be restored through the crawler task ID
If there is an association between multi-level pages, when the existing crawler technology captures multi-level hierarchical data, it is difficult to verify the integrity and accuracy of the data due to the complexity of the association logic between the levels.
At the same time, due to the greater difficulty of data access, multi-level web data usage rules are more cumbersome

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cascade crawling method and device for multi-level pages based on web crawlers
  • Cascade crawling method and device for multi-level pages based on web crawlers
  • Cascade crawling method and device for multi-level pages based on web crawlers

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0030] Such as figure 1 As shown, a web crawler-based multi-level page cascading crawling method includes the following steps: grabbing the upper-level page data, storing the captured data in the upper-level page data analysis table, and analyzing the upper-level page data In the table, the primary key value is set for the object that needs to continue to grab the lower-level page. The primary key value has uniqueness, and the corresponding primary key values ​​of each described object are all different; the superior page where the object is identified by the primary key value and Associate the lower-level page through the primary key value; click the URL link of the upper-level page, access the lower-level page through crawler simulation, capture the data of the lower-level page and store the captured data in the lower-level page data analysis table, and compare the data of the lower-level page The analysis table is set to associate the foreign key value of the upper-level pa...

Embodiment 2

[0042] A cascade crawling device based on a web crawler for multi-level pages, the device includes a microprocessor and a memory, a program is stored on the memory, the microprocessor runs the program and performs the following steps: grabbing the upper level Page data, and the captured data is stored in the upper-level page data analysis table, and the primary key value is set for the object that needs to continue to grab the lower-level page in the upper-level page data analysis table. The primary key value is unique. Through The primary key value identifies the upper-level page where the object is located and associates the lower-level page through the primary key value; click the URL link of the upper-level page, access the lower-level page through crawler simulation, grab the data of the lower-level page and store the captured data in the lower-level page data analysis table, and set the foreign key value used to associate the upper-level page with the lower-level page dat...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a cascade crawling method for multi-level pages based on web crawlers. The method comprises: grabbing an upper-level page, storing grabbed data in an upper-level page data analysis table, and setting main key values for objects needing to continue to grab a lower-level page in the upper-level page data analysis table, wherein the main key values corresponding to the objects are different; grabbing a subordinate page and storing the captured data in a subordinate page data analysis table; setting a foreign key value for the lower-level page data analysis table, obtaining a main key value of an object corresponding to a lower-level page from an upper-level page data analysis table, and taking the main key value as the foreign key value of the lower-level page data analysis table, thereby realizing associated query of an upper-level webpage and a lower-level webpage after grabbed data falls to the ground. According to the method, a data acquisition mode capable ofrestoring logics before and after the webpage is provided, the webpage capture integrity is ensured, the data is stored according to the original webpage hierarchy sequence, and the associated multi-hierarchy page data can be conveniently obtained.

Description

technical field [0001] The invention relates to a web crawler-based cascade crawling method and equipment for multi-level pages, belonging to the field of data crawling. Background technique [0002] The existing method of crawling upper and lower level pages is: first grab the upper level pages, then store the URL addresses in the upper level pages, and repeatedly grab the lower level pages according to these URL addresses, and finally identify and match the landing data through the crawler task. The crawler task ID is in one-to-one correspondence with the crawler and the data landing file captured by the crawler; when the crawler task ends and needs to match data, use the crawler task ID to parse the crawled data file into a structured file according to the logic of the original web page data. Because the crawler task ID can only play a role of one-to-one correspondence with the crawler task, the task ID does not reflect the hierarchical relationship, therefore, the origi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/953G06F16/955
CPCG06F16/951G06F16/953G06F16/955
Inventor 邱涛丘水文陈昊陈耀才
Owner 厦门商集网络科技有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products