Method for crawling webpage contents with paging

A web page content and paging technology, applied in the field of JAVA platform, can solve the problem that the paging part cannot be directly captured

Inactive Publication Date: 2018-09-21
ZHUHAI HENGQIN SHENGDA ZHAOYE TECH INVESTMENT CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The technical problem to be solved by the present invention is to provide a method for grabbing webpage content with paging; it solves the problem that the paginated parts of webpages with paging cannot be directly grabbed
This solves the problem that the paginated part of the webpage with paging cannot be directly crawled

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for crawling webpage contents with paging

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0016] like figure 1 Shown, the present invention adopts following steps:

[0017] Step 1. Check whether the URL of the current page to be crawled has a page number. If not, use the developer tool to analyze it, find out its query parameters and the requested URL, and splice out a URL with a page number based on them;

[0018] 1) Open the webpage to be crawled through mainstream browsers such as 360 or Google;

[0019] 2) Open the developer tools;

[0020] 3) Find the Headers sub-tab in the Network tab;

[0021] 4) Find the requested main URL and request method in General;

[0022] 5) Obtain the content of the Request Headers request header;

[0023] 6) Find the parameter content of Query String Parameters, assemble it with the same main URL as above, and generate a URL with paging numbers;

[0024] Step 2, use a network tool to load it, and obtain the Html information content;

[0025] / / 1) Initialize the network tool according to the request header information

[0026...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of JAVA platforms, and particular relatesto a method for crawling webpage contents with paging. The method includes the steps that firstlywhether a URL ofa current page to be crawled has a paging number is checked,if not, a developer tool is used for parsing to find out query parameters and the requested URL, and a URL with apaging number is splicedaccording to the query parameters and the requested URL; a network tool is used for loading to obtain Html information contents, and a crawler tool is used for extracting information such as the total number of pages, the current number of pages and the like;the total number of pages is used as an end value, the current number of pages is used as a starting value, circulation is conducted, and cyclicvariables are used for replacing the paging number in the URL during the circulation to generatea URL of each page; finally, the network tool is used for loadingURLs of pages, the crawler tool is used for extracting the required contents, and the obtained data is saved to a database. The method solves the problem that paged parts which arenot displayed of a webpage with pages cannot be directly crawled.

Description

technical field [0001] The invention relates to the technical field of the JAVA platform, in particular to a method for grabbing web page content with paging. Background technique [0002] When crawling webpage intelligence information, it is often encountered that a lot of content to be crawled has paging. What we can capture is only the data of the page we are currently viewing. For other paging data, we need to click the pagination button. to be loaded. If the page has tens of thousands of pages, it is not advisable to manually click the button to load the content of the page for crawling. In order to solve these problems, it is necessary to implement a function that can simulate clicking the pagination button to obtain the URLs of all pagination pages so as to capture some information that has not been loaded. Contents of the invention [0003] The technical problem solved by the present invention is to provide a method for grabbing webpage content with paging; it so...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 陈林张来卿庞严冬
Owner ZHUHAI HENGQIN SHENGDA ZHAOYE TECH INVESTMENT CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products