Webpage data crawling method and device

A webpage data and data technology, applied in the field of Internet technology applications, can solve the problems of high data request volume and high network resource consumption, and achieve the effect of solving high data request volume, reducing network resource consumption, and reducing repeated requests

Active Publication Date: 2019-07-16
BEIJING GRIDSUM TECH CO LTD
View PDF9 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Embodiments of the present invention provide a method and device for crawling webpage data, so as to at least solve the technical problem of high consumption of network resources due to the high amount of data requests for crawling webpages by webpage crawlers in the related art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage data crawling method and device
  • Webpage data crawling method and device
  • Webpage data crawling method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0043] According to an embodiment of the present invention, an embodiment of a method for crawling web page data is provided. It should be noted that the steps shown in the flow chart of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and , although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

[0044] In the embodiment of the present invention, a proxy server is added between the web crawler and the remote website. When web crawlers send web page data crawling requests to network resources, they will all go through a proxy server. The proxy server includes a caching mechanism, which can formulate preset rules, and save network resources that meet the preset rules in the cache after successful acquisition.

[0045] image 3 It is a schematic flow chart of a method for crawling webpage data acco...

Embodiment 2

[0082] Image 6 It is a schematic flow diagram of another method for crawling webpage data according to an embodiment of the present invention, such as Image 6 As shown, on the remote website side, the method includes the following steps:

[0083] Step S602, receiving a webpage data crawling request forwarded by the proxy server;

[0084] Step S604, extracting corresponding data according to the web page data crawling request;

[0085] Step S606, returning the data to the proxy server.

[0086] In the webpage data crawling method provided by the embodiment of the present application, by receiving the webpage data crawling request forwarded by the proxy server; extracting the corresponding data according to the webpage data crawling request; and returning the data to the proxy server, reducing the hypertext transfer protocol The purpose of repeating HTTP requests is to realize the technical effect of reducing network resource consumption, and then solve the technical proble...

Embodiment 3

[0092] Figure 7 It is a schematic flow diagram of a device for crawling web page data according to an embodiment of the present invention, such as Figure 7 As shown, on the proxy server side, the device includes:

[0093] The parsing module 72 is used to parse the webpage data crawling request received to obtain the requested resource type; the first judging module 74 is used to judge whether the requested resource type is the same as the preset cached requested resource type; the sending module 76 is used to In the case of different judgment results, the webpage data crawling request is sent to the remote website; the second judging module 78 is used to judge whether there is data corresponding to the webpage data crawling request in the prestored data in the case of the same judgment result , and perform corresponding operations according to the judgment result.

[0094] In the webpage data crawling device provided in the embodiment of the present application, the reques...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage data crawling method and device. The method comprises the following steps: analyzing a received webpage data crawling request to obtain a request resource type; judging whether the request resource type is the same as the request resource type of a preset cache or not; under the condition that the judgment results are different, sending the webpage data crawling request to a remote website; and under the condition that the judgment results are the same, judging whether data corresponding to the webpage data crawling request exist in the pre-stored data or not,and executing corresponding operation according to the judgment results. According to the method and the device, the technical problem of high network resource consumption caused by high data requestquantity of webpage crawlers crawling webpages in related technologies is solved.

Description

technical field [0001] The present invention relates to the field of Internet technology applications, in particular to a method and device for crawling web page data. Background technique [0002] A web crawler is a program that automatically extracts web pages, such as figure 1 as shown, figure 1 It is a schematic diagram of the composition structure of the existing webpage. The webpage is located by the uniform resource locator (Uniform Resource Locator, referred to as URL) address. The general format of the URL is as follows: protocol type: / / server address (add port number if necessary) ) / path / file name; one URL can only correspond to one web page. The web crawler can obtain the content of the web page by specifying the URL address and sending a HyperText Transfer Protocol (HTTP) request. [0003] In general, web crawlers only crawl hypertext markup language (HTML) type webpages, but in some cases, such as the page turning operation of some webpages, or some webpages ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/953G06F16/958
CPCG06F16/972G06F16/951
Inventor 曹志明
Owner BEIJING GRIDSUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products