Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and device for crawling network data

A network data and website technology, applied in the Internet field, can solve the problems of inability to crawl network data, crawling servers and low efficiency of crawling network data, and achieve the effect of improving efficiency

Active Publication Date: 2018-11-23
TENCENT TECH (SHENZHEN) CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Web servers usually set an upper limit on access frequency. Since the crawling server sends data requests to the web server in the order in which the URLs are obtained, it often occurs that a large number of data requests are sent to a web server within a certain period of time. If the data request is sent If the frequency is greater than the upper limit of the visit frequency of the website, the website server will block the IP (Internet Protocol, Internet Protocol) address of the crawling server, causing the crawling server to be unable to crawl the network from the website server within a certain period of time data, which makes crawling servers less efficient in crawling network data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for crawling network data
  • A method and device for crawling network data
  • A method and device for crawling network data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0022] Embodiments of the present invention provide a method for crawling network data, such as figure 1 As shown, the processing flow of the method may include the following steps:

[0023] Step 101 , according to a preset polling sequence, select domain names to be crawled one by one from a pre-stored domain name queue.

[0024] Step 102, after each domain name to be crawled is selected, if the time interval between the last crawled time of the selected domain name and the current time exceeds the preset time interval threshold, extract from the URL queue corresponding to the selected domain name The website to be crawled, crawl the network data of the website to be crawled, if the time interval between the last crawled time of the selected domain name and the current time does not exceed the preset time interval threshold, select the next one to be crawled domain name.

[0025] In the embodiment of the present invention, according to the preset polling order, the domain n...

Embodiment 2

[0027] An embodiment of the present invention provides a method for crawling network data, and the execution body of the method is a crawling server. Wherein, the crawling server may be a background server of a browser or a background server of a website, and the crawling server may be a single server or a server group composed of multiple servers.

[0028] The following will combine specific implementation methods, figure 1 The processing flow shown is described in detail, and the content can be as follows:

[0029] Step 101 , according to a preset polling sequence, select domain names to be crawled one by one from a pre-stored domain name queue.

[0030] In implementation, technicians may pre-store domain names of multiple websites in the crawling server, and these domain names may be stored in the form of a domain name queue according to a preset polling sequence. The crawling server can also store a plurality of URLs under the domain name corresponding to each website do...

Embodiment 3

[0051] Based on the same technical idea, the embodiment of the present invention also provides a device for crawling network data, such as image 3 As shown, the device includes:

[0052] The selection module 310 is used to select the domain names to be crawled one by one in the pre-stored domain name queue according to the preset polling sequence;

[0053] The crawling module 320 is configured to select a domain name to be crawled each time, if the time interval between the last crawled time of the selected domain name and the current time exceeds a preset time interval threshold, then the selected domain name Extract URLs to be crawled from the corresponding URL queue, and perform network data crawling on the URLs to be crawled, if the time interval between the last crawled time of the selected domain name and the current time does not exceed the preset time interval threshold, select the next domain name to be crawled.

[0054] Optionally, the crawling module 320 is confi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a network data crawling method and apparatus, and belongs to the technical field of Internet. The method comprises the following steps: according to a preset polling sequence, selecting domain names to be crawled one by one in a domain name queue stored in advance; after selection of the domain names to be crawled each time, if a time interval between time when selected domain names are crawled last time and current time exceeds a preset time interval threshold, drawing Web addresses to be crawled from a Web address queue corresponding to the selected domain names, performing network data crawling on the web addresses to be crawled, and if a time interval between the time when the selected domain names are crawled last time and the current time does not exceed the preset time interval threshold, selecting next domain names to be crawled. By using the network data crawling method and apparatus, the network data crawling efficiency can be improved.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a method and device for crawling network data. Background technique [0002] With the development of Internet technology, the application of the Internet is becoming more and more extensive, and the amount of network data in the Internet is also increasing. People can browse network data, such as news, videos and novels, on the Internet through a browser. In order to facilitate users to obtain more network data, some websites often crawl network data from other websites, set the crawled network data in this website, and the processing of crawling network data is usually completed by the crawling server. [0003] There are a large number of URLs stored in the crawling server. These URLs can be input by technicians, or they can be obtained by the crawling server during the process of crawling network data. The website server sends a data request. After receiving the data req...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): H04L29/12G06F17/30
Inventor 刘杰
Owner TENCENT TECH (SHENZHEN) CO LTD