Method with verification for intelligently crawling network information in distributed way

A network information, distributed technology, applied in the field of distributed intelligent crawling network information with verification, can solve problems such as slow speed, achieve the effect of balancing node load, improving efficiency, and saving development time

Active Publication Date: 2017-06-27
北京京拍档科技股份有限公司
View PDF5 Cites 44 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But this kind of crawler needs to rely on the browser, the speed is relatively slow, and it is only suitable for stand-alone testing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method with verification for intelligently crawling network information in distributed way
  • Method with verification for intelligently crawling network information in distributed way

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.

[0032] The invention proposes a method for crawling network information with distributed intelligence with verification, which can realize automatic login, access to protected pages, automatic generation of mining scripts, and data crawling.

[0033] Such as figure 1 with figure 2 As shown, the method for crawling network information with verified distributed intelligence in the embodiment of the present invention includes the following steps:

[0034] Step S1, when it is judged that the target page data of the website can onl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention puts forward a method with verification for intelligently crawling network information in a distributed way. The method comprises the following steps that: when a judgement result shows that the target page data of a website can be obtained after login verification is carried out, obtaining corresponding login information from a database, carrying out automatic login through a browser, and submitting verification information; starting a timed task, using cookie to access the webpage of the timed task, and carrying out keep-alive processing; starting a network package capture detector, accessing a corresponding target page according to business requirements, carrying out HTTP (Hyper Text Transport Protocol) message analysis, carrying out customization on a crawler script, and determining a task crawling data size; and emitting a broadcast by a main node, notifying a corresponding task node, distributing the crawler script, starting the task node, applying for a task from a main node task queue, carrying out data crawling according to the applied task, and storing the crawled target data into the queue so as to store the crawled target data into the database in batches. By use of the method, a protected page can be automatically logged in and accessed, and a quick and expandable distributed webpage crawler integrated framework capable of mining the script is automatically generated.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a method for crawling network information with verified distributed intelligence. Background technique [0002] Existing components mainly aimed at crawling web pages include the following three methods: [0003] (1) Non-analytic crawler (HTTP-Based): This kind of crawler is based on the HTTP(s) protocol (such as JSoup and HTTPClient in JAVA components), and can construct page requests (HTTPRequest) according to HTTP message rules and connection parameters. ). But this type of crawler does not have a JS engine module, cannot obtain the content dynamically generated by the page, and cannot receive multiple connection pages at a time. [0004] (2) Analytical crawler (Browser-Based): This kind of crawler is also based on the HTTP(s) protocol (such as HtmlUnit in the JAVA component), which can simulate browser behavior, not only accepting multiple connection pages, but also pa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F9/50
CPCG06F9/505G06F16/951G06F16/955Y02D10/00
Inventor 王文峰杨振许千帆
Owner 北京京拍档科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products