Optimization method of distributed vertical crawler service system

A technology of service system and optimization method, applied in the optimization field of distributed vertical crawler service system, can solve problems such as vertical crawler service system not working normally, low performance of crawler logic unit, complicated webpage download and analysis logic, etc. The effect of difficulty, download efficiency improvement, and analysis efficiency improvement

Inactive Publication Date: 2016-01-20
GUANGZHOU JISHUBAO DATA SERVICES CO LTD
View PDF4 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

With the advent of the big data era, the amount of information that vertical crawlers need to process is increasing, and some websites have adopted dynamic page technology, which makes the logic of downloading and analyzing web pages more and more complicated. The integrated

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Optimization method of distributed vertical crawler service system
  • Optimization method of distributed vertical crawler service system
  • Optimization method of distributed vertical crawler service system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] The existing general distributed vertical crawler service system workflow is as follows: figure 1 shown. Crawler services run in a multi-thread or multi-process manner, and they may be deployed on one host or multiple hosts within the enterprise. Each crawler service obtains a download task from the task queue, then sends an HTTP request to the target URL address, and saves the returned result in memory, then uses the DOM analyzer to analyze the content of the HTML page, and then passes the DOM selector Select useful information, and finally save the useful information on the storage device.

[0047]The present invention optimizes the structure and flow of a general distributed vertical crawler, and the optimized flow is as follows figure 2 shown. In the present invention, the original crawler service system is divided into two parts: download service and page analysis logic, and both the download service and analysis logic are deployed on multiple cloud hosts, and ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention aims at providing an optimization method of a distributed vertical crawler service system. According to the method, an original crawler service system is split into two parts of download service and page analysis logic, moreover, the download service and the analysis logic both are arranged at a plurality of cloud hosts, and a task queue is also split into a download task queue and an analysis task queue. The crawler service system optimized by the method provided by the invention can improve efficiency of processing a great deal of data information of a vertical crawler, can enhance capturing capability to a dynamic HTML page adopting lazy loading, can effectively manage and expand the page download logic and the analytical processing logic, and can provide effective circumvention to a crawler defending strategy of website owners.

Description

technical field [0001] The invention relates to a network data transmission method, in particular to an optimization method of a distributed vertical crawler service system. Background technique [0002] With the development of the Internet, more and more information content is included in the Internet. Search engines can help people find the content they are interested in in the massive information. General search engines, such as Baidu, Google and Bing, are all oriented to all Users provide Internet content search services. These search engines need to continuously obtain information from the Internet through crawler technology, and store the information so that people can easily retrieve the information. Due to the huge amount of data to be crawled, large-scale search engines often adopt a distributed processing mechanism, that is, establish a distributed crawler service system. These crawlers obtain the target URL from the unified download queue, and then download and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F9/44
CPCG06F16/951
Inventor 闫峰李桂兵魏继超
Owner GUANGZHOU JISHUBAO DATA SERVICES CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products