Method for realizing web crawler tasks

A web crawler and task technology, applied in the field of web crawlers, achieves the effects of speed assurance, shortened development cycle, and reduced development difficulty

Inactive Publication Date: 2013-03-27
维我软件(上海)有限公司
View PDF4 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The web crawlers in the prior art are usually designed for a specific website in advance, and it is difficult to modify the target website and its

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for realizing web crawler tasks
  • Method for realizing web crawler tasks
  • Method for realizing web crawler tasks

Examples

Experimental program
Comparison scheme
Effect test

Example

[0032] Embodiment: A method for establishing a web crawler task for a physician database system, specifically refers to a method for quickly realizing a fast and stable web crawler for crawling corresponding websites when crawling different websites. The specific implementation is as follows:

[0033] A. In step S11, write a template for storing the link address of the page. This step creates a template for storing page link address information for each page that needs to be crawled. This template is equivalent to a page address blank record book that can be used to save the link address of the crawled page and the depth of the page. For example, the link to the detailed information page of Wanfang's paper is:

[0034] (Http: / / d.wanfangdata.com.cn / Periodical_ahzylczz201203001.aspx), the page depth is 3, then the content stored in the template is the above address link and the depth value 3.

[0035] B. In step S12, write a link resolver. First, establish a regular expression, anal...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for realizing web crawler tasks. The method comprises the following steps: 1, initializing a link address of a webpage to be crawled to a client; 2, packaging the link address of the webpage to be crawled to a task request to a server by the client; 3, sending an HTTP request by the server to the page to be crawled and returning the information required to the client; 4, receiving the information and processing the information by the client; and 5, repeating the process and completing the webpage crawling in a crawling list sequentially. The invention provides a universal crawling frame for crawling different network contents. Through adopting the method, crawlers for crawling a special network can be quickly compiled. According to the method, the development difficulty of developers is greatly reduced, and the development period is shortened. As the method is established based on the distributed network crawler frame, the network crawling speed can be further guaranteed. The method provided by the invention can be used to medical information systems.

Description

technical field [0001] The invention relates to the technical field of web crawlers, more specifically, to a method for realizing web crawler tasks, and is mainly used in medical information systems. Background technique [0002] Web crawler (also known as web spider, web robot) is a program or script that automatically grabs information on the World Wide Web according to certain rules. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. Traditional crawlers start from the URL of one or several initial webpages, obtain the URLs on the initial webpage, and continuously extract new URLs from the current page and put them into the queue during the process of crawling webpages, until they meet certain stop conditions of the system and terminate the operation . [0003] The current practical web crawler programs are usually distributed. The distributed web crawler contains multiple crawlers. The tasks that each craw...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 金博
Owner 维我软件(上海)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products