Web page crawling method and spider

A web scraping, web page technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of traditional methods such as difficulty in collecting deep web pages and difficulty in overcoming Javascript interference.

Inactive Publication Date: 2013-09-11
FUJITSU LTD
View PDF3 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Therefore, it is difficult for traditional methods to completely and efficiently

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page crawling method and spider
  • Web page crawling method and spider
  • Web page crawling method and spider

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] Exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. It should be understood, however, that in developing any such practical embodiment, many implementation-specific decisions must be made in order to achieve the developer's specific goals, such as meeting those constraints related to the system and business, and those Restrictions may vary from implementation to implementation. Moreover, it should also be understood that development work, while potentially complex and time-consuming, would at least be a routine undertaking for those skilled in the art having the benefit of this disclosure.

[0029]Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the device structure and / or processing steps closely relate...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a web page crawling method and a spider. The method comprises the following steps: injecting seed URL into a Web database; generating a URL list based on the Web database; feeding back the URL in the URL list to a web page crawler; crawling the webpage by the web page crawler according to the fed back URL comforming to the corresponding visit mode of the web page; and updating the URL state in the Web database and injecting newly found URL based on the crawled web page, wherein the visit mode comprises requesting parameter socket, responsing parameter socket, requesting the corresponding relationship between the requesting parameter socket and the responsing parameter socket; the requesting parameter socket comprises requesting parameter, as well as the mapping relationship of the requesting parameter socket and the responsing parameter socket; the responsing parameter socket comprises a responsing parameter, as well as the extraction position information about the extraction position of the responsing parameter in http respongsing message.

Description

technical field [0001] The present invention generally relates to web crawling methods and crawlers. Specifically, the present invention relates to a method and a crawler capable of grabbing deep web pages (DeepWeb). Background technique [0002] In recent years, with the development of the Internet, a large amount of information is provided on the Internet. Web pages on the Internet can be roughly divided into two categories: surface web pages (Surface Web) and deep web pages (Deep Web). Surface web pages refer to a collection of pages that can be indexed by traditional search engines through hyperlinks. Deep web refers to the part of the Internet that cannot be indexed by traditional search engines, and mainly includes four types: (1) dynamic pages obtained by filling in forms to query background online databases; (2) requiring login to Accessed content (3) pages not indexed by search engines due to lack of hyperlinks pointed to; (4) non-web files accessible on the Inte...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 邹纲皮冰锋张军钟朝亮于浩
Owner FUJITSU LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products