Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Distributed web crawler system

A distributed network and crawler system technology, applied in the field of distributed network crawler systems, can solve problems such as the inability to realize the correlation between pages and topics, and the speed and quality of crawling webpages that cannot meet user requirements, etc.

Active Publication Date: 2013-09-18
慧科教育科技集团有限公司
View PDF4 Cites 63 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, the current existing technologies are unable to determine the relevance of pages and topics and accommodate different topics in one crawler system, so the speed and quality of crawling web pages cannot meet user requirements.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed web crawler system
  • Distributed web crawler system
  • Distributed web crawler system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044]The system of the present invention adopts a distributed system structure based on a data extractor, and is composed of a central master control node and a distributed crawler server. figure 1 .

[0045] Such as figure 1 Shown, the present invention mainly is made up of following modules:

[0046] 1. Management Portal

[0047] The management portal is a web interface provided by the crawler system to the administrator. You can view the logs of the center and sub-servers, set and add topics, update the URL seed of a certain topic, configure parameters such as the frequency of crawling topics, and control the status of the crawler. The central node and distributed crawlers are the main body of the system, completing topic operations, learning of data extractors, page analysis and storage of target pages.

[0048] 2. Central node server

[0049] The crawler center master control node is the control center, mainly including URL controller, extractor module and theme cont...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A distributed web crawler system is suitable for the field of network information collection and comprises a management portal, a central node server and a distributed sub-node server, wherein the management portal is a Web interface provided for an administrator by the crawler system and can be used for viewing the logs of the central node server and the distributed sub-node server, setting and adding themes, updating a URL (uniform resource locator) seed of a theme, configuring a theme capture frequency parameter, and controlling a crawler state; the central node server and the distributed sub-node server are the main bodies of the system and can be used for operating the themes, learning a data extractor, analyzing pages and storing target pages. According to the distributed web crawler system, the capture of different themes can be accommodated by a crawler, the webpage capture speed is increased, and the quality meets the user requirement.

Description

technical field [0001] The invention relates to a distributed network crawler system, which belongs to the field of network information collection. Background technique [0002] The rapid development of the network has brought about the explosive growth of the amount of information on the World Wide Web. As an Internet information retrieval tool, the traditional general search engine is becoming more and more important. However, due to its own limitations such as low network coverage and high missed detection rate, it cannot provide users with Accurate and comprehensive information. In order to overcome the above shortcomings of general search engines, topic search engines emerged as the times require, with the goal of providing users with the most accurate results in their fields of concern with limited bandwidth and hardware resource consumption. [0003] The theme crawler is the foundation of the theme search engine, and its speed and quality of crawling web pages are im...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 王宝会于雷王丽华王新河尹科
Owner 慧科教育科技集团有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products