Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A multi-thread-based web crawler system and webpage crawling method thereof

A web crawler and multi-threading technology, applied in the field of multi-threaded web crawler systems, can solve problems such as low efficiency, difficult maintenance, complex programs, etc., and achieve the effect of improving concurrent efficiency

Active Publication Date: 2019-06-14
泰州市东盛电脑科技有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] However, the existing multi-threaded web crawler system generally has the problems of slow crawling speed and low efficiency, and the program is very complicated and difficult to maintain

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A multi-thread-based web crawler system and webpage crawling method thereof
  • A multi-thread-based web crawler system and webpage crawling method thereof
  • A multi-thread-based web crawler system and webpage crawling method thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0046] Such as figure 1 As shown, a multithread-based web crawler system includes a URL processing module, a webpage crawling module, a webpage analysis module and a webpage storage module.

[0047] The URL processing module obtains the host name, port number, and file name of each URL through URL class processing.

[0048] The general form of URL is: : / / : / . In this program, it can be made simple, so a class for storing URLs is designed, which includes Host (host name), Port (port), File (file path), Fname (this is for this web page called name). The following code is all members of the URL class and its member functions:

[0049] class URL

[0050] {

[0051] public:

[0052] URL() {}

[0053] void SetHost(const string& host) {Host = host;}

[0054] string GetHost() {return Host;}

[0055] void SetPort(int port) {Port = port;}

[0056] int GetPort() {return Port;}

[0057] void SetFile(const string& file) {File = file;}

[0058] string GetFile() {return File;}

[...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a multithreading-based web crawler system, which comprises an URL (Uniform Resource Locator) processing module, a web crawling module, a web analysis module and a web storage module, wherein the URL processing module obtains the host name, the port number and the filename of each URL through URL-class processing; the web crawling module carries out partitioning crawling on web contents and stores a captured web into a temporary storage module; the web analysis module extracts the URL, redirects the URL, carries out repetition judgment processing on the URL and deletes repeated the URL; the web storage module judges whether the file is in the presence or not when the file is stored, and the file is directly crawled if the file is not in the presence; if the file is in the presence, contents obtained by crawling the web at the time are more than the contents crawled in the previous time, and the original file is covered; and otherwise, the file is abandoned. The web matched with a regular expression is firstly input, a web request signal is sent, then, a private function is triggered to obtain matched substance, finally, specific information which contains keywords is finally obtained, crawling speed is high, and efficiency is high.

Description

technical field [0001] The invention relates to a multi-thread-based web crawler system, in particular to a multi-thread-based web crawler system and a webpage crawling method with fast and efficient web page crawling. Background technique [0002] A web crawler is like a spider crawling around on the web of the Internet. Through the link address of the web page, the web crawler can find other links to the next web page by reading the content from the home page of the website. By repeating such a cycle, the web crawler crawls all the required web pages of this website. [0003] Crawlers can fetch web pages automatically. In the search engine, it has the main function of downloading web pages in the Internet, and plays a key role in the engine. From the perspective of the crawler program, what directly affects the search structure is the implementation strategy and operating efficiency it adopts. Each search engine has different needs, so it is necessary to choose the bes...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951
CPCG06F16/951
Inventor 黄金城曹瑞袁敏
Owner 泰州市东盛电脑科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products