A multi-thread-based web crawler system and webpage crawling method thereof

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A web crawler and multi-threading technology, applied in the field of multi-threaded web crawler systems, can solve problems such as low efficiency, difficult maintenance, complex programs, etc., and achieve the effect of improving concurrent efficiency

Active Publication Date: 2019-06-14

泰州市东盛电脑科技有限公司

View PDF5 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0013] However, the existing multi-threaded web crawler system generally has the problems of slow crawling speed and low efficiency, and the program is very complicated and difficult to maintain

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0046] Such as figure 1 As shown, a multithread-based web crawler system includes a URL processing module, a webpage crawling module, a webpage analysis module and a webpage storage module.

[0047] The URL processing module obtains the host name, port number, and file name of each URL through URL class processing.

[0048] The general form of URL is: : / / : / . In this program, it can be made simple, so a class for storing URLs is designed, which includes Host (host name), Port (port), File (file path), Fname (this is for this web page called name). The following code is all members of the URL class and its member functions:

[0049] class URL

[0050] {

[0051] public:

[0052] URL() {}

[0053] void SetHost(const string& host) {Host = host;}

[0054] string GetHost() {return Host;}

[0055] void SetPort(int port) {Port = port;}

[0056] int GetPort() {return Port;}

[0057] void SetFile(const string& file) {File = file;}

[0058] string GetFile() {return File;}

[...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multithreading-based web crawler system, which comprises an URL (Uniform Resource Locator) processing module, a web crawling module, a web analysis module and a web storage module, wherein the URL processing module obtains the host name, the port number and the filename of each URL through URL-class processing; the web crawling module carries out partitioning crawling on web contents and stores a captured web into a temporary storage module; the web analysis module extracts the URL, redirects the URL, carries out repetition judgment processing on the URL and deletes repeated the URL; the web storage module judges whether the file is in the presence or not when the file is stored, and the file is directly crawled if the file is not in the presence; if the file is in the presence, contents obtained by crawling the web at the time are more than the contents crawled in the previous time, and the original file is covered; and otherwise, the file is abandoned. The web matched with a regular expression is firstly input, a web request signal is sent, then, a private function is triggered to obtain matched substance, finally, specific information which contains keywords is finally obtained, crawling speed is high, and efficiency is high.

Description

technical field [0001] The invention relates to a multi-thread-based web crawler system, in particular to a multi-thread-based web crawler system and a webpage crawling method with fast and efficient web page crawling. Background technique [0002] A web crawler is like a spider crawling around on the web of the Internet. Through the link address of the web page, the web crawler can find other links to the next web page by reading the content from the home page of the website. By repeating such a cycle, the web crawler crawls all the required web pages of this website. [0003] Crawlers can fetch web pages automatically. In the search engine, it has the main function of downloading web pages in the Internet, and plays a key role in the engine. From the perspective of the crawler program, what directly affects the search structure is the implementation strategy and operating efficiency it adopts. Each search engine has different needs, so it is necessary to choose the bes...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F16/951

CPCG06F16/951

Inventor黄金城曹瑞袁敏

Owner泰州市东盛电脑科技有限公司

A multi-thread-based web crawler system and webpage crawling method thereof

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology