Multithreading-based web crawler system and web crawling method thereof

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A web crawler, multithreading technology, applied in the field of web crawler systems based on multithreading, can solve problems such as low efficiency, slow crawling speed, difficult maintenance, etc., and achieve the effect of improving the efficiency of concurrency

Active Publication Date: 2016-05-25

泰州市东盛电脑科技有限公司

View PDF5 Cites 21 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0013] However, the existing multi-threaded web crawler system generally has the problems of slow crawling speed and low efficiency, and the program is very complicated and difficult to maintain

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0046] Such as figure 1 As shown, a multithread-based web crawler system includes a URL processing module, a webpage crawling module, a webpage analysis module and a webpage storage module.

[0047] The URL processing module obtains the host name, port number, and file name of each URL through URL class processing.

[0048] The general form of URL is: : / / : / . In this program, it can be made simple, so a class for storing URLs is designed, which includes Host (host name), Port (port), File (file path), Fname (this is for this web page called name). The following code is all the members of the URL class and its member functions:

[0049] classURL

[0050] {

[0051] public:

[0052] URL(){}

[0053] voidSetHost(conststring&host){Host=host;}

[0054] stringGetHost(){returnHost;}

[0055] voidSetPort(intport){Port=port;}

[0056] intGetPort(){returnPort;}

[0057] voidSetFile(conststring&file){File=file;}

[0058] stringGetFile(){returnFile;}

[0059] voidSetFname(const...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multithreading-based web crawler system, which comprises an URL (Uniform Resource Locator) processing module, a web crawling module, a web analysis module and a web storage module, wherein the URL processing module obtains the host name, the port number and the filename of each URL through URL-class processing; the web crawling module carries out partitioning crawling on web contents and stores a captured web into a temporary storage module; the web analysis module extracts the URL, redirects the URL, carries out repetition judgment processing on the URL and deletes repeated the URL; the web storage module judges whether the file is in the presence or not when the file is stored, and the file is directly crawled if the file is not in the presence; if the file is in the presence, contents obtained by crawling the web at the time are more than the contents crawled in the previous time, and the original file is covered; and otherwise, the file is abandoned. The web matched with a regular expression is firstly input, a web request signal is sent, then, a private function is triggered to obtain matched substance, finally, specific information which contains keywords is finally obtained, crawling speed is high, and efficiency is high.

Description

technical field [0001] The invention relates to a multi-thread-based web crawler system, in particular to a multi-thread-based web crawler system and a webpage crawling method with fast and efficient web page crawling. Background technique [0002] A web crawler is like a spider crawling around on the web of the Internet. Through the link address of the web page, the web crawler can find other links to the next web page by reading the content from the home page of the website. By repeating such a cycle, the web crawler crawls all the required web pages of this website. [0003] Crawlers can fetch web pages automatically. In the search engine, it has the main function of downloading web pages in the Internet, and plays a key role in the engine. From the perspective of the crawler program, what directly affects the search structure is the implementation strategy and operating efficiency it adopts. Each search engine has different needs, so it is necessary to choose the bes...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F16/951

Inventor 黄金城曹瑞袁敏

Owner 泰州市东盛电脑科技有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multithreading-based web crawler system and web crawling method thereof

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology