The invention discloses a
multithreading-based
web crawler system, which comprises an URL (
Uniform Resource Locator)
processing module, a web
crawling module, a web analysis module and a
web storage module, wherein the URL
processing module obtains the host name, the port number and the filename of each URL through URL-class
processing; the web
crawling module carries out partitioning
crawling on web contents and stores a captured web into a
temporary storage module; the web analysis module extracts the URL, redirects the URL, carries out repetition judgment processing on the URL and deletes repeated the URL; the
web storage module judges whether the file is in the presence or not when the file is stored, and the file is directly crawled if the file is not in the presence; if the file is in the presence, contents obtained by crawling the web at the time are more than the contents crawled in the previous time, and the original file is covered; and otherwise, the file is abandoned. The web matched with a
regular expression is firstly input, a web request
signal is sent, then, a private function is triggered to obtain matched substance, finally, specific information which contains keywords is finally obtained, crawling speed is high, and efficiency is high.