The invention provides a keyword based topic-focused
web crawler design method. The method comprises the following steps: step (1), configuring a search URL of a topic keyword, and forming an
initial seed hyperlink originalURL; step (2), according to the originalURL, searching and downloading web pages in a
search engine, and extracting a preliminary field of news based on webpage contents; step (3), according to a topic correlation
algorithm, obtaining the similarity between each news and the topic, keeping news fields relevant to the topic and putting the news fields in a public
queue newsQueue, and filtering out news not relevant to the topic; step (4), downloading a webpage content of the next page according to a nextPage URL, extracting the nextPageURL and the relevant field in step (3), putting the relevant field into the public
queue newsQueue, and repeating step (4) until there is no next page
hyperlink nextPageURL; and step (5), taking out the URL from the newsQueue and handing the URL to a crawler
processing thread, that is a
consumer thread. The keyword based topic-focused
web crawler design method provided by the invention improves the
crawling efficiency of the topic-focused
web crawler, and enhances the effectiveness of crawled URL resources.