Berkeley DB database based topic crawler system
A theme crawler and database technology, which is applied in other database retrieval, network data retrieval, network data indexing, etc., can solve the problem of unrealistic crawling of web pages, and achieve the effect of improving system performance
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0015] In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below. The present invention adopts MyEclipse 8.5+ Berkeley DB tool to realize.
[0016] 1. Theme crawler architecture
[0017] Theme crawler architecture such as figure 1 It includes several components such as page download, page analysis, relevance calculation, visited page information, URL importance score, and URL queue. The details are as follows.
[0018] (1) Page download: Take out the first element of the queue from the URL priority queue, download the Web page corresponding to the URL through the Apache tool class HttpClient, and save it on the local disk.
[0019] (2) Page analysis: It is mainly responsible for analyzing the web pages crawled to the local disk by the page download module, using the HttpParser tool class to analyze, and extracting the URL, anchor text, web page title, web page content ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 