Unlock instant, AI-driven research and patent intelligence for your innovation.

Berkeley DB database based topic crawler system

A theme crawler and database technology, which is applied in other database retrieval, network data retrieval, network data indexing, etc., can solve the problem of unrealistic crawling of web pages, and achieve the effect of improving system performance

Inactive Publication Date: 2015-10-14
XUCHANG UNIV
View PDF6 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although the performance of the machine has been relatively improved a lot, it is unrealistic to crawl the web pages on the entire Web in the face of such a huge number of URLs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Berkeley DB database based topic crawler system
  • Berkeley DB database based topic crawler system
  • Berkeley DB database based topic crawler system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below. The present invention adopts MyEclipse 8.5+ Berkeley DB tool to realize.

[0016] 1. Theme crawler architecture

[0017] Theme crawler architecture such as figure 1 It includes several components such as page download, page analysis, relevance calculation, visited page information, URL importance score, and URL queue. The details are as follows.

[0018] (1) Page download: Take out the first element of the queue from the URL priority queue, download the Web page corresponding to the URL through the Apache tool class HttpClient, and save it on the local disk.

[0019] (2) Page analysis: It is mainly responsible for analyzing the web pages crawled to the local disk by the page download module, using the HttpParser tool class to analyze, and extracting the URL, anchor text, web page title, web page content ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention designs and realizes a Berkeley DB database based topic crawler system. The goal of designing the system is to provide a domain information acquisition tool for a user and only acquire specific topic related web pages so as to save software and hardware resources and more quickly update the pages. The design idea is as follows: the web pages are analyzed first; according to a topic correlation algorithm and a crawling policy, the web pages are filtered; only links of the topic related web pages are remained and added into a to-be-crawled URL queue; according to a web page crawling policy, next to-be-crawled page URL is selected; and the process is circularly repeated until a system termination condition is met. In a web page downloading process, the URL and summary information are inserted into a Berkeley database; when a database configuration object is created, a delayed write function is set for the database; and when data of specific size is stored in a memory, the data is written into a disk again, so that the system performance is improved. At a parameter setting interface of a topic crawler, a user can select crawled topic word class, seed network addresses and thread number; and at an operating interface, network address information, downloaded web page number, analyzed URL number, to-be-crawled web page number and effective web page number are given.

Description

technical field [0001] The invention belongs to the technical category of Internet information collection, and specifically relates to a theme crawler system based on a Berkeley DB database. Background technique [0002] With the explosive growth of network resources, the scale of web pages in the network has become very large. Although the performance of the machine has been relatively improved a lot, it is unrealistic to crawl the web pages on the entire Web in the face of such a huge number of URLs. For web crawlers, there are always "too rich" URL resources. The research on web crawlers has begun to shift to using a better URL selection or sorting strategy, sorting or choosing URLs, and trying to grab high-quality or close to "fixed theme" web pages first, instead of simply pursuing Page coverage. Theme-oriented Web information collection (also called focused web crawler) mainly refers to information collection that selectively searches for pages related to a pre-defi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/9535G06F16/951
Inventor 杨月华刘红雅
Owner XUCHANG UNIV