Design method of theme network crawler system

A design method and web crawler technology, applied in computing, special data processing applications, instruments, etc., can solve problems such as large correlation gap and slow crawling speed, and achieve the improvement of accuracy and comprehensive rate, speed of crawling, and work. volume reduction effect

Inactive Publication Date: 2010-01-20
KUNMING UNIV OF SCI & TECH
View PDF0 Cites 49 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Purpose of the invention: the purpose of the present invention is to propose a design method of a theme web crawler system based on the best priority search strategy for the existing crawler search technology, which has a relatively large gap in search result correlation and a relatively slow crawling speed.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Design method of theme network crawler system
  • Design method of theme network crawler system
  • Design method of theme network crawler system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] Such as figure 1 As shown, the web crawler system constructed by the method of the present invention mainly includes: management interface 1, crawling database 2, thesaurus 3, topic determiner 4, webpage classifier 5, webpage selector 6 and Web Crawler main program 7. The topic determiner 4 is the basis of the topic crawler, and the webpage classifier 5 is responsible for learning the characteristics of the crawling target, calculating the relevance degree of the webpage, and filtering the webpage. The webpage selector 6 is responsible for calculating the importance of the webpages, and thus dynamically determines the order of visiting the webpages.

[0018] The design method is described in detail below:

[0019] Step (1): Establish a thesaurus, establish search topics, and establish different weights for each topic. There are usually two methods for weight setting: feature extraction and manual setting. Feature extraction means that given a set of web pages related...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a design method for a theme network crawler system, which is based on a best-first search strategy and mainly comprises the following steps: 1. establishing a theme word stock; 2. filtering crawling web pages, and eliminating the web pages with lower theme association degree (smaller than a set threshold value); 3. computing the significance of the web pages and determining the accessing order of the web pages; and 4. establishing four URL queues: a waiting queue, a running queue, a completed queue and an exceptions queue. By the design method, the workload of a crawler is greatly reduced, and the accuracy rate and the comprehension rate of crawling results are improved.

Description

technical field [0001] The patent of the present invention relates to a design method of a network data collection system, in particular to a design method of a subject network crawler system. Background technique [0002] Today's world is a world of information, but with the rapid development of the network, network information is growing exponentially. Therefore, how to quickly find and obtain the information you need or are interested in in the vast information space has become one of the most fundamental problems in the information age. Most of the current search engines are oriented to all information, which can be called comprehensive search engines. However, with the development of information diversification, this kind of comprehensive search engine applicable to all users obviously cannot satisfy the more in-depth needs of specific users. Inquiry requirements, their information needs are often for some limited areas and specific topics, and the information recall r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 张云伟汪斌何庆华
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products