Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for designing focused crawler

A technology focusing on crawlers and design methods, applied in computing, special data processing applications, instruments, etc., can solve problems such as discarding correlation, and achieve the effect of improving processing efficiency, enriching quantity, and improving effectiveness

Inactive Publication Date: 2013-02-13
UNIV OF ELECTRONIC SCI & TECH OF CHINA
View PDF2 Cites 38 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] Through the above analysis of related focused crawlers, it is found that there are still many areas worthy of research in the existing web crawlers. For example, traditional focused crawlers often crawl for pre-specified websites or web pages. There are few researches on issues such as quickly discarding webpages with weak domain correlation based on domain information, and effectively locating collected information resources.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for designing focused crawler
  • Method for designing focused crawler
  • Method for designing focused crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] Specific embodiments of the present invention will be described below in conjunction with the accompanying drawings, so that those skilled in the art can better understand the present invention. It should be noted that in the following description, when detailed descriptions of known functions and designs may dilute the main content of the present invention, these descriptions will be omitted here.

[0021] figure 1 It is a flow chart of an embodiment of the design method of the focused crawler of the present invention.

[0022] In this example, if figure 1 As shown, the focus crawler design method of the present invention comprises the following steps:

[0023] ST1. Configure the description information of the domain ontology and use it as a template for focusing on crawlers. These description information include: search keywords, filter keywords, and crawl keywords, which are respectively used as three-level information of crawler templates.

[0024] In this implem...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for designing a focused crawler. According to the method, domain-specific local domain webpage URL (Uniform Resource Locator) sources instead of few specific websites are searched through a search engine to enrich the URL sources; a seed URL is selected for performing source crawling by adopting a certain possibility according to the domain correlation degree of the URL sources, i.e., the URL sources with weak domain correlation are not processed with a certain possibility, the processing efficiency of the URL sources is improved so that source crawling is faster; and finally the weight analysis of a webpage label is utilized to obtain targeted information to be crawled, and the information effectiveness is improved.

Description

technical field [0001] The invention belongs to the technical field of network information processing, and in particular relates to a design method for focusing crawlers. Background technique [0002] With the rapid development of the Internet, the Internet has become the carrier of a large amount of information, and the information in it has shown explosive growth. These massive Internet information resources contain huge potential value. How to effectively and quickly extract and use these information has become a huge problem. Therefore, various web crawler technologies have emerged, such as traditional general web crawlers, topic web crawlers, incremental web crawlers, and deep web crawlers. [0003] A web crawler is an automatic web page crawling program that can crawl (crawl) relevant and useful web page resources from the Internet. [0004] Traditional web crawlers start from one or several initial URLs, crawl webpage source code information and extract new URLs from...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 陈端兵高辉傅彦张博
Owner UNIV OF ELECTRONIC SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products