A Focused Crawler Method Based on Link Analysis

A technology focusing on crawlers and link analysis, applied in network data navigation, special data processing applications, instruments, etc., can solve the problems of low accuracy and efficiency of web pages, improve efficiency and accuracy, simplify the processing process, and improve the accuracy. Effect

Active Publication Date: 2017-10-20
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Aiming at the shortcomings of the prior art, the present invention provides a focused crawler method based on link analysis to solve the problem of low accuracy and efficiency of the existing focused crawler to grab webpages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Focused Crawler Method Based on Link Analysis
  • A Focused Crawler Method Based on Link Analysis
  • A Focused Crawler Method Based on Link Analysis

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0049] A focused crawler method based on link analysis, comprising the following steps:

[0050] (1) Grab the webpage, compare the structure of the webpage and the target sample webpage, determine the target webpage, start from the website entrance link, record each link path of the crawler to the target webpage, and establish a target webpage link tree.

[0051] The specific steps for establishing the link tree of the target web page are as follows:

[0052] (11) select a target webpage as the target sample webpage, for comparing the webpage structure to be downloaded;

[0053] (12) Initialize the link tree, that is, the link tree is set to an empty tree;

[0054](13) Initialize the link queue, add the entry link of the website to the end of the link queue. The link queue is a storage structure used to store the links extracted from the webp...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A focused crawler method based on link analysis, which belongs to the fields of Internet information retrieval, search engines, etc., solves the problem of low crawling accuracy of existing crawlers. Compare and determine the target webpage, start from the website entrance link, record each link path from the crawler to the target webpage, and build a link tree of the target webpage; analyze the link tree of the target webpage, summarize the links on the path of the target webpage in the link tree, and replace the links in the link tree links to form a link template tree; the crawler uses the link template tree as a navigation, and crawls the webpage links that match the link template tree, until the entire crawling cycle process ends, and finally crawls all target webpages. According to the navigation of the link template tree, the crawler of the present invention can only capture valid links in the process of capturing web pages, thereby ensuring the efficiency and accuracy of the crawler capturing web pages.

Description

technical field [0001] A focused crawler method based on link analysis, used for navigation crawlers to accurately grab webpages, relates to the fields of Internet information retrieval, search engines, etc., and specifically relates to link analysis based on webpages-establishing a link template tree. Background technique [0002] Massive Web data has brought unprecedented challenges to Information Retrieval, and general search engine technology is the main solution for Web Information Retrieval. Such as Google, Baidu, Bing and other general search engines, netizens have already conveniently used these search engines to input keywords to obtain the required Web information. [0003] Crawler technology is an inseparable part of search engines. The Internet provides people with massive amounts of knowledge and information. Crawler technology is used to automatically download Web content from massive Web resources. These are called seed links, and then use these seed links as...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/954
Inventor 屈鸿周安林张马路孙明邵领
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products