Anchor Text-Based Focused Web Crawler Search Method and System

A technology focusing on network and search methods, applied in the field of focusing on web crawler search methods and its systems, can solve problems such as slow speed, no increase in topic relevance, slow growth, etc., and achieve the effect of efficient information collection requirements

Active Publication Date: 2011-12-28
INST OF AUTOMATION CHINESE ACAD OF SCI
View PDF2 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present invention proposes an anchor text-based focused web crawler search method and its system to solve the following technical problems in the existing algorithm of topic relevance in the prior art: although the crawler guided by the existing breadth-first algorithm can accumulate topic relevance stably G

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Anchor Text-Based Focused Web Crawler Search Method and System
  • Anchor Text-Based Focused Web Crawler Search Method and System
  • Anchor Text-Based Focused Web Crawler Search Method and System

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

[0022] figure 1 It is a flow chart of the anchor text-based focused web crawler search method according to the present invention. The method includes the following steps:

[0023] Step 1: The web crawler downloader obtains the URL from the URL priority queue, and downloads the web page from the Internet according to the URL, and puts it into the web page library, where the web page library is used to store the downloaded web page:

[0024] The URL priority queue is divided into URL main priority queue and URL backup priority queue; when the system is started, the seed URL specified by the user is stored in the main priority queue, and the backup priority queue is empty; the downloader starts from the URL priority queue. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a search method for focused web crawler based on an anchor text and a system thereof. The method mainly comprises the following steps of obtaining a URL (uniform resource locator) from a URL priority query and downloading from the Internet to obtain a Web page according to the URL; analyzing the downloaded Web page and extracting the URL and the anchor text thereof; screening the extracted URL and anchor text thereof; and selecting an algorithm combined by TF-IDF (term frequency-inverse document frequency) and LSI (latent semantic indexing) to calculate a topic correlativity of the URL and putting the URL matched with the condition in the priority query. The system comprises a URL priority query, a web crawler downloader, a Web page library, a URL parser, a URL filter and a topic correlativity identifier. With the adoption of the search method of focused web crawler based on the anchor text and the system thereof, the topic correlativity of the crawling result of the focused web crawler and the crawling efficiency are improved.

Description

technical field [0001] The invention relates to a crawler search method and a system thereof, in particular to a focused web crawler search method and a system thereof. Background technique [0002] At present, the Internet has increasingly become the main channel for people to obtain information, and traditional search engines can no longer fully meet people's needs. With the further maturity of artificial intelligence technology and the diversification of information services, search engine technology is developing in the direction of intelligence, personalization and domainization. [0003] Vertical search engines are professional search engines for specific fields, aiming at narrowing the total scope of search, thereby obtaining higher search accuracy and improving the search engine's ability to track network resources. As the core part of the vertical search engine, the focused web crawler takes on the important task of collecting and updating information from the Inte...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 郝红卫台宪青王艳军殷绪成
Owner INST OF AUTOMATION CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products