Method for implementing topical crawler system based on learning URL string information

A technology of character string information and topic crawler, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc. Effects of Computational Complexity

Inactive Publication Date: 2012-09-12
HANGZHOU DIANZI UNIV
View PDF3 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, it cannot reflect the overall structure of the Web, and has side effects such as high computational complexity and difficult determination of the threshold.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for implementing topical crawler system based on learning URL string information
  • Method for implementing topical crawler system based on learning URL string information
  • Method for implementing topical crawler system based on learning URL string information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] Below in conjunction with accompanying drawing and specific implementation application process, the present invention is further described:

[0027] refer to figure 1 Execute steps to illustrate the implementation process of the present invention:

[0028] Step 1 - Select the torrent URL:

[0029] According to a given topic, combined with machine learning and manual selection, the URLs of K webpages related to the topic are selected as seed URLs, and the webpage downloader starts to download webpages from the seed URLs.

[0030] Step 2 - Analyze the download page:

[0031] The webpage analyzer analyzes the downloaded webpage content and links, and extracts URL string information, webpage content, and anchor information of the webpage.

[0032] Step 3 - Calculation of topic relevance:

[0033] 1) Correlation calculation model:

[0034] The topic correlation calculation model adopts the space vector model as follows:

[0035] (1)

[0036] in Indicates the w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for implementing a topical crawler system based on learning URL string information. First, a traditional correlation judgment method of the topical crawler is improved, and a method for judging the correlation between a target URL and a topic based on URL string information, web content, and anchor information is proposed. Information brought by the URL string is learned continuously by adopting a method of machine learning to update relevant vectors of the topic dynamically, and the judgment accuracy of correlation between the target URL and the topic is improved. Finally, a crawler strategy which combines content analysis and link analysis is adopted while computational complexity is not increased, which prevents the topical crawler from trapping in local optimum, improves overallness of the crawler when crawling, and improves efficiency of the crawler. The method of the invention can be used in a crawler module of a vertical search engine to crawl web pages of a particular field.

Description

technical field [0001] The invention belongs to the technical fields of data mining and search engines, in particular to a method for realizing a theme crawler system based on URL character string information learning. Background technique [0002] With the rapid increase of the amount of information on the Internet and people's higher and higher requirements for search engines, the limitations of traditional search engines, such as low coverage, poor timeliness, inaccurate results, and returning too many irrelevant results, etc. are gradually emerging. . To solve these problems, researchers propose a vertical search engine that focuses on content search in a specific field. Among them, the topic crawler system is the core part of the vertical search engine. Its main goal is to collect as many high-quality web pages related to a specified topic as possible under the limited time and network bandwidth constraints, ignoring some unrelated or low-quality web pages. quality we...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 徐向华任祖杰万健殷昱煜胡昔祥
Owner HANGZHOU DIANZI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products