Method for implementing topical crawler system based on learning URL string information

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of character string information and topic crawler, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc. Effects of Computational Complexity

Inactive Publication Date: 2012-09-12

HANGZHOU DIANZI UNIV

View PDF3 Cites 33 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, it cannot reflect the overall structure of the Web, and has side effects such as high computational complexity and difficult determination of the threshold.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0026] Below in conjunction with accompanying drawing and specific implementation application process, the present invention is further described:

[0027] refer to figure 1 Execute steps to illustrate the implementation process of the present invention:

[0028] Step 1 - Select the torrent URL:

[0029] According to a given topic, combined with machine learning and manual selection, the URLs of K webpages related to the topic are selected as seed URLs, and the webpage downloader starts to download webpages from the seed URLs.

[0030] Step 2 - Analyze the download page:

[0031] The webpage analyzer analyzes the downloaded webpage content and links, and extracts URL string information, webpage content, and anchor information of the webpage.

[0032] Step 3 - Calculation of topic relevance:

[0033] 1) Correlation calculation model:

[0034] The topic correlation calculation model adopts the space vector model as follows:

[0035] (1)

[0036] in Indicates the w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for implementing a topical crawler system based on learning URL string information. First, a traditional correlation judgment method of the topical crawler is improved, and a method for judging the correlation between a target URL and a topic based on URL string information, web content, and anchor information is proposed. Information brought by the URL string is learned continuously by adopting a method of machine learning to update relevant vectors of the topic dynamically, and the judgment accuracy of correlation between the target URL and the topic is improved. Finally, a crawler strategy which combines content analysis and link analysis is adopted while computational complexity is not increased, which prevents the topical crawler from trapping in local optimum, improves overallness of the crawler when crawling, and improves efficiency of the crawler. The method of the invention can be used in a crawler module of a vertical search engine to crawl web pages of a particular field.

Description

technical field [0001] The invention belongs to the technical fields of data mining and search engines, in particular to a method for realizing a theme crawler system based on URL character string information learning. Background technique [0002] With the rapid increase of the amount of information on the Internet and people's higher and higher requirements for search engines, the limitations of traditional search engines, such as low coverage, poor timeliness, inaccurate results, and returning too many irrelevant results, etc. are gradually emerging. . To solve these problems, researchers propose a vertical search engine that focuses on content search in a specific field. Among them, the topic crawler system is the core part of the vertical search engine. Its main goal is to collect as many high-quality web pages related to a specified topic as possible under the limited time and network bandwidth constraints, ignoring some unrelated or low-quality web pages. quality we...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

Inventor徐向华任祖杰万健殷昱煜胡昔祥

OwnerHANGZHOU DIANZI UNIV

Method for implementing topical crawler system based on learning URL string information

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology