Method for implementing topical crawler system based on learning URL string information

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A string information and implementation method technology, applied in the field of theme crawler systems, can solve the problems that the threshold is difficult to determine, cannot reflect the overall structure, and has high computational complexity, and achieves the effect of improving accuracy and reducing computational complexity.

Inactive Publication Date: 2014-08-13

HANGZHOU DIANZI UNIV

View PDF3 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, it cannot reflect the overall structure of the Web, and has side effects such as high computational complexity and difficult determination of the threshold.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0026] The present invention will be further described below in conjunction with the drawings and the specific implementation and application process:

[0027] Reference figure 1 Perform steps to illustrate the implementation process of the present invention:

[0028] Step 1-Select the seed URL:

[0029] According to a given topic, combined with machine learning and manual selection, the URLs of K web pages related to the topic are selected as seed URLs, and the web page downloader starts downloading web pages from the seed URL.

[0030] Step 2-Analyze the download page:

[0031] The web page analyzer analyzes the downloaded web content and links, and extracts the URL string information, web content, and anchor information of the web page.

[0032] Step 3-topic relevance calculation:

[0033] 1) Correlation calculation model:

[0034] The topic relevance calculation model uses the space vector model as follows:

[0035] (1)

[0036] among them Indicates the weight value of the feature ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for implementing a topical crawler system based on learning URL string information. First, a traditional correlation judgment method of the topical crawler is improved, and a method for judging the correlation between a target URL and a topic based on URL string information, web content, and anchor information is proposed. Information brought by the URL string is learned continuously by adopting a method of machine learning to update relevant vectors of the topic dynamically, and the judgment accuracy of correlation between the target URL and the topic is improved. Finally, a crawler strategy which combines content analysis and link analysis is adopted while computational complexity is not increased, which prevents the topical crawler from trapping in local optimum, improves overallness of the crawler when crawling, and improves efficiency of the crawler. The method of the invention can be used in a crawler module of a vertical search engine to crawl web pages of a particular field.

Description

Technical field [0001] The invention belongs to the technical field of data mining and search engines, and particularly relates to an implementation method of a topic crawler system based on URL character string information learning. Background technique [0002] With the rapid increase in the amount of information on the Internet and people’s requirements for search engines, the limitations of traditional search engines, such as low coverage, poor timeliness, inaccurate results, and too many irrelevant results, are gradually manifesting. . To solve these problems, researchers have proposed a vertical search engine that focuses on content search in a specific field. Among them, the topic crawler system is the core part of the vertical search engine. Its main goal is to collect as many high-quality web pages related to a specified topic as possible under the limited time and network bandwidth constraints, ignoring that it is not related to the specified topic or some low-quality ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F17/30

Inventor 徐向华任祖杰万健殷昱煜胡昔祥

Owner HANGZHOU DIANZI UNIV

Method for implementing topical crawler system based on learning URL string information

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology