Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for implementing topical crawler system based on learning URL string information

A string information and implementation method technology, applied in the field of theme crawler systems, can solve the problems that the threshold is difficult to determine, cannot reflect the overall structure, and has high computational complexity, and achieves the effect of improving accuracy and reducing computational complexity.

Inactive Publication Date: 2014-08-13
HANGZHOU DIANZI UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, it cannot reflect the overall structure of the Web, and has side effects such as high computational complexity and difficult determination of the threshold.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for implementing topical crawler system based on learning URL string information
  • Method for implementing topical crawler system based on learning URL string information
  • Method for implementing topical crawler system based on learning URL string information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The present invention will be further described below in conjunction with the drawings and the specific implementation and application process:

[0027] Reference figure 1 Perform steps to illustrate the implementation process of the present invention:

[0028] Step 1-Select the seed URL:

[0029] According to a given topic, combined with machine learning and manual selection, the URLs of K web pages related to the topic are selected as seed URLs, and the web page downloader starts downloading web pages from the seed URL.

[0030] Step 2-Analyze the download page:

[0031] The web page analyzer analyzes the downloaded web content and links, and extracts the URL string information, web content, and anchor information of the web page.

[0032] Step 3-topic relevance calculation:

[0033] 1) Correlation calculation model:

[0034] The topic relevance calculation model uses the space vector model as follows:

[0035] (1)

[0036] among them Indicates the weight value of the feature ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for implementing a topical crawler system based on learning URL string information. First, a traditional correlation judgment method of the topical crawler is improved, and a method for judging the correlation between a target URL and a topic based on URL string information, web content, and anchor information is proposed. Information brought by the URL string is learned continuously by adopting a method of machine learning to update relevant vectors of the topic dynamically, and the judgment accuracy of correlation between the target URL and the topic is improved. Finally, a crawler strategy which combines content analysis and link analysis is adopted while computational complexity is not increased, which prevents the topical crawler from trapping in local optimum, improves overallness of the crawler when crawling, and improves efficiency of the crawler. The method of the invention can be used in a crawler module of a vertical search engine to crawl web pages of a particular field.

Description

Technical field [0001] The invention belongs to the technical field of data mining and search engines, and particularly relates to an implementation method of a topic crawler system based on URL character string information learning. Background technique [0002] With the rapid increase in the amount of information on the Internet and people’s requirements for search engines, the limitations of traditional search engines, such as low coverage, poor timeliness, inaccurate results, and too many irrelevant results, are gradually manifesting. . To solve these problems, researchers have proposed a vertical search engine that focuses on content search in a specific field. Among them, the topic crawler system is the core part of the vertical search engine. Its main goal is to collect as many high-quality web pages related to a specified topic as possible under the limited time and network bandwidth constraints, ignoring that it is not related to the specified topic or some low-quality ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 徐向华任祖杰万健殷昱煜胡昔祥
Owner HANGZHOU DIANZI UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More