Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Subject crawling method based on link hierarchical classification in network search

A technology of hierarchical classification and network search, applied in the field of network search, it can solve the problems of staying, lack of large-scale crawling performance test, and lack of mature system.

Inactive Publication Date: 2008-01-09
PEKING UNIV
View PDF0 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the current method of using anchor text and URL does not consider the deep web page structure, only determines the priority by the degree of relevance to the topic, and the experiment also stays on a small-scale collection. There is no mature system and lacks the performance of large-scale crawling test

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subject crawling method based on link hierarchical classification in network search
  • Subject crawling method based on link hierarchical classification in network search
  • Subject crawling method based on link hierarchical classification in network search

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The method of the present invention will be described in detail below in conjunction with specific facilities. In this embodiment, the relevant subject web pages are course web pages on the university website.

[0029] The detailed topic crawling method based on link hierarchical classification, the flow chart of which is shown in Figure 6, is as follows:

[0030] Construct training set

[0031] In the method of the invention, the training set consists of several classes of links. First crawl all the webpages below the California Institute of Technology homepage (www.caltech.edu) on December 10, 2006 and retain their structure to generate a directed graph PageGraph(V, E), where v(v∈V) is The point of represents a webpage, and e(e∈E) is a directed edge in the graph, representing a link from one webpage to another. Afterwards, 1543 course-related webpages were manually marked, and 9 classes were set. This is an experience value and can be adjusted according to differen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The method includes following steps: (1) constructing training set; (2) adding seed web page into queue to be climbed; (3) climbing all URL in queue to be climbed, parsing new climbing web page, and extracting all links; (4) based on class of training set, and then based on class of link to determine priorities of each new link; (5) viewing preferential queue according to sequence, and putting all URL in not void queue with highest priority into queue to be climbed, then jumping to step (3), and holding other queues unchanged; (6) the method ends climbing when all preferential queues are void, or specified climbing cycle index is reached. Using useful information including anchor character and URL, the method analyzes hierarchy of link, and carries out analyzing and climbing topology of web pages in deep layer.

Description

technical field [0001] The invention belongs to the technical field of network search, in particular to a method for subject search on Internet pages. Background technique [0002] The local specialization of Web information distribution is one of the characteristics presented by Internet information. However, the proportion of topic information is small and the degree of dispersion is high. Due to the lack of effective content pre-analysis and filtering, the traditional search strategy has crawled too many irrelevant topic pages, which has become a bottleneck that limits the efficiency of crawlers. Therefore, how to use limited bandwidth and storage capacity to quickly and accurately crawl subject webpages has become a concern of search engine webpage crawling in recent years. [0003] At present, a large number of in-depth research work has been carried out in the field of topic search at home and abroad. The basic ideas and methods mainly come from the Focused Crawling s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30H04L29/06
Inventor 张铭周毅江云亮
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products