Unlock instant, AI-driven research and patent intelligence for your innovation.

Body-bused subject type network reptile system configuration method

A web crawler and construction method technology, which is applied in the field of theme-based web crawler system construction, can solve problems such as topic correlation evaluation deviation, high computational overhead and high-dimensional data maintenance, difficulty in describing topics or page content, etc., to achieve accuracy and Improved work efficiency and high intelligence

Inactive Publication Date: 2008-06-04
NANJING UNIV
View PDF3 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the content of web pages varies greatly, and the corresponding keyword database is usually very large, so the system has a large computational overhead and requires a large amount of high-dimensional data maintenance
In addition, due to the phenomenon of polysemy and polysemy in natural language itself, it is often quite difficult to describe the topic or the content of the page only through keywords, which makes the evaluation of topic relevance biased.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Body-bused subject type network reptile system configuration method
  • Body-bused subject type network reptile system configuration method
  • Body-bused subject type network reptile system configuration method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] Such as figure 1 As shown, the web crawler system constructed by the method of the present invention includes a basic crawler working module, a topic relevance evaluation module and an ontology management system module. Among them, the subject correlation degree evaluation module also includes preprocessing and correlation degree calculation sub-modules.

[0024] The method process of the present invention is as figure 2 As shown, the following details:

[0025] Step (1): By parsing the HTML file of the current web page, the text information of the main content therein is separated.

[0026] Step (2): Preprocessing the separated text information. Here we usually count the number of occurrences of each keyword in the current document according to the keyword list preset by the system N(w i ).

[0027] Step (3): According to the keyword set corresponding to each ontology class in the ontology database, calculate the class frequency of the ontology class in the curre...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invented method includes: 1, resolving web page; 2, preprocessing current page txt information to obtain word layer information; 3, converting word layer information into main unit information; 4, calculating page subject degree of correlation; 5, if subject degree of correlation being greater than set value then extracting current all out links directional URL, otherwise turning to step 7; 6, if directional URL having accessed then extracting next links, otherwise according to said links located page subject degree of correlation size inserting preference waiting accessing queue; 7, selecting the first URL from preference waiting accessing queue, i.e. highest priority accessing; 8, repeating executing step 1-7, until occurring new URL without meeting the condition or reaching certain limit. Said invention has advantages of high accuracy rating result and small calculating and storage load.

Description

1. Technical field [0001] The invention relates to a method for constructing a crawler system, in particular to a method for constructing a theme-based web crawler system. 2. Background technology [0002] Web crawler is one of the core parts of search engine, how to make the web crawler system work more efficiently has been paid more and more attention by researchers. Among them, the web crawler system for specific topics has become a hot research topic today. The goal of theme-based web crawler is to make the crawler system avoid accessing non-theme-related web pages as much as possible, and focus on those web pages that are related to the theme. This kind of Web crawler system is mainly used in search engines and Web information retrieval systems in those specific fields. [0003] The current theme-based crawler system is mainly based on the text keyword statistics of Web pages to evaluate their topic relevance. However, the content of Web pages varies widely, and the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 高阳苏畅
Owner NANJING UNIV