Webpage classification technology based on vertical search and focused crawler

A crawler-focused and vertical search technology, applied in the field of webpage classification technology based on vertical search and focused crawlers, can solve problems such as lack of directional extraction of web page structured information, difficulty in adapting to revisit strategies for focused crawlers, and difficulty in judging focused crawlers

A crawler-focused and vertical search technology, applied in the field of webpage classification technology based on vertical search and focused crawlers, can solve problems such as lack of directional extraction of web page structured information, difficulty in adapting to revisit strategies for focused crawlers, and difficulty in judging focused crawlers

CN101520798AInactive Publication Date: 2009-09-02苏州锐创通信有限责任公司

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage classification technology based on vertical search and focused crawler
  • Webpage classification technology based on vertical search and focused crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] The navigation website warehousing engine and the broadband network user behavior analysis system developed according to this method adopt the B / S architecture, and the development platform is vs2005+oracle 9i. Users can easily access the existing required website classifications according to their needs. in the system. You only need to modify the configuration file when deploying, and it can run on one PC or multiple PCs at the same time. The system has been specifically verified in our development and construction. The URLs captured by this system have a coverage rate of 98% in the Chinese site ALEXATOP100, 87% in the global site ALEXA TOP 500, and 56% in local characteristic websites. Through the actual operation and testing during our development and construction, the implementation effect of the recognition method based on vertical search and crawler-focused webpage classification is well reflected, and the accuracy of this method is verified.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for identifying webpage classification based on vertical search and focused crawler. The method comprises two parts, namely a webpage source code acquisition method and a webpage content analysis method, wherein the webpage content analysis method is a key method, and comprises two main parts, namely extraction of structured information of the webpage and crawling strategy of the focused crawler. First, a URL is selected from a navigation site URL list to acquire a source file of the URL; and then, all classified URL of the navigation URL sites can be identified and acquired by the webpage content analysis method. The key method in the method is the webpage content analysis method, which is to first extract the webpage structured information, then carry out URL snatch by a directional breadth-first search strategy based on webpage content feature, and finally store the snatched URL and corresponding website classification in a list Category.

Description

technical field [0001] The present invention is aimed at the research of the webpage classification identification method in the vertical search engine of the fixed navigation website list. It mainly studies how to effectively obtain the classification information of the webpage based on the vertical search and focused crawler technology, and designs the identification model and algorithm of the webpage classification. It involves multiple fields such as vertical search, focused crawler, web data extraction, machine learning, data mining and natural language. Background technique [0002] With the continuous expansion of information, people are increasingly inseparable from search engines. Although general search engines such as Baidu and Google provide people with a lot of convenience, with the diversification of people's needs and the higher and higher requirements for the quality of search results, general search engines can no longer meet people's requirements in some sp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
02 Sep 2009
Publication
CN101520798A
IPC
G06F17/30
Inventors
王攀; 张顺颐