Webpage classification technology based on vertical search and focused crawler

A crawler-focused and vertical search technology, applied in the field of webpage classification technology based on vertical search and focused crawlers, can solve problems such as lack of directional extraction of web page structured information, difficulty in adapting to revisit strategies for focused crawlers, and difficulty in judging focused crawlers

Inactive Publication Date: 2009-09-02
苏州锐创通信有限责任公司
View PDF0 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] First, it is difficult for a focused crawler to judge how to crawl the web pages that are most likely to contain topic-related information from the queue of URLs to be crawled
[0006] Second, many open source crawler systems do not have the function of directional extraction of web page structured information from crawled web pages
[0007] Third, the content and structure of the same webpage often change, and it is difficult for the crawler-focused revisit strategy to adapt to this change
[0008] It can be seen from the above that it is difficult to accurately identify different types of web pages using traditional open source focused crawler technology

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage classification technology based on vertical search and focused crawler
  • Webpage classification technology based on vertical search and focused crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] The navigation website warehousing engine and the broadband network user behavior analysis system developed according to this method adopt the B / S architecture, and the development platform is vs2005+oracle 9i. Users can easily access the existing required website classifications according to their needs. in the system. You only need to modify the configuration file when deploying, and it can run on one PC or multiple PCs at the same time. The system has been specifically verified in our development and construction. The URLs captured by this system have a coverage rate of 98% in the Chinese site ALEXATOP100, 87% in the global site ALEXA TOP 500, and 56% in local characteristic websites. Through the actual operation and testing during our development and construction, the implementation effect of the recognition method based on vertical search and crawler-focused webpage classification is well reflected, and the accuracy of this method is verified.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for identifying webpage classification based on vertical search and focused crawler. The method comprises two parts, namely a webpage source code acquisition method and a webpage content analysis method, wherein the webpage content analysis method is a key method, and comprises two main parts, namely extraction of structured information of the webpage and crawling strategy of the focused crawler. First, a URL is selected from a navigation site URL list to acquire a source file of the URL; and then, all classified URL of the navigation URL sites can be identified and acquired by the webpage content analysis method. The key method in the method is the webpage content analysis method, which is to first extract the webpage structured information, then carry out URL snatch by a directional breadth-first search strategy based on webpage content feature, and finally store the snatched URL and corresponding website classification in a list Category.

Description

technical field [0001] The present invention is aimed at the research of the webpage classification identification method in the vertical search engine of the fixed navigation website list. It mainly studies how to effectively obtain the classification information of the webpage based on the vertical search and focused crawler technology, and designs the identification model and algorithm of the webpage classification. It involves multiple fields such as vertical search, focused crawler, web data extraction, machine learning, data mining and natural language. Background technique [0002] With the continuous expansion of information, people are increasingly inseparable from search engines. Although general search engines such as Baidu and Google provide people with a lot of convenience, with the diversification of people's needs and the higher and higher requirements for the quality of search results, general search engines can no longer meet people's requirements in some sp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 王攀张顺颐宫婷
Owner 苏州锐创通信有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products