Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Webpage classification identifying system and method based on vertical search and focused crawler technology

A technology focusing on crawlers and web page classification, applied in the field of web search engines, it can solve the problems of no directional extraction, difficult to judge crawling, and difficult to identify different types of web pages, so as to save network bandwidth, improve efficiency, and reduce the number of effects.

Inactive Publication Date: 2012-07-18
SUZHOU YAXINFENG INFORMATION TECH
View PDF2 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Webpage classification and recognition for vertical search and focused crawler technology is difficult for the following reasons: First, it is difficult for focused crawlers to judge how to crawl the most likely webpages containing topic-related information from the queue of URLs to be crawled
Second, many open source crawler systems do not have the function of directional extraction of web page structured information from crawled web pages
Third, the content and structure of the same webpage often change, and it is difficult for the crawler-focused revisit strategy to adapt to this change
It can be seen from the above that it is difficult to accurately identify different types of web pages using traditional open source focused crawler technology

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage classification identifying system and method based on vertical search and focused crawler technology
  • Webpage classification identifying system and method based on vertical search and focused crawler technology
  • Webpage classification identifying system and method based on vertical search and focused crawler technology

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0034] The navigation website warehousing engine and broadband network user behavior analysis system developed in this embodiment adopts the B / S architecture, and the development platform is vs2005+oracle 9i. Users can easily access existing URL categories according to their needs. In the system. You only need to modify the configuration file during deployment, and it can run on one PC or on multiple PCs at the same time.

[0035] The following is a detailed introduction to the various modules of the design and their web page classification and recognition methods based on vertical search and focused crawlers. The specific processing process of the method of web page classification and recognition is as attached figure 1 , Follow the steps below:

[0036] (1) Read the URL list of the preset URL navigation site and judge whether the URL list is empty,

[0037] If it is empty, go to step (8);

[0038] (2) Take out a site URL and put it in the list of unvisited URLs (UV_URL list).

[00...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage classification identifying system based on vertical search and focused crawler technology. The system is characterized by comprising an application expressing module, a data acquisition module and a content analyzing module, wherein the data acquisition module finishes acquisition of webpage data through a Web protocol, and transfers the acquired page data to the content analyzing module; the content analyzing module performs HTML (hyper text mark-up language) analysis on the page data acquired by the data acquisition module, extracts hyperlink in a page and adds the hyperlink into a URL (uniform resource locator) queue to obtain a correspondence table between the website type and URL; and the application expressing module receives the keyword input by a user for search, and feeds the searched website of a specific field and / or the website type thereof back to the user. Through actual operation and test in the development and construction process, the implementation effect of the webpage classification identification method based on vertical search and focused crawler is perfectly reflected, and the accuracy of the method is verified.

Description

Technical field [0001] The invention belongs to the technical field of webpage search engines, and specifically relates to a webpage classification recognition system and method based on vertical search and focused crawler technology. Background technique [0002] With the continuous expansion of information, people are increasingly inseparable from search engines. Although general search engines such as Baidu and Google provide people with a lot of convenience, with the diversification of people's needs and the increasing requirements for the quality of search results, general search engines can no longer meet people's requirements in some specialized fields. , So vertical search came into being. It is an accurate search technology that serves local professional fields. It is more professional and returns more targeted results. Through the domain knowledge of specific industry topics, it can provide queries based on semantic information, which can Meet the special search needs ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 曹武龙王国圃
Owner SUZHOU YAXINFENG INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products