System and method for identifying webpage types

A web page type and web page technology, which is applied in web data retrieval, web data retrieval using information identifiers, and special data processing applications, etc., can solve the problems of poor web page type recognition effect, inappropriate classifier feature selection, and low efficiency. Achieve the effect of great flexibility, fast speed and high recognition accuracy

Active Publication Date: 2014-01-29
烟台中科网络技术研究所
View PDF5 Cites 51 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The technical problem to be solved by the present invention is to provide a system and method for identifying web page types, which solves the problem of poor web page type identification based on heuristic rules and inappropriate feature selection of classifiers in the prior art, especially for cross-language The problem of requiring large changes and low efficiency when identifying web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for identifying webpage types
  • System and method for identifying webpage types
  • System and method for identifying webpage types

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

[0028] figure 1 It is a schematic flow chart of the method for identifying the type of webpage in this embodiment, such as figure 1 shown, including the following steps:

[0029] Based on the specific background knowledge in a specific application scenario, pre-define heuristic rules for one or more types of webpages and generate a heuristic rule list, the heuristic rule list is stored in the rule storage, and any A heuristic rule corresponds to a unique web page type. The content of the heuristic rule has different definitions for different webpage types, and the definition of the rule must fully conform to the characteristics of this type of webpage. If there is ambiguity, then remove the rule; For unambiguous ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of network information retrieval and mining, in particular to a system and method for identifying webpage types. The method comprises the following steps that a heuristic rule is predefined, and a heuristic rule list is generated; a predetermined feature is extracted from a training webpage to form a standard feature vector which is optimized twice to form a simplified feature set, a classifier and a feature extractor are established, and a classification model is generated through the classifier; based on an URL and a source code of a webpage to be identified, rule matching is carried out on the heuristic rule list; if matching succeeds, the webpage type of the webpage to be identified is output; if matching fails, the classifier is used for carrying out webpage type classification on the webpage to be identified. The system and method for identifying webpage types are flexible and convenient to use, high in identifying speed and high in identifying accuracy, big change is of no need when a cross language webpage is identified, identifying efficiency is high, and high actual use value is achieved.

Description

technical field [0001] The invention relates to the field of network information retrieval and mining, in particular to a system and method for identifying web page types. Background technique [0002] With the increase of network information, it is sometimes difficult to retrieve the information documents that users want through search engines. At the same time, how to express the search results of search engines to users has also attracted more and more attention. Most traditional search systems return a large collection of web documents that can match user queries. However, the high recall rate and low precision of search engine processing result documents make it more and more difficult to find useful information for users. In recent years, researchers have done a lot of research on the method of classifying documents according to topics, and achieved good results. However, although documents can be successfully classified according to topics, there are still a large n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/955
Inventor 李海燕王海洋刘大伟刘玮余智华隋雪青
Owner 烟台中科网络技术研究所
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products