Automatic webpage type identification method based on Web structure characteristic mining

An automatic identification and web page type technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve the problems of not including all the content information of web pages, negative effects of classification, unreliability, etc., to reduce diversity, Great versatility and volume-reducing effect

Inactive Publication Date: 2018-01-12
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF8 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] URL feature mining in web structure mining may cause the following problems: First, the URL contains only the location information of the resource, not all the content information of the web page
The second is that not all hyperlinks included in the web page are related to the content of this page, which will have a negative effect on classification
Web page content mining uses plain text classification technology. Compared with structured text information, a web page is a semi-structured document, and there are many other information, not just text information, so plain text classification technology is used to realize web page The classification of is unreliable and unrealistic

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic webpage type identification method based on Web structure characteristic mining
  • Automatic webpage type identification method based on Web structure characteristic mining
  • Automatic webpage type identification method based on Web structure characteristic mining

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] After analyzing and researching different webpage types and a large number of webpages, the present invention finds that webpage structures under the same website are similar, but webpages of different types or websites with different domain names have obvious similarities and differences. The URL of the news page has the characteristics of time and domain name. Tag has the characteristics of continuity, text concentration, and hierarchy. The number of hyperlinks contained in the webpage source code, and the length characteristics of the source code. Based on the summary of web page features, the present invention proposes a web page type automatic recognition method based on web structure feature mining. Feature extraction is the focus of the method of the present invention and is the basis for effectively identifying the type of webpage. This method first obtains the webpage source code set through the crawler system, and calls the JAVA API interface to parse the webpage ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic webpage type identification method based on Web structure characteristic mining. The automatic webpage type identification method comprises the following steps that S1, a webpage source code set is obtained through a crawler system; S2, webpage source codes are preprocessed; S3, webpage characteristics are extracted; S4, a classifier is established by applyinga classification algorithm used in machine learning, and automatic webpage type identification is completed through the classifier. Before a webpage characteristic set is extracted, a depth-first traversal search strategy is adopted to search noise labels needing to be removed, the volume of a webpage is decreased, the number of labels to be processed is decreased, and the performance of extracting the webpage characteristic set is improved. An HTML document characteristic set is extracted from four aspects closely bound up with a webpage structure through Web structure mining, and then the classification algorithm used in machine learning is applied to establish the classifier so as to complete automatic webpage type identification. Compared with other webpage type identification methods,the automatic webpage type identification method has the advantages of being simple in concept, easy to achieve, convenient to popularize, good in universality and high in accuracy rate.

Description

Technical field [0001] The invention belongs to the technical field of web page identification, and in particular relates to a web page type automatic identification method based on Web structural feature mining. Background technique [0002] With the rapid development of science and technology, the Internet has become the main place for people to acquire knowledge because it contains a large amount of information. In recent years, with the country’s vigorous promotion and large amounts of capital investment, the Internet has become more and more popular. The following results can be obtained from the statistics of Internet development status. The number of Chinese websites is increasing. According to statistics in June 2016, there were 4.54 million websites, an increase of 7.4% compared to December 2015. [0003] Due to the explosive growth of the number of websites, the difficulty for users to obtain the resources of real interest has risen sharply, which is the phenomenon of "...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 于富财汪辉文友枥胡光岷费高雷
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products