The invention discloses an automatic webpage type identification method based on
Web structure characteristic mining. The automatic webpage type identification method comprises the following steps that S1, a webpage
source code set is obtained through a crawler
system; S2, webpage source codes are preprocessed; S3, webpage characteristics are extracted; S4, a classifier is established by applyinga classification
algorithm used in
machine learning, and automatic webpage type identification is completed through the classifier. Before a webpage characteristic set is extracted, a depth-first traversal search strategy is adopted to search
noise labels needing to be removed, the volume of a webpage is decreased, the number of labels to be processed is decreased, and the performance of extracting the webpage characteristic set is improved. An
HTML document characteristic set is extracted from four aspects closely bound up with a webpage structure through
Web structure mining, and then the classification
algorithm used in
machine learning is applied to establish the classifier so as to complete automatic webpage type identification. Compared with other webpage type identification methods,the automatic webpage type identification method has the advantages of being simple in concept, easy to achieve, convenient to popularize, good in universality and high in accuracy rate.