Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for quickly classifying webpage topics based on HTML source code features

A technology of HTML code and fast classification, which is applied in the direction of website content management, network data retrieval, special data processing applications, etc. It can solve the problem that the query log is not easy to obtain, and achieve the effect of fast and accurate classification

Active Publication Date: 2020-08-04
浙江网新恒天软件有限公司
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, query logs are not easy to obtain, and special methods are required, such as deploying routers to track user query logs, which involves user data privacy issues

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for quickly classifying webpage topics based on HTML source code features
  • Method for quickly classifying webpage topics based on HTML source code features
  • Method for quickly classifying webpage topics based on HTML source code features

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0026] In the following description, a lot of specific details are set forth in order to fully understand the present invention, but the present invention can also be implemented in other ways different from those described here, and those skilled in the art can do it without departing from the meaning of the present invention. By analogy, the present invention is therefore not limited to the specific examples disclosed below.

[0027] The present invention proposes a method for quickly and accurately classifying webpage topics based on HTML source code features. First, the target information is extracted by parsing the HTML source code of the webpage, and then data preprocessing is performed on the extracted information to make it a we...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for quickly classifying webpage topics based on HTML source code features. According to the method, the image data containing the webpage layout characteristics are obtained by automatically analyzing the webpage source code, and the characteristics can effectively reflect the layout information of the webpage by selecting the content length and the link length contained in the tags, the hierarchical relationship to which the selected tags belong and the distance relationship between the selected tags. The image data generated by the webpage source code is trained through a deep learning model to obtain webpage layout features contained in the image data, thereby achieving the purpose of quickly and accurately classifying massive webpages by using the webpage layout features. According to the method, the webpage layout information contained in the webpage source code is effectively utilized, the layout information is automatically extracted and learned,and the constructed classification model is high in robustness and high in classification speed.

Description

technical field [0001] The technical field of the present invention is rapid classification of massive webpages, especially in the case of no need to analyze text semantics, rapid multi-classification of webpage topics based on HTML source code features, which provides convenience for the next step of structured and efficient extraction of webpage information. Background technique [0002] With the explosive growth of Internet information, how to enable machines to extract information from huge web page data more effectively has become more and more important. The first step in extracting webpage information automatically and intelligently is to quickly identify and classify webpages; different types of webpages have different information extraction methods. The existing methods of webpage classification mainly use three types of webpage data: webpage text content, webpage layout features, and webpage query logs. [0003] In the method of classifying webpages by using textu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F16/958
CPCG06F16/958G06F18/241
Inventor 简小云朱雨佳杨哲王莉芳陈金辉
Owner 浙江网新恒天软件有限公司