A web page classification method based on deep learning with the fusion of text and structural features

A technology for structural features and web page classification, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem that the accuracy rate of web page text feature classification is not high enough, and achieve the effect of comprehensive and effective classification and high accuracy rate

Inactive Publication Date: 2018-12-11
ZHEJIANG UNIV
View PDF5 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In view of the above deficiencies, the present invention provides a webpage classification method based on deep learning fusion of text and structural features, which solves the problem that the classification accuracy of a single webpage text feature is not high enough

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A web page classification method based on deep learning with the fusion of text and structural features
  • A web page classification method based on deep learning with the fusion of text and structural features
  • A web page classification method based on deep learning with the fusion of text and structural features

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The present invention will be further described below in conjunction with the accompanying drawings.

[0028] Such as As shown in Figure 1, a webpage classification method based on deep learning fusion of text and structural features is characterized in that the method includes the following steps:

[0029 ] (1) Obtain web page information

[0030] Enter the URL of the web page, and the scrapy crawler will obtain the HTML document of the web page and store it in the MongoDB database.

[0031] (2) Extract web page text features

[0032] First, from the HTML tag , , , , , to extract text information, these tags represent the title of the web page, meta information, titles at all levels, hyperlinks, etc., including web page main information of . Then preprocess the obtained text, unify lowercase, remove garbled characters, remove abbreviations and numbers, and remove stop words. Stop words are some frequently appearing words that do not have much effect on classification, and the stop...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a web page classification method based on deep learning with fusion of text and structural features. Firstly, a HTML (HyperText Markup Language) document of the web page is obtained by a crawler, and the key text information such as title, meta, hyperlink and so on is extracted, and the text vocabulary is converted into vector (word2vec) to represent the text features. Thenthe HTML tags are traversed and transformed into vectors to represent the structural characteristics of the web page. Finally, the vector is input into the long-term and short-term memory network (LSTM), and the heterogeneous web page text features and web page structure features are fused into the training model through the neural network to classify. This method synthesizes the distinguishing features to represent the web pages more comprehensively and improves the classification accuracy.

Description

technical field [0001] The invention relates to the field of webpage classification, in particular to a webpage classification method based on deep learning fusion of text and structural features. Background technique [0002] There are abundant information resources on the Internet, and with the passage of time, the amount of information on the Internet has exploded. The classification of web pages helps to retrieve and manage web page information, such as developing and maintaining web page directories, improving search engine quality, filtering web page content, and so on. One of the research contents of webpage classification is to classify webpages into pre-defined categories, which is a supervised method. Web pages are a kind of unstructured data. The content and structure of different web pages are different, and there are noise information such as advertisements and copyright notices on the web pages, which brings challenges to the classification of web pages. At t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 沈继忠邓立杜歆
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products