Website service classification method combining graph attention network and contrastive learning
By combining graph attention networks and contrastive learning, and utilizing BERT to extract semantic features of website text and construct DOM parsing trees and relationship graphs, the problem of low accuracy in website classification in existing technologies is solved, and efficient website service classification is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INST OF TECH
- Filing Date
- 2023-08-11
- Publication Date
- 2026-06-26
AI Technical Summary
Existing website service classification methods fail to fully utilize website structure information, resulting in low classification accuracy and significant impact from text noise.
By combining graph attention networks and contrastive learning methods, we extract textual semantic features using BERT, construct webpage DOM parsing trees and relationship graphs, introduce attention mechanisms to extract structural features, and enhance feature representations through contrastive learning. Finally, we fuse textual and structural features for classification.
It improved the accuracy of website service categorization to 89.2%, effectively solving the problems of limited information and noise.
Smart Images

Figure CN117171611B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a website service classification method that combines graph attention networks and contrastive learning, and belongs to the field of computer and information science. Background Technology
[0002] The number of websites on the internet is increasing rapidly, and people's demand for website retrieval is also growing. In cyberspace, websites serve as a vital bridge connecting users, information, and services, and website service classification plays a crucial role in research fields such as cybersecurity incident analysis and cyberspace mapping. Based on the nature of website services, websites can be categorized into industry enterprises, internet technology, leisure and entertainment, healthcare, lifestyle services, transportation and tourism, government organizations, and other categories. While directory-style websites like Alexa, Chinaz, and hao123 exist, their keyword-based classification requires significant manual intervention, resulting in slow updates and low efficiency, failing to meet the demands of classifying the vast number of websites. Existing website service classification methods can be broadly categorized into two types:
[0003] Website service classification algorithms based on text information. These methods extract and preprocess text information from websites to obtain long text data, then extract features and classify it, transforming website classification into a text classification task. However, the quality of text content on websites varies greatly, and websites may contain a large amount of text noise unrelated to service type. Therefore, existing website service classification methods face the problem of sparse text features and difficulty in extracting key features.
[0004] Website service classification algorithms based on website structure. A website is composed of multiple web pages with a certain physical and logical structure. A website's topology diagram can be constructed by analyzing the DOM structure within web pages or the URL links between pages, thereby extracting key information and classifying the website. However, due to different programming habits among developers, the structures of different web pages within the same website can vary significantly. Existing methods only utilize one topology diagram, providing limited information, ignoring the overall structure of the website, and failing to capture comprehensive structural features.
[0005] In summary, classifying website services solely based on the textual semantic information or structural information of a single webpage is insufficient, neglecting the overall picture of the website. Furthermore, textual noise within the website can negatively impact the model, resulting in low classification accuracy. Therefore, this invention proposes a website service classification method that combines graph attention networks and contrastive learning. Summary of the Invention
[0006] This invention addresses the problem that existing methods do not fully utilize website structural information, resulting in low accuracy in website service classification. It proposes a website service classification method that combines graph attention networks and contrastive learning.
[0007] The design principle of this invention is as follows: First, BERT is used to extract the textual semantic features of the website; second, a webpage DOM parsing tree is constructed based on the HTML code of the webpage, and a webpage relationship graph is generated based on the URL links; then, a graph attention network is used in combination with a contrastive learning method to extract the structural features of the webpage DOM parsing tree and the webpage relationship graph to generate a website structural representation; finally, the textual semantic features and structural representation of the website are integrated to classify website services.
[0008] The technical solution of the present invention is achieved through the following steps:
[0009] Step 1: Extract semantic features of website text using the BERT pre-trained language model.
[0010] Step 2: Based on the HTML code and URL link relationships of the webpage, construct the webpage DOM parsing tree and webpage relationship graph.
[0011] Step 3: Introduce an attention mechanism to extract webpage parsing structure features and webpage link structure features respectively, and enhance the representation of website structure features through comparative learning.
[0012] Step 4: Integrate the semantic and structural features of the website text and classify the website services using a fully connected classifier.
[0013] Beneficial effects
[0014] This invention addresses the problems of simplistic website information extraction methods that fail to adequately consider the overall structural relationships of the website, and the impact of text noise on the accuracy of website service classification. It proposes a website service classification method that combines graph attention networks and contrastive learning, thereby improving the accuracy of website service classification. Attached Figure Description
[0015] Figure 1 This is a schematic diagram illustrating the principle of the website service classification method combining graph attention networks and contrastive learning in this invention.
[0016] Figure 2 This is a diagram of the DOM parsing tree. Detailed Implementation
[0017] To better illustrate the purpose and advantages of the present invention, the implementation methods of the present invention will be further described in detail below with reference to examples.
[0018] The experimental data came from the domain names of 2000 websites to be classified, obtained from WHOIS. The training and test sets were split in a 4:1 ratio. The experimental data for website service classification is shown in Table 1.
[0019] Table 1. Experimental Data on Website Service Classification (Number of Websites)
[0020]
[0021] The experiment uses accuracy to evaluate the results of website service classification. The accuracy calculation method is shown in formula (1).
[0022]
[0023] Among them, TP is the number of positive websites predicted as positive, FN is the number of positive websites predicted as negative, FP is the number of negative websites predicted as positive, and TN is the number of negative websites predicted as negative.
[0024] This experiment was conducted on one computer and one server. The computer's specific configuration was: Intel i7-6700, CPU 2.40GHz, 4GB RAM, and Windows 7 64-bit operating system; the server's specific configuration was: E7-4820v4, 256GB RAM, and Linux Ubuntu 64-bit operating system.
[0025] The specific procedure for this experiment is as follows:
[0026] Step 1: Preprocess the text obtained from the website, and use the BERT pre-trained model to extract semantic feature vectors c of the text information of each module in the webpage. i Continue using BERT to extract all modules [c1, c2, ..., c] within the webpage. m The website's global semantic feature vector c.
[0027] Step 2: Based on the HTML code and URL link relationships of the webpage, construct the webpage DOM parsing tree and webpage relationship graph.
[0028] The webpage DOM parsing tree and webpage relationship graph model the website structure from two dimensions: within the webpage and between webpages, respectively, making full use of the overall structural relationships of the website.
[0029] Step 2.1: Crawl the entire website content using the entered website domain name, and use the BeautifulSoup library to convert the website's HTML code into a DOM parsing tree structure.
[0030] Step 2.2: Extract URL links from web page information and construct a web page relationship graph.
[0031] Step 3: Extract the structural features of the webpage DOM parsing tree and webpage relationship graph.
[0032] Step 3.1: Since the quality of website content varies, some websites contain a lot of irrelevant text information. Potentially irrelevant text may receive the same attention as normal text in the model. Therefore, an attention mechanism is introduced to reduce the adverse effects of potential irrelevant text on the model.
[0033] The node semantic feature vector c i As the initial node vector of the DOM parsing tree, the weight calculation between different nodes in the DOM parsing tree is shown in formula (2).
[0034] a ij =A(W a c i W a c j (2)
[0035] Where a ij W represents the weight between node i and node j. a is a learnable linear augmented matrix used to project the initial node representations onto the same vector space. A is the attention function. To make the attention coefficients easy to compare between different nodes, they are normalized by applying the softmax function as shown in formula (3).
[0036]
[0037] in, Let be the normalized attention weight between node i and node j, and m be the total number of nodes in the DOM parsing tree.
[0038] By using a graph attention network to learn the DOM parsing tree, we can obtain the DOM structure feature vector x′ of the webpage.
[0039] Step 3.2, take the webpage DOM structure feature vector x as the initial node vector of the webpage relationship graph, and the subsequent processing method is the same as in step 3.1. The weight calculation between different nodes in the webpage relationship graph is shown in formula (4).
[0040] b ij =A(W b x′ i W b x′ j (4)
[0041] Where b ij x′ represents the weight between webpage i and webpage j. i and x′ j Let W be the DOM structure feature vectors for web pages i and j. b is a learnable linear augmenting matrix used to project the initial node representations onto the same vector space. A is the attention function.
[0042] By using a graph attention network to learn the webpage relationship graph, the feature vector y′ of the webpage link structure is obtained.
[0043] Step 3.3: Enhance the website structural feature representation by maximizing the mutual information between textual features and structural features through contrastive learning. The enhanced feature vectors x′ and y′ of the webpage DOM structure feature vector x′ and webpage link structure feature vector y′ of website i are obtained through the enhancement function, as shown in formulas (5) and (6).
[0044] x i =f1(x′) i (5)
[0045] y i =f2(y′) i (6)
[0046] The contrastive learning loss functions L1 and L2 are shown in equations (7) and (8).
[0047]
[0048]
[0049] Where τ is the temperature coefficient. L1 and L2 are optimized respectively, and the parameters in the enhancement functions f1 and f2 are learned to obtain the enhanced webpage DOM structure feature vector x and webpage link structure feature vector y.
[0050] Step 4: Combine the website global semantic feature vector c, the webpage DOM structure feature vector x, and the webpage link structure feature vector y to obtain the final website feature vector [c, x, y]. Then, use a fully connected classifier to classify the website services.
[0051] Test Results: Combining graph attention networks and contrastive learning to classify website services, this invention achieved an accuracy of 89.2% in classifying 400 websites from WHOIS, demonstrating good performance in website service classification.
[0052] The above detailed description further illustrates the purpose, technical solution, and beneficial effects of the invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A website service classification method combining graph attention networks and contrastive learning, characterized in that... The method includes the following steps: Step 1: Extract website text features using the BERT pre-trained language model; Step 2: Based on the HTML code and URL link relationships of the web pages, model the website structure from two dimensions: within the web page and between web pages: construct the web page DOM parsing tree and the web page relationship graph; Step 3: Introduce an attention mechanism to extract webpage parsing structure features and website link structure features respectively. Enhance the website structure feature representation through contrastive learning. Specifically, the webpage DOM structure feature vector... As the initial node vector in the webpage relationship graph, the weights between different nodes in the webpage relationship graph are calculated as follows: ,in Represents a webpage and web pages The weights between them and For webpages and DOM structure feature vectors It is a learnable linear augmenting matrix used to project the initial node representations onto the same vector space. It is an attention function; Step 4: Integrate the semantic and structural features of the website text and classify the website services using a fully connected classifier.
2. The website service classification method combining graph attention networks and contrastive learning according to claim 1, characterized in that: In step 3, the node semantic feature vector As the initial node vector of the DOM parsing tree, the weights between different nodes in the DOM parsing tree are calculated as follows: ,in Represents a node and nodes The weights between them It is a learnable linear augmented matrix used to project the initial node representations onto the same vector space.
3. The website service classification method combining graph attention networks and contrastive learning according to claim 1, characterized in that: In step 3, contrastive learning maximizes the mutual information between textual features and structural features, thereby enhancing the representation of website structural features. The contrastive learning loss function is shown in the formula. ,in, and It is a contrastive learning loss. It is the temperature coefficient. and , and These represent different enhanced web pages. , DOM structure feature vectors and webpage link structure feature vectors, and They are different web pages and The semantic feature vector of the node.