Website performance evaluation method based on multi-modal data fusion and AI deep learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a heterogeneous website information graph model and a multi-head self-attention graph convolutional network, combined with a cross-modal converter and a deep residual regression network, the problem of multimodal data fusion and correlation modeling in existing website evaluation methods is solved, realizing a comprehensive, accurate, and automated evaluation of website performance.

CN122243266APending Publication Date: 2026-06-19CLOUD GUANGXI NETWORK TECH CO LTD +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CLOUD GUANGXI NETWORK TECH CO LTD
Filing Date: 2026-03-03
Publication Date: 2026-06-19

Application Information

Patent Timeline

03 Mar 2026

Application

19 Jun 2026

Publication

CN122243266A

IPC: G06Q10/0639; G06F18/27; G06F18/25; G06F18/2137; G06N3/042; G06N3/045; G06N3/0464

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing website evaluation methods fail to effectively integrate multimodal heterogeneous data, neglect the complex relationships between different dimensions of a website, resulting in biased and inaccurate evaluation results. Furthermore, they lack effective modeling of network graph structure data and cannot uncover the impact of inter-page relationships on overall performance.

⚗Method used

A heterogeneous information graph model for websites is constructed. A multi-head self-attention graph convolutional network is used to capture the propagation characteristics of performance bottlenecks in the topology. A cross-modal converter is used to achieve global feature alignment and inter-modal interaction mapping. A deep residual regression network model is used for multi-task learning, and the output is a quantitative evaluation score of website performance covering multiple dimensions.

🎯Benefits of technology

It enables comprehensive, accurate, and automated evaluation of website performance, identifies the propagation paths of performance bottlenecks, improves evaluation accuracy and model generalization ability, and provides all-round performance diagnosis and optimization guidance.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122243266A_ABST

Patent Text Reader

Abstract

This invention discloses a website performance evaluation method based on multimodal data fusion and AI deep learning. It extracts heterogeneous data from the target website, including text semantics, visual layout, link topology, and temporal performance, to construct a multimodal raw feature tensor. This feature tensor is mapped to a graph space, using web pages as nodes and hyperlinks as edges, to generate a heterogeneous information graph model of the website. Iterative convolutions are performed using a multi-head self-attention graph convolutional network to capture the propagation characteristics of performance bottlenecks in the topology. Global feature alignment and modal interaction mapping are achieved through a cross-modal converter and encoder, generating a globally fused embedding vector. Finally, a deep residual regression network is used to perform multi-task learning, outputting quantitative evaluation scores covering technical performance, search engine optimization health, content quality, and user experience dimensions, providing website operators with comprehensive, accurate, and automated performance diagnosis and optimization guidance.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of website evaluation, and in particular to a website performance evaluation method based on multimodal data fusion and AI deep learning. Background Technology

[0002] With the rapid development of internet technology and the deepening of digital transformation, websites have become a core platform for businesses to conduct business, provide services, and shape their brand image. Statistics show that the number of active websites worldwide has exceeded 2 billion, and businesses are increasingly emphasizing website quality and operational effectiveness. Website performance evaluation, as a crucial aspect of website operation and management, directly impacts a company's market competitiveness and user satisfaction. However, existing website evaluation methods are gradually revealing numerous limitations when faced with increasingly complex website architectures and diverse evaluation needs.

[0003] Currently, mainstream website performance evaluation methods can be broadly categorized into three types. The first type consists of technical testing tools based on single metrics. These tools primarily focus on technical performance indicators such as page load speed and resource optimization, measuring response time and rendering performance by simulating user access. However, this method only provides a partial technical perspective and cannot comprehensively evaluate a website's content quality, user experience, and commercial value. Furthermore, it suffers from efficiency bottlenecks when evaluating large websites across the entire site. The second type comprises search engine optimization (SEO) auditing tools based on rule checks. These tools crawl websites and examine optimization elements such as meta tags, link structure, and content duplication to generate a list of optimization suggestions. However, this method relies on pre-set rule bases, making it difficult to adapt to the rapid iteration of search engine algorithms and unable to quantify the actual impact of rule violations on overall performance. The third type is experience monitoring platforms based on user behavior analysis. These platforms evaluate user experience by collecting real user behavior data such as clicks, scrolling, and bounce rates. However, this method heavily relies on actual website traffic data, lacking applicability for newly built websites or websites with low traffic. Additionally, the interpretation of behavioral data requires significant human experience, resulting in low automation.

[0004] More importantly, the aforementioned methods all employ isolated evaluation perspectives, analyzing different dimensions of a website in isolation and neglecting the inherent interconnectedness of the website as an organic whole. In reality, there are complex interrelationships among a website's text content, visual design, link structure, and performance metrics. For example, the homepage's loading performance not only affects its own user experience score but also impacts the overall website's accessibility assessment through link relationships; layout flaws in the core navigation page can cascade and affect the usability judgment of all downstream pages. Existing methods lack the ability to model such cross-page and cross-modal correlation effects, leading to biased and inaccurate evaluation results. Furthermore, traditional methods often use linear weighting or simple rule combinations for multi-indicator fusion, failing to effectively handle dimensional differences and semantic gaps between heterogeneous data, and struggling to capture complex non-linear mapping relationships, thus limiting the generalization ability and predictive accuracy of the evaluation model.

[0005] In recent years, although some scholars have attempted to introduce machine learning methods for website quality prediction, most of these studies have remained within the scope of shallow feature engineering and traditional supervised learning, employing manually designed statistical features combined with classic algorithms such as support vector machines or random forests. These methods face problems such as high feature engineering costs, weak feature representation capabilities, and poor model scalability, and fail to fully utilize the advantages of deep learning in representation learning and end-to-end optimization. Furthermore, existing research pays little attention to the important information source of website link topology, lacks effective modeling of network graph structure data, and fails to explore the impact of inter-page relationships on overall performance.

[0006] Therefore, there is an urgent need for a new website performance evaluation method that can integrate multimodal heterogeneous data, model complex relationships, and achieve end-to-end intelligent assessment to meet the higher requirements of modern website operation and management for comprehensiveness, accuracy, and automation. Summary of the Invention

[0007] In view of this, the present invention provides a website performance evaluation method based on multimodal data fusion and AI deep learning. The purpose is to construct a heterogeneous information graph model of the website, use a multi-head self-attention graph convolutional network to capture the propagation characteristics of performance bottlenecks in the topology, use a cross-modal converter fusion encoder to achieve global feature alignment and inter-modal interaction mapping, and finally output a quantitative evaluation score of website performance covering multiple dimensions through a deep residual regression network model, thereby providing website operators with comprehensive, accurate and automated performance diagnosis and optimization guidance.

[0008] To achieve the above objectives, this invention provides a website performance evaluation method based on multimodal data fusion and AI deep learning, comprising the following steps: C1: Extract the original feature data of the target website, including: text semantic features, visual layout features, structured link topology features and time-series performance index features, and align the original feature data with the page's Uniform Resource Locator according to the timestamp to construct a multimodal original feature tensor; C2: Map the original multimodal feature tensor to the graph space, with web pages as nodes and hyperlinks between pages as edges, and embed feature vectors of each dimension into the corresponding nodes to generate a website heterogeneous information graph model that integrates page attributes and topology. C3: Iterative convolution is performed on the heterogeneous information graph model of the website using a multi-head self-attention graph convolutional network. The influence weights of nodes and their different modal features on global performance are dynamically calculated through the attention mechanism. The propagation characteristics of performance bottlenecks in the topology are captured, and the enhanced node hidden layer feature vectors are output. C4: Input the enhanced node hidden layer feature vector into the cross-modal Transformer fusion encoder, use the self-attention mechanism to perform global feature alignment and intermodal interaction mapping, eliminate the dimensional differences and redundancy between different dimensions of data, and generate a global fusion embedding vector that represents the overall operating status of the website. C5: Input the global fusion embedding vector into the pre-trained deep residual regression network model, perform multi-task learning through multi-layer nonlinear mapping, calculate and output a quantitative evaluation score of website performance covering dimensions such as technical performance, search engine optimization health, content quality and user experience.

[0009] As a further improvement of the present invention: Optionally, step C1 further includes: The target website is subjected to full-site data collection, which includes obtaining the Hypertext Markup Language source code, Cascading Style Sheet files, and script files of the web pages. A data collection time window is set to cover the complete business cycle of the target website. For each web page, raw feature data of four modalities is extracted, including text semantic features, visual layout features, structured link topology features, and time-series performance index features. The extraction of text semantic features involves segmenting the content of the webpage's title tags, meta-description tags, body paragraphs, and title-level content into words. A pre-trained word embedding model is used to map the segmentation results into dense vector representations. Weighted average pooling is then performed on multiple text segment vectors from the same page to obtain the first... Text semantic feature vector of each page ,in Indicates the page index. , Indicates the total number of pages on the target website; The extraction of visual layout features involves rendering the webpage using a headless browser, capturing screenshots of the page, and then using a convolutional neural network to extract the spatial layout features from these screenshots, thus obtaining the first... Visual layout feature matrix of each page ,in This represents the height dimension of the visual layout feature matrix. Represents the width dimension of the visual layout feature matrix; Represent real numbers; The extraction of structured link topology features involves parsing hyperlink tags in web pages, recording the link relationships between pages, and constructing a link adjacency matrix. Calculate the in-degree, out-degree, and page weight of each page to form the first page. Structured link topology feature vector of each page ; The extraction of the time-series performance index features is achieved by periodically monitoring the page loading time, initial content rendering time, maximum content rendering time, and cumulative layout offset value within the data acquisition time window. Performance indices from multiple sampling moments are then organized into a time series to obtain the time-series performance index sequence. ,in This represents the total number of sampling times for the performance metric. The number of dimensions representing the performance metric; The original feature data of the four modalities are aligned and matched according to the page Uniform Resource Locator (URL) and data acquisition timestamp. Page samples missing original feature data of any modality are removed. The aligned feature data are then concatenated according to the page index dimension, modality type dimension, and feature dimension to construct a multimodal original feature tensor. ,in Indicates the total number of modal types. This represents the dimension of the feature vectors after uniform alignment.

[0010] Optionally, step C2 further includes: Based on the multimodal original feature tensor and link adjacency matrix obtained in step C1, a website heterogeneous information graph model is constructed, which is represented as follows: ,in Represents a set of nodes. Denotes the set of edges. Represents the node feature matrix; The set of nodes It consists of all pages of the target website, and the total number of nodes equals the total number of pages. , No. The node corresponds to the first One page; The set of edges It consists of hyperlinks between pages. If the page There is a page that points to. Hyperlinks, then in the node With nodes Establish directed edges between them Edge weight The calculation is based on the positional importance of hyperlinks and the relevance of anchor text, specifically: ; in, This represents the position weighting coefficient. Page Point to page Link location score, This represents the anchor text weight coefficient. Page Point to page The link anchor text relevance score, and satisfying ; The node feature matrix It is obtained from the multimodal original feature tensor through feature embedding transformation, for the For each node, its corresponding text semantic feature vector, visual layout feature matrix, structured link topology feature vector, and temporal performance index sequence are subjected to dimensionality normalization and feature fusion operations to obtain the initial feature vector of the i-th node with a unified dimension. The initial feature vectors of each node The nodes are concatenated to obtain the node feature matrix. ,in This represents the dimension of the initial feature vector of the node.

[0011] Optionally, step C3 further includes: Construct a multi-head self-attention graph convolutional network, the multi-head self-attention graph convolutional network comprising Each graph attention convolutional layer contains: One's attention, This indicates the number of attention heads; the website heterogeneous information graph model obtained in step C2 is used as input to a multi-head self-attention graph convolutional network, and the following steps are performed. Rounds of iterative convolution operations; In the In the layered graph attention convolutional layer, Regarding the first For each node, calculate its relationship with its neighboring nodes. In the Attention coefficient of each attention head ; Based on the calculated attention coefficient The node is updated by weighted aggregation of the feature information of neighboring nodes. In the Layer Hidden feature vectors of each attention head Specifically: ; in, Represents the activation function of the exponential linear unit; Indicates the first Layer Linear transformation weight matrix for each attention head; Representing neighboring nodes In the Hidden feature vectors of the layer; Represents a node The set of neighboring nodes; The first The hidden feature vectors output by all attention heads in the layer are concatenated to obtain the node. In the The final hidden layer feature vector of the layer ; go through After iterative convolution of the layered graph attention convolutional layer, the output is an enhanced hidden feature vector of the node. The enhanced node hidden layer feature vector integrates the node's own multimodal attribute information and the neighborhood propagation information in the network topology.

[0012] This step effectively captures the propagation characteristics of performance bottlenecks between website pages through the iterative convolution mechanism of a multi-head self-attention graph convolutional network. Website performance issues are often not isolated; slow loading of one page can affect the user experience of other pages accessed via links, and layout problems on key navigation pages can impact the usability assessment of the entire website. Traditional independent page analysis methods cannot capture this interconnected effect in the topology. This step, however, dynamically calculates the influence weights between nodes through an attention mechanism, adaptively aggregating neighborhood information based on link relationships and feature similarity between pages. This allows the impact of performance bottlenecks to propagate within the graph structure and be accurately identified, providing a more accurate and comprehensive feature representation for subsequent global performance evaluation.

[0013] Optionally, step C4 further includes: Construct a cross-modal Transformer fusion encoder, the cross-modal Transformer fusion encoder comprising Each Transformer encoding layer contains a multi-head self-attention sublayer and a feedforward neural network sublayer. The enhanced hidden feature vectors of all nodes obtained in step C3 are organized into a sequence according to their node indices to form a node feature sequence. , ,..., The node feature sequence is used as the input to the cross-modal Transformer fusion encoder; In the cross-modal Transformer fusion encoder In the Transformer encoding layer, First, multi-head self-attention calculation is performed for each node. The feature vector of is calculated, and its relationship with the feature vectors of all nodes at the th node is calculated. Global attention weights for each attention head ; Based on the calculated global attention weights, the feature information of all nodes is aggregated in a weighted manner to calculate the node attention weights. In the Layer Attention output feature vector of each attention head Specifically: ; in, Indicates the first Layer The value weight matrix of each attention head; Represents a node In the The output feature vector of the Transformer encoding layer. Represents a node In the The dimension of the output feature vector of the Transformer encoding layer; The first The attention output feature vectors of all attention heads in the layer are concatenated, and after output linear transformation and residual connection, the output feature vector of the multi-head self-attention sub-layer is obtained. ; The output feature vector of the multi-head self-attention sublayer is input into the feedforward neural network sublayer, which contains two fully connected layers and residual connections. The nodes are then computed. In the The final output feature vector of the Transformer encoding layer ; go through After processing by the Transformer encoding layer, global pooling is performed on the output feature vectors of all nodes at the last Transformer encoding layer to generate a globally fused embedding vector representing the overall operating state of the website. Specifically: ; The global fusion embedding vector It integrates feature information from all pages of the website across four modalities: text semantics, visual layout, link topology, and temporal performance, eliminating dimensional differences and data redundancy between different modalities and forming a unified global representation.

[0014] Optionally, step S5 further includes: Construct a deep residual regression network model, wherein the deep residual regression network model includes A residual block and a multi-task output layer; In the In each residual block, Perform nonlinear transformations and residual connection operations involving two fully connected layers to compute the first... The output feature vector of each residual block ; go through After processing each residual block, the output feature vector of the last residual block is... The input multi-task output layer contains four independent output branches, which correspond to the technical performance evaluation score, the search engine optimization health evaluation score, the content quality evaluation score, and the user experience evaluation score, respectively. The technical performance evaluation score Specifically: ; in, This represents the weight vector of the technical performance evaluation branch. The dimension of the weight vector representing the technical performance evaluation branch. This indicates that a transpose operation is being performed. This represents the bias term in the technical performance evaluation branch. This represents the activation function of the output layer; The search engine optimization health assessment score Specifically: ; in, This represents the weight vector of the search engine optimization health assessment branch. This indicates the bias item in the search engine optimization health assessment branch; The content quality assessment score Specifically: ; in, This represents the weight vector of the content quality assessment branch. This indicates the bias term in the content quality assessment branch; The user experience evaluation score Specifically: ; in, This represents the weight vector of the user experience evaluation branch. This indicates the bias term in the user experience evaluation branch; The deep residual regression network model is pre-trained by performing multi-task joint training on a labeled dataset containing website samples with known performance ratings. The training process uses a weighted combination of mean squared error loss functions as the overall loss function. Specifically: ; in, The weighting coefficients representing the loss of technical performance. This represents the mean squared error loss between the predicted and actual values of technical performance. The weighting coefficients representing the loss of search engine optimization health. This represents the mean squared error loss between the predicted and actual values of search engine optimization health. The weighting coefficients representing the loss of content quality. This represents the mean squared error loss between the predicted and actual values of content quality. The weighting coefficients represent the loss of user experience. This represents the mean squared error loss between the predicted and actual user experience values, and satisfies... ; The global fusion embedding vector obtained in step C4 The input is a pre-trained deep residual regression network model, which ultimately outputs a quantitative evaluation score for website performance, including a technical performance evaluation score. Search Engine Optimization Health Assessment Score Content quality assessment score and user experience evaluation score The scoring results across four dimensions.

[0015] This step utilizes a multi-task learning framework to achieve a comprehensive, multi-dimensional evaluation of website performance. Website quality evaluation is a multifaceted issue. Technical performance focuses on underlying metrics such as page load speed and response time; search engine optimization (SEO) health assesses the website's visibility and ranking potential in search engines; content quality measures the richness and value of the website's information; and user experience reflects visitor satisfaction and usability. These four dimensions are inherently interconnected but each has its own emphasis, and a single-dimensional evaluation cannot fully reflect the website's overall performance level. This step employs a multi-task output structure of a deep residual regression network model, simultaneously learning the four evaluation tasks by sharing underlying feature representations. This leverages the correlations between tasks to improve overall prediction accuracy while providing website operators with a comprehensive performance diagnostic report, guiding targeted optimization and improvement efforts.

[0016] Compared with the prior art, the present invention has at least the following beneficial effects: This invention effectively addresses the problem of traditional methods neglecting the inter-page correlation effect by constructing a heterogeneous website information graph model and using a multi-head self-attention graph convolutional network for iterative convolution. The modeling approach, which maps website pages to graph nodes and hyperlink relationships to graph edges, enables the model to propagate and aggregate feature information within the graph topology. Through a dynamic calculation of the influence weights between nodes using an attention mechanism, the model can adaptively identify the propagation paths of critical pages and performance bottlenecks, such as the cascading impact of slow homepage loading on the user experience of downstream pages, and the ripple effect of core navigation page layout problems on the overall site usability.

[0017] This invention employs a cross-modal converter fusion encoder to achieve deep fusion of four heterogeneous modalities: text semantics, visual layout, link topology, and temporal performance, overcoming the limitations of traditional linear weighting methods in processing multimodal data. Through a self-attention mechanism, global feature alignment is performed, enabling the model to automatically learn the correlations and complementarities between features of different modalities, eliminating dimensional differences and data redundancy, and generating a unified global fusion embedding vector. This end-to-end deep fusion approach avoids the high cost and subjectivity of manual feature engineering, fully exploiting the complex nonlinear relationships inherent in heterogeneous data, and significantly improving the model's ability to represent and evaluate the overall operational status of a website.

[0018] This invention utilizes a multi-task learning framework based on a deep residual regression network model to achieve a comprehensive quantitative evaluation of website performance across multiple dimensions, providing website operators with more practical and actionable diagnostic reports. The four evaluation dimensions—technical performance, search engine optimization health, content quality, and user experience—are interconnected yet each has its own emphasis. The multi-task output structure learns all four evaluation tasks simultaneously by sharing underlying feature representations. This not only leverages the correlation between tasks to improve overall prediction accuracy but also outputs specific scores for different dimensions. Attached Figure Description

[0019] Figure 1 This is a flowchart illustrating a website performance evaluation method based on multimodal data fusion and AI deep learning according to an embodiment of the present invention. Figure 2 Visualization of attention weights for a cross-modal Transformer fusion encoder: (a) Heatmap of the global attention weight matrix, (b) Distribution of attention weights of the homepage node across all pages. Detailed Implementation

[0020] The present invention will be further described below with reference to the accompanying drawings, but this is not intended to limit the present invention in any way. Any modifications or substitutions made based on the teachings of the present invention shall fall within the protection scope of the present invention.

[0021] Example 1: A website performance evaluation method based on multimodal data fusion and AI deep learning, such as... Figure 1 As shown, it includes the following steps: C1: Extract the raw feature data of the target website, including: textual semantic features, visual layout features, structured link topology features, and temporal performance index features. Align the raw feature data with the page's Uniform Resource Locator (URL) according to the timestamp, and construct a multimodal raw feature tensor, including: The target website is subjected to full-site data collection, which includes obtaining the Hypertext Markup Language source code, Cascading Style Sheet files, and script files of the web pages. A data collection time window is set to cover the complete business cycle of the target website. For each web page, raw feature data of four modalities is extracted, including text semantic features, visual layout features, structured link topology features, and time-series performance index features. The extraction of text semantic features involves segmenting the content of the webpage's title tags, meta-description tags, body paragraphs, and title-level content into words. A pre-trained word embedding model is used to map the segmentation results into dense vector representations. Weighted average pooling is then performed on multiple text segment vectors from the same page to obtain the first... Text semantic feature vector of each page ,in Indicates the page index. , Indicates the total number of pages on the target website; The extraction of visual layout features involves rendering the webpage using a headless browser, capturing screenshots of the page, and then using a convolutional neural network to extract the spatial layout features from these screenshots, thus obtaining the first... Visual layout feature matrix of each page ,in This represents the height dimension of the visual layout feature matrix. Represents the width dimension of the visual layout feature matrix; Represent real numbers; The extraction of structured link topology features involves parsing hyperlink tags in web pages, recording the link relationships between pages, and constructing a link adjacency matrix. Calculate the in-degree, out-degree, and page weight of each page to form the first page. Structured link topology feature vector of each page As an alternative implementation, the page weight value can also be calculated using the HITS algorithm, which simultaneously calculates the authority value and hub value of each page, incorporating these values as additional topological feature components into the structured link topological feature vector. middle; The extraction of the time-series performance index features is achieved by periodically monitoring the page loading time, initial content rendering time, maximum content rendering time, and cumulative layout offset value within the data acquisition time window. Performance indices from multiple sampling moments are then organized into a time series to obtain the time-series performance index sequence. ,in This represents the total number of sampling times for the performance metric. The number of dimensions representing the performance metric; The original feature data of the four modalities are aligned and matched according to the page Uniform Resource Locator (URL) and data acquisition timestamp. Page samples missing original feature data of any modality are removed. The aligned feature data are then concatenated according to the page index dimension, modality type dimension, and feature dimension to construct a multimodal original feature tensor. ,in This represents the total number of modal types, which is 4 in this embodiment. This represents the dimension of the feature vectors after uniform alignment.

[0022] C2: Map the multimodal original feature tensor to a graph space, using web pages as nodes and hyperlinks between pages as edges, and embed the feature vectors of each dimension into the corresponding nodes to generate a website heterogeneous information graph model that integrates page attributes and topological structure, including: Based on the multimodal original feature tensor and link adjacency matrix obtained in step C1, a website heterogeneous information graph model is constructed, which is represented as follows: ,in Represents a set of nodes. Denotes the set of edges. Represents the node feature matrix; The set of nodes It consists of all pages of the target website, and the total number of nodes equals the total number of pages. , No. The node corresponds to the first One page; The set of edges It consists of hyperlinks between pages. If the page There is a page that points to. Hyperlinks, then in the node With nodes Establish directed edges between them Edge weight The calculation is based on the positional importance of hyperlinks and the relevance of anchor text, specifically: ; in, This represents the position weighting coefficient. Page Point to page Link location score, This represents the anchor text weight coefficient. Page Point to page The link anchor text relevance score, and satisfying ; The node feature matrix It is obtained from the multimodal original feature tensor through feature embedding transformation, for the For each node, its corresponding text semantic feature vector, visual layout feature matrix, structured link topology feature vector, and temporal performance index sequence are subjected to dimensionality normalization and feature fusion operations to obtain the initial feature vector of the i-th node with a unified dimension. The initial feature vectors of each node The nodes are concatenated to obtain the node feature matrix. ,in Indicates the dimension of the initial feature vector of the node; In this embodiment, the node initial feature vector Specifically: ; in, The embedding weight matrix represents the semantic features of the text. Represents the semantic feature vector of the text The original dimension, An embedding weight matrix representing visual layout features. This represents the matrix flattening operation. The embedding weight matrix represents the topological features of structured links. Represents the topological feature vector of structured links The original dimension, An embedding weight matrix representing the characteristics of time-series performance metrics.

[0023] C3: A multi-head self-attention graph convolutional network is used to iteratively convolve the heterogeneous information graph model of the website. The attention mechanism dynamically calculates the weights of the influence of nodes and their different modal features on global performance, capturing the propagation characteristics of performance bottlenecks in the topology, and outputting enhanced node hidden layer feature vectors, including: Construct a multi-head self-attention graph convolutional network, the multi-head self-attention graph convolutional network comprising Each graph attention convolutional layer contains: One's attention, This indicates the number of attention heads; the website heterogeneous information graph model obtained in step C2 is used as input to a multi-head self-attention graph convolutional network, and the following steps are performed. Rounds of iterative convolution operations; In the In the layered graph attention convolutional layer, Regarding the first For each node, calculate its relationship with its neighboring nodes. Attention coefficients between neighboring nodes In the edge set In and nodes There are nodes connected by edges; the first Attention is focused on the node. with neighboring nodes Attention coefficient between In this embodiment, specifically: ; in, Indicates the first Layer The attention parameter vector of each attention head. The dimension of the hidden layer feature vectors. This represents the vector transpose operation. Indicates the first Layer Linear transformation weight matrix for each attention head. Indicates the first The input feature dimension of the layer Represents a node In the Hidden feature vectors of the layer Representing neighboring nodes In the Hidden feature vectors of the layer Representing neighboring nodes In the Hidden feature vectors of the layer This represents a vector concatenation operation. This represents the activation function of a linear rectified function with leakage. Represents a node The set of neighboring nodes, Represents the set of neighboring nodes Any node index in the; It is a natural exponential function; The input feature vector of the first graph attention convolutional layer is the initial feature vector of the nodes obtained in step C2, i.e. ; Based on the calculated attention coefficient The node is updated by weighted aggregation of the feature information of neighboring nodes. In the Layer Hidden feature vectors of each attention head Specifically: ; in, Represents the activation function of the exponential linear unit; The first The hidden feature vectors output by all attention heads in the layer are concatenated to obtain the node. In the The final hidden layer feature vector of the layer In this embodiment, specifically: ; go through After iterative convolution of the layered graph attention convolutional layer, the output is an enhanced hidden feature vector of the node. The enhanced node hidden layer feature vector integrates the node's own multimodal attribute information and the neighborhood propagation information in the network topology.

[0024] C4: Input the enhanced node hidden layer feature vector into the cross-modal Transformer fusion encoder, and use the self-attention mechanism to perform global feature alignment and intermodal interaction mapping to eliminate the dimensional differences and redundancy between different dimensions of data, generating a global fusion embedding vector representing the overall operating status of the website, including: Construct a cross-modal Transformer fusion encoder, the cross-modal Transformer fusion encoder comprising Each Transformer encoding layer contains a multi-head self-attention sublayer and a feedforward neural network sublayer. The enhanced hidden feature vectors of all nodes obtained in step C3 are organized into a sequence according to their node indices to form a node feature sequence. , ,..., The node feature sequence is used as the input to the cross-modal Transformer fusion encoder; In the cross-modal Transformer fusion encoder In the Transformer encoding layer, First, multi-head self-attention calculation is performed for each node. The feature vector of node is used to calculate the global attention weight between it and the feature vectors of all nodes; the feature vector of node is used to calculate the global attention weight between it and the feature vectors of all nodes. Attention is focused on the node. With nodes Global attention weights between In this embodiment, specifically: ; in, This represents the attention head index in the Transformer coding layer. This indicates the number of attention heads in the Transformer coding layer. Indicates the first Layer The query weight matrix for each attention head. The feature dimension represents the attention computation. This represents the dimension of the input features of the Transformer encoding layer. Represents a node In the The output feature vector of the Transformer encoding layer. Indicates the first Layer The key weight matrix of each attention head. Represents a node In the The output feature vector of the Transformer encoding layer. Indicates the attention scaling factor. Indicates the node index. , Represents a node In the The output feature vector of the Transformer encoding layer; such as Figure 2 As shown, Figure 2 The test website showcases the Transformer fusion encoder at its 1st iteration. Global attention weight matrix of the layer Heatmap visualization results Figure 2 (a) is a global heatmap of the attention weight matrix. Figure 2(b) is a bar chart showing the attention weight distribution of the homepage node, which displays the attention weight value of the homepage to all other pages; The input feature vector of the first Transformer coding layer is the enhanced hidden layer feature vector of the nodes obtained in step B3, i.e. ; Based on the calculated global attention weights, the feature information of all nodes is aggregated in a weighted manner to calculate the node attention weights. In the Layer Attention output feature vector of each attention head Specifically: ; in, Indicates the first Layer The value weight matrix of each attention head; Represents a node In the The output feature vector of the Transformer encoding layer. Represents a node In the The dimension of the output feature vector of the Transformer encoding layer; The first The attention output feature vectors of all attention heads in the layer are concatenated, and after output linear transformation and residual connection, the output feature vector of the multi-head self-attention sub-layer is obtained. In this embodiment, specifically: ; in, Indicates the first The layer's output linear transformation weight matrix; The output feature vector of the multi-head self-attention sublayer is input into the feedforward neural network sublayer, which contains two fully connected layers and residual connections. The nodes are then computed. In the The final output feature vector of the Transformer encoding layer In this embodiment, specifically: ; in, Indicates the first The weight matrix of the first fully connected layer in a feedforward neural network sublayer. This represents the hidden layer dimension of a sublayer in a feedforward neural network. This represents the activation function of a sublayer in a feedforward neural network. Indicates the first The weight matrix of the second fully connected layer in a sublayer of a feedforward neural network; go through After processing by the Transformer encoding layer, global pooling is performed on the output feature vectors of all nodes at the last Transformer encoding layer to generate a globally fused embedding vector representing the overall operating state of the website. Specifically: ; The global fusion embedding vector It integrates feature information from all pages of the website across four modalities: text semantics, visual layout, link topology, and temporal performance, eliminating dimensional differences and data redundancy between different modalities and forming a unified global representation.

[0025] C5: Input the globally fused embedded vector into the pre-trained deep residual regression network model, perform multi-task learning through multi-layer nonlinear mapping, calculate and output a quantitative evaluation score of website performance covering dimensions such as technical performance, search engine optimization health, content quality, and user experience, including: Construct a deep residual regression network model, wherein the deep residual regression network model includes A residual block and a multi-task output layer; In the In each residual block, Perform nonlinear transformations and residual connection operations involving two fully connected layers to compute the first... The output feature vector of each residual block In this embodiment, specifically: ; in, Indicates the first The output feature vectors of each residual block The feature dimension of the residual block is represented. Indicates the first The first fully connected layer weight matrix of each residual block This represents the activation function of the residual block. Indicates the first The second fully connected layer weight matrix of each residual block; The input feature vector of the first residual block is the feature vector obtained by dimensionality transformation of the global fusion embedding vector, i.e. ,in Represents the dimensional projection weight matrix; go through After processing each residual block, the output feature vector of the last residual block is... The input multi-task output layer contains four independent output branches, which correspond to the technical performance evaluation score, the search engine optimization health evaluation score, the content quality evaluation score, and the user experience evaluation score, respectively. The technical performance evaluation score Specifically: ; in, This represents the weight vector of the technical performance evaluation branch. The dimension of the weight vector representing the technical performance evaluation branch. This indicates that a transpose operation is being performed. This represents the bias term in the technical performance evaluation branch. This represents the activation function of the output layer; The search engine optimization health assessment score Specifically: ; in, This represents the weight vector of the search engine optimization health assessment branch. This indicates the bias item in the search engine optimization health assessment branch; The content quality assessment score Specifically: ; in, This represents the weight vector of the content quality assessment branch. This indicates the bias term in the content quality assessment branch; The user experience evaluation score Specifically: ; in, This represents the weight vector of the user experience evaluation branch. This indicates the bias term in the user experience evaluation branch; The deep residual regression network model is pre-trained by performing multi-task joint training on a labeled dataset containing website samples with known performance ratings. The training process uses a weighted combination of mean squared error loss functions as the overall loss function. Specifically: ; in, The weighting coefficients representing the loss of technical performance. This represents the mean squared error loss between the predicted and actual values of technical performance. The weighting coefficients representing the loss of search engine optimization health. This represents the mean squared error loss between the predicted and actual values of search engine optimization health. The weighting coefficients representing the loss of content quality. This represents the mean squared error loss between the predicted and actual values of content quality. The weighting coefficients represent the loss of user experience. This represents the mean squared error loss between the predicted and actual user experience values, and satisfies... ; The final output is a quantitative evaluation score for website performance, which includes a technical performance evaluation score. Search Engine Optimization Health Assessment Score Content quality assessment score and user experience evaluation score The scoring results across four dimensions.

[0026] It should be noted that the sequence numbers of the above embodiments of the present invention are merely for descriptive purposes and do not represent the superiority or inferiority of the embodiments. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, apparatus, article, or method. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, apparatus, article, or method that includes that element.

[0027] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0028] The above are merely preferred embodiments of the present invention and do not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.

Claims

1. A website performance evaluation method based on multimodal data fusion and AI deep learning, characterized in that, Includes the following steps: C1: Extract the original feature data of the target website, including: text semantic features, visual layout features, structured link topology features and time-series performance index features, and align the original feature data with the page's Uniform Resource Locator according to the timestamp to construct a multimodal original feature tensor; C2: Map the original multimodal feature tensor to the graph space, with web pages as nodes and hyperlinks between pages as edges, and embed feature vectors of each dimension into the corresponding nodes to generate a website heterogeneous information graph model that integrates page attributes and topology. C3: Iterative convolution of the heterogeneous information graph model of the website is performed using a multi-head self-attention graph convolutional network. The influence weights of nodes and different modal features on global performance are dynamically calculated through the attention mechanism. The propagation characteristics of performance bottlenecks in the topology are captured, and the enhanced node hidden layer feature vectors are output. C4: Input the enhanced node hidden layer feature vector into the cross-modal Transformer fusion encoder, use the self-attention mechanism to perform global feature alignment and intermodal interaction mapping, eliminate the dimensional differences and redundancy between different dimensions of data, and generate a global fusion embedding vector that represents the overall operating status of the website. C5: Input the global fusion embedding vector into the pre-trained deep residual regression network model, perform multi-task learning through multi-layer nonlinear mapping, calculate and output a quantitative evaluation score of website performance covering dimensions such as technical performance, search engine optimization health, content quality and user experience.

2. The website performance evaluation method based on multimodal data fusion and AI deep learning according to claim 1, characterized in that, Step C1 includes: The target website is subjected to full-site data collection, which includes obtaining the Hypertext Markup Language source code, Cascading Style Sheet files, and script files of the web pages. A data collection time window is set to cover the complete business cycle of the target website. For each web page, raw feature data of four modalities is extracted, including text semantic features, visual layout features, structured link topology features, and time-series performance index features. The extraction of text semantic features involves segmenting the content of the webpage's title tags, meta-description tags, body paragraphs, and title-level content into words. A pre-trained word embedding model is used to map the segmentation results into dense vector representations. Weighted average pooling is then performed on multiple text segment vectors from the same page to obtain the first... Text semantic feature vector of each page ,in Indicates the page index. , Indicates the total number of pages on the target website; The extraction of visual layout features involves rendering the webpage using a headless browser, capturing screenshots of the page, and then using a convolutional neural network to extract the spatial layout features from these screenshots, thus obtaining the first... Visual layout feature matrix of each page ,in This represents the height dimension of the visual layout feature matrix. Represents the width dimension of the visual layout feature matrix; Represent real numbers; The extraction of structured link topology features involves parsing hyperlink tags in web pages, recording the link relationships between pages, and constructing a link adjacency matrix. Calculate the in-degree, out-degree, and page weight of each page to form the first page. Structured link topology feature vector of each page ; The extraction of the time-series performance index features is achieved by periodically monitoring the page loading time, initial content rendering time, maximum content rendering time, and cumulative layout offset value within the data acquisition time window. Performance indices from multiple sampling moments are then organized into a time series to obtain the time-series performance index sequence. ,in This represents the total number of sampling times for the performance metric. The number of dimensions representing the performance metric; The original feature data of the four modalities are aligned and matched according to the page Uniform Resource Locator (URL) and data acquisition timestamp. Page samples missing original feature data of any modality are removed. The aligned feature data are then concatenated according to the page index dimension, modality type dimension, and feature dimension to construct a multimodal original feature tensor. ,in Indicates the total number of modal types. This represents the dimension of the feature vector after uniform alignment.

3. The website performance evaluation method based on multimodal data fusion and AI deep learning according to claim 2, characterized in that, Step C2 includes: Based on the multimodal original feature tensor and link adjacency matrix obtained in step C1, a website heterogeneous information graph model is constructed, which is represented as follows: ,in Represents a set of nodes. Denotes the set of edges. Represents the node feature matrix; The set of nodes It consists of all pages of the target website, and the total number of nodes equals the total number of pages. , No. The node corresponds to the first One page; The set of edges It consists of hyperlinks between pages. If the page There is a page that points to. Hyperlinks, then in the node With nodes Establish directed edges between them Edge weight The calculation is based on the positional importance of hyperlinks and the relevance of anchor text, specifically: ； in, This represents the position weighting coefficient. Page Point to page Link location score, This represents the anchor text weight coefficient. Page Point to page The link anchor text relevance score, and satisfying ; The node feature matrix It is obtained from the multimodal original feature tensor through feature embedding transformation, for the For each node, its corresponding text semantic feature vector, visual layout feature matrix, structured link topology feature vector, and temporal performance index sequence are subjected to dimensionality normalization and feature fusion operations to obtain the initial feature vector of the i-th node with a unified dimension. The initial feature vectors of each node The nodes are concatenated to obtain the node feature matrix. ,in This represents the dimension of the initial feature vector of the node.

4. The website performance evaluation method based on multimodal data fusion and AI deep learning according to claim 3, characterized in that, Step C3 includes: Construct a multi-head self-attention graph convolutional network, the multi-head self-attention graph convolutional network comprising Each graph attention convolutional layer contains: One's attention, This indicates the number of attention heads; the website heterogeneous information graph model obtained in step C2 is used as input to a multi-head self-attention graph convolutional network, and the following steps are performed. Rounds of iterative convolution operations; In the In the layered graph attention convolutional layer, Regarding the first For each node, calculate its relationship with its neighboring nodes. In the Attention coefficient of each attention head ; Based on the calculated attention coefficient The node is updated by weighted aggregation of the feature information of neighboring nodes. In the Layer Hidden feature vectors of each attention head Specifically: ； in, Represents the activation function of the exponential linear unit; Indicates the first Layer Linear transformation weight matrix for each attention head; Representing neighboring nodes In the Hidden feature vectors of the layer; Represents a node The set of neighboring nodes; The first The hidden feature vectors output by all attention heads in the layer are concatenated to obtain the node. In the The final hidden layer feature vector of the layer ; go through After iterative convolution of the layered graph attention convolutional layer, the output is an enhanced hidden feature vector of the node. The enhanced node hidden layer feature vector integrates the node's own multimodal attribute information and the neighborhood propagation information in the network topology.

5. The website performance evaluation method based on multimodal data fusion and AI deep learning according to claim 4, characterized in that, Step C4 includes: Construct a cross-modal Transformer fusion encoder, the cross-modal Transformer fusion encoder comprising Each Transformer encoding layer contains a multi-head self-attention sublayer and a feedforward neural network sublayer. The enhanced hidden feature vectors of all nodes obtained in step C3 are organized into a sequence according to their node indices to form a node feature sequence. , ,..., The node feature sequence is used as the input to the cross-modal Transformer fusion encoder; In the cross-modal Transformer fusion encoder In the Transformer encoding layer, First, multi-head self-attention calculation is performed for each node. The feature vector of is calculated, and its relationship with the feature vectors of all nodes at the th node is calculated. Global attention weights for each attention head ; Based on the calculated global attention weights, the feature information of all nodes is aggregated in a weighted manner to calculate the node attention weights. In the Layer Attention output feature vector of each attention head Specifically: ； in, Indicates the first Layer The value weight matrix of each attention head; Represents a node In the The output feature vector of the Transformer encoding layer. Represents a node In the The dimension of the output feature vector of the Transformer encoding layer; The first The attention output feature vectors of all attention heads in the layer are concatenated, and after output linear transformation and residual connection, the output feature vector of the multi-head self-attention sub-layer is obtained. ; The output feature vector of the multi-head self-attention sublayer The input is a feedforward neural network sublayer, which contains two fully connected layers and residual connections, and computes nodes. In the The final output feature vector of the Transformer encoding layer ; go through After processing by the Transformer encoding layer, global pooling is performed on the output feature vectors of all nodes at the last Transformer encoding layer to generate a globally fused embedding vector representing the overall operating state of the website. Specifically: ； The global fusion embedding vector It integrates feature information from all pages of the website across four modalities: text semantics, visual layout, link topology, and temporal performance, eliminating dimensional differences and data redundancy between different modalities and forming a unified global representation.

6. The website performance evaluation method based on multimodal data fusion and AI deep learning according to claim 5, characterized in that, Step C5 includes: Construct a deep residual regression network model, wherein the deep residual regression network model includes One residual block and a multi-task output layer; In the In each residual block, Perform nonlinear transformations and residual connection operations involving two fully connected layers to compute the first... The output feature vector of each residual block ; go through After processing each residual block, the output feature vector of the last residual block is... The input multi-task output layer contains four independent output branches, which correspond to the technical performance evaluation score, the search engine optimization health evaluation score, the content quality evaluation score, and the user experience evaluation score, respectively. The technical performance evaluation score Specifically: ； in, This represents the weight vector of the technical performance evaluation branch. The dimension of the weight vector representing the technical performance evaluation branch. This indicates that a transpose operation is being performed. This represents the bias term in the technical performance evaluation branch. This represents the activation function of the output layer; The search engine optimization health assessment score Specifically: ； in, This represents the weight vector of the search engine optimization health assessment branch. This indicates the bias item in the search engine optimization health assessment branch; The content quality assessment score Specifically: ； in, This represents the weight vector of the content quality assessment branch. This indicates the bias term in the content quality assessment branch; The user experience evaluation score Specifically: ； in, This represents the weight vector of the user experience evaluation branch. This indicates the bias term in the user experience evaluation branch; The deep residual regression network model is pre-trained by performing multi-task joint training on a labeled dataset containing website samples with known performance ratings. The training process uses a weighted combination of mean squared error loss functions as the overall loss function. Specifically: ； in, The weighting coefficients representing the loss of technical performance. This represents the mean squared error loss between the predicted and actual values of technical performance. The weighting coefficients representing the loss of search engine optimization health. This represents the mean squared error loss between the predicted and actual values of search engine optimization health. The weighting coefficients representing the loss of content quality. This represents the mean squared error loss between the predicted and actual values of content quality. The weighting coefficients representing the loss of user experience This represents the mean squared error loss between the predicted and actual user experience values, and satisfies... ; The global fusion embedding vector obtained in step C4 The input is a pre-trained deep residual regression network model, which ultimately outputs a quantitative evaluation score for website performance, including a technical performance evaluation score. Search Engine Optimization Health Assessment Score Content quality assessment score and user experience evaluation score The scoring results across four dimensions.