Webpage element similarity detection method based on adaptive clustering
By using an adaptive clustering method, a general XPath expression is generated based on XPath path processing and multi-dimensional similarity calculation. This solves the problems of adaptability, accuracy and universality in web page element similarity detection in existing technologies, and achieves efficient and robust web page data extraction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN SKIEER INFORMATION TECH CO LTD
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-26
AI Technical Summary
Existing web page element similarity detection technologies have significant shortcomings in terms of adaptability, accuracy, robustness, and universality, and cannot effectively identify similar data records in diverse web page structures.
An adaptive clustering method is used to generate a general XPath expression through XPath path processing, multi-dimensional similarity calculation, kernel density estimation, and adaptive clustering to identify similar HTML elements in web pages.
It achieves highly adaptive, accurate, robust, efficient, and versatile webpage element similarity detection for diverse webpage structures, reducing maintenance costs and improving the accuracy and efficiency of data extraction.
Smart Images

Figure CN121980109B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of web page structured data processing technology, specifically relating to a web page element similarity detection method based on adaptive clustering. Background Technology
[0002] With the rapid development of internet technology, the value of web page data is becoming increasingly prominent, and structured web page data extraction has become a core requirement in fields such as data analysis, business decision-making, and automated testing. In the process of extracting structured web page data, when a web page contains multiple similar data records, such as product names, prices, and sales volume in an e-commerce product list, or titles, publication times, and authors in a news list, accurately identifying the relationships between the web page elements corresponding to each field is a crucial prerequisite for ensuring the accuracy of data extraction.
[0003] Similar data records on a webpage typically have similar but not identical HTML structures. Currently, industry solutions for webpage element similarity detection and field attribution identification are mainly based on fixed rules or preset templates. These solutions have several pain points in practical applications, as follows:
[0004] 1. Limited coverage of conditional judgments: Conditional judgment methods based on fixed rules (such as preset rules such as "the class name of the product name element contains 'name'" and "the tag of the price element contains the class name 'price'") can only cover a limited number of webpage structure change scenarios. When the class name, tag level, or attribute value of the webpage element exceeds the range of the preset rules (such as changing the product name class name to "item-title" and not containing "name"), the rules become invalid and cannot identify the target element.
[0005] 2. Insufficient Scenario Adaptability: Existing technical solutions are typically designed with judgment logic tailored to specific scenarios (such as a particular e-commerce platform or a certain type of news website), lacking the ability to adapt to diverse scenarios. For example, the product price element recognition logic designed for e-commerce platform A cannot be directly applied to platform B because platform A uses the "price-red" class name for price elements, while e-commerce platform B uses the "price-green" class name. When encountering unforeseen changes in webpage structure (such as when a webpage is adapted from PC to mobile, increasing the tag level from 3 layers to 5 layers), existing judgment logic often cannot adjust quickly, leading to failure in field attribution recognition.
[0006] 3. High sensitivity to structural changes: Even minor adjustments to the HTML structure of a webpage can cause traditional field recognition schemes to completely fail. For example, changing the class name of a webpage element from "news-title" to "news-headline," a difference of only one word, can cause the fixed rules based on class name matching to become invalid. This high sensitivity means that traditional solutions require frequent manual maintenance, resulting in extremely high maintenance costs.
[0007] 4. Lack of Universality: Solutions optimized for specific websites often rely on the unique structural characteristics of that website and cannot be directly transferred to other websites with different structural characteristics. For example, a user information extraction solution designed for a blog website cannot be reused because user information on that website is concentrated in a div container with the class name "user-info," while user information on other blog websites may be scattered in multiple containers. Even for websites of the same type (such as different news websites), due to differences in development frameworks and design styles, their web page element structures may vary significantly. Traditional solutions need to be designed separately for each website, resulting in low development efficiency.
[0008] 5. Insufficient similarity recognition capability: Existing technologies lack effective similarity analysis mechanisms, making it difficult to identify data records that are structurally similar but not completely identical. Traditional solutions typically employ single-dimensional matching (such as matching based solely on class names or tag paths), failing to comprehensively assess the similarity of elements.
[0009] In summary, existing webpage element similarity detection and field attribution identification technologies have significant shortcomings in terms of adaptability, universality, robustness, and accuracy. They cannot meet the needs of webpage data processing for adapting to diverse webpage structures, nor can they cope with the challenges brought about by dynamic changes in webpage structures. Therefore, there is an urgent need for a webpage element similarity detection method that can automatically adapt to diverse webpage structures and possesses stronger universality, robustness, and similarity identification capabilities to address the aforementioned technical pain points and improve the efficiency and accuracy of webpage data extraction. Summary of the Invention
[0010] To address the shortcomings of existing technologies, this invention provides a web page element similarity detection method based on adaptive clustering, which is highly adaptive, accurate, robust, efficient, and versatile.
[0011] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0012] A webpage element similarity detection method based on adaptive clustering includes the following steps:
[0013] S1. Input Field XPath Processing: Receives at least one XPath path of the target field as input, and verifies the format validity of the XPath path. If there is a syntax error in the XPath path, an error message is returned and the current process is terminated.
[0014] S2. Element path extraction: Locate the corresponding web page HTML element based on the verified XPath path, extract the complete tag path of the HTML element from its own node to the root node of the web page, construct an element identifier containing the tag name of the HTML element, and cache the complete tag path and element identifier.
[0015] S3. Similarity Calculation: Pair all HTML elements corresponding to the complete tag paths extracted in step S2 to obtain element pairs; for each element pair, calculate the path structure similarity and class name similarity respectively, and fuse the path structure similarity and class name similarity using a weighted combination method to obtain the comprehensive similarity of the element pair; wherein, the path structure similarity is calculated based on the ratio of the common prefix length of the complete tag paths of the two elements to the longest path length, and the class name similarity is calculated based on the Jaccard similarity coefficient of the class name sets of the two elements;
[0016] S4. Distribution Analysis: Construct a symmetric similarity matrix from the comprehensive similarity of all element pairs obtained in step S3. Use kernel density estimation to perform statistical distribution analysis on the comprehensive similarity values in the similarity matrix, identify the distribution peak and corresponding significant interval of the comprehensive similarity values, and dynamically adjust the parameters of KDE analysis to optimize the distribution identification results.
[0017] S5. Adaptive Clustering: Based on the distribution peaks and significant intervals identified in step S4, multi-objective clustering is performed on the HTML elements. The tolerance range of the clustering is dynamically adjusted to adapt to the characteristics of the significant intervals. Clusters whose similarity meets the preset merging conditions in the clustering results are merged, and the quality of each merged cluster is evaluated. The quality evaluation includes evaluating the number of elements in the cluster, the standard deviation of the comprehensive similarity of elements within the cluster, and the path consistency of elements within the cluster.
[0018] S6. XPath Generation: Extract the common prefix of the complete tag path of all HTML elements from the cluster with the best quality assessment results in step S5, generate a general XPath expression based on the common prefix, and optimize the hierarchical structure of the general XPath expression to improve its adaptability to changes in web page structure.
[0019] S7. Output: Output an optimized general XPath expression, which is used to identify similar HTML elements in the webpage that correspond to the target field, supporting subsequent applications in webpage data crawling, automated testing, and other scenarios.
[0020] Preferably, in step S2, the method of caching the complete tag path and element identifier is as follows: the complete tag path and the corresponding element identifier are associated and stored in a memory cache or a local disk cache, and the cache validity period is set according to the web page data update frequency; if the complete tag path and element identifier need to be called again within the cache validity period, they are read directly from the cache without relocating the HTML element and extracting the path.
[0021] Preferably, in step S3, the weighted combination of path structure similarity weights ranges from 0.5 to 0.8, and the class name similarity weights range from 0.2 to 0.5. The specific steps for calculating path structure similarity are as follows:
[0022] S311. Segment the two complete tag paths in the element pair, and split each complete tag path into multiple path segments according to the tag level, with each path segment corresponding to an HTML tag node;
[0023] S312. Starting from the initial path segments of two complete tag paths, compare the corresponding path segments one by one. If the tag names and fixed attributes of the path segments are completely identical, they are determined to be the same path segments. Count the number of consecutive identical path segments as the common prefix length. ;
[0024] S313. Calculate the total number of path segments for each of the two complete tag paths to obtain the path length. and ,Pick and The maximum value in the range is used as the longest path length. ;
[0025] S314. Path Structure Similarity ,in The value range of is [0,1]. The closer to 1, the more similar the path structures of the two elements are.
[0026] Preferably, in step S3, the specific steps for calculating class name similarity are as follows:
[0027] S321. Extract the class names of the two HTML elements in an element pair. If a single HTML element has multiple class names, then combine all the class names into a set of class names for that element. and If an HTML element has no class name, then the corresponding set of class names is empty.
[0028] S322. Collection of Computation Class Names and intersection Union ,in, , ;
[0029] S323. Class Name Similarity ,in Indicates intersection The number of elements, Union The number of elements, The value range of is [0,1]; if If empty, then This indicates that there is no difference between the two in terms of class name dimension.
[0030] Preferably, in step S3, the path structure similarity weight of the weighted combination is set to 0.6, the class name similarity weight is set to 0.4, and the overall similarity is... .
[0031] Preferably, in step S4, the parameters for dynamically adjusting the KDE analysis include adjusting the kernel function type and bandwidth; the kernel function type is selected from any one of the Gaussian kernel function, Epanechnikov kernel function, or triangular kernel function; the bandwidth is dynamically adjusted according to the number of samples and the degree of dispersion of the comprehensive similarity value.
[0032] By adopting the above technical solution, the present invention has the following beneficial effects:
[0033] (1) Strong adaptability and adaptability to diverse web page structures: This invention analyzes the similarity distribution characteristics through kernel density estimation and dynamically adjusts clustering parameters, such as tolerance range, without the need for manual preset of fixed parameters; at the same time, the weighted combination similarity calculation method can adjust the weight ratio according to the web page scenario, such as class name stability and path structure change frequency, to achieve adaptability to the structural features of different types of web pages, such as e-commerce, news, social media, etc.
[0034] (2) High accuracy and precise identification of similar elements: This invention adopts multi-dimensional similarity analysis such as path structure and class name to avoid misjudgment of similarity caused by single-dimensional analysis; through adaptive clustering and multi-dimensional quality assessment, such as element quantity, similarity consistency and path consistency, it ensures that the clustering results can accurately reflect the belonging relationship of elements and select the optimal set of similar elements. This invention has a high accuracy rate in identifying similar web page elements, which is significantly higher than the traditional fixed rule method, thus ensuring the accuracy of subsequent data extraction.
[0035] (3) Good robustness and strong resistance to structural changes in XPath expressions: The general XPath expressions generated by this invention are built based on the common prefixes of similar elements, and the dynamic index is deleted and redundant levels are trimmed. Only stable tag levels and fixed attributes are retained. This expression structure makes it highly adaptable to minor adjustments to the webpage structure, such as changes in tag indexes, class name modifications, and adjustments to the underlying level. When the webpage structure is adjusted, the positioning accuracy of the XPath expressions generated by this invention is still very high, while the positioning accuracy of traditional fixed XPath expressions is usually low. The robustness is significantly better than the existing technology, which greatly reduces the maintenance costs caused by webpage updates.
[0036] (4) High efficiency, fast processing speed and support for large-scale scenarios: This invention avoids repeatedly extracting web page element paths and reduces DOM tree traversal time through a path information caching mechanism; at the same time, similarity calculation and cluster analysis adopt optimized algorithms to reduce computational complexity.
[0037] (5) High versatility, no need for separate design for specific websites: This invention does not rely on the structural features of specific websites. Through adaptive clustering and dynamic parameter adjustment, it can be directly applied to all web pages containing similar data records. The generated general XPath expression does not need to be modified for specific websites and can be directly reused. There is no need to redesign the test logic for each version. This versatility greatly reduces the development and maintenance costs and improves the efficiency of web page data processing. It is especially suitable for scenarios that require cross-platform and cross-website data processing.
[0038] (6) Easy to use and lower technical threshold: The entire process of the present invention can be completed automatically. Users do not need to have in-depth knowledge of web development or algorithms. They only need to input the initial XPath of the target field to obtain a robust general XPath expression, which greatly reduces the threshold for using web data extraction technology and enables non-technical personnel to easily carry out web data processing work.
[0039] In summary, this invention has the advantages of strong adaptability, high accuracy, good robustness, high efficiency, and strong versatility. Attached Figure Description
[0040] Figure 1 This is a schematic diagram of the overall process of the present invention;
[0041] Figure 2 This is a flowchart illustrating the similarity calculation process of the present invention;
[0042] Figure 3 This is a schematic diagram of the clustering process of the present invention. Detailed Implementation
[0043] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments.
[0044] The components of the embodiments of the invention described and shown in the accompanying drawings can typically be arranged and designed in a variety of different configurations. Therefore, the following detailed description of the embodiments of the invention provided in the drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention.
[0045] Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0046] In the description of this invention, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.
[0047] In the description of this invention, it should be noted that, unless otherwise explicitly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.
[0048] Example 1
[0049] In this embodiment, the present invention achieves accurate identification and robust localization of similar web page elements through five core steps: element path extraction, multi-dimensional similarity calculation, kernel density distribution analysis, adaptive clustering, and robust XPath generation. That is, through adaptive clustering and multi-dimensional similarity analysis, it effectively solves the problems of adaptability, accuracy, robustness, and universality of existing web page element similarity detection technologies, providing an efficient and reliable technical solution for fields such as web page data crawling, automated testing, and structured data extraction. It has significant practical application value and broad prospects for promotion.
[0050] like Figures 1-3As shown, in one embodiment of the present invention, the webpage element similarity detection method based on adaptive clustering is characterized by comprising the following steps:
[0051] S1. Input Field XPath Processing: Receives at least one XPath path of the target field as input, and verifies the format validity of the XPath path. If there is a syntax error in the XPath path, an error message is returned and the current process is terminated. Syntax errors include unclosed tags, abnormal index format, etc.
[0052] S2. Element path extraction: Locate the corresponding web page HTML element based on the verified XPath path, extract the complete tag path of the HTML element from its own node to the root node of the web page, construct an element identifier containing the tag name and class name information of the HTML element, and cache the complete tag path and element identifier.
[0053] In step S2, the method of caching the complete tag path and element identifier is as follows: the complete tag path and the corresponding element identifier are associated and stored in the memory cache or local disk cache (e.g., stored in a local file in JSON format), and the cache validity period is set according to the web page data update frequency (e.g., the cache validity period for e-commerce web pages is set to 30 minutes, and for news web pages it is set to 10 minutes); if the complete tag path and element identifier need to be called again within the cache validity period, they are read directly from the cache without relocating the HTML element and extracting the path.
[0054] Specifically, the purpose of element path extraction is to construct the foundational data for similarity analysis, which is a fundamental step in this invention. Its aim is to obtain complete structural information of target elements from web pages, providing accurate and comprehensive data support for subsequent similarity calculations. Specific steps include:
[0055] S2.1 Target Element Location: Receives an XPath path for at least one target field from user input. First, it validates the format and validity of the XPath path—checking if the XPath syntax is correct (e.g., whether tags are closed, attribute expressions are standardized), and verifying whether the XPath can locate at least one HTML element in the current webpage. If the XPath format is incorrect or the element cannot be located, an error message is returned to guide the user to correct the input; if the validation passes, all corresponding target HTML elements are located based on the XPath path, such as all product name tags in a product list, all news title tags in a news list, etc.
[0056] S2.2 Construction of Complete Tag Path: For each located HTML element, starting from the element's own node, traverse upwards layer by layer to the root node (tag) of the webpage, recording the tag name and key attributes (such as id attribute, fixed class name, excluding dynamic indexes) at each level, and finally constructing the complete tag path of the element. For example, the HTML node of a product name element is located under the list item tag, and its parent nodes are the list tag, container tag, body tag, and html tag in sequence. By traversing, a complete tag path can be constructed. This path fully reflects the element's hierarchical position in the HTML DOM tree and is the core basis for calculating path structure similarity.
[0057] S2.3 Element Identifier Generation: A unique element identifier is constructed for each HTML element. The identifier includes the element's tag name (such as "a", "span", "div"), a complete set of class names (if the element has multiple class names, all are included in the set), and the parent node's tag name (used to help distinguish elements in different positions with the same tag). Element identifiers are not only used to quickly distinguish different elements, but also provide the data foundation for class name similarity calculation.
[0058] S2.4 Path Information Caching: The complete tag path and its corresponding element identifier are associated and stored in the cache. The caching medium can be memory caching or local disk caching (e.g., stored in a local file in JSON format). The cache validity period is set according to the webpage type and data update frequency—for e-commerce product pages, where product information updates frequently, the cache validity period can be set to 30 minutes; for news detail pages, where content is relatively stable, the cache validity period can be set to 2 hours. In subsequent similarity calculations or cluster analysis steps, if the element's path information needs to be used again, it can be read directly from the cache without retracing the HTML DOM tree to extract the path, significantly reducing redundant calculations and improving overall processing efficiency.
[0059] The core advantage of element path extraction lies in its ability to comprehensively capture the structural features of web page elements through complete tag paths and element identifiers, providing a reliable data foundation for subsequent multi-dimensional similarity analysis. At the same time, the introduction of a caching mechanism effectively reduces the time consumption of repetitive operations, providing performance assurance for large-scale web page element detection scenarios.
[0060] S3. Similarity Calculation: Pair all HTML elements corresponding to the complete tag paths extracted in step S2 to obtain element pairs; for each element pair, calculate the path structure similarity and class name similarity respectively, and fuse the path structure similarity and class name similarity using a weighted combination method to obtain the comprehensive similarity of the element pair; wherein, the path structure similarity is calculated based on the ratio of the common prefix length of the complete tag paths of the two elements to the longest path length, and the class name similarity is calculated based on the Jaccard similarity coefficient of the class name sets of the two elements;
[0061] In step S3, the weighted combination of path structure similarity weights ranges from 0.5 to 0.8, and the class name similarity weights range from 0.2 to 0.5. The specific steps for calculating path structure similarity are as follows:
[0062] S311. Segment the two complete tag paths in the element pair, and split each complete tag path into multiple path segments according to the tag level, with each path segment corresponding to an HTML tag node;
[0063] S312. Starting from the initial path segment of two complete tag paths (the path segment corresponding to the root node "html"), compare the corresponding path segments one by one. If the tag name and fixed attributes (such as the id attribute of div, excluding dynamic indexes) of the path segments are completely identical, they are determined to be the same path segment. Count the number of consecutive identical path segments as the common prefix length. ;
[0064] S313. Calculate the total number of path segments for each of the two complete tag paths to obtain the path length. and ,Pick and The maximum value in the range is used as the longest path length. ;
[0065] S314. Path Structure Similarity ,in The value range of is [0,1]. The closer to 1, the more similar the path structures of the two elements are;
[0066] In step S3, the specific steps for calculating class name similarity are as follows:
[0067] S321. Extract the class names of the two HTML elements in an element pair. If a single HTML element has multiple class names (e.g., an `` tag contains both "goods-name" and "red-text"), then combine all the class names into a set of class names for that element. and If an HTML element has no class name, the corresponding set of class names is empty.
[0068] S322. Collection of Computation Class Names and intersection (Class name shared by two sets) and union (The sum of all class names in both sets after removing duplicates), where, , ;
[0069] S323. Class Name Similarity ,in Indicates intersection The number of elements, Union The number of elements, The value range of is [0,1]; if If empty, then This indicates that there is no difference between the two in terms of class name;
[0070] More specifically, in step S3, the weighted combination of path structure similarity is set to 0.6, class name similarity is set to 0.4, and the overall similarity is... The weighting is based on statistics of webpage structure features: in most webpages, the influence of path structure (tag hierarchy relationship) on element positioning (about 60%) is higher than that of class name (about 40%). The weight ratio can be dynamically adjusted according to the specific webpage type (e.g., news webpages with high class name stability can increase the class name weight to 0.5, and e-commerce webpages with frequent changes in path structure can increase the path structure weight to 0.7).
[0071] More specifically, in this invention, similarity calculation is used to quantify element similarity from multiple dimensions. Similarity calculation is the core step in identifying similar elements, and its purpose is to quantify the degree of similarity between two webpage elements through multi-dimensional analysis (path structure, class name), providing data for subsequent clustering analysis. This invention adopts a multi-dimensional calculation method that combines "path structure similarity with class name similarity and weighted combination," avoiding the limitations of single-dimensional analysis. Specific steps include:
[0072] a. Element Pair Construction: All target HTML elements obtained from the element path extraction step are paired to generate element pairs. For example, if 5 product name elements (E1, E2, E3, E4, E5) are extracted, the generated element pairs include (E1, E2), (E1, E3), (E1, E4), (E1, E5), (E2, E3), (E2, E4), (E2, E5), (E3, E4), (E3, E5), (E4, E5), a total of 10 element pairs. The construction of element pairs ensures that the similarity between all elements is quantitatively analyzed, avoiding the omission of potential similar elements and providing a guarantee for the comprehensiveness of subsequent clustering.
[0073] b. Path structure similarity calculation: Path structure similarity reflects the degree of similarity between two elements in their hierarchical position in the HTML DOM tree. It is calculated based on the ratio of the length of the common prefix of the complete tag path to the length of the longest path. The specific process is as follows:
[0074] b1. Path Segmentation: The complete tag paths of the two elements in an element pair are split into multiple path segments according to the tag level. Each path segment corresponds to an HTML tag node. For example, the complete tag path of element E1 includes html, body, container tag, list tag, list item tag and product name tag, which is split into 6 corresponding path segments; the complete tag path of element E2 is similar to E1 in structure, only the index of the list item tag is different, and it is also split into 6 path segments.
[0075] b2. Common Prefix Length Count: Starting from the beginning of the path segment set (the path segment corresponding to the root node "html"), compare the corresponding path segments of the two sets one by one. If the tag name and fixed attributes (such as the class name of the container tag) of the path segments are completely identical, they are determined to be the same path segments. The number of consecutive identical path segments is the common prefix length. For example, the first four path segments (html, body, container tag, list tag) of E1 and E2 are completely identical, but the fifth path segment (list item tag) is different due to different indices, therefore the length of the common prefix is different. =4.
[0076] b3. Calculation of the longest path length: Count the total number of path segments in both sets of path segments to obtain the path length. (e.g., E1 has 6 path segments) and (For example, if the number of path segments in E2 is 6), take... and The maximum value in the range is used as the longest path length. (here) =6).
[0077] b4. Path structure similarity calculation: Path structure similarity The value range is [0,1]. In the example above, =4 / 6≈0.667, indicating that the path structures of E1 and E2 have a moderate degree of similarity.
[0078] The design of path structure similarity is based on the fact that the hierarchical position (path structure) of web page elements is the core feature of their belonging relationship—elements belonging to the same field (such as product names in the same product list) are usually located in similar hierarchical positions, while elements from different fields (such as product names and product prices) have significantly different hierarchical positions. The ratio of the common prefix length to the longest path length can effectively quantify this hierarchical similarity, providing an important basis for identifying similar elements.
[0079] c. Class Name Similarity Calculation: Class name similarity reflects the degree of similarity between two elements in their CSS class name properties. It is calculated based on the Jaccard similarity coefficient of the class name set. The Jaccard similarity coefficient is a classic indicator for measuring the similarity between two sets. Its value is the ratio of the number of elements in the intersection of the two sets to the number of elements in the union of the two sets, which can effectively reflect the degree of overlap of class names. The specific process is as follows:
[0080] c1. Class Name Set Construction: Extract all class names from the two elements in an element pair and construct a class name set. For example, if the tag of element E1 contains the class name "goods-name", the class name set C1 = {"goods-name"}; if the tag of element E2 contains the class name "product-name", the class name set C2 = {"product-name"}; if an element has multiple class names (such as a span tag containing the class names "price" and "red-text"), then the class name set is {"price", "red-text"}; if an element has no class names, then the class name set is empty.
[0081] c2. Intersection and Union Calculations: Calculate the intersection of two sets of class names. (i.e., the class name shared by the two sets) and union (That is, the sum of all class names in the two sets after removing duplicates). In the example above, and intersection empty set, union ={"goods-name","product-name"}, therefore =0, =2.
[0082] c3. Class Name Similarity Calculation: Class Name Similarity The value range is [0,1]. In the example above, =0 / 2=0, indicating that the class name similarity between E1 and E2 is extremely low; if the class name sets of both elements are empty sets, then =1 (no difference in class name dimension); if the class name sets of two elements are exactly the same (e.g., both are {"news-title"}), then =1 indicates that the class names are completely identical.
[0083] The design basis of class name similarity is that CSS class names are semantic identifiers given to elements by web developers. Elements belonging to the same field (such as product price) usually have similar or identical class names (such as "price" and "goods-price"). Therefore, class name similarity can serve as an important supplementary feature for identifying similar elements, complementing path structure similarity and improving the comprehensiveness of similarity assessment.
[0084] d. Weighted Combination Calculation of Comprehensive Similarity: Since path structure and class name have different degrees of influence on element similarity, this invention uses a weighted combination method to integrate path structure similarity and class name similarity into a comprehensive similarity score, thus fully reflecting the degree of element similarity. The formula for the weighted combination is: ,in, This represents the path structure similarity weight, with a value ranging from 0.5 to 0.8. This represents the class name similarity weight, with a value ranging from 0.2 to 0.5; and + =1.
[0085] Based on extensive statistical analysis of webpage structures, this invention has a default setting. =0.6, =0.4 — In most web pages, the influence of path structure on element attribution (approximately 60%) is higher than that of class name (approximately 40%). For example, in e-commerce product lists, even if the class names of product name elements differ (such as "goods-name" and "product-name"), their path structure (located within the "li" tags under "goods-list") tends to remain consistent, indicating higher stability of the path structure. However, in scenarios where class name stability is high (such as news headlines on news websites often using "news-title" or "news-headline" as class names, with minimal variation), the path structure can be... Increase to 0.5, Reduce it to 0.5 to increase the influence weight of the class name.
[0086] Taking E1 and E2 as examples, ≈0.667, =0, =0.6, =0.4, then the overall similarity is =0.667×0.6+0×0.4≈0.400, indicating that E1 and E2 have a low overall similarity. If the path structure similarity of another element E3 is... =0.8, class name similarity =0.7, then the overall similarity is =0.8×0.6+0.7×0.4=0.48+0.28=0.76, indicating that E3 has a high degree of comprehensive similarity with the target element.
[0087] The advantage of comprehensive similarity design lies in taking into account both path structure and class name, two core features, through weighted combination, thus avoiding the limitations of single feature analysis. At the same time, the weights can be dynamically adjusted according to specific web page scenarios, improving the adaptability and flexibility of the method.
[0088] S4. Distribution Analysis: Construct a symmetric similarity matrix from the comprehensive similarity of all element pairs obtained in step S3 (the value in the i-th row and j-th column of the matrix is equal to the value in the j-th row and i-th column, and the diagonal values represent the element's own similarity of 1). Perform statistical distribution analysis on the comprehensive similarity values in the similarity matrix using kernel density estimation to identify the distribution peak (the interval where the similarity values appear most frequently) and the corresponding significant interval (the range of continuous distribution of similarity values around the peak). Dynamically adjust the parameters of the KDE analysis to optimize the distribution identification results.
[0089] In step S4, the dynamic adjustment of KDE analysis parameters includes adjusting the kernel function type and bandwidth. The kernel function type is selected from any one of the following: Gaussian kernel function (suitable for scenarios with smooth similarity distribution), Epanechnikov kernel function (suitable for scenarios with obvious peaks in the distribution), or triangular kernel function (suitable for scenarios with a small number of samples). The bandwidth is dynamically adjusted according to the number of samples and the dispersion of the distribution of the comprehensive similarity value. The more samples or the more dispersed the distribution, the larger the bandwidth is set (e.g., when the number of samples exceeds 1000 and the standard deviation of the distribution is greater than 0.2, the bandwidth is set to 0.1; when the number of samples is less than 200 and the standard deviation of the distribution is less than 0.1, the bandwidth is set to 0.05) to ensure that KDE analysis can accurately capture the similarity distribution characteristics.
[0090] More specifically, distribution analysis guides the adaptive adjustment of clustering parameters. The purpose of distribution analysis is to identify the distribution characteristics of similarity (such as peak intervals and saliency intervals) by statistically analyzing the comprehensive similarity of all element pairs, providing a basis for parameter adjustment in subsequent adaptive clustering and avoiding poor clustering results caused by fixed clustering parameters. This invention uses kernel density estimation (KDE) for distribution analysis. KDE is a non-parametric statistical method that does not require a pre-set data distribution model and can adaptively estimate the probability density function through sample data, accurately capturing the distribution characteristics of the data. The specific steps include:
[0091] (1) Similarity Matrix Construction: Organize the comprehensive similarity of all element pairs into a symmetric similarity matrix. Assuming there are n target elements, the similarity matrix is an n×n square matrix. The element in the i-th row and j-th column of the matrix is the comprehensive similarity S_ij between the i-th element and the j-th element. Since S_ij = S_ji (the similarity between elements i and j is equal to the similarity between elements j and i), the matrix is a symmetric matrix. The diagonal element S_ii = 1 (the similarity between an element and itself is 1). For example, if there are 3 elements E1, E2, and E3, and their comprehensive similarities are respectively =0.4, =0.7, If the similarity is 0.6, then the similarity matrix is a 3x3 symmetric matrix, with all diagonal elements being 1 and off-diagonal elements filled according to their corresponding similarity. The similarity matrix structures the scattered element-pair similarity data, providing a unified and standardized data format for subsequent distribution analysis.
[0092] (2) KDE analysis of similarity distribution: The KDE method is used to estimate the probability density of the off-diagonal elements in the similarity matrix (i.e., the comprehensive similarity of all element pairs, excluding their own similarity of 1), generating a similarity distribution curve. The core parameters of KDE analysis include kernel function type and bandwidth. This invention ensures the accuracy of the distribution analysis by dynamically adjusting these two parameters:
[0093] (2.1) Kernel function type selection: The kernel function is used to define how the sample data contributes to the probability density. This invention dynamically selects the kernel function type based on the distribution characteristics of the similarity data.
[0094] If the similarity data distribution is relatively smooth (e.g., most similarity values are concentrated between 0.5 and 0.8, with no obvious abrupt changes), choose the Gaussian kernel function, as its curve is smooth and can fit the smooth distribution well. If the similarity data has obvious peaks (e.g., the frequency of similarity values in a certain interval is significantly higher than in other intervals), choose the Epanechnikov kernel function, as it has higher fitting accuracy near the peak and can accurately capture peak features. If the number of similarity samples is small (e.g., the number of elements is less than 10, and the number of element pairs is less than 45), choose the triangular kernel function, as it is more robust to small sample data and can reduce fitting bias caused by insufficient sample size.
[0095] (2.2) Dynamic Bandwidth Adjustment: Bandwidth is a key parameter in KDE analysis, determining the smoothness of the distribution curve. Excessive bandwidth leads to an overly smooth curve, masking the true distribution peaks; insufficient bandwidth leads to an overly steep curve, resulting in false peaks. This invention dynamically adjusts the bandwidth based on the number of similarity samples (N) and the degree of distribution dispersion, adjusting and optimizing according to the Silverman rule to ensure accurate fitting of similarity data with different sample sizes and dispersion levels. For example, when the number of similarity samples is large and the distribution dispersion is high, the bandwidth setting is larger; when the number of samples is small and the distribution is concentrated, the bandwidth setting is smaller.
[0096] (3) Identify distribution peaks and salient intervals: Based on the similarity distribution curve generated by KDE, identify the distribution peaks and corresponding salient intervals:
[0097] (3.1) Distribution peak identification: Distribution peak refers to the local maximum point in the similarity distribution curve, corresponding to the interval where the similarity value appears most frequently. For example, if the distribution curve reaches its highest point at the similarity value of 0.7, and the curves on both sides of this point show a downward trend, then the interval where 0.7 is located (such as 0.65-0.75) is the distribution peak interval; if the curve has multiple local maximum points (such as the intervals of 0.5-0.6 and 0.7-0.8 are both peaks), then it is identified as a multi-peak distribution, indicating that there are multiple sets of similar elements in the webpage.
[0098] (3.2) Significant Interval Identification: A significant interval refers to the range of continuous distribution of similarity values around a peak, that is, the range extending from the peak interval to both sides until the distribution curve drops to half the peak height. For example, if the peak interval is 0.65-0.75, and the distribution curve drops to half the peak height at 0.6 and also at 0.8, then the significant interval is 0.6-0.8. The element pairs corresponding to the similarity values within the significant interval are determined to have potential similarity and are the main objects of subsequent clustering.
[0099] The core value of distribution analysis lies in capturing the objective distribution characteristics of similarity through KDE analysis, providing data support for the dynamic adjustment of clustering parameters (such as tolerance range), enabling the clustering process to adapt to the similarity distribution of different web pages, avoiding the subjectivity and limitations of manually setting fixed parameters, and improving the accuracy and rationality of clustering.
[0100] S5. Adaptive Clustering: Based on the distribution peaks and significant intervals identified in step S4, multi-objective clustering is performed on the HTML elements. The tolerance range of the clustering is dynamically adjusted to adapt to the characteristics of the significant intervals. Clusters whose similarity meets the preset merging conditions in the clustering results are merged, and the quality of each merged cluster is evaluated. The quality evaluation includes evaluating the number of elements in the cluster, the standard deviation of the comprehensive similarity of elements within the cluster, and the path consistency of elements within the cluster.
[0101] In step S5, the specific method for dynamically adjusting the tolerance range of clustering is as follows: based on the median value of the salient interval identified in step S4, the tolerance range is set to 1 / 5 to 1 / 3 of the width of the salient interval; if the width of the salient interval is greater than 0.2, the tolerance range is 0.05 to 0.08; if the width of the salient interval is less than or equal to 0.2, the tolerance range is 0.03 to 0.05; the adjustment of the tolerance range aims to ensure that clustering can cover all similar elements within the salient interval, while avoiding the inclusion of dissimilar elements.
[0102] In step S5, the preset merging conditions for cluster merging are: the average comprehensive similarity of the two clusters is greater than or equal to 0.85, and the sum of the number of elements in the two clusters is less than or equal to the preset maximum cluster size; the maximum cluster size is determined based on the total number of HTML elements corresponding to the target field in the webpage. If the total number is less than 100, the maximum cluster size is 1 / 3 of the total number; if the total number is greater than or equal to 100, the maximum cluster size is 50. This condition setting can avoid the decrease in positioning accuracy caused by excessively large clusters, and at the same time reduce redundant clusters.
[0103] In step S5, the specific indicators and weights of the quality assessment are as follows: the weight of the number of cluster elements is 0.3, the weight of the standard deviation of the comprehensive similarity of elements within the cluster is 0.4, and the weight of the path consistency of elements within the cluster is 0.3; the total quality assessment score = (score of the number of elements × 0.3) + (1 - standard deviation score × 0.4) + (path consistency score × 0.3), and the cluster with the highest total score is the optimal cluster.
[0104] More specifically, adaptive clustering can achieve accurate grouping of similar elements. Adaptive clustering is the core step in similar element identification. Its purpose is to automatically group (cluster) web page elements with high similarity based on the similarity distribution characteristics obtained from distribution analysis, and to select the optimal cluster through clustering quality evaluation, providing a high-quality set of similar elements for subsequent XPath generation. The adaptive clustering of this invention differs from traditional fixed-parameter clustering. It achieves accurate grouping of similar elements by dynamically adjusting clustering parameters (tolerance range), merging redundant clusters, and conducting multi-dimensional quality evaluation. The specific steps include:
[0105] (1) Multi-objective clustering initialization: Based on the salient intervals identified by distribution analysis, the multi-objective clustering process is initialized:
[0106] (1.1) Cluster center selection: Select the element with the highest comprehensive similarity from all target elements as the initial cluster center, for example, select the element with the highest average comprehensive similarity with other elements; if there are multiple significant intervals (multi-peak distribution), select an initial cluster center for each significant interval to achieve multi-target clustering (i.e., cluster multiple groups of similar elements at the same time) and ensure that similar elements in different groups can be effectively identified.
[0107] (1.2) Tolerance range dynamic setting: The tolerance range is the core parameter of clustering, which defines the similarity threshold for elements to be added to the cluster. If the overall similarity between an element and the cluster center is greater than or equal to the lower limit of the tolerance range, then the element can be added to the cluster. This invention dynamically sets the tolerance range based on the characteristics of salient intervals: using the median of the salient interval as a benchmark, the tolerance range is set to 1 / 5-1 / 3 of the salient interval width. For example, if the salient interval is 0.6-0.8, the median is 0.7, and the width is 0.2, then the tolerance range is set to 0.7 ± (0.2 × 1 / 4) = 0.65-0.75 (taking 1 / 4 as an adjustment coefficient, located between 1 / 5 and 1 / 3). If the salient interval width is greater than 0.2 (e.g., 0.5-0.8, width 0.3), then the tolerance range takes the benchmark value ± 0.08 (1 / 3.75 of the width); if the salient interval width is less than or equal to 0.2 (e.g., 0.7-0.8, width 0.1), then the tolerance range takes the benchmark value ± 0.03 (1 / 3.33 of the width). This dynamic setting of the tolerance range ensures that clustering covers similar elements within the salient interval while avoiding the inclusion of dissimilar elements, balancing the completeness and accuracy of clustering.
[0108] (1.3) Element clustering assignment: Calculate the overall similarity between each element and each cluster center, and assign the element to the cluster with the highest overall similarity and the tolerance range requirement; if the overall similarity between an element and all cluster centers is lower than the lower limit of the tolerance range, it is marked as "element to be assigned" and will be re-evaluated after subsequent cluster merging to avoid misjudgment of elements due to the initial clustering parameter settings.
[0109] (2) Cluster merging optimization: Since there may be redundancy in the initial clustering (such as two clusters having highly similar elements, but being divided into two clusters due to different initial centers), this invention reduces redundant clusters and improves clustering quality through cluster merging optimization:
[0110] (2.1) Cluster similarity calculation: Calculate the average comprehensive similarity between any two clusters, that is, the average of the comprehensive similarity between all elements in the first cluster and all elements in the second cluster, and use this as an indicator to measure the similarity between the two clusters.
[0111] (2.2) Merging condition judgment: If the average comprehensive similarity of the two clusters is greater than or equal to the preset merging threshold (default 0.85, which can be adjusted according to the web page scenario), and the sum of the number of elements in the two clusters is less than or equal to the preset maximum cluster size (determined according to the total number of elements: when the total number of elements is <100, the maximum size is 1 / 3 of the total number; when the total number of elements is ≥100, the maximum size is 50), then the two clusters will be merged into a new cluster. The center of the new cluster can be selected from the elements with the highest average comprehensive similarity with other elements in the merged cluster.
[0112] (2.3) Reassignment of elements to be assigned: After the clusters are merged, the overall similarity between the elements to be assigned and each new cluster center is recalculated. Elements that meet the tolerance range requirements are assigned to the corresponding clusters. If there are still unassigned elements, they are marked as "abnormal elements" (which may be non-target field elements or structurally abnormal elements) and will not participate in XPath generation in the future to avoid affecting the accuracy of XPath expressions.
[0113] (3) Cluster quality assessment: In order to select the optimal cluster of similar elements (i.e., the set of elements that best represents the target field), this invention assesses the quality of each cluster from three dimensions:
[0114] (3.1) Evaluation of the number of cluster elements: The number of cluster elements must meet the preset minimum cluster size (range 2-10, default 3), and the number should be as close as possible to a reasonable range (e.g., 1 / 5-1 / 2 of the total number of elements). For example, when the total number of elements is 10, the reasonable number of clusters is 2-5; if the number of elements in a cluster is 1 (less than the minimum cluster size of 3), the quality evaluation score of the cluster is 0; if the number of elements in a cluster is 4 (within a reasonable range), the score is 1 (full marks).
[0115] (3.2) Cluster Similarity Consistency Assessment: The similarity consistency of elements within a cluster is assessed by calculating the comprehensive similarity standard deviation of all element pairs within the cluster. The smaller the standard deviation, the more consistent the similarity of elements within the cluster, and the higher the cluster quality. For example, if the comprehensive similarity standard deviation of cluster A is 0.05 and the standard deviation of cluster B is 0.2, then the similarity consistency of cluster A is better than that of cluster B. The assessment score is calculated as: 1 - (standard deviation / 0.5) (0.5 is the maximum reasonable value of the standard deviation; if it exceeds this value, the score is 0). The score for cluster A is 1 - (0.05 / 0.5) = 0.9, and the score for cluster B is 1 - (0.2 / 0.5) = 0.6.
[0116] (3.3) Path consistency assessment within clusters: Path consistency reflects the degree of uniformity in the structure of elements within a cluster. It is calculated by taking into account the proportion of identical tag names at each level in the complete tag paths of all elements within the cluster. For example, if there are 3 elements in a cluster, and the 3rd level tag names in their paths are all div tags containing a specific class name, the similarity is 100%; in the 4th level tag names, two are ul tags containing a specific class name, and one is an ul tag with another class name, the similarity is 66.7%; then the path consistency is the average of the similarity of each level, i.e., (100% + 66.7%) / 2 ≈ 83.3%, and the assessment score is this average value (0.833).
[0117] (4) Optimal cluster selection: The evaluation scores of the three dimensions are weighted and summed according to their weights (number of elements weight 0.3, similarity consistency weight 0.4, path consistency weight 0.3) to obtain the total cluster quality score:
[0118] The total score is calculated as follows: (Element count score × 0.3) + (Similarity consistency score × 0.4) + (Path consistency score × 0.3). The cluster with the highest total score is selected as the optimal cluster, and the elements in this cluster are the set of similar web page elements corresponding to the target field.
[0119] The core advantage of adaptive clustering lies in its ability to adapt to different webpage structures through dynamic parameter adjustment (tolerance range), redundant cluster merging, and multi-dimensional quality assessment. This ensures the accuracy and reliability of the clustering results, provides a high-quality set of elements for generating robust XPath expressions, and lays the foundation for accurate data extraction.
[0120] S6. XPath Generation: Extract the common prefix (the sequence of tag hierarchy common to all paths) of the complete tag paths of all HTML elements from the cluster with the best quality assessment results in step S5. Generate a general XPath expression based on the common prefix and optimize the hierarchical structure of the general XPath expression to improve its adaptability to changes in web page structure.
[0121] In step S6, the specific method for optimizing the hierarchical structure of the general XPath expression is as follows: delete the tag levels containing dynamic indexes in the general XPath expression; if there is ambiguity after deleting the dynamic indexes, retain the outermost or innermost fixed index level; trim redundant tag levels in the expression, retain the core positioning level, and finally generate a concise and robust general XPath expression.
[0122] More specifically, XPath generation is the generation of robust and general XPath expressions. XPath generation is the ultimate goal of this invention. Its purpose is to extract common features of similar elements from optimal clusters and generate general and robust XPath expressions that can accurately identify all similar elements in a webpage corresponding to the target field and have good adaptability to changes in webpage structure. Specific steps include:
[0123] (1) Common prefix extraction: The common prefix is the sequence of common label levels of the complete label paths of all elements in the optimal cluster, and it is the core foundation of general XPath expressions. The specific process of extracting the common prefix is as follows:
[0124] (1.1) Path alignment: Align the complete label paths of all elements in the optimal cluster according to the level, that is, ensure that the number of levels of all paths is consistent (if there is a difference in the number of levels, take the number of levels of the shortest path as the benchmark, and ignore the excess levels of the longer path). For example, if there are 3 elements in the optimal cluster, and their complete label paths all contain 6 levels, no adjustment is needed, and they can be directly aligned.
[0125] (1.2) Hierarchical Tag Comparison: Starting from the initial level of the path (root node "html"), compare the tag names and fixed attributes (such as class and id attributes) of all corresponding levels of the paths one by one. If the tag names and fixed attributes of a certain level are completely consistent in all paths, then that level is the common level; if there is at least one path whose tag name or fixed attribute is different from other paths, then the comparison stops, and subsequent levels are no longer included in the common prefix. For example, if the first four levels of three element paths ("html", "body", "div tag with a specific class name", "ul tag with a specific class name") are completely consistent, and the fifth level is inconsistent due to different indices, then the common prefix is the path composed of the first four levels.
[0126] The basis for extracting common prefixes is that similar elements belonging to the same field usually maintain a stable upper-level tag hierarchy (common prefix), while the lower-level tag hierarchy (such as li tags with dynamic indexes and a tags with changing class names) may differ due to changes in element position and class name. Therefore, common prefixes are the key to ensuring the robustness of XPath expressions.
[0127] (2) Construction of general XPath expressions: based on the extracted common prefixes, combined with the tag type of the target element (e.g., 、 This involves constructing a preliminary, generalized XPath expression. For example, the target elements in the optimal cluster are all... For the label, add " / / a" after the common prefix to get the preliminary expression, where " / / a" represents the descendant nodes of the node corresponding to the common prefix. The expression can cover all positions under the common prefix. Tags are used to prevent location failures caused by changes in the underlying hierarchy.
[0128] If the target element has specific semantic features (such as product price elements usually containing the "¥" symbol, or news release time containing the "year / month / day" format), attribute filtering conditions can be added to the expression to further improve the accuracy of positioning.
[0129] (3) Hierarchical structure optimization: To further improve the robustness and simplicity of general XPath expressions, the hierarchical structure of the expressions is optimized, mainly including dynamic index deletion and redundant level pruning:
[0130] (3.1) Deletion of dynamic indexes: Dynamic indexes (such as the numeric indexes in "div[2]" and "li[3]") are the main reason why XPath expressions are sensitive to changes in web page structure - when the web page content is updated, the indexes are very easy to change (such as adding a new div tag, the original "div[2]" becomes "div[3]"). Therefore, when optimizing, first delete the tag levels containing dynamic indexes in the expression, and only keep the tag levels with fixed attributes (such as class, id). For example, if the common prefix contains "div[2]" and the div tag has no fixed attributes, then delete "div[2]"; if "div[2]" has a fixed id attribute, then keep the fixed attributes and delete the dynamic indexes.
[0131] (3.2) Redundant Hierarchy Pruning: Redundant hierarchies refer to tag hierarchies (usually global common container hierarchies) that do not affect element positioning after deletion. For example, if a common prefix contains "header tag with a specific id", this tag is a global header container for the webpage, and almost all elements are located under this container. After deletion, the remaining path can still accurately locate the target element. Therefore, "header tag with a specific id" is a redundant hierarchy and should be pruned. Pruning redundant hierarchies can make XPath expressions more concise and reduce the risk of positioning failure caused by changes in the redundant hierarchical structure.
[0132] Through hierarchical structure optimization, robust and concise general XPath expressions are ultimately generated.
[0133] S7. Output: Output an optimized general XPath expression, which is used to identify similar HTML elements in the webpage that correspond to the target field, supporting subsequent applications in webpage data crawling, automated testing, and other scenarios.
[0134] More specifically, the optimized general XPath expression output can be directly applied to scenarios such as web scraping and automated testing. In web scraping scenarios, this XPath expression can locate all similar elements on a webpage that correspond to the target field (such as all product names). Tags), and then extract the text content (such as the product name "XX mobile phone") or attribute value (such as the product link " / goods / 123");
[0135] In web page automation testing scenarios, this XPath expression can be used as a UI element locator for functional testing (such as verifying the redirect function when clicking on a product name link) or regression testing (such as verifying whether the product name is displayed correctly and whether the price is loaded correctly), avoiding test process failures caused by changes in web page structure.
[0136] This invention is applicable to fields such as web page data crawling (automatic extraction of e-commerce product information, news content, and social media data), automated web page testing (such as UI element positioning, functional testing, and element identification in regression testing), structured data extraction (such as automatically identifying and extracting structured information from web pages), competitor data monitoring (automatic collection and analysis of competitor prices and product information), and web crawler systems (which can improve the crawler's adaptability to changes in web page structure).
[0137] This invention provides a web page element similarity detection method based on adaptive clustering. Through five core steps, namely element path extraction, multi-dimensional similarity calculation, kernel density distribution analysis, adaptive clustering and robust XPath generation, it can achieve accurate identification and robust localization of similar web page elements.
[0138] This embodiment does not impose any limitation on the shape, material, structure, etc. of the present invention. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention shall fall within the protection scope of the technical solution of the present invention.
Claims
1. A webpage element similarity detection method based on adaptive clustering, characterized in that, Includes the following steps: S1. Input Field XPath Processing: Receives at least one XPath path of the target field as input, and verifies the format validity of the XPath path. If there is a syntax error in the XPath path, an error message is returned and the current process is terminated. S2. Element path extraction: Locate the corresponding web page HTML element based on the verified XPath path, extract the complete tag path of the HTML element from its own node to the root node of the web page, construct an element identifier containing the tag name of the HTML element, and cache the complete tag path and element identifier. S3. Similarity Calculation: Pair all HTML elements corresponding to the complete tag paths extracted in step S2 to obtain element pairs; for each element pair, calculate the path structure similarity and class name similarity respectively, and fuse the path structure similarity and class name similarity using a weighted combination method to obtain the comprehensive similarity of the element pair; wherein, the path structure similarity is calculated based on the ratio of the common prefix length of the complete tag paths of the two elements to the longest path length, and the class name similarity is calculated based on the Jaccard similarity coefficient of the class name sets of the two elements; S4. Distribution Analysis: Construct a symmetric similarity matrix from the comprehensive similarity of all element pairs obtained in step S3. Use kernel density estimation to perform statistical distribution analysis on the comprehensive similarity values in the similarity matrix, identify the distribution peak and corresponding significant interval of the comprehensive similarity values, and dynamically adjust the parameters of KDE analysis to optimize the distribution identification results. S5. Adaptive Clustering: Based on the distribution peaks and significant intervals identified in step S4, multi-objective clustering is performed on the HTML elements. The tolerance range of the clustering is dynamically adjusted to adapt to the characteristics of the significant intervals. Clusters whose similarity meets the preset merging conditions in the clustering results are merged, and the quality of each merged cluster is evaluated. The quality evaluation includes evaluating the number of elements in the cluster, the standard deviation of the comprehensive similarity of elements within the cluster, and the path consistency of elements within the cluster. S6. XPath Generation: Extract the common prefix of the complete tag path of all HTML elements from the cluster with the best quality assessment results in step S5, generate a general XPath expression based on the common prefix, and optimize the hierarchical structure of the general XPath expression to improve its adaptability to changes in web page structure. S7. Output: Output an optimized general XPath expression, which is used to identify similar HTML elements in the webpage that correspond to the target field, supporting subsequent webpage data crawling and automated testing scenarios; In step S3, the weighted combination of path structure similarity weights ranges from 0.5 to 0.8, and the class name similarity weights range from 0.2 to 0.
5. The specific steps for calculating path structure similarity are as follows: S311. Segment the two complete tag paths in the element pair, and split each complete tag path into multiple path segments according to the tag level, with each path segment corresponding to an HTML tag node; S312. Starting from the initial path segments of two complete tag paths, compare the corresponding path segments one by one. If the tag names and fixed attributes of the path segments are completely identical, they are determined to be the same path segments. Count the number of consecutive identical path segments as the common prefix length. ; S313. Calculate the total number of path segments for each of the two complete tag paths to obtain the path length. and ,Pick and The maximum value in the range is used as the longest path length. ; S314. Path Structure Similarity ,in The value range is [0,1]. The closer the value is to 1, the more similar the path structures of the two elements are; In step S3, the specific steps for calculating class name similarity are as follows: S321. Extract the class names of the two HTML elements in an element pair. If a single HTML element has multiple class names, then combine all the class names into a set of class names for that element. and If an HTML element has no class name, then the corresponding set of class names is empty. S322. Collection of Computation Class Names and intersection Union ,in, , ; S323. Class Name Similarity ,in Indicates intersection The number of elements, Union The number of elements, The value range of is [0,1]; if If empty, then This indicates that there is no difference between the two in terms of class name dimension.
2. The webpage element similarity detection method based on adaptive clustering according to claim 1, characterized in that: In step S2, the method of caching the complete tag path and element identifier is as follows: the complete tag path and the corresponding element identifier are associated and stored in the memory cache or local disk cache, and the cache validity period is set according to the web page data update frequency; If the complete tag path and element identifier are needed again within the cache validity period, they will be read directly from the cache without needing to relocate the HTML element and extract the path.
3. The webpage element similarity detection method based on adaptive clustering according to claim 1, characterized in that, In step S3, the weighted combination of path structure similarity is set to 0.6, class name similarity is set to 0.4, and the overall similarity is... .
4. The webpage element similarity detection method based on adaptive clustering according to claim 1, characterized in that, In step S4, the parameters for dynamically adjusting the KDE analysis include adjusting the kernel function type and bandwidth; the kernel function type is selected from any one of the Gaussian kernel function, Epanechnikov kernel function, or triangular kernel function; the bandwidth is dynamically adjusted according to the number of samples and the degree of dispersion of the comprehensive similarity value.