A method, system, computer, and storage medium for structured extraction of multimodal web page data based on visual features and large models.

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining webpage screenshots with HTML source code analysis, and utilizing multimodal large model generation and verification of XPath, the problems of high adaptation costs and insufficient accuracy in multi-website structured data collection are solved. This enables automatic generation and real-time updating of XPath, reduces manual maintenance costs, and improves collection efficiency and stability.

CN122309876APending Publication Date: 2026-06-30SUZHOU AEROSPACE INFORMATION RES INST

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SUZHOU AEROSPACE INFORMATION RES INST
Filing Date: 2026-06-03
Publication Date: 2026-06-30

Application Information

Patent Timeline

03 Jun 2026

Application

30 Jun 2026

Publication

CN122309876A

IPC: G06F16/958; G06F16/957; G06V10/32; G06V10/26; G06F40/30

AI Tagging

Technology Topics

Theoretical computer science Engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Bidirectional schema modification on tree-structured schemas
US20260169961A1Special data processing applications Database design/maintainance Theoretical computer science Data science
A method for modeling delay differential equations based on Bayesian optimization and neural networks
CN121920244BAlgorithm Theoretical computer science
Large language model text provenance method based on virtual prompt word embedding
CN121959528Bquick fitLower deployment costsBiological models Program/content distribution protection Linguistic model Theoretical computer science
Methods and apparatus for processing trusted data
CN117892308BRealize the processing functionincrease credibility Theoretical computer science Data transport
A client selection method and system for multi-task federated learning
CN122287781AComputation complexity Theoretical computer science

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies suffer from core drawbacks in batch collection of structured data from multiple websites, including high cost of adapting to multiple websites, insufficient accuracy of XPath generation, inability to respond to page changes in real time, and reliance on manual maintenance. These shortcomings make it impossible to meet the adaptive maintenance requirements in batch collection scenarios from multiple websites.

Method used

By combining webpage screenshots and HTML source code analysis, XPath is generated using a multimodal large model. Combined with bimodal verification of visual and code features, XPath is automatically generated and updated in real time. A periodic monitoring mechanism is adopted to ensure the stability and adaptability of XPath.

Benefits of technology

It lowers the entry barrier for collecting structured data from multiple websites, improves collection efficiency and stability, reduces manual maintenance costs, supports extraction from multiple types of web pages, and eliminates the need to develop independent rules for different websites.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122309876A_ABST

Patent Text Reader

Abstract

This invention discloses a method, system, computer, and storage medium for multimodal webpage data structured extraction based on visual features and a large model. The method involves acquiring webpage screenshots and HTML source code via a browser; normalizing, enhancing, and segmenting the screenshots to output preprocessed screenshots and coordinates of the core content area; cleaning the HTML, adding XPath and hierarchical structured text; inputting both into a multimodal large model to generate an initial XPath; performing bimodal validation of the initial XPath using both code and visual rules; if either fails, retrying based on the same preprocessed data until the limit is reached or success is achieved; periodic monitoring is set up, re-collecting and preprocessing data, comparing visual and code features with historical preprocessed data corresponding to the previous valid XPath; if any feature changes, repeating the aforementioned steps to generate a new XPath. This invention enables automatic XPath generation and adaptive updates, reducing maintenance costs.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to network information extraction technology, specifically to a method, system, computer, and storage medium for the structured extraction of multimodal web page data based on visual features and large models. Background Technology

[0002] XPath (XML Path Language) is a language for locating nodes in XML documents (including HTML documents). It uses path expressions to select elements or attributes in a document and is commonly used in scenarios such as data extraction and web page parsing.

[0003] In batch data acquisition applications involving structured data from multiple websites, XPath serves as a core tool for accurately locating web page elements, and its stability and adaptability directly determine the accuracy and efficiency of the acquisition. Currently, mainstream XPath acquisition methods still rely on manual coding or single-modal automatic generation, which has the following technical limitations.

[0004] Traditional manual XPath coding relies on developers' familiarity with the target webpage's DOM structure, requiring them to write XPath code individually for each website. When there are many websites, the manpower cost is extremely high. Furthermore, if the webpage structure changes even slightly, the XPath code will fail, requiring manual debugging, resulting in high maintenance costs and an inability to adapt to dynamically updated modern web pages.

[0005] Existing single-modal automatic XPath generation technologies fall into two categories: one is based solely on HTML source code parsing (e.g., CN121030536A), which generates XPath by analyzing the DOM structure through a large model or rule engine, but cannot perceive the actual rendering effect of the webpage, resulting in poor adaptability to webpages with chaotic DOM structures but clear visual layouts, and is prone to positioning errors; the other is based solely on page feature comparison (e.g., CN112579862A), which generates XPath through static comparisons such as MD5 values, lacks the ability to understand the semantics of a large model, and has limited adaptability.

[0006] Existing bimodal technologies: Some technologies (such as the Midscene.js tool and CN120894792A) introduce bimodal data of webpage screenshots and DOM parsing, but the core application scenario is web automation testing. The core goal is to improve the recognition accuracy of dynamic and hidden elements. It does not involve the automatic generation and real-time updating of XPath, and cannot meet the adaptive maintenance requirements of batch collection scenarios from multiple websites.

[0007] In summary, existing technologies generally suffer from core defects such as "high cost of adapting to multiple websites, insufficient accuracy of XPath generation, inability to respond to page changes in real time, and reliance on manual maintenance." There is an urgent need for a technical solution that can achieve automatic XPath generation, dual-modal verification, and real-time updates, thereby reducing the entry barrier and maintenance cost of multi-website structured data collection and improving collection efficiency and stability. Summary of the Invention

[0008] The purpose of this invention is to provide a method, system, computer, and storage medium for the structured extraction of multimodal web page data based on visual features and large models.

[0009] The technical solution to achieve the purpose of this invention is: a method for structured extraction of multimodal web page data based on visual features and large models, comprising the following steps:

[0010] Step 1: Access the target webpage URL using a browser automation tool to obtain a screenshot of the webpage and the HTML source code;

[0011] Step 2: Preprocess the webpage screenshot, including size normalization, image enhancement, region segmentation and format conversion, and output the preprocessed screenshot data and the coordinates of the core content area;

[0012] Step 3: Preprocess the HTML source code, including cleaning up invalid tags and redundant attributes, adding XPath paths and node hierarchy identifiers to each DOM node, formatting tags and attributes, and outputting structured HTML text;

[0013] Step 4: Input the preprocessed screenshot data and structured HTML text into the multimodal large model, and guide the multimodal large model to generate the initial XPath of the target data item through structured prompt words;

[0014] Step 5: Perform bimodal validation on the initial XPath, including code rule validation and visual rule validation; if both validations pass, it is confirmed as a valid XPath. If either validation fails, re-execute Step 4 based on the same preprocessed screenshot data and structured HTML text until the preset retry limit is reached or a valid XPath is obtained.

[0015] Step 6: Set up periodic monitoring to periodically re-collect webpage screenshots and HTML source code. Perform the preprocessing steps 2 and 3 on the re-collected data to obtain new preprocessed screenshot data, coordinates of the core content area, and new structured HTML text. Compare the new preprocessed data with the historical preprocessed data corresponding to the previous valid XPath using visual and code features. If any feature is determined to have changed, automatically repeat steps 2 to 5 to generate a new valid XPath, achieving adaptive updating of XPath.

[0016] Furthermore, the size normalization mentioned in step 2 specifically involves: scaling the screenshots to a standard size of 1920×1080. If the original resolution is different, a bilinear interpolation algorithm is used to ensure that the image is not distorted. The region segmentation adopts a multi-feature fusion strategy, including edge density features, text density features, layout prior rules, and tag semantic features, to identify the core content area of the webpage and exclude the navigation bar and advertising area.

[0017] Furthermore, step 3, which involves cleaning up invalid tags and redundant attributes, includes deleting comment tags, empty tags, JavaScript content that is not loaded with data, and deleting inline style attributes.

[0018] Furthermore, the code rule verification in step 5 specifically involves: using initial XPath to locate nodes in the preprocessed HTML source code, verifying the existence of the nodes, and verifying whether the node text conforms to the semantic format of the target data item; the visual rule verification specifically involves: setting the browser viewport to 1920×1080, reloading the target webpage URL, obtaining the screen coordinates of the target node through initial XPath, verifying whether the coordinates fall within the core content area coordinates obtained in step 2, and simultaneously recognizing the text in the coordinate area through OCR and comparing its similarity with the DOM node text. If the similarity is not less than 90%, it is considered passed.

[0019] Furthermore, the preset retry limit mentioned in step 5 is 3 times. When the limit is exceeded, the system issues an alarm and retains the previous valid XPath.

[0020] Furthermore, in step 6, the visual feature comparison uses the average hash algorithm to calculate the hash value of the core content area obtained in step 2, and a similarity of less than 90% is considered a significant change; the code feature comparison includes comparing the hierarchy and tag type of the DOM node corresponding to the target data item.

[0021] Furthermore, the trigger time for the periodic monitoring mentioned in step 6 is 2:00 AM every day.

[0022] A multimodal webpage data structure extraction system based on visual features and large models, used to implement the aforementioned multimodal webpage data structure extraction method based on visual features and large models, includes:

[0023] The data acquisition module is used to access the target webpage URL through browser automation tools to obtain webpage screenshots and HTML source code;

[0024] The screenshot preprocessing module is used to preprocess webpage screenshots, including size normalization, image enhancement, region segmentation and format conversion, and outputs the preprocessed screenshot data and the coordinates of the core content area;

[0025] The HTML preprocessing module is used to preprocess HTML source code, including cleaning invalid tags and redundant attributes, adding XPath paths and node hierarchy identifiers to each DOM node, formatting tags and attributes, and outputting structured HTML text.

[0026] The XPath generation module is used to input preprocessed screenshot data and structured HTML text into the multimodal large model, and guide the multimodal large model to generate the initial XPath of the target data item through structured prompt words;

[0027] The dual-modal verification module is used to perform code rule verification and visual rule verification on the initial XPath. If both verifications pass, it is confirmed as a valid XPath. If either verification fails, the XPath generation module is retried based on the same preprocessed screenshot data and structured HTML text until the preset retry limit is reached or a valid XPath is obtained.

[0028] The periodic monitoring module is used to set up periodic monitoring, periodically re-collecting webpage screenshots and HTML source code. The re-collected data triggers the screenshot preprocessing module and the HTML preprocessing module respectively to obtain new preprocessed data. The new preprocessed data is compared with the historical preprocessed data corresponding to the previous valid XPath by visual features and code features. If any feature is determined to have changed, the screenshot preprocessing module, the HTML preprocessing module, the XPath generation module, and the bimodal verification module are automatically triggered again to generate new valid XPath.

[0029] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the aforementioned method for structured extraction of multimodal web page data based on visual features and large models.

[0030] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the aforementioned method for structured extraction of multimodal web page data based on visual features and large models.

[0031] Compared with existing technologies, the significant advantages of this invention are: 1) Combining webpage screenshots with HTML source code parsing can effectively address scenarios such as JavaScript rendering and CSS hiding; 2) Adaptive updates based on semantic difference detection can significantly reduce manual maintenance costs; 3) Supports extraction of multiple types of webpages without the need to develop independent rules for different websites. Attached Figure Description

[0032] Figure 1 This is a flowchart of a method for extracting structured data from web pages based on screenshots and large models. Detailed Implementation

[0033] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0034] This invention discloses a method for structured extraction of multimodal web page data based on visual features and large models, aiming to solve the following core technical problems:

[0035] (1) How to establish a precise mapping between webpage visual features and DOM structure through dual-modal collaborative parsing, thereby improving the XPath generation accuracy of heterogeneous webpages with disordered DOM across multiple websites.

[0036] (2) How to design a periodic monitoring and verification mechanism for changes in web page structure, realize automatic iterative updates of XPath, and reduce manual maintenance costs.

[0037] (3) How to optimize the multi-website adaptation logic, reduce the technical entry threshold, and realize the generation and updating of end-to-end XPath that can be operated by non-professionals.

[0038] The core idea of this invention is to simultaneously collect visual information and underlying code information from web pages. Visual information is obtained by taking screenshots of the web page, while underlying code information is obtained from the HTML source code. After preprocessing, the visual and underlying code information is input into a multimodal large-scale model. Through collaborative model parsing, a precise mapping between visual elements and DOM nodes is established, automatically generating XPath for target data items such as title, body text, publication time, and author. Simultaneously, a closed-loop mechanism of "periodic monitoring - bimodal verification - automatic update" is constructed to continuously adapt to changes in web page structure, ensuring the long-term stability and effectiveness of XPath, ultimately achieving batch and efficient collection of structured data from multiple websites. Detailed processes are as follows... Figure 1 As shown.

[0039] The specific implementation steps of this invention are as follows:

[0040] Step 1: Initialize Playwright by launching a headless browser using Python code, employing the Chrome engine, and setting the page load timeout to 30 seconds. Access the specified URL, and after the page loads, capture a screenshot (PNG format) of the current webpage and the corresponding HTML source code. Simultaneously, save the original URL, collection timestamp, and other relevant information for traceability.

[0041] Step 2, webpage screenshot preprocessing.

[0042] Step 2.1, size normalization: All captured screenshots are uniformly scaled to a standard size of 1920*1080. If the original resolution differs, a bilinear interpolation algorithm is used: for each pixel in the target image, its mapping position in the original image is calculated, and the gray values of its four surrounding pixels are weighted and averaged to obtain the new pixel value. This algorithm effectively avoids jagged edges or distortion during image scaling.

[0043] Step 2.2, Image Enhancement: Grayscale is performed using an algorithm to reduce noise interference, contrast is adjusted to enhance the distinction between text and background, and Gaussian filtering is applied to remove noise.

[0044] Step 2.3, Region Segmentation and Core Content Recognition, employs a multi-feature fusion strategy, as detailed below.

[0045] (1) Edge density features: The Canny operator is used to perform edge detection on the grayscale image after step 2.2. The Gaussian smoothing kernel size is 5×5, the standard deviation σ is set to 1.4, the gradient calculation is performed by the Sobel operator, and after non-maximum suppression, double threshold edge connection is performed with a low threshold of 50 and a high threshold of 150 to obtain a binary edge image. Morphological closing operation with a kernel size of 5×5 is performed on the binary edge image to connect adjacent edges and extract the bounding rectangles of all connected regions.

[0046] (2) Text density features: Based on the original HTML output in step 1, traverse the DOM nodes, obtain the position and size of each node in the viewport through the bounding_box method of Playwright, and calculate the text density = node text length / node area; mark the node area with text density greater than 0.2 as high text density area.

[0047] (3) Layout prior rules: The area within 10% of the top height of the page is defined as the navigation bar candidate area, the area within 15% of the bottom height (i.e., the Y coordinate is greater than 85% of the page height) is defined as the footer candidate area, and the area within 10% of the width on both sides (the X coordinate is less than 10% or greater than 90% of the page width) is defined as the sidebar or advertisement candidate area; the above areas are eliminated from the candidate rectangle.

[0048] (4) Label semantic features: retain usage <article> 、 <main>The node area of the tag, and tags whose class attribute contains keywords such as "content", "article", and "post". Node region.

[0049] The core content area determination logic is as follows: The intersection of the candidate edge rectangles and the high text density rectangles is taken. Areas that conform to the prior layout rules are eliminated, and areas with tag semantic features are prioritized. Finally, the rectangle with the largest area and an aspect ratio between 0.5 and 3 is retained as the core content area. The coordinates of the top-left corner (x1, y1) and the bottom-right corner (x2, y2) of this rectangle are recorded, with the origin of the coordinate system at the top-left corner of the page, and the unit is pixels.

[0050] Step 2.4, format conversion: Convert the preprocessed screenshot into a Base64 encoded string so that it can be transmitted to the multimodal large model via API later.

[0051] Step 3, HTML source code preprocessing.

[0052] Step 3.1, Cleaning process: Remove invalid tags, such as comment tags, using BeautifulSoup. Empty tags Remove redundant attributes, such as the `style` inline style attribute; also remove JavaScript content that is not loaded with data.

[0053] Step 3.2, structured annotation, traverse the DOM tree, add a unique identifier to each node, including the XPath path and node level. For example, annotate the news title node as " / html / body / div[3] / div[2] / h1, level=3".

[0054] Step 3.3, standardize the format by converting all HTML source code tags to lowercase.

[0055] Step 4: Establish a mapping between visual elements and DOM nodes through a multimodal large model to generate the initial XPath of the target data item.

[0056] Step 4.1: Construct structured prompts, i.e., Prompts, to clarify the model task, input data, and output format, ensuring that the generated XPath accurately matches the target data items.

[0057] Model task: Based on the provided webpage screenshot (Base64 encoded) and standardized HTML source code, generate XPath for four data items: news title, author, publication time, and body text. The XPath should be able to accurately locate the corresponding elements and be compatible with similar news detail pages on the website.

[0058] Input data: Image input is the Base64 encoded screenshot string output in step 2, and HTML input is the standardized HTML string output in step 3.

[0059] Output data: The model must output a strict JSON structure, as shown in the example below:

[0060] {

[0061] Title: " / html / body / div[3] / div[2] / h1",

[0062] "Author": " / html / body / div[3] / div[2] / div[1] / span",

[0063] "Published Time": " / html / body / div[3] / div[2] / div[2] / time",

[0064] "Body text": " / html / body / div[3] / div[3] / article"

[0065] }

[0066] Step 4.2: Call the large model API, assemble the results from Steps 2 and 3 into an API request body, send a POST request, and retrieve the model output. Different large models supporting multimodal parsing can be selected.

[0067] 4.3 Selection and Invocation Methods of Multimodal Large Models

[0068] The multimodal large model described in this invention employs existing commercial models with image-text joint understanding capabilities, such as Qwen-Max (or a similar multimodal large model), and is accessed via its official API. This type of model takes images and text as joint input and outputs text results. Its internal network structure is well-known in the field and is not limited in this invention. The core of this invention lies in inputting preprocessed screenshot data and structured HTML text into the model, and guiding the model to generate the initial XPath for the target data item through structured prompts.

[0069] 4.4 Sample Format and Training Parameters

[0070] If the above model needs to be fine-tuned to adapt to a specific webpage type, the following sample format can be used: Each training sample is a JSON structure containing three fields—image_base64 (Base64 encoded string of preprocessed screenshot), html_text (structured HTML text after cleaning and adding hierarchical / XPath annotations), and xpath_label (manually annotated or verified valid XPath JSON object, such as {"title":" / html / body / div[3] / h1"}). The training process adopts supervised fine-tuning, and the loss function is cross-entropy loss. The typical range of key training parameters is: learning rate 1×e -5 Up to 5×e -5 The batch size is 8 to 16, the optimizer is AdamW, and the number of training epochs is 10 to 20. In actual deployment, commercial APIs can be called directly without self-training; the above fine-tuning scheme is only an optional implementation method.

[0071] 4.5, Structured prompts

[0072] Structured prompts are a key means of guiding a multimodal large model to generate the target XPath. These prompts use natural language templates and consist of three parts: a task description (e.g., "Generate XPaths for title, body, and other data items based on screenshots and HTML"), input data placeholders (used to insert Base64 strings of preprocessed screenshots and structured HTML text), and output format constraints (requiring a strict JSON format output, with the target data item name as the key and the XPath path as the value). The prompts are concatenated with the Base64 screenshot and structured HTML text, serving as the joint input to the multimodal large model, which then outputs the initial XPath. The prompts can be dynamically adjusted based on the bimodal validation results from step 5 (e.g., adding successful examples or correcting error patterns), while the parameters of the multimodal large model remain unchanged.

[0073] Step 5: Verify the validity of XPath.

[0074] Step 5.1, code rule verification, i.e. DOM node matching: Load the preprocessed HTML source code using BeautifulSoup, locate nodes using initial XPath, verify that the nodes exist, and confirm that the node text conforms to the semantic format of the target data item. For example, the title node should contain non-empty text content, and the publication time should conform to the date format.

[0075] Step 5.2, Visual Rule Verification, i.e., Screenshot Element Matching: First, restart a browser instance using Playwright, access the original URL from Step 1, and set the viewport size to 1920×1080, consistent with the normalized size from Step 2, ensuring coordinate system alignment. Then, obtain the corresponding ElementHandle object from the DOM node located using the initial XPath, and call the object's bounding_box() method to obtain the node's coordinates (x, y, width, height) in the viewport. Here, x and y represent the horizontal and vertical coordinates of the element's top-left corner relative to the main frame viewport, and width and height represent the node's rendered width and height in the viewport, respectively. Compare these coordinates with the core area coordinates from Step 2 to verify if the node is within the core content area. Simultaneously, use the OCR engine to crop out the (x, y, width, height) area from the current screenshot, identify the text within it, and compare its similarity with the DOM node's innerText, for example, using edit distance or Jaccard similarity. If the similarity is not less than 90% (the default threshold), the comparison is successful.

[0076] Step 5.3: If both validations pass, the result is a valid XPath, which can be directly used for data extraction. If either validation fails, the model will be automatically regenerated based on the same preprocessed screenshot data and structured HTML text until the preset retry limit is reached or a valid XPath is obtained.

[0077] Step 6: Set the monitoring cycle, such as automatically triggering data collection once a day at 2:00 AM. Perform the preprocessing steps 2 and 3 on the newly collected webpage screenshots and HTML source code respectively, and obtain the latest preprocessed webpage screenshots and HTML source code.

[0078] Step 6.1 compares the new preprocessed data with the historical preprocessed data (stored locally or in a database) corresponding to the most recent valid XPath, mainly including the following two characteristics:

[0079] Visual features: The average hash value of the core area is used. Specifically, the similarity is calculated based on the average hash value, and a similarity of <90% is considered a change.

[0080] Code characteristics: The hierarchy and tag type of the DOM node corresponding to the target data item, such as whether the title node is still an h1 tag and whether the hierarchy has changed.

[0081] If any feature comparison fails in step 6.2, it is determined that the webpage structure has changed. Steps 2-5 are automatically executed to obtain a new valid XPath, and the webpage structured data is obtained based on the new XPath.

[0082] This invention also proposes a multimodal web page data structure extraction system based on visual features and large models, used to implement the aforementioned multimodal web page data structure extraction method based on visual features and large models, including:

[0083] The data acquisition module is used to access the target webpage URL through browser automation tools to obtain webpage screenshots and HTML source code;

[0084] The screenshot preprocessing module is used to preprocess webpage screenshots, including size normalization, image enhancement, region segmentation and format conversion, and outputs the preprocessed screenshot data and the coordinates of the core content area;

[0085] The HTML preprocessing module is used to preprocess HTML source code, including cleaning invalid tags and redundant attributes, adding XPath paths and node hierarchy identifiers to each DOM node, formatting tags and attributes, and outputting structured HTML text.

[0086] The XPath generation module is used to input preprocessed screenshot data and structured HTML text into the multimodal large model, and guide the multimodal large model to generate the initial XPath of the target data item through structured prompt words;

[0087] The dual-modal verification module is used to perform code rule verification and visual rule verification on the initial XPath. If both verifications pass, it is confirmed as a valid XPath. If either verification fails, the XPath generation module is retried based on the same preprocessed screenshot data and structured HTML text until the preset retry limit is reached or a valid XPath is obtained.

[0088] The periodic monitoring module is used to set up periodic monitoring, periodically re-collecting webpage screenshots and HTML source code. The re-collected data triggers the screenshot preprocessing module and the HTML preprocessing module respectively to obtain new preprocessed data. The new preprocessed data is compared with the historical preprocessed data corresponding to the previous valid XPath by visual features and code features. If any feature is determined to have changed, the screenshot preprocessing module, the HTML preprocessing module, the XPath generation module, and the bimodal verification module are automatically triggered again to generate new valid XPath.

[0089] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the aforementioned method for structured extraction of multimodal web page data based on visual features and large models.

[0090] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the aforementioned method for structured extraction of multimodal web page data based on visual features and large models.

[0091] In summary, this invention utilizes a dual-modal input of webpage screenshots and HTML code, leveraging a large model for visual and structural alignment to generate high-precision XPath. Through an end-to-end structured extraction process, it achieves a fully automated workflow from webpage loading, parsing, extraction to output, eliminating the need for manual rule writing. It supports local large model deployment with Playwright services, enabling low-latency, high-concurrency real-time extraction. This method is suitable for batch collection of structured data from various types of websites, including news websites, social media, forums, and blogs. It is particularly suitable for modern web pages with dynamic rendering, varied structures, and heterogeneous multi-website architecture, effectively addressing the technical pain points of high costs associated with manually maintaining XPath and the inability to adapt to page structure changes in real time.

[0092] Example

[0093] To verify the effectiveness of the present invention, the following experiment was conducted.

[0094] Step 1: Data Collection

[0095] Launch a headless Chrome browser using Playwright, set the viewport size to 1920×1080 pixels, and set a page load timeout of 30 seconds. Access the target product details page URL (e.g., https: / / www.example.com / product / 12345.html), wait for the network to become less busy, and then capture a full screenshot of the webpage (PNG format, 1920×1080 resolution) and the corresponding HTML source code. Simultaneously save the original URL and the collection timestamp (e.g., 2025-06-01 10:00:00) to your local database.

[0096] Step 2: Screenshot Preprocessing

[0097] The screenshot resolution already conforms to the normalization standard and requires no scaling. After grayscale conversion, histogram equalization is used to enhance contrast, followed by 5×5 Gaussian filtering for noise reduction. A multi-feature fusion strategy is employed for region segmentation: Canny edge detection (low threshold 50, high threshold 150) extracts edge connected components. Combined with DOM node regions with text density greater than 0.2, the top navigation bar (Y<108) and bottom recommendation block (Y>918) are removed, resulting in the core content area coordinates (x1=180, y1=200, x2=1760, y2=980). The screenshot of this region is converted to a Base64 string with a length of approximately 120KB.

[0098] Step 3: HTML Preprocessing

[0099] Cleanse HTML with BeautifulSoup: Remove All <script> 标签、注释标签、空标签及内联 style 属性。遍历DOM树，为每个节点添加绝对XPath路径和节点层级。例如商品标题节点被标注为 / html / body / div[2] / div[3] / h1，层级=3。将所有标签转为小写，相对路径转换为绝对路径。输出结构化HTML文本，大小为原始HTML的约65%。

[0100] 步骤4：大模型生成初始XPath

[0101] 调用多模态大模型API，按照提示词输入Base64截图和结构化HTML。提示词要求生成以下数据项的XPath：商品名称、价格、销量、商品描述。模型返回JSON如下：

[0102] {

[0103] "商品名称": " / html / body / div[2] / div[3] / h1",

[0104] "价格": " / html / body / div[2] / div[3] / div[@class='price'] / span",

[0105] "销量": " / html / body / div[2] / div[3] / div[@class='sales']",

[0106] "商品描述": " / html / body / div[2] / div[4] / div[@id='description'] / p"

[0107] }

[0108] 步骤5：XPath验证

[0109] 代码规则验证：在预处理后的HTML中执行上述XPath，结果：商品名称节点存在且文本非空（"智能无线耳机”），价格节点文本包含数字和货币符号（"¥299”），销量节点文本符合数字格式（"1.2万件”），商品描述节点段落长度大于50字符。验证通过。

[0110] 视觉规则验证：重启浏览器并设视口为1920×1080，通过XPath定位各元素句柄，调用 bounding_box() 获取坐标。商品名称节点坐标为（x=180, y=210, width=600,height=36），其中心点（480,228）位于核心内容区（180,200,1760,980）内。OCR识别该区域文本为"智能无线耳机”，与DOM文本一致。其他三项同样通过验证。

[0111] 步骤6：周期性监测与更新

[0112] 设置每天凌晨2点自动执行监测任务。首次监测时，新采集的截图与历史数据相比：视觉哈希相似度>90%。代码特征比对中，商品名称XPath的标签仍为h1，层级未变。因此沿用原有XPath。假设三个月后网站改版，价格字段的标签类型由原标签改为新标签，导致代码特征比对失败，系统自动触发步骤2至步骤5，重新生成了新的有效XPath，后续采集恢复正常。

[0113] 以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。

[0114] 以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。< / script> < / main> < / article>

Claims

1. A method for structured extraction of multimodal web page data based on visual features and large models, characterized in that, Includes the following steps: Step 1: Access the target webpage URL using a browser automation tool to obtain a screenshot of the webpage and the HTML source code; Step 2: Preprocess the webpage screenshot, including size normalization, image enhancement, region segmentation and format conversion, and output the preprocessed screenshot data and the coordinates of the core content area; Step 3: Preprocess the HTML source code, including cleaning up invalid tags and redundant attributes, adding XPath paths and node hierarchy identifiers to each DOM node, formatting tags and attributes, and outputting structured HTML text; Step 4: Input the preprocessed screenshot data and structured HTML text into the multimodal large model, and guide the multimodal large model to generate the initial XPath of the target data item through structured prompt words; Step 5: Perform bimodal validation on the initial XPath, including code rule validation and visual rule validation; if both validations pass, it is confirmed as a valid XPath. If either validation fails, re-execute Step 4 based on the same preprocessed screenshot data and structured HTML text until the preset retry limit is reached or a valid XPath is obtained. Step 6: Set up periodic monitoring to periodically re-collect webpage screenshots and HTML source code. Perform the preprocessing steps 2 and 3 on the re-collected data to obtain new preprocessed screenshot data, coordinates of the core content area, and new structured HTML text. Compare the new preprocessed data with the historical preprocessed data corresponding to the previous valid XPath using visual and code features. If any feature is determined to have changed, automatically repeat steps 2 to 5 to generate a new valid XPath, achieving adaptive updating of XPath.

2. The method for structured extraction of multimodal web page data based on visual features and large models according to claim 1, characterized in that, The size normalization mentioned in step 2 specifically involves scaling the screenshots to a standard size of 1920×1080. If the original resolution is different, a bilinear interpolation algorithm is used to ensure that the image is not distorted. The region segmentation adopts a multi-feature fusion strategy, including edge density features, text density features, layout prior rules, and tag semantic features, to identify the core content area of the webpage and exclude the navigation bar and advertising area.

3. The method for structured extraction of multimodal web page data based on visual features and large models according to claim 1, characterized in that, Step 3, which involves cleaning up invalid tags and redundant attributes, includes deleting comment tags, empty tags, JavaScript content that is not loaded with data, and deleting inline style attributes.

4. The method for structured extraction of multimodal web page data based on visual features and large models according to claim 1, characterized in that, The code rule verification in step 5 specifically involves: using initial XPath to locate nodes in the preprocessed HTML source code, verifying the existence of the nodes, and verifying whether the node text conforms to the semantic format of the target data item; the visual rule verification specifically involves: setting the browser viewport to 1920×1080, reloading the target webpage URL, obtaining the screen coordinates of the target node through initial XPath, verifying whether the coordinates fall within the core content area coordinates obtained in step 2, and simultaneously recognizing the text in the coordinate area through OCR and comparing its similarity with the DOM node text. If the similarity is not less than 90%, it is considered passed.

5. The method for structured extraction of multimodal web page data based on visual features and large models according to claim 1, characterized in that, The preset retry limit mentioned in step 5 is 3 times. When the limit is exceeded, the system will issue an alarm and retain the previous valid XPath.

6. The method for structured extraction of multimodal web page data based on visual features and large models according to claim 1, characterized in that, The visual feature comparison in step 6 uses the average hash algorithm to calculate the hash value of the core content area obtained in step 2. When the similarity is less than 90%, it is judged as a significant change. The code feature comparison includes comparing the hierarchy and tag type of the DOM node corresponding to the target data item.

7. The method for structured extraction of multimodal web page data based on visual features and large models according to claim 1, characterized in that, The periodic monitoring mentioned in step 6 is triggered at 2:00 AM every day.

8. A multimodal webpage data structure extraction system based on visual features and large models, characterized in that, The method for implementing the multimodal web page data structure extraction method based on visual features and large models according to any one of claims 1-7 includes: The data acquisition module is used to access the target webpage URL through browser automation tools to obtain webpage screenshots and HTML source code; The screenshot preprocessing module is used to preprocess webpage screenshots, including size normalization, image enhancement, region segmentation and format conversion, and outputs the preprocessed screenshot data and the coordinates of the core content area; The HTML preprocessing module is used to preprocess HTML source code, including cleaning invalid tags and redundant attributes, adding XPath paths and node hierarchy identifiers to each DOM node, formatting tags and attributes, and outputting structured HTML text. The XPath generation module is used to input preprocessed screenshot data and structured HTML text into the multimodal large model, and guide the multimodal large model to generate the initial XPath of the target data item through structured prompt words; The dual-modal verification module is used to perform code rule verification and visual rule verification on the initial XPath. If both verifications pass, it is confirmed as a valid XPath. If either verification fails, the XPath generation module is retried based on the same preprocessed screenshot data and structured HTML text until the preset retry limit is reached or a valid XPath is obtained. The periodic monitoring module is used to set up periodic monitoring, periodically re-collecting webpage screenshots and HTML source code. The re-collected data triggers the screenshot preprocessing module and the HTML preprocessing module respectively to obtain new preprocessed data. The new preprocessed data is compared with the historical preprocessed data corresponding to the previous valid XPath by visual features and code features. If any feature is determined to have changed, the screenshot preprocessing module, the HTML preprocessing module, the XPath generation module, and the bimodal verification module are automatically triggered again to generate new valid XPath.

9. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the multimodal web page data structure extraction method based on visual features and large models as described in any one of claims 1-7.

10. A computer-readable storage medium having a computer program stored thereon, wherein when executed by a processor, the computer program implements the multimodal web page data structure extraction method based on visual features and large models as described in any one of claims 1-7.