Webpage automation method and device, computer device and readable storage medium
By acquiring the structural and visual features of web page elements, performing semantic encoding and DSL instruction parsing, the problem of fragile positioning and low collaborative efficiency of web page automation technology in complex environments is solved, achieving high robustness and efficient human-machine collaboration.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING TAIXIN TIANCHENG TECHNOLOGY CO LTD
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing web automation technologies suffer from problems such as semantic fragmentation, fragile positioning, limited perception, unreliable decision-making, difficulty in accumulating and reusing business experience, and inefficient human-machine collaboration when faced with the iteration of web application technology stacks and the increasing complexity of business in professional fields.
By acquiring the structural and visual features of web page elements, performing semantic encoding, constructing semantic anchors, receiving task description text and parsing DSL instructions, calculating matching scores, and executing operations or requesting user intervention, a highly robust localization of web page elements is achieved.
It significantly improves the robustness of web page automation technology in terms of positioning, unifies business semantics and execution operations, and improves the efficiency of human-machine collaboration.
Smart Images

Figure CN122242447A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of web page automation technology, and in particular to a web page automation method, apparatus, computer device, and readable storage medium. Background Technology
[0002] With the iteration of Web application technology stacks (SPA, Canvas rendering) and the increasing complexity of business in professional fields (legal document monitoring, cross-border patent data crawling, financial information collection, etc.), the defects of web automation technology, such as semantic fragmentation, fragile positioning, single perception, unreliable decision-making, difficulty in accumulating and reusing business experience, and inefficient human-machine collaboration, have become apparent. Summary of the Invention
[0003] The embodiments of the present invention provide a web page automation method, apparatus, computer device, and readable storage medium, aiming to solve the technical problem that existing web page automation methods, which rely solely on the document object model, are difficult to adapt to changes in the front end.
[0004] In a first aspect, embodiments of the present invention provide a webpage automation method, comprising: Obtain the element features of all web page elements in the current web page. The element features include structural features and visual features. The structural features are the A11Y Tree path and DOM path of the web page elements. The visual features include at least the spatial position, color, shape and icon features of the web page elements. Obtain the element tags of each webpage element, and perform semantic encoding on the element tags to obtain the business meaning vector; Semantic anchors are constructed based on the structural features, visual features, and business meaning vectors corresponding to each web page element. The semantic anchors contain the element ID and anchor features of the corresponding web page element. The anchor features include anchor structural features, anchor visual features, and anchor business features. Receive the description text of the current task, parse and verify the description text to obtain at least one DSL instruction, the DSL instruction carries the current intent element and the corresponding current semantic features, the current semantic features include the current structural features, current visual features and current business features of the current intent element; The target element is obtained by parsing the DSL instruction based on the current semantic features; If it exists, calculate the matching score between the current semantic feature and the anchor feature corresponding to the target element; If the matching score is greater than or equal to the preset upper limit threshold, then the DSL instruction is executed according to the anchor point feature of the current intent element. If the matching score is greater than or equal to the lower threshold and less than the upper threshold, then output the possible candidate elements of the current intent element in the web page element tree; If the matching score is less than the lower threshold, a user intervention request is initiated.
[0005] In a second aspect, embodiments of the present invention provide a webpage automation device, comprising: The feature acquisition module is used to acquire the element features of all web page elements in the current web page. The element features include structural features and visual features. The structural features are the A11Y Tree path and DOM path of the web page element. The visual features include at least the spatial position, color, shape and icon features of the web page element. The encoding module is used to obtain the element tags of each web page element and perform semantic encoding on the element tags to obtain a business meaning vector; An anchor point construction module is used to construct corresponding semantic anchor points based on the structural features, visual features, and business meaning vectors of each web page element. The semantic anchor point contains the element ID and anchor point features of the corresponding web page element. The anchor point features include anchor point structural features, anchor point visual features, and anchor point business features. The task parsing module is used to receive the description text of the current task, parse and verify the description text to obtain at least one DSL instruction. The DSL instruction carries the current intent element and the corresponding current semantic features. The current semantic features include the current structural features, current visual features and current business features of the current intent element. The monitoring module is used to parse the DSL instruction based on the current semantic features to obtain the target element; The calculation module calculates the matching score between the current semantic feature and the anchor feature corresponding to the target element; The decision module is used to execute DSL instructions based on the anchor features of the current intent element if the matching score is greater than or equal to a preset upper threshold; output possible candidate elements of the current intent element in the web page element tree if the matching score is greater than or equal to a lower threshold and less than the upper threshold; and initiate a user intervention request if the matching score is less than the lower threshold.
[0006] Thirdly, embodiments of the present invention provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the web page automation method described in the first aspect.
[0007] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the web page automation method described in the first aspect.
[0008] This invention provides a webpage automation method, apparatus, computer device, and readable storage medium. The method acquires the element features of all webpage elements; acquires the element tags of each webpage element and performs semantic encoding on the element tags to obtain a business meaning vector; constructs corresponding semantic anchors based on the structured features, visual features, and business meaning vectors corresponding to each webpage element; receives the description text of the current task, performs task parsing and verification on the description text to obtain at least one DSL instruction; parses the DSL instruction according to the current semantic features to obtain the target element; calculates the matching score between the current semantic features and the anchor features corresponding to the target element; if the matching score is greater than or equal to a preset upper threshold, the DSL instruction is executed according to the anchor features of the current intent element; if the matching score is greater than or equal to a lower threshold and less than the upper threshold, the possible candidate elements of the current intent element in the webpage element tree are output; if the matching score is less than the lower threshold, a user intervention request is initiated. This method constructs corresponding semantic anchors using the structured features, visual features, and business meaning vectors corresponding to each webpage element, using the semantic anchors as a positioning dimension to unify business semantics and execution operations, significantly improving positioning robustness. Attached Figure Description
[0009] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0010] Figure 1 This is a flowchart illustrating an embodiment of the web page automation method provided by the present invention. Figure 2 This is a schematic block diagram of a web page automation device provided in an embodiment of the present invention. Detailed Implementation
[0011] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0012] It should be understood that, when used in this specification and the appended claims, the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0013] It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.
[0014] It should also be further understood that the term "and / or" as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0015] Please see Figure 1 This is a flowchart illustrating a webpage automation method provided in an embodiment of the present invention, which includes steps S110 to S173.
[0016] Step S110: Obtain the element characteristics of all web page elements in the current web page; Step S120: Obtain the element tags of each webpage element, and perform semantic encoding on the element tags to obtain the business meaning vector; Step S130: Construct corresponding semantic anchors based on the structured features, visual features, and business meaning vectors corresponding to each webpage element; In this embodiment, a domain-specific language (DSL) is constructed for the target domain, serving as a unified semantic carrier for natural language and execution operations. This is achieved by building an atomic operation instruction set, semantic anchors (semantic anchor descriptors), and DSL verification rules. The atomic operation instruction set covers all web page interaction scenarios and is compatible with the A11Y operation specification, for example: FILL_FORM(field="post date", value="2023-10-27", a11y_label="post date input box") CLICK_ELEMENT(semantic anchor="search button", a11y_role="button") EXTRACT_DATA(target="announcement list", range="current page", a11y_tree_path="body>main>section[role=list]") WAIT_FOR_CONDITION(condition="loading animation disappeared", a11y_state="busy:false") In addition, the element features of all web page elements in the current web page are collected. The element features include structural features and visual features. Among them, (1) structural features, A11Y Tree (Accessibility tree) paths (such as node identifiers: aria-label, role, data-testid) are collected first, and DOM (Document Object Model, XPath / CSS) paths are collected second, as the basic positioning dimension; (2) visual features: CV model (CLIP / YOLO) is called to extract the feature vectors of web page elements in the screenshot, and record spatial position, color, shape, icon features, etc.; (3) business meaning vector: LLM is called to perform semantic encoding on the A11Y tags and context text of web page elements to generate business meaning vectors. Semantic anchors are obtained by structural storage according to semantic anchor ID--feature type--feature vector, and the semantic anchors of all web page elements are stored in the anchor library. Semantic anchors contain the element ID and anchor features of the corresponding web page element. Anchor features include anchor structural features, anchor visual features, and anchor business features. Anchor structural features are the structured features corresponding to the web page element, anchor visual features are the visual features corresponding to the web page element, and anchor business features are the business meaning vector corresponding to the web page element.
[0017] Step S140: Receive the description text of the current task, parse and verify the description text to obtain at least one DSL instruction; In this embodiment, the system receives a description text of a business task input by the user in natural language, performs intent recognition, domain matching, and key information extraction; it then uses a planning agent to break down complex tasks into an ordered sequence of subtasks, determining execution dependencies and preconditions; and generates an execution flowchart to ensure the subtasks are ordered correctly and without logical conflicts. The LLM is then invoked to generate executable DSL instructions one by one based on the subtasks, DSL syntax rules, and semantic anchors. Each DSL instruction carries the current intent element and its corresponding current semantic features, including the current structural features, current visual features, and current business features of the current intent element.
[0018] In one embodiment, the user's description text regarding "downloading DNF attachments" is obtained, and the parsed structured text is as follows: { Business Intent: "Download PDF Attachment" "a11y characteristics": { "role": "link", "label": Contains the text 'download' }, "Visual Features": "Contains a download icon (SVG / PNG), RGB color value #165DFF, located on the right side of the page within 10%-20% of the page width, and its height is 80%-120% of the page line height." "Context": "Immediately following the PDF filename text node, located within the same data line", "Functional Features": "Clicking triggers a PDF file download; the response type is application / pdf". } The system may have two matching nodes: Node 1: Download the PDF attachment of the authorization notice { Business Intent: "Download PDF Attachment" "a11y characteristics": { "role": "link", "label": "Download Authorization Notice.pdf" }, "Visual Features": "Contains a download icon (SVG), RGB color value #165DFF, located at 15% of the page width on the right side, with a height equal to 100% of the page line height." "Context": "Immediately following the 'Authorization Announcement.pdf' text node, located within the same data row", "Functional Features": "Clicking triggers a file download; the response format is application / pdf"} Node 2: Download the DPF attachment of the instruction manual { Business Intent: "Download PDF Attachment" "a11y characteristics": { "role": "link", "label": "Download Instructions.pdf", "style": "margin-top: 2px;" }, "Visual Features": "Contains a download icon (PNG), RGB color value #165DFF, located at 16% of the page width on the right side, with a height of 110% of the page line height." "Context": "Immediately following the 'Instruction Manual.pdf' text node, located within the same data line", "Functional Features": "Clicking triggers a file download; the response format is application / pdf"} Step S150: Parse the DSL instruction according to the current semantic features to obtain the target element; Step S160: Calculate the matching score between the current semantic feature and the anchor feature corresponding to the target element; Step S171: If the matching score is greater than or equal to the preset upper limit threshold, then execute the DSL instruction according to the anchor point feature of the target element. Step S172: If the matching score is greater than or equal to the lower threshold and less than the upper threshold, then output the possible candidate elements of the current intent element in the web page element tree. Step S173: If the matching score is less than the lower limit threshold, then initiate a user intervention request.
[0019] In this embodiment, the DSL instruction is executed first according to the current semantic features described in the description text. If the execution is successful, the current semantic features of the current intent element are recorded and the confidence level is marked (default is 1). If it fails, the matching score S between the current semantic features and the anchor features corresponding to the current intent element is calculated according to the following formula: , In the formula, This indicates the weight, which can be dynamically adjusted according to the actual situation. Indicates the similarity of structured features; Indicates visual feature similarity; Represents the similarity of business meaning vectors; Indicates the current structural features; Indicates current visual features; Indicates the current business characteristics; Indicates the structural features of the anchor point; Indicates the visual features of the anchor point; This indicates the characteristics of the anchor point business.
[0020] In one embodiment, if the matching score is greater than or equal to a preset upper threshold (e.g., S≥0.9), then the DSL instruction is executed according to the anchor point features of the current intent element; if the matching score is greater than or equal to the lower threshold and less than the upper threshold (e.g., 0.6≤S≤0.9), then the possible candidate elements of the current intent element in the web page element tree are output, and the element features of each candidate element are marked; if the matching score is less than the lower threshold (e.g., S≤0.6), then a user intervention request is initiated to avoid invalid operations.
[0021] In one embodiment, when the output of the current intent element is triggered and it appears as a possible candidate element in the webpage element tree, the user selects the correct webpage element from the possible candidate nodes or manually specifies the webpage element. If there is a logical error in the DSL instruction, the user corrects the instruction sequence. Each successful execution of a DSL instruction updates the semantic anchor of the webpage element in real time. For example, if the user corrects the positioning of a webpage element, the structural features, visual features, and business meaning vector of that webpage element are updated; if the user corrects a DSL instruction, the "scenario--error instruction--correct instruction" mapping relationship is recorded.
[0022] This method acquires the element features of all web page elements in the current webpage; obtains the element tags of each webpage element and performs semantic encoding on the element tags to obtain a business meaning vector; constructs corresponding semantic anchors based on the structured features, visual features, and business meaning vectors corresponding to each webpage element; receives the description text of the current task, performs task parsing and verification on the description text to obtain at least one DSL instruction; parses the DSL instruction according to the current semantic features to obtain the target element; calculates the matching score between the current semantic features and the anchor features corresponding to the target element; if the matching score is greater than or equal to a preset upper threshold, the DSL instruction is executed according to the anchor features of the current intent element; if the matching score is greater than or equal to a lower threshold and less than the upper threshold, the possible candidate elements of the current intent element in the webpage element tree are output; if the matching score is less than the lower threshold, a user intervention request is initiated. This method constructs corresponding semantic anchors using the structured features, visual features, and business meaning vectors corresponding to each webpage element, and uses the semantic anchors as a positioning dimension to unify business semantics and execution operations, significantly improving positioning robustness.
[0023] This invention also provides a webpage automation device for executing any of the aforementioned webpage automation methods. Specifically, please refer to... Figure 2 , Figure 2 This is a schematic block diagram of a web page automation device provided in an embodiment of the present invention. The web page automation device 100 can be configured in a server.
[0024] like Figure 2 As shown, the web page automation device 100 includes a feature acquisition module 110, an encoding module 120, an anchor point construction module 130, a task parsing module 140, a monitoring module 150, a calculation module 160, and a decision-making module 170.
[0025] The feature acquisition module 110 is used to acquire the element features of all web page elements in the current web page. The element features include structural features and visual features. The structural features are the A11Y Tree path and DOM path of the web page element. The visual features include at least the spatial position, color, shape and icon features of the web page element. The encoding module 120 is used to obtain the element tags of each web page element and perform semantic encoding on the element tags to obtain a business meaning vector; Anchor point construction module 130 is used to construct corresponding semantic anchor points based on the structural features, visual features and business meaning vectors corresponding to each web page element. The semantic anchor point includes the element ID and anchor point features of the corresponding web page element. The anchor point features include anchor point structural features, anchor point visual features and anchor point business features. The task parsing module 140 is used to receive the description text of the current task, and to parse and verify the description text to obtain at least one DSL instruction. The DSL instruction carries the current intent element and the corresponding current semantic features. The current semantic features include the current structural features, current visual features and current business features of the current intent element. Monitoring module 150 is used to parse the DSL instruction based on the current semantic features to obtain the target element; Calculation module 160 is used to calculate the matching score between the current semantic feature and the anchor feature corresponding to the target element; The decision module 170 is used to execute DSL instructions according to the anchor point features of the current intent element if the matching score is greater than or equal to a preset upper threshold; output possible candidate elements of the current intent element in the web page element tree if the matching score is greater than or equal to a lower threshold and less than the upper threshold; and initiate a user intervention request if the matching score is less than the lower threshold.
[0026] This invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the web page automation method described above.
[0027] In another embodiment of the invention, a computer-readable storage medium is provided. This computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the web page automation method as described above.
[0028] Those skilled in the art will readily understand that, for the sake of convenience and brevity, the specific working processes of the devices, apparatuses, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in terms of function in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention.
[0029] In the embodiments provided by this invention, it should be understood that the disclosed devices, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Units with the same function may be grouped into one unit. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interfaces, devices, or units, or it may be an electrical, mechanical, or other form of connection.
[0030] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of the present invention, depending on actual needs.
[0031] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0032] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), magnetic disks, or optical disks.
[0033] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method of web automation, characterized by, include: Obtain the element features of all web page elements in the current web page. The element features include structural features and visual features. The structural features are the A11Y Tree path and DOM path of the web page elements. The visual features include at least the spatial position, color, shape and icon features of the web page elements. Obtain the element tags of each webpage element, and perform semantic encoding on the element tags to obtain the business meaning vector; Semantic anchors are constructed based on the structural features, visual features, and business meaning vectors corresponding to each web page element. The semantic anchors contain the element ID and anchor features of the corresponding web page element. The anchor features include anchor structural features, anchor visual features, and anchor business features. Receive the description text of the current task, parse and verify the description text to obtain at least one DSL instruction. The DSL instruction carries the current intent element and the corresponding current semantic features. The current semantic features include the current structural features, current visual features and current business features of the current intent element. The target element is obtained by parsing the DSL instruction based on the current semantic features; Calculate the matching score between the current semantic feature and the anchor feature corresponding to the target element; If the matching score is greater than or equal to the preset upper limit threshold, then the DSL instruction is executed according to the anchor point feature of the target element. If the matching score is greater than or equal to the lower threshold and less than the upper threshold, then output the possible candidate elements of the current intent element in the web page element tree; If the matching score is less than the lower threshold, a user intervention request is initiated.
2. An apparatus for web automation, characterized by, include: The feature acquisition module is used to acquire the element features of all web page elements in the current web page. The element features include structural features and visual features. The structural features are the A11Y Tree path and DOM path of the web page element. The visual features include at least the spatial position, color, shape and icon features of the web page element. The encoding module is used to obtain the element tags of each web page element and perform semantic encoding on the element tags to obtain a business meaning vector; An anchor point construction module is used to construct corresponding semantic anchor points based on the structural features, visual features, and business meaning vectors of each web page element. The semantic anchor point contains the element ID and anchor point features of the corresponding web page element. The anchor point features include anchor point structural features, anchor point visual features, and anchor point business features. The task parsing module is used to receive the description text of the current task, parse and verify the description text to obtain at least one DSL instruction. The DSL instruction carries the current intent element and the corresponding current semantic features. The current semantic features include the current structural features, current visual features and current business features of the current intent element. The monitoring module is used to parse the DSL instruction based on the current semantic features to obtain the target element; The calculation module is used to calculate the matching score between the current semantic feature and the anchor feature corresponding to the target element, if it exists. The decision module is used to execute DSL instructions based on the anchor point features of the target element if the matching score is greater than or equal to a preset upper limit threshold. If the matching score is greater than or equal to the lower threshold and less than the upper threshold, then the possible candidate elements of the current intent element in the web page element tree are output; if the matching score is less than the lower threshold, then a user intervention request is initiated.
3. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the web page automation method as described in claim 1.
4. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the web page automation method as described in claim 1.