A browser intelligent form filling method and system based on multi-modal understanding
By combining webpage visual and code information, monitoring user actions to identify dynamic form items and verifying them in real time, the accuracy and adaptability issues of form filling in existing technologies are solved, achieving a more efficient form filling effect.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN HONGYANG CENTURY TECHNOLOGY CO LTD
- Filing Date
- 2026-04-20
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240954A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of semantic processing technology, and in particular to a browser-based intelligent form filling method and system based on multimodal understanding. Background Technology
[0002] With the continuous expansion of internet services, web forms have become the main form of interaction for users to register, log in, and submit information. Form autofill technology can effectively improve the efficiency of users filling out forms and has become an important function of browsers and various clients.
[0003] Traditional form filling techniques primarily rely on parsing fixed field names, ID attributes, or tag text in HyperText Markup Language (HTML) code to identify form items through keyword matching, and then directly calling pre-stored local data to complete the filling. However, this method, which relies solely on keyword matching within HTML code, has a single basis for form item identification and is easily affected by differences in webpage layout and inconsistent field naming, leading to insufficient accuracy in form item identification. Furthermore, this method cannot adapt to dynamic changes in webpage forms and struggles to handle the identification and associated filling of dynamic form items triggered by user actions, thus limiting its applicability.
[0004] In summary, existing form filling methods suffer from low recognition accuracy and poor adaptability. Summary of the Invention
[0005] This application provides a browser-based intelligent form filling method and system based on multimodal understanding to improve the accuracy and adaptability of web page form filling.
[0006] According to one aspect of this application, a browser-based smart form filling method based on multimodal understanding is provided, comprising: obtaining a visual screenshot of a web page form and Hypertext Markup Language code;
[0007] Based on the visual screenshot and the Hypertext Markup Language code, determine the purpose type of the inherent form items in the web page form, and identify the trigger condition items from the inherent form items based on the purpose type;
[0008] Monitor user operations on the trigger condition items and identify new form items that appear due to the operations. Based on the dynamic occurrence sequence of the trigger condition items, the new form items, the inherent form items, and the new form items, determine the dynamic dependency relationship between the inherent form items and the new form items.
[0009] Based on the usage type of the inherent form items and the dynamic dependency relationship, the corresponding fill data is matched from the user's pre-stored information database, and the form is filled according to the dynamic appearance sequence.
[0010] During the filling process, the page feedback information triggered by the filling operation of the web form is captured in real time, and the filling result is verified based on the page feedback information;
[0011] If the validation passes, the form is confirmed to be filled in; otherwise, adjust the filled data according to the page feedback information, and re-execute the filling and validation until the validation passes.
[0012] Optionally, determining the purpose type of the inherent form items in the web page form based on the visual screenshot and the Hypertext Markup Language code includes:
[0013] Identify the inherent form items and their text labels in the visual screenshot, and calculate the visual position features of the inherent form items; wherein, the visual position features include the relative position, alignment relationship and spatial distance between the inherent form items and the corresponding text labels;
[0014] Extract the code semantic features of the inherent form items from the hypertext markup language code; wherein, the code semantic features include the name attribute and prompt text of the inherent form items;
[0015] The semantic features of the code are parsed using a semantic model, and the purpose type of the inherent form item is determined based on the visual location features and the parsed semantic features of the code.
[0016] According to another aspect of this application, a browser-based intelligent form filling system based on multimodal understanding is provided, comprising:
[0017] The data acquisition module is used to acquire visual screenshots and Hypertext Markup Language (HMR) code of web page forms;
[0018] The form item recognition module is used to determine the purpose type of the inherent form items in the web page form based on the visual screenshot and the hypertext markup language code, and to identify the trigger condition items from the inherent form items based on the purpose type;
[0019] The dependency relationship building module is used to monitor the user's operation on the trigger condition item, identify the new form item that appears due to the operation, and determine the dynamic dependency relationship between the inherent form item and the new form item based on the dynamic occurrence sequence of the trigger condition item, the new form item, the inherent form item, and the new form item.
[0020] The data matching and filling module is used to match the corresponding filling data from the user's pre-stored information database based on the usage type of the inherent form items and the dynamic dependency relationship, and to complete the form filling according to the dynamic appearance sequence.
[0021] The validation and correction module is used to capture the page feedback information triggered by the form filling operation in real time during the filling process, and to validate the filling result based on the page feedback information; if the validation passes, the form filling is confirmed to be complete; otherwise, the filling data is adjusted according to the page feedback information, and the filling and validation are re-executed until the validation passes.
[0022] The technical solution of this application achieves multimodal collaborative understanding by integrating webpage visual information and code structure, and combines dynamic form monitoring, effectively solving the problems of inaccurate form item recognition and inability to adapt to dynamically changing forms caused by existing technologies relying solely on single code matching. Specifically, this application combines bimodal information from webpage visual screenshots and HTML code for comprehensive judgment, determining the purpose of form items from multiple dimensions such as visual layout, text association, and code semantics, avoiding recognition errors caused by non-standard field naming and page layout differences, and improving the accuracy of form item recognition. Simultaneously, this application accurately identifies dynamically added form items by monitoring user operations on trigger condition items, and analyzes the temporal sequence and dependency logic between existing and newly added form items, completing data filling according to the associated order, solving the problem of chaotic dynamic form association filling, and broadening the applicable scenarios of the webpage form filling solution. Furthermore, this application also captures and verifies page feedback information in real time during the filling process, forming a closed-loop error correction mechanism, significantly improving the stability and success rate of form filling. Therefore, the technical solution of this application can significantly improve the accuracy, adaptability, and reliability of webpage form filling.
[0023] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this application, nor is it intended to limit the scope of this application. Other features of this application will become readily apparent from the following description. Attached Figure Description
[0024] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0025] Figure 1 A flowchart illustrating a browser-based smart form filling method based on multimodal understanding, provided for embodiments of this application;
[0026] Figure 2A flowchart illustrating another browser-based smart form filling method based on multimodal understanding provided in this application embodiment;
[0027] Figure 3 A flowchart illustrating yet another browser-based smart form filling method based on multimodal understanding, provided for embodiments of this application;
[0028] Figure 4 This is a schematic diagram of the structure of a browser-based intelligent form filling system based on multimodal understanding, provided in an embodiment of this application. Detailed Implementation
[0029] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.
[0030] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0031] Figure 1 This flowchart illustrates a browser-based smart form filling method based on multimodal understanding, provided as an embodiment of this application. This embodiment is applicable to web page form smart filling. The method can be executed by a browser-based smart form filling system based on multimodal understanding, which can be implemented in hardware and / or software. Figure 1 As shown, the method includes:
[0032] S110. Obtain a visual screenshot and Hypertext Markup Language code of the web page form.
[0033] Specifically, a visual screenshot refers to a visual image obtained by capturing the area of a webpage form. Visual screenshots can be used to extract visual information such as the position, layout, label text, and arrangement of form items. Hypertext Markup Language (HTML) code contains underlying information such as form input boxes, labels, attributes, and structure, used to obtain semantic features such as field names, attributes, and business meanings.
[0034] Optionally, obtain a visual screenshot and Hypertext Markup Language (HTML) code of the webpage form, including: taking a screenshot of the page area where the webpage form is located to obtain a visual screenshot; and obtaining the document object model source code corresponding to the webpage form through the browser interface to obtain the HTML code.
[0035] Specifically, the browser interface refers to the application programming interface (API) provided by the browser that can be called by programs to implement functions such as page manipulation, element acquisition, screenshotting, document object model (DOM) reading, and event listening. The Document Object Model (DOM) source code refers to the tree-like node structure formed by the browser after parsing HTML code. The DOM source code includes all form elements, attributes, hierarchical relationships, and rendering logic.
[0036] In this embodiment of the application, by synchronously collecting the visual features and code structure information of the web page form, dual-modal information fusion is achieved, which can improve the comprehensiveness and accuracy of form item recognition and avoid recognition errors caused by the lack or insufficiency of single-dimensional information.
[0037] S120. Based on the visual screenshot and Hypertext Markup Language code, determine the purpose type of the inherent form items in the web page form, and identify the trigger condition items from the inherent form items according to the purpose type.
[0038] Specifically, inherent form items refer to form items that exist when the webpage form is initially loaded and can be displayed without any operation. For example, inherent form items may include username, mobile phone number, and ID card number. Purpose type refers to the actual business meaning of the inherent form item, such as name, address, ID card number, occupation, and tax ID. Purpose type is determined through a combination of visual information and code semantics. Triggering condition items refer to fields in the inherent form items that can trigger subsequent form changes. For example, selecting "Married" will bring up "Spouse Information"; entering "ID Card Number" will bring up "Do You Live Together?".
[0039] For example, by analyzing the label text, input box positions, and layout relationships in the visual screenshot, and combining this with a comprehensive judgment based on the attribute names, identifier fields, and business semantics in the Hypertext Markup Language code, the actual purpose of each inherent form item in the initial page loading state is determined, such as name, mobile phone number, ID card number, marital status, and region. Based on this, according to the purpose type and business semantics of each form item, form items that can trigger subsequent dynamic changes to the page, the display of related fields, or business logic switching are further filtered out and identified as trigger condition items.
[0040] In this embodiment of the application, by comprehensively judging the purpose of form items through bimodal information, it is possible to accurately distinguish between regular form items and form items that can trigger dynamic behaviors, thus laying the foundation for the subsequent construction of dynamic dependencies.
[0041] S130. Monitor user operations on trigger condition items and identify new form items that appear due to the operations. Based on the dynamic occurrence sequence of trigger condition items, new form items, inherent form items, and new form items, determine the dynamic dependency relationship between inherent form items and new form items.
[0042] Specifically, a newly added form item refers to a form item that dynamically appears on the page after a user's action triggers a condition; this item did not exist before. For example, a newly added form item can be a dynamically popped-up option, an additional input box, or a related field. The dynamic appearance sequence refers to the order in which the newly added form items appear. Dynamic dependency refers to the causal relationship and filling constraints between existing form items and newly added form items. For example, after entering an ID number, a "Do you live together?" option pops up, forming a dynamic dependency between the two.
[0043] In this embodiment, by binding event listeners to trigger condition items, the system monitors user input, selection, or focus switching operations performed on these trigger condition items in real time. When the page dynamically changes due to these operations, newly rendered form items are identified. Combining the business relationships between trigger condition items, newly added form items, and existing form items, as well as the dynamic appearance sequence of the newly added form items, the system analyzes and determines the trigger constraints and sequential relationships between existing and newly added form items, thus obtaining dynamic dependencies. By monitoring user operations and constructing dynamic dependencies between form items, the system can effectively adapt to dynamic interactive forms, avoiding issues such as missing fields, incorrect fields, or disordered order, thereby improving the applicability and stability of the form filling method.
[0044] S140. Based on the inherent form item's purpose type and dynamic dependency relationship, match the corresponding fill data from the user's pre-stored information database and complete the form filling according to the dynamic appearance sequence.
[0045] Specifically, the user pre-stored information database refers to a collection of personal information that users have pre-entered, authorized, or historically submitted and that is securely stored by the system. This information is used for semantic matching and data retrieval during form autofill. For example, the user pre-stored information database may include name, mobile phone number, ID card number, address, email address, bank card information, and employer information. The user pre-stored information database can be manually filled in and saved by the user, extracted and organized from historical form entry records, or synchronized from a trusted data source with user authorization. The filled data refers to the specific content matched from the user pre-stored information database based on the purpose type of the form item and used to fill in the corresponding form item. For example, if the purpose type is "name," the corresponding filled data would be "Zhang San."
[0046] In this embodiment, semantic matching is performed in the user's pre-stored information database based on the usage type of the inherent form items and the established dynamic dependencies, and the corresponding fill data is retrieved. Simultaneously, automatic filling is performed on both inherent and newly added form items sequentially, strictly following the dynamic appearance order of the new form items, to complete the orderly filling of the entire form. By combining usage type with precise data matching and orderly filling according to the dynamic sequence, it ensures that the filled content is consistent with the form's business meaning and that the filling order conforms to the page logic. This effectively improves the correctness, completeness, and adaptability of form filling, avoiding filling failures due to missing relationships or incorrect order.
[0047] S150. During the filling process, capture the page feedback information triggered by the filling operation of the web form in real time, and verify the filling result based on the page feedback information.
[0048] Specifically, page feedback information refers to the prompts given on the page after the form is filled in, including prompts such as format error, invalid content, missing required fields, and validation passed.
[0049] In this embodiment, by monitoring the validation events, DOM style changes, and API return information of the web form during the form filling process, the system captures real-time page feedback information such as prompt text, style tags, and error reminders generated by the form filling operation. Based on this feedback information, the system verifies and judges whether the currently filled content is legal and compliant. For example, after filling in the ID number form field, the page immediately displays a red prompt text "ID number verification error," or adds an error style class to the input box. This prompt and style change constitute the page feedback information captured in this step. The browser's intelligent form filling system determines that the filling result is unsuccessful based on this feedback information, thus validating the filled content.
[0050] The embodiments of this application can promptly detect problems such as format errors, invalid content, and non-compliance with form rules in the data filling process, avoiding direct submission of erroneous data, improving the reliability of the filling results, and reducing form submission failures caused by filling errors.
[0051] S160. If the validation passes, the form is confirmed to be filled in; otherwise, adjust the filled data according to the page feedback information and re-execute the filling and validation until the validation passes.
[0052] For example, if the page displays "Incorrect mobile number format" after filling in the mobile number field, the browser's smart form filling system will automatically correct the filled data to a valid mobile number format based on the error message, fill in and verify it again, until the page displays "Correct format", thus completing the filling of the field.
[0053] In this embodiment of the application, a closed-loop error correction mechanism is formed by automatically adjusting the filled data and performing cyclic verification. Filling errors can be automatically corrected without manual intervention, ensuring that the final submitted form data complies with the page verification rules.
[0054] The technical solution of this application embodiment achieves multimodal collaborative understanding by integrating webpage visual information and code structure, and combines dynamic form monitoring, effectively solving the problems of inaccurate form item recognition and inability to adapt to dynamically changing forms caused by relying solely on single code matching in existing technologies. Specifically, this application embodiment combines webpage visual screenshots and HTML code bimodal information for comprehensive judgment, determining the purpose of form items from multiple dimensions such as visual layout, text association, and code semantics, avoiding recognition errors caused by non-standard field naming and page layout differences, and improving the accuracy of form item recognition. Simultaneously, this application embodiment accurately identifies dynamically added form items by monitoring user operations on trigger condition items, and sorts out the temporal sequence and dependency logic between existing form items and newly added form items, completing data filling according to the association order, solving the problem of chaotic dynamic form association filling, and broadening the applicable scenarios of the webpage filling solution. Furthermore, this application embodiment also captures and verifies page feedback information in real time during the filling process, forming a closed-loop error correction mechanism, significantly improving the stability and success rate of form filling. Therefore, this application embodiment can significantly improve the accuracy, adaptability, and reliability of webpage form filling.
[0055] Figure 2 A flowchart illustrating another browser-based smart form filling method based on multimodal understanding, provided as an embodiment of this application. Based on the above embodiments, as follows... Figure 2 As shown, optionally, the method includes:
[0056] S210. Obtain a visual screenshot and Hypertext Markup Language code of the web page form.
[0057] S220. Identify the inherent form items and their text labels in the visual screenshot, and calculate the visual positional features of the inherent form items. The visual positional features include the relative position, alignment, and spatial distance between the inherent form items and their corresponding text labels.
[0058] Specifically, text labels refer to the prompt text displayed in a web form that accompanies the input field and explains the meaning of filling in that form item. For example, text labels may include text content such as "name," "phone number," and "ID number." Visual position features refer to features extracted from visual screenshots that describe the spatial layout relationship of form items. Relative position refers to the vertical and horizontal orientation of the inherent form item and its text label on the page; for example, the label may be to the left of the input field. Alignment refers to whether the inherent form item and the text label are aligned horizontally or vertically. For example, alignment may include top alignment, center alignment, and bottom alignment. Spatial distance refers to the pixel distance between the inherent form item and its corresponding text label.
[0059] For example, in a registration form, the browser's smart form fill system identifies the "phone number" text label and the input box to its right from the visual screenshot, calculates that the text label is located to the left of the input box, the two are horizontally centered and the pixel spacing is 10 pixels, thus forming the visual position feature of the form item.
[0060] In this embodiment of the application, by identifying form items and their text labels and extracting visual position features, a visual basis can be provided for subsequent judgment of the correspondence between form items and text labels and for accurately determining the type of use of form items.
[0061] S230. Extract the code semantic features of inherent form items from the Hypertext Markup Language code. The code semantic features include the name attribute and prompt text of the inherent form items.
[0062] Specifically, code semantic features refer to the feature information extracted from HTML code that directly reflects the business meaning and filling requirements of form items, used for subsequent semantic retrieval and semantic processing. Name attributes refer to the name attribute value of form items in HTML code, used to identify the purpose of the field, such as userName, phone, and idCard. Prompt text refers to the placeholder text of form items, that is, the default prompt text displayed in the input box, such as "Please enter your mobile number".
[0063] For example, the name attribute of an input field is extracted from the HTML code as "userPhone" and the prompt text is "Please enter an 11-digit mobile phone number". This gives us the semantic features of the form field and clarifies that its purpose is to input a mobile phone number.
[0064] In this embodiment of the application, by parsing the HTML code of the webpage, two key types of information, namely name attributes and prompt text, can be extracted from the form input elements. This allows us to obtain the underlying code semantics of the form items and provide a code-dimensional basis for determining the purpose of the form items.
[0065] S240. The semantic features of the code are analyzed through a semantic model, and the usage type of the inherent form items is determined based on the visual position features and the analyzed semantic features of the code.
[0066] Specifically, a semantic model refers to a pre-trained language model used to understand the meaning of text. A semantic model can perform semantic encoding and analysis on name attributes, prompt text, etc., to determine the true business meaning of a field. In this embodiment, the semantic model is trained by collecting a large amount of text sample data from the form domain. The sample data includes name attributes, placeholder text, label text, and corresponding usage type annotations for various form items. This sample data is input into a deep learning network for iterative training, enabling the model to learn the mapping relationship between form text features and usage types, ultimately resulting in a semantic model that can be used to identify the semantics of form items.
[0067] Optionally, the purpose type of the inherent form item is determined based on visual position features and parsed code semantic features, including: if there is no conflict between the visual position features and the parsed code semantic features, a semantic vector is generated based on the visual position features and the parsed code semantic features; if there is a conflict between the visual position features and the parsed code semantic features, a cross-attention mechanism is used to weightedly fuse the visual position features and the parsed code semantic features to generate a semantic vector. The semantic vector is then matched with predefined purpose type features to determine the purpose type of the inherent form item.
[0068] Specifically, semantic vectors refer to digital feature vectors obtained by fusing and encoding visual location features with code semantic features. These vectors represent the comprehensive semantic information of inherent form items, facilitating subsequent usage type matching. Predefined usage types are pre-built, categorized, and stored by the intelligent form filling system based on common form business scenarios. By summarizing and organizing commonly used form items from various registration forms, authentication forms, and application forms, a standard usage type set is formed, including names, mobile phone numbers, ID card numbers, email addresses, contact addresses, bank card numbers, etc.
[0069] For example, an intelligent form filling system identifies the text label as "phone number" from visual location features and parses the name attribute as "userphone" from code semantic features, while the placeholder text is "Please enter your ID number," indicating an inconsistency and conflict. In this case, a cross-attention mechanism can be used to adaptively weight and fuse visual and code semantic features, automatically strengthening high-confidence features and weakening low-confidence features. The fused feature generates a unified semantic vector, which is then matched with predefined usage type features to accurately determine the usage type of the form item.
[0070] In this embodiment of the application, by integrating visual information and code semantic information to identify the purpose of form items, the intelligent form filling system can still accurately identify the purpose type of inherent form items even when there are errors or inconsistencies in single modal information.
[0071] S250. Monitor the user's operation on the trigger condition item, identify the newly added web page element after the operation, and determine the newly added web page element as a new form item.
[0072] Specifically, newly added form items refer to HTML elements that are dynamically generated and displayed on the webpage after a user performs an action on a trigger condition. These HTML elements do not exist or are hidden when the page initially loads; they are only loaded and displayed due to user interaction. For example, after a user selects "Married" in the trigger condition "Marital Status," the webpage dynamically renders and displays elements such as the "Spouse's Name" and "Spouse's ID Number" input boxes. These elements, which were not displayed before the user's action but appeared afterward, are considered newly added webpage elements.
[0073] For example, when a user's action triggers the condition "marital status" and selects "married", the intelligent form filling system monitors in real time and detects that two new input box elements, "spouse's name" and "spouse's ID number", have been inserted into the webpage. After determining that they meet the preset form item characteristics, these two elements are identified as the new form items corresponding to this operation.
[0074] Optionally, identify newly added web page elements after the operation and determine the newly added web page elements as new form items, including: identifying newly inserted web page elements or web page elements that switch from a hidden state to a visible state after the operation, and determining web page elements that meet the preset form item characteristics as new form items.
[0075] Specifically, preset form item features refer to a pre-defined set of features used to determine whether a webpage element is a form input control. These features mainly include element type, control attributes, and interaction styles. The preset form item features are derived by summarizing common form controls, covering typical features of common form items such as input boxes, dropdown selection boxes, radio buttons, checkboxes, and text fields.
[0076] In this embodiment of the application, after the user performs an operation on the trigger condition item, the changes of page elements are monitored, and newly inserted and loaded web page elements after the user operation and web page elements that switch from hidden state to display state are identified. These elements are then filtered according to preset form item characteristics, and web page elements that meet the characteristics of form controls such as input boxes and selection boxes are identified as dynamically added form items, thereby achieving accurate capture and differentiation of dynamic form content.
[0077] S260. Record the timestamp of each newly added form item to obtain the dynamic appearance sequence of the newly added form items.
[0078] For example, after a user selects "Married" for the inherent form field "Marital Status," the webpage first displays the "Spouse's Name" input box, and then displays the "Spouse's ID Number" input box after a 100-millisecond interval. The intelligent form filling system records the timestamps of the appearance of "Spouse's Name" as 16:20:30.100 and "Spouse's ID Number" as 16:20:30.200. Based on the order of the timestamps, the dynamic appearance sequence of the newly added form fields is: "Spouse's Name" appears first, followed by "Spouse's ID Number."
[0079] In this embodiment, the dynamic appearance sequence provides a clear execution order for subsequent form filling, ensuring that the intelligent form filling system can fill in the form items according to their actual display order.
[0080] S270. Construct a directed dependency graph with inherent form items and newly added form items as nodes, and triggering conditions and dynamic appearance order as edges. The directed dependency graph reflects the dynamic dependency relationship between inherent form items and newly added form items.
[0081] Specifically, a directed dependency graph is a directed graph structure used to describe the dependencies and execution order between form items. A directed dependency graph uses nodes to represent form items and directed edges to represent the triggering relationships and sequence of events between form items.
[0082] In this embodiment, the directed dependency graph can transform the complex dependencies of dynamic forms into a structured representation, thereby providing a clear execution path for form filling, ensuring that the form filling order conforms to the page logic, and avoiding filling errors. For example, the inherent form item "Marital Status" and the newly added form items "Spouse's Name" and "Spouse's ID Number" are respectively used as nodes in the directed dependency graph. Based on the temporal relationship where "Marital Status" triggers the display of the subsequent two items, and "Spouse's Name" appears before "Spouse's ID Number," directed edges are constructed sequentially in the direction of "Marital Status → Spouse's Name → Spouse's ID Number," ultimately forming a complete directed dependency graph, thus clearly expressing the triggering relationship and corresponding filling order between each form item.
[0083] S280. Based on the usage type and dynamic dependency relationship of the inherent form items, match the corresponding fill data from the user's pre-stored information database and complete the form filling according to the dynamic appearance sequence.
[0084] S290. During the filling process, capture the page feedback information triggered by the filling operation of the web form in real time, and verify the filling result based on the page feedback information.
[0085] S200. If the validation passes, the form is confirmed to be filled in; otherwise, adjust the filled data according to the page feedback information and re-execute the filling and validation until the validation passes.
[0086] The technical solution of this application embodiment, by fusing visual screenshot features and code semantic features for multimodal recognition, can strengthen highly reliable features when single-modal information is incorrect or conflicting, accurately determine the purpose type of inherent form items, and improve the accuracy and robustness of form recognition. Simultaneously, by monitoring user operations to dynamically identify newly added form items, recording the time sequence and constructing a directed dependency graph, it clearly represents the triggering relationships and display order between form items, achieving complete structural analysis of dynamic forms. In summary, the technical solution provided by this application embodiment can effectively adapt to various static and dynamic web page forms, accurately identify the purpose and dependency relationships of form items, and significantly improve the adaptability and recognition efficiency of intelligent form recognition systems for complex form scenarios.
[0087] Figure 3 A flowchart illustrating yet another browser-based smart form filling method based on multimodal understanding, provided as an embodiment of this application. Based on the above embodiments, as... Figure 3 As shown, optionally, the method includes:
[0088] S310. Obtain a visual screenshot and Hypertext Markup Language code of a web page form.
[0089] S320. Based on the visual screenshot and Hypertext Markup Language code, determine the purpose type of the inherent form items in the web page form, and identify the trigger condition items from the inherent form items according to the purpose type.
[0090] S330. Monitor user actions on trigger condition items and identify new form items that appear due to the actions. Based on the dynamic occurrence sequence of trigger condition items, new form items, inherent form items, and new form items, determine the dynamic dependency relationship between inherent form items and new form items.
[0091] S340. Based on the usage type of the inherent form item, search for data that matches the usage type in the user's pre-stored information database.
[0092] For example, when the purpose type of a certain form item is identified as a mobile phone number, the intelligent form filling system searches for and retrieves the corresponding mobile phone number data from a pre-stored user information database containing user name, mobile phone number, ID card number, email address, etc., based on the purpose type, so as to automatically fill in the corresponding form item later.
[0093] In this embodiment, by accurately matching the form item usage type with the user's pre-stored information database, data consistent with the business requirements of the form item can be quickly located and obtained, avoiding interference from irrelevant data, ensuring the accuracy and relevance of the filled data, and providing a reliable data source for subsequent automatic form filling.
[0094] S350. Based on dynamic dependencies, filter the matched data to obtain the populated data; according to the dynamic occurrence sequence, populate each form item with data in sequence.
[0095] For example, the intelligent form filling system identifies the intended use of the inherent form item "Marital Status" as "married" and matches it with multiple relevant data from the user's pre-stored information database, including spouse's name, spouse's ID number, and children's information. Subsequently, the intelligent form filling system filters the data based on dynamic dependencies, retaining only the spouse's name and ID number data that are actually triggered for display in the current form, excluding children's information that has not been triggered for display, thus obtaining the final fill data. Finally, the intelligent form filling system fills the corresponding data into the appropriate form items sequentially according to the dynamic order of "spouse's name first, then spouse's ID number," completing the orderly filling process.
[0096] In this embodiment of the application, by obtaining page feedback information in real time when filling in form items one by one, it is possible to determine in a timely manner whether the filled content conforms to the web page validation rules, providing a basis for subsequent automatic error correction and refilling.
[0097] S360: During the process of filling in each form field of the web page, obtain the page feedback information triggered after each form field is filled in real time. The form fields include both existing and newly added form fields, and the page feedback information includes format error messages, content validity messages, and required field messages.
[0098] Specifically, format error messages refer to the information displayed on a webpage when it detects that the entered content does not conform to the format specifications. The purpose of format error messages is to remind users that the current content does not meet the form requirements so that they can correct it. Content validity messages refer to the information displayed on a webpage after verifying the authenticity and compliance of the entered content. Their purpose is to verify whether the filled-in content is valid and usable, preventing the submission of invalid information. Required field messages refer to the information displayed on a webpage for required form fields that are not filled in or not filled in correctly. Their purpose is to remind users of any missing key form fields, ensuring the completeness of the form information.
[0099] In this embodiment, by obtaining real-time feedback information from the page during item-by-item filling, it is possible to promptly detect whether the filled content conforms to the webpage validation rules, thereby achieving simultaneous filling and validation. This allows for the early detection of issues such as format errors, invalid content, and missing required fields, providing real-time basis for subsequent automatic error correction and preventing overall errors when the form is finally submitted, thus effectively improving the accuracy and success rate of filling.
[0100] S370. Based on the format error message, verify whether the format of the filled data is compliant; based on the content validity message, verify whether the content of the filled data is valid; based on the required field message, verify whether the filled data meets the requirements for filling the required fields.
[0101] In this embodiment, the filled data is validated item by item using three types of page feedback information. This accurately identifies issues such as non-compliant formatting, invalid content, and missing required fields, enabling real-time validation and problem localization during the filling process. This ensures that the filled data conforms to form validation rules, providing a reliable basis for subsequent automatic error correction and compliant filling. For example, after filling in the mobile phone number form field, a format error message appears on the page. The intelligent form filling system uses this to verify that the filled data is a pure letter combination, which does not conform to the mobile phone number number format specifications. After filling in the ID card number form field, a content validity message appears on the page. The intelligent form filling system uses this to verify that the number has insufficient digits and does not conform to the valid ID card number rules. When submitting the form field without filling in the name, a required field message appears on the page. The intelligent form filling system uses this to verify that this field is not filled in and does not meet the required field filling requirements.
[0102] S380. If the validation passes, the form is confirmed to be filled in; otherwise, adjust the filled data according to the page feedback information and re-execute the filling and validation until the validation passes.
[0103] Optionally, adjust the populated data based on page feedback, including: if the format of the populated data is determined to be non-compliant based on the format error message, adjust the format of the populated data; if the content of the populated data is determined to be invalid based on the content validity message, replace it with valid populated data; if the required field message indicates that there are required fields that have not been filled, supplement the corresponding populated data.
[0104] In this embodiment, based on real-time page feedback, the required data that is not formatted correctly, has invalid content, or is missing is specifically corrected and supplemented. This ensures that the filled data meets the validation rules of the web page form, thereby effectively avoiding form submission failure and improving the accuracy and automation of intelligent filling.
[0105] For example, if a format error message appears on the webpage after filling in a mobile phone number, the original non-numeric format will be adjusted to a numeric format that conforms to mobile phone number specifications. If a content validity message appears on the webpage after filling in an ID card number, the invalid number will be replaced with a valid ID card number from the user's pre-stored information database. If a required field is displayed on the webpage, the missing required field will be identified, and the corresponding data will be retrieved from the user's pre-stored information database to fill it in, thus automatically adjusting the filled data.
[0106] The technical solution of this application embodiment, by combining the form item usage type and dynamic dependency relationship for data matching and filtering, can accurately obtain fill data that matches the actual content displayed in the current form, and fill it in an orderly manner according to the dynamic appearance sequence, effectively avoiding problems such as redundant fill content and disordered order, and improving fill adaptability and accuracy. At the same time, during the fill process, page feedback information is captured in real time, and the data format, content validity and required fields are checked item by item. Based on the check results, the data format is automatically adjusted, invalid data is replaced or required content is supplemented, realizing instant error correction, which significantly improves the success rate, stability and reliability of intelligent form fill.
[0107] Figure 4 This is a schematic diagram illustrating the structure of a browser-based intelligent form filling system based on multimodal understanding, provided as an embodiment of this application. For example... Figure 4 As shown, this browser-based intelligent form filling system based on multimodal understanding includes:
[0108] The data acquisition module 410 is used to acquire visual screenshots and Hypertext Markup Language (HTML) codes of web page forms.
[0109] The form item recognition module 420 is used to determine the purpose type of the inherent form items in the web page form based on the visual screenshot and the hypertext markup language code, and to identify the trigger condition items from the inherent form items based on the purpose type.
[0110] The dependency construction module 430 is used to monitor user operations on trigger condition items and identify new form items that appear due to the operations. Based on the dynamic occurrence sequence of trigger condition items, new form items, inherent form items and new form items, the dynamic dependency relationship between inherent form items and new form items is determined.
[0111] The data matching and filling module 440 is used to match the corresponding filling data from the user's pre-stored information database based on the usage type and dynamic dependency relationship of the inherent form items, and to complete the form filling according to the dynamic appearance sequence.
[0112] The validation and correction module 450 is used to capture the page feedback information triggered by the form filling operation in real time during the filling process, and to validate the filling result based on the page feedback information; if the validation passes, the form filling is confirmed to be complete; otherwise, the filling data is adjusted according to the page feedback information, and the filling and validation are re-executed until the validation passes.
[0113] The browser-based smart form filling system based on multimodal understanding provided in this application can execute the browser-based smart form filling method based on multimodal understanding provided in any embodiment of this application, and has the corresponding functional modules and beneficial effects of the execution method.
Claims
1. A method for intelligent form filling in a browser based on multi-modal understanding, characterized in that, include: Obtain visual screenshots and Hypertext Markup Language (HTML) code of web page forms; Based on the visual screenshot and the Hypertext Markup Language code, determine the purpose type of the inherent form items in the web page form, and identify the trigger condition items from the inherent form items based on the purpose type; Monitor user operations on the trigger condition items and identify new form items that appear due to the operations. Based on the dynamic occurrence sequence of the trigger condition items, the new form items, the inherent form items, and the new form items, determine the dynamic dependency relationship between the inherent form items and the new form items. Based on the usage type of the inherent form items and the dynamic dependency relationship, the corresponding fill data is matched from the user's pre-stored information database, and the form is filled according to the dynamic appearance sequence. During the filling process, the page feedback information triggered by the filling operation of the web form is captured in real time, and the filling result is verified based on the page feedback information; If the validation passes, the form is confirmed to be filled in. Otherwise, adjust the filled data according to the page feedback information, and re-execute the filling and validation until the validation passes.
2. The method of claim 1, wherein the method further comprises: The process of obtaining visual screenshots and Hypertext Markup Language (HTML) code from webpage forms includes: Take a screenshot of the page area where the webpage form is located to obtain the visual screenshot; The source code of the Document Object Model corresponding to the web page form is obtained through the browser interface, thus obtaining the Hypertext Markup Language code.
3. The multi-modal understanding based browser intelligent form filling method of claim 1, wherein, The step of determining the purpose type of the inherent form items in the web page form based on the visual screenshot and the Hypertext Markup Language code includes: Identify the inherent form items and their text labels in the visual screenshot, and calculate the visual position features of the inherent form items; wherein, the visual position features include the relative position, alignment relationship and spatial distance between the inherent form items and the corresponding text labels; Extract the code semantic features of the inherent form items from the hypertext markup language code; wherein, the code semantic features include the name attribute and prompt text of the inherent form items; The semantic features of the code are parsed using a semantic model, and the purpose type of the inherent form item is determined based on the visual location features and the parsed semantic features of the code.
4. The multi-modal understanding based browser intelligent form filling method of claim 3, wherein, Determining the purpose type of the inherent form item based on the visual location features and the parsed code semantic features includes: If the visual position features do not conflict with the parsed code semantic features, then a semantic vector is generated based on the visual position features and the parsed code semantic features; If the visual position features conflict with the parsed code semantic features, a cross-attention mechanism is used to weight and fuse the visual position features and the parsed code semantic features to generate the semantic vector. The semantic vector is matched with predefined usage type features to determine the usage type of the inherent form item.
5. The browser-based intelligent form filling method based on multimodal understanding according to claim 1, characterized in that, The monitoring of user operations on the trigger condition items and identification of new form items resulting from the operations, along with determining the dynamic dependency relationship between the inherent form items and the new form items based on the dynamic occurrence sequence of the trigger condition items, the new form items, the inherent form items, and the new form items, includes: Monitor the user's operation on the trigger condition item, identify the newly added web page element after the operation, and determine the newly added web page element as the newly added form item; Record the timestamp of each newly added form item to obtain the dynamic occurrence sequence of the newly added form items; Construct a directed dependency graph with the inherent form items and the newly added form items as nodes, and the triggering condition and the dynamic occurrence sequence as edges; wherein, the directed dependency graph is used to reflect the dynamic dependency relationship between the inherent form items and the newly added form items.
6. The browser-based intelligent form filling method based on multimodal understanding according to claim 5, characterized in that, The step of identifying the newly added webpage element after the operation and determining the newly added webpage element as the newly added form item includes: The web page element newly inserted after the operation or the web page element that switches from a hidden state to a visible state is identified, and the web page element that meets the preset form item characteristics is determined as the new form item.
7. The browser-based intelligent form filling method based on multimodal understanding according to claim 1, characterized in that, The process of matching corresponding fill data from the user's pre-stored information database based on the inherent form item's usage type and the dynamic dependency relationship, and completing form filling according to the dynamic appearance sequence, includes: Based on the usage type of the inherent form item, search for data matching the usage type from the user's pre-stored information database; Based on the dynamic dependency relationship, the matched data is filtered to obtain the fill data; according to the dynamic occurrence sequence, the data is filled into each form item in sequence.
8. The browser-based intelligent form filling method based on multimodal understanding according to claim 1, characterized in that, During the filling process, the page feedback information triggered by the form filling operation is captured in real time, and the filling result is verified based on the page feedback information, including: During the process of filling in each form item of the web page form, the page feedback information triggered after each form item is filled in real time is obtained; wherein, the form item includes the inherent form item and the newly added form item, and the page feedback information includes format error prompts, content validity prompts, and required field prompts; Based on the format error message, verify whether the format of the filled data is compliant; based on the content validity message, verify whether the content of the filled data is valid; based on the required field message, verify whether the filled data meets the requirements for filling the required fields.
9. The browser-based intelligent form filling method based on multimodal understanding according to claim 8, characterized in that, Adjusting the fill data based on the page feedback information includes: If the format of the fill data is determined to be non-compliant based on the format error message, then the format of the fill data shall be adjusted. If the content of the fill data is determined to be invalid based on the content validity prompt, then it is replaced with valid fill data; If, based on the required field prompts, it is determined that there are any unfilled required fields, then the corresponding data should be filled in.
10. A browser-based intelligent form filling system based on multimodal understanding, characterized in that, include: The data acquisition module is used to acquire visual screenshots and Hypertext Markup Language (HMR) code of web page forms; The form item recognition module is used to determine the purpose type of the inherent form items in the web page form based on the visual screenshot and the hypertext markup language code, and to identify the trigger condition items from the inherent form items based on the purpose type; The dependency relationship building module is used to monitor the user's operation on the trigger condition item, identify the new form item that appears due to the operation, and determine the dynamic dependency relationship between the inherent form item and the new form item based on the dynamic occurrence sequence of the trigger condition item, the new form item, the inherent form item, and the new form item. The data matching and filling module is used to match the corresponding filling data from the user's pre-stored information database based on the usage type of the inherent form items and the dynamic dependency relationship, and to complete the form filling according to the dynamic appearance sequence. The verification and correction module is used to capture the page feedback information triggered by the form during the filling process in real time, and to verify the filling result based on the page feedback information. If the validation passes, the form is confirmed to be filled in. Otherwise, adjust the filled data according to the page feedback information, and re-execute the filling and validation until the validation passes.