Training method of gui agent, task processing method, device and storage medium
By evaluating the contribution of actions during the training process of GUI agents using an action evaluation model, the problems of high training data annotation costs and sparse supervision information are solved, and efficient and stable GUI agent training is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU ALIBABA INT INTERNET IND CO LTD
- Filing Date
- 2026-02-13
- Publication Date
- 2026-06-12
AI Technical Summary
Existing GUI agent training methods suffer from high training data labeling costs, sparse supervision information, and environmental fluctuations affecting training efficiency, resulting in low training efficiency and poor performance.
The contribution of each action is evaluated using a pre-trained action evaluation model. By acquiring action sequences generated by GUI interaction, a multimodal language model is used to output candidate actions. The GUI agent is trained based on the score differences of the candidate actions to maximize the cumulative amount of advantage values.
It reduces the need for high-quality interactive trajectory data, overcomes the problem of sparse supervised information, avoids the impact of environmental fluctuations, and improves training efficiency and effectiveness.
Smart Images

Figure CN122197938A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of artificial intelligence technology, and in particular to training methods, task processing methods, devices, storage media and program products for GUI intelligent agents. Background Technology
[0002] With the development of artificial intelligence technology, GUI (Graphical User Interface) agents, as intelligent models capable of simulating human interaction with graphical user interfaces, are widely used in various scenarios. GUI agents can accurately understand user-inputted task commands, identify various interactive elements in the GUI (such as buttons, input boxes, drop-down menus, etc.), generate reasonable sequences of interactive actions, and complete the target task, thereby replacing manual labor in performing tedious and repetitive GUI interaction operations and improving work efficiency.
[0003] The industry commonly uses multimodal language models as the basic model for GUI agents. By training multimodal language models, they are endowed with the ability to understand user commands, recognize interface elements, and generate interactive actions, thus obtaining a GUI agent. However, the training methods for GUI agents in related technologies suffer from problems such as high cost of training data annotation, sparse supervision information (feedback information is only obtained after a long sequence of actions is executed), and the optimization process being easily affected by environmental information. These issues lead to low training efficiency and unsatisfactory training results for GUI agents. Summary of the Invention
[0004] In view of the above, this specification provides one or more embodiments of a GUI intelligent agent training method, task processing method, device, storage medium, and program product.
[0005] To achieve the above objectives, one or more embodiments of this specification provide the following technical solutions: According to a first aspect of one or more embodiments of this specification, a method for training a graphical user interface (GUI) agent is provided, the GUI agent being used to interact with the GUI to complete a user-input task, the method comprising: Obtain the first action sequence formed by interaction with the GUI; The state information of a single step action in the first action sequence is input into a multimodal language model to predict multiple candidate actions under the state information. The state information of each single step action includes the task information of the task associated with the first action sequence, the GUI screenshot before the single step action is executed, and the relevant information of the historical actions executed before the single step action. For each candidate action, the state information and the candidate action are input into a pre-trained action evaluation model to predict the score of the candidate action. The score is used to characterize the degree of contribution of performing the candidate action under the state information to the success of task execution. The mean score of each of the plurality of candidate actions is determined, and the advantage value of each candidate action is determined based on the difference between the score of each candidate action and the mean score, wherein the advantage value is positively correlated with the difference. The multimodal language model is trained with the goal of maximizing the cumulative advantage value of each of the multiple candidate actions, and the trained multimodal language model is used as the GUI agent.
[0006] According to a second aspect of the embodiments of this specification, an electronic device is provided, comprising: processor; Memory used to store processor-executable instructions; Wherein, when the processor executes the executable instructions, it is used to implement the method described in the first aspect.
[0007] According to a third aspect of the embodiments of this specification, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps of the method described in the first aspect.
[0008] According to a fourth aspect of the embodiments of this specification, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the method described in the first aspect.
[0009] The technical solutions provided in the embodiments of this specification may include the following beneficial effects: In this embodiment of the specification, to address the problem of sparse supervision information during training, an action evaluation model is pre-trained. This model evaluates the score of each action in a given state, representing the contribution of the action to the success of task execution in that state. During the training of the GUI agent, action sequences generated through interaction with the GUI are acquired. The state information of single-step actions in the action sequence is input into a multimodal language model, which outputs multiple candidate actions under that state information. The action evaluation model is then used to determine the score of each candidate action under that state information. Based on the difference between the score of each candidate action and the average score of multiple candidate actions, the dominance value of each candidate action is determined; the larger the difference, the greater the dominance value of the candidate action. The multimodal language model is then trained with the goal of maximizing the cumulative dominance values of the multiple candidate actions to obtain the GUI agent.
[0010] The training method described above eliminates the need to acquire large amounts of high-quality interaction trajectory data, reducing annotation costs. Furthermore, since an action evaluation model can be used to determine the score of each action as supervision information, the problem of sparse supervision information can be overcome. Moreover, this training method does not require the model to perform actions and obtain feedback information in real time in the actual GUI environment, thus avoiding the impact of environmental fluctuations on model training and greatly improving the training efficiency and effectiveness of the model.
[0011] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this specification. Attached Figure Description
[0012] Figure 1 This is a schematic diagram illustrating the task of verifying certificate information through interaction with a GUI, provided as an exemplary embodiment.
[0013] Figure 2 This is a flowchart of a training method for a GUI agent provided in an exemplary embodiment.
[0014] Figure 3 This is a schematic diagram of a training method for a GUI agent provided in an exemplary embodiment.
[0015] Figure 4 This is a flowchart of collecting training data provided in an exemplary embodiment.
[0016] Figure 5 This is a flowchart of collecting training data provided in another exemplary embodiment.
[0017] Figure 6 This is a schematic diagram of a training action evaluation model and a GUI agent based on a supervised fine-tuned multimodal language model, provided in another exemplary embodiment.
[0018] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an exemplary embodiment. Detailed Implementation
[0019] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with one or more embodiments of this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of one or more embodiments of this specification as detailed in the appended claims.
[0020] It should be noted that the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification in other embodiments. In some other embodiments, the methods may include more or fewer steps than described in this specification. Furthermore, a single step described in this specification may be broken down into multiple steps in other embodiments; and multiple steps described in this specification may be combined into a single step in other embodiments.
[0021] The user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this manual are all information and data authorized by the user or fully authorized by all parties. The collection, use and processing of related data shall comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals shall be provided for users to choose to authorize or refuse.
[0022] GUI agents can simulate human interaction with a GUI. Their core function is to accurately understand user-inputted task commands, identify interactive elements in the GUI (such as buttons, input boxes, drop-down menus, etc.), generate a reasonable sequence of step-by-step interactive actions, and ultimately autonomously complete the target task. This effectively replaces manual labor in performing tedious, repetitive, and standardized GUI interactions, significantly improving work efficiency and reducing manual operating costs. Currently, GUI agents have been applied to various task scenarios, such as verifying merchant licenses, querying customs clearance status information throughout the entire process, and multilingual inspections by Account Executives (AEs).
[0023] Taking the verification of business licenses as an example, such as Figure 1 As shown, if a user manually performs this task, they might need to perform the following interactive operations: Step 1, enter the URL of the official website of the license verification department to access the official homepage; Step 2, manually enter the merchant name, unified social credit code, and other information; Step 3, click the query button to query the merchant's license information; Step 4, enter the URL to log in to the merchant's backend system; Step 5, click the view button to view the merchant's license information; Step 6, verify whether key fields such as the license validity period and business scope are consistent, record the verification results, and approve / reject the merchant's license review application. It is evident that the entire process requires multiple interactive operations between the user and the GUI interface, such as entering the URL, scrolling down the page, clicking buttons, and entering the merchant name. Using a GUI agent to perform this task allows the GUI agent to perform the above GUI interface interactive operations on behalf of the user, such as clicking buttons and entering merchant information, which can greatly improve work efficiency and reduce errors.
[0024] Currently, the industry generally adopts multimodal language models as the basic model for GUI intelligent agents. By training multimodal language models, they are given the ability to understand user commands, recognize interface elements, and generate interactive actions, thus obtaining GUI intelligent agents.
[0025] The mainstream methods for training GUI agents include the following: (1) Use a multimodal language model to imitate high-quality interaction trajectories; for example, a large amount of high-quality interaction trajectory data can be collected in advance as training samples. The training samples include the state information of each step in the interaction trajectory (e.g., GUI screenshots, task instructions, etc.), the interaction actions of each step (e.g., clicking, inputting, scrolling, etc.), and the reasoning process text. Then, the state information of each step is used as input, and the model is allowed to predict the corresponding action sequence and reasoning process text based on the state information. The training objective is to make the action sequence predicted by the model as consistent as possible with the actual action sequence and the reasoning process text output by the model as consistent as possible with the actual reasoning process text. The model is trained to obtain the GUI agent.
[0026] To endow multimodal language models with powerful GUI navigation capabilities, a large amount of high-quality interaction trajectory data is typically required for training. Unlike traditional image-text pairing data, GUI interaction trajectory data not only needs to fully capture the state information of each step, but also needs to record the corresponding reasoning process and specific interaction actions in detail to meet the needs of model training. The construction of such high-quality interaction trajectory data heavily relies on the participation of domain experts, who need to manually complete the interaction operations, record the reasoning process, and label the state information. The entire data organization, collection, and labeling process is time-consuming and labor-intensive, resulting in extremely high training data acquisition costs and making it difficult to acquire at scale, thus limiting the training effect and generalization ability of GUI agents.
[0027] (2) Reinforcement learning-based training methods. This involves using task interaction trajectory data and task execution results as training samples, and then using these results as supervisory information to iteratively optimize the model's parameters. However, in a GUI environment, completing a user task typically requires multiple rounds of continuous interaction actions, forming a long sequence of interaction trajectories. The feedback information (i.e., whether the task was successful or failed) can only be obtained after the entire action sequence has been completed, resulting in extremely sparse and delayed step-by-step supervisory information. This makes it difficult for the model to obtain effective supervision and guidance for each intermediate single-step action in the trajectory during training. It also makes it difficult to accurately determine the contribution of each single-step action to the final success of the task, thus hindering the optimization of the action selection logic for intermediate steps. This can easily lead to unreasonable intermediate actions, ineffective interactions, and other problems, affecting the task completion rate of the GUI agent.
[0028] Furthermore, the aforementioned training methods must be tightly coupled with the actual execution environment. That is, the model needs to execute actions and obtain feedback in real time within the actual GUI environment in order to update parameters. However, the actual GUI execution environment is usually non-static, and the environment state may change over time and with different operating scenarios. This causes the optimization process of reinforcement learning to be affected by environmental fluctuations, resulting in not only low training efficiency but also problems such as training oscillations and parameter non-convergence, making it difficult to stably train high-performance GUI agents.
[0029] Based on this, embodiments of this specification provide a training method for a GUI agent. To address the problem of sparse supervision information during training, embodiments of this specification pre-train an action evaluation model. This action evaluation model can be used to evaluate the score of each action in a given state, and the score can characterize the contribution of the action to the success of task execution in a given state. When training the GUI agent, action sequences generated through interaction with the GUI can be acquired. The state information of single-step actions in the action sequence is input into a multimodal language model, and the multimodal language model outputs multiple candidate actions under the given state information. Then, the action evaluation model is used to determine the score of each candidate action under the given state information, and the advantage value of each candidate action is determined based on the difference between the score of each candidate action and the average score of multiple candidate actions. The larger the difference, the greater the advantage value of the candidate action. Then, the multimodal language model can be trained with the goal of maximizing the cumulative advantage values of the multiple candidate actions to obtain the GUI agent. That is, the training objective of the model is to maximize the difference in scores between the candidate actions predicted by the model, so that the optimal action can be selected from them.
[0030] The training method described above eliminates the need to acquire large amounts of high-quality interaction trajectory data, reducing annotation costs. Furthermore, since an action evaluation model can be used to determine the score of each action as supervision information, the problem of sparse supervision information can be overcome. Moreover, this training method does not require the model to perform actions and obtain feedback information in real time in the actual GUI environment, thus avoiding the impact of environmental fluctuations on model training and greatly improving the training efficiency and effectiveness of the model.
[0031] The GUI agent training method described in this specification can be executed by various electronic devices, including but not limited to physical servers, server clusters, cloud servers, smartphones / mobile phones, tablet computers, personal digital assistants (PDAs), laptops, and desktop computers.
[0032] The GUI agent in the embodiments of this specification refers to an agent that can understand user instructions, recognize GUI interface interaction elements, and execute GUI interface interaction actions to complete the tasks assigned by the user.
[0033] The following combination Figure 2 and Figure 3 This specification describes the GUI agent training method according to embodiments, such as... Figure 2 As shown, the GUI agent training method in the embodiments of this specification may include the following steps: S202, Obtain the first action sequence formed by interaction with the GUI; In step S202, a first action sequence formed by interaction with the GUI can be obtained. This first action sequence can be a sequence of interactive actions with the GUI during the execution of a target task. This action sequence can be formed by user interaction with the GUI or by interaction between a multimodal language model and the GUI. Its core is to completely record all continuous single-step interactive actions with the GUI during the completion of a task, with each action corresponding to one interaction with a GUI element (such as clicking a button, entering text, scrolling the page, etc.). For example, the action sequence could be: enter a URL - click button A - scroll the page - check option B.
[0034] The first action sequence can be single or multiple. For example, in a scenario where there are multiple first action sequences, each action sequence can correspond to a target task. This action sequence is the action sequence formed by interacting with the GUI during the execution of the target task.
[0035] S204. Input the state information of the single-step action in the first action sequence into the multimodal language model to predict multiple candidate actions under the state information. The state information of each single-step action includes the task information of the task associated with the first action sequence, the GUI screenshot before the single-step action is executed, and the relevant information of the historical actions executed before the single-step action. In step S204, each action in the first action sequence is associated with state information. This state information describes the GUI environment state when the action is executed. It may include task information associated with the task in the first action sequence (to clearly define the core objective of the task and provide a benchmark for action value assessment), such as the task type (e.g., "query the validity of a merchant's license"), the URL corresponding to the task, and the user's requirements for the task (e.g., the output format). The state information may also include a GUI screenshot before the action is executed. This screenshot is a pixel-level complete interface image, fully presenting the visual information of the current interaction scene, ensuring the model can recognize interactive elements. The state information may also include relevant information about previously executed historical actions, such as the type of the previously executed historical actions, the inference process text corresponding to the previously executed historical actions, the GUI screenshots corresponding to each previously executed historical action, and the position information of the interactive elements corresponding to the previously executed historical actions in the GUI, to obtain complete interaction context information.
[0036] Then, one or more single-step actions can be extracted from the first action sequence. The state information corresponding to each single-step action can be used as a training sample for the multimodal language model. For example, for each single-step action, such as... Figure 3 As shown, the state information of the single-step action can be input into the multimodal language model. The multimodal language model combines its own visual understanding ability (analyzing GUI screenshots) and text parsing ability (understanding task information, historical actions, etc.) to predict and generate multiple candidate actions (e.g., candidate action 1 to candidate action K) based on the state information. The candidate actions cover multiple reasonable interactive actions under the state information.
[0037] S206. For each candidate action, the state information and the candidate action are input into a pre-trained action evaluation model to predict the score of the candidate action. The score is used to characterize the degree of contribution of performing the candidate action under the state information to the success of the task. In step S206, an action evaluation model can be trained using historical interaction trajectory data (such as state-action sequences) generated through GUI interaction. This action evaluation model has built-in task contribution quantification logic, which, based on the input single-step state information and candidate actions, combined with the task objective, quantifies and evaluates the contribution of the candidate action to the final success of the task in the current scenario, and outputs a score representing this contribution. For example, the higher the score, the greater the contribution. Figure 3As shown, for each candidate action, the state information and the candidate action can be input into a pre-trained action evaluation model to predict the score of the candidate action. During the evaluation process, the action evaluation model can combine task information to determine whether the candidate action fits the task objective, combine GUI screenshots to determine whether the candidate action is executable, and combine historical action information to determine whether the candidate action conforms to the interaction logic, and comprehensively derive the score result. In some scenarios, the scoring criteria can also be recorded to facilitate subsequent optimization and adjustment.
[0038] By providing precise value quantification standards for each candidate action, the contribution of each step to task success is clearly defined. This allows for supervision and guidance of intermediate steps without relying on the final task feedback, thus solving the problem of sparse supervision information.
[0039] In some embodiments, after obtaining scores for multiple candidate actions using an action evaluation model, each candidate action and its score can be recorded in a log file for easy subsequent analysis and backtracking.
[0040] S208. Determine the mean score of each of the plurality of candidate actions, and determine the advantage value of each candidate action based on the difference between the score of each candidate action and the mean score, wherein the advantage value is positively correlated with the difference. In step S208, after obtaining the scores of multiple candidate actions, the multimodal language model can be trained based on these scores. The training objective is to increase the distance between the scores of the multiple candidate actions predicted by the model (i.e., to increase the difference between the scores of each candidate action), thereby selecting the optimal action. Specifically, the mean score of all candidate actions under the same single-step state information can be calculated, and then the difference between the score of each candidate action and the mean can be calculated. Based on this difference, the advantage value of each candidate action is obtained, where the larger the difference, the larger the advantage value.
[0041] In some embodiments, in order to eliminate the influence of differences in scoring scales under different tasks and states and to ensure the comparability of advantage values, the differences can be normalized to obtain the advantage value. For example, the ratio of the above-mentioned difference of each candidate action to the standard deviation of the scores of the group of candidate actions can be used as the advantage value of each candidate action.
[0042] By quantifying the advantage value, we can accurately capture the value difference of candidate actions in a single step, providing an optimization reference for subsequent model training.
[0043] S210. The multimodal language model is trained with at least maximizing the cumulative advantage value of each of the multiple candidate actions as the training objective, and the trained multimodal language model is used as the GUI agent.
[0044] In step S210, the multimodal language model is trained with the goal of maximizing the cumulative advantage values of each of the multiple candidate actions. For example, this cumulative value can be obtained by directly summing the advantage values of each candidate action, or by weighted summing of the advantage values of each candidate action. Alternatively, in actual model training, different weights can be assigned to the advantage value of each candidate action, and the cumulative value can be the weighted cumulative value of the advantage values of multiple candidate actions.
[0045] In some embodiments, the above-mentioned action evaluation model can be trained in the following way: First, a second action sequence formed by interaction with the GUI can be obtained. The second action sequence can include multiple sequences, each corresponding to a task. This second action sequence can originate from the interaction trajectory between the multimodal language model and the GUI, or from the interaction trajectory generated by the user's interaction with the GUI. It can completely record all single-step interaction actions in the process of completing a task and is associated with the final execution result (success or failure) of the task. Each single-step action and its corresponding state information can be extracted one by one from the second action sequence. The state information includes the task information of the task associated with the second action sequence, the GUI screenshot before executing the single-step action, and relevant information of historical actions executed before executing the single-step action. The extracted single-step actions and their corresponding state information are used as input data for the multimodal language model. At the same time, the true score of the single-step action can be determined based on the final execution result of the task associated with the second action sequence as supervision information. The multimodal language model is then trained under targeted supervision. By iteratively optimizing the model parameters, the model can accurately learn the correlation between single-step actions and task contribution (i.e., score), thus obtaining an action evaluation model with accurate scoring capabilities.
[0046] In determining the true score of a single-step action, the true score of each single-step action in the second action sequence can be derived backward from the final execution result of the task associated with the second action sequence. For example, if the task is ultimately executed successfully, single-step actions that propel the task forward and conform to the task objective are assigned higher true scores, while single-step actions that have no practical effect or hinder task progress are assigned lower true scores. If the task ultimately fails, the true score of each single-step action is determined backward based on the specific reasons for the failure, assigning low scores to key single-step actions that led to the failure and medium scores to single-step actions that have no significant impact. Of course, the rules for assigning the true scores of single-step actions can be flexibly set based on actual needs, and the embodiments in this specification do not impose any limitations.
[0047] By using the final execution result of the task to back-calibrate the true score of each step action, the cost of obtaining supervision information and the workload of annotation are reduced, effectively solving the pain point of difficulty in obtaining supervision information. At the same time, the true score is strongly correlated with the actual execution effect of the task, which improves the accuracy of supervision information and ensures that the trained action evaluation model can accurately quantify the contribution of each candidate action to the success of task execution. This provides reliable scoring support for the calculation of advantage value and model optimization in the training process of GUI agent, further improving the feasibility, efficiency and stability of the entire GUI agent training scheme.
[0048] When back-calibrating the true scores of individual actions in the second action sequence based on the task execution results, it is considered that analyzing the specific impact of each action on task progress and back-judging the value of actions in conjunction with the reasons for task failure would not only be cumbersome and inefficient, but also prone to inconsistencies in scoring due to inconsistent analysis standards and subjective judgment biases, thus affecting the reliability of supervision information and the training efficiency of the action evaluation model. Therefore, in some embodiments, a binary classification scoring mechanism can be adopted. If the execution result of the task associated with the second action sequence is successful, then all individual actions in the trajectory are deemed to have a positive supporting effect on task success; therefore, the true score of each action in the second action sequence is set to a preset first score. If the execution result of the task associated with the second action sequence is unsuccessful, then all individual actions in the trajectory as a whole have failed to achieve the task objective; therefore, the true score of each action in the second action sequence is set to a preset second score, where the first score is greater than the second score. For example, if the task is successful, the true score of each individual action in the second action sequence is set to 1; if the task fails, the true score of each individual action in the second action sequence is set to 0.
[0049] The above method simplifies the calibration process for accurate scores of single-step actions, and it has been verified that the action evaluation model trained using this calibration method has superior performance.
[0050] Considering that the state information of a single step in the second action sequence includes information about all previous historical interactions, and that completing a task in a GUI interaction scenario often requires multiple rounds of continuous interaction, resulting in excessively long historical interaction information, the action evaluation model, when processing this state information, struggles to focus on the core context of the current interaction and is easily interfered with by redundant early historical interaction information, leading to inaccurate predicted scores. To address this issue, some embodiments provide a state information optimization scheme based on a sliding window strategy. When training the action evaluation model, the state information of a single step includes the task information of the task associated with the second action sequence, a GUI screenshot before executing the action, and information about the previous N (N is an integer) historical actions of the action—that is, information about the most recent N historical actions of the action, rather than all historical interaction information.
[0051] To address the difficulty in focusing model context caused by excessively long interaction history, a sliding window strategy is specifically adopted. This strategy limits the historical interaction content contained in a single-step state information to the most recent few steps. N can be flexibly adjusted according to task complexity and model processing capabilities, for example, set to the most recent 2 or 5 steps. The sliding window is dynamically updated according to the action execution order, retaining only the key historical interaction information closest to the current action, while simultaneously removing redundant early historical interaction content. This ensures that the state information contains necessary context references while avoiding information redundancy.
[0052] By using a sliding window strategy to eliminate redundant historical information, the burden of context processing on the model is greatly reduced, enabling the action evaluation model to accurately capture the core context and key information of the current interaction and avoid evaluation bias caused by redundant information interference.
[0053] In some embodiments, to train the action evaluation model and the GUI agent, a training dataset can be pre-constructed. For example, a multimodal language model can be used to execute a specified GUI interaction task. During the task execution, a complete action sequence formed by the interaction with the GUI is recorded, along with the state information corresponding to each single action in the action sequence (task information of the task associated with the action sequence, a screenshot of the GUI before executing the single action, and relevant information about the historical interaction actions before the action), the reasoning process text corresponding to each single action, and the final execution result of the task (success or failure). This serves as a training data point. The reasoning process text records the model's logical thinking process when executing the action (e.g., "To advance the task execution, the query button in the current GUI interface needs to be clicked to obtain relevant data"), which helps the model learn reasonable interaction logic.
[0054] The first action sequence and / or the second action sequence can be obtained from the constructed training dataset. By using a multimodal language model to interact with the GUI to construct the training dataset, the process of obtaining training data can be simplified, thereby obtaining a rich training dataset.
[0055] In related technologies, the construction of datasets required for training GUI agents commonly faces the problem of unstable operating environments. Influenced by factors such as network load limitations and system configuration differences, the collection of interaction trajectory data is prone to various environmental interference problems, including human-machine verification, website inaccessibility, and browser crashes. Such environmental interference introduces a large amount of task-irrelevant noise into the collected training data, failing to accurately reflect the interactive capabilities of the multimodal language model and hindering the model from capturing task-specific interaction patterns. Furthermore, task execution failures caused by environmental interference are usually directly attributed to insufficient model performance, interfering with the judgment of model training effectiveness and thus affecting the training quality of the GUI agent and action evaluation model.
[0056] To address this issue, a training data collection scheme based on a "detection-rerun" strategy has been proposed in some examples, which can effectively eliminate the negative impact of environmental interference. For example, ... Figure 4As shown, for any GUI-related task, prompts can be constructed based on the task information (such as task objectives and execution requirements). These prompts guide the multimodal language model to execute the task. During task execution, the model synchronously outputs the complete action sequence formed by interaction with the GUI, the state information corresponding to each single action in the action sequence, the reasoning process text for each action, and finally outputs the execution result (success or failure) of the specified task, ensuring the completeness of the core information required for each training data point. Then, a pre-trained environmental interference judgment model (which can be a pre-trained expert model with accurate environmental interference recognition capabilities) can be called. The state information of each action in the output action sequence, the reasoning process text for each action, and the execution result of the task are used as input. The environmental interference judgment model determines whether the multimodal language model is interfered with by irrelevant factors such as network environment and system operation during the execution of the task. If the judgment result is that there is no interference, it means that the output interaction trajectory (state-action sequence) can truly reflect the model's interaction capability and has no redundant noise. The action sequence, the corresponding single-step action state information, the reasoning process text for each action, and the task execution result are integrated into a structured training data and stored in the training dataset. If the judgment result is that there is interference, it means that the output data contains noise and cannot be used as valid training data. In this case, the multimodal language model is used to execute the task again (for example, prompts can be regenerated or the execution time can be chosen to execute the task, ensuring that the rerun process is not affected by previous interference factors) until valid data without interference is obtained and stored in the training dataset.
[0057] By employing the "detect-rerun" strategy, the negative impact of environmental interference can be effectively eliminated, ensuring the validity and purity of training data, improving the training accuracy and stability of GUI agents and action evaluation models, and reducing the impact of environmental factors on the overall training process.
[0058] In related technologies, when constructing training datasets, trajectory data generated by multimodal language models performing tasks are often directly included in the training set without screening for task difficulty and the training value of the trajectories. This easily leads to the inclusion of trajectories from extremely easy or extremely difficult tasks. Trajectories from such extremely difficult tasks (all successful or all unsuccessful) show highly consistent results in repeated executions, failing to provide the model with effective interactive exploration directions. This contributes very little to enhancing the model's exploration capabilities and optimizing model strategies, while also consuming training resources, interfering with the model's training direction, and causing unstable model optimization.
[0059] To address the aforementioned issues, some embodiments provide a data filtering scheme based on hierarchical data training, such as... Figure 5As shown, for any GUI-related task, the multimodal language model is controlled to repeatedly execute the task multiple times (in practice, the number of repetitions can be set to n=8 times, which can be flexibly adjusted according to task complexity and training requirements). During each execution, the action sequence formed by the interaction between the model and the GUI is recorded synchronously, along with the state information corresponding to each single step action in the action sequence, the inference process text for each action, and the execution result of the task. After all repetitions are completed, the ratio of the number of successful executions of the task to the total number of executions (i.e., the success rate) is calculated. A reasonable lower limit and upper limit of the success rate can be preset to define task trajectories with effective training value. The lower limit is used to eliminate extremely difficult tasks (the success rate is too low, almost all failures), and the upper limit is used to eliminate extremely easy tasks (the success rate is too high, almost all successes). If the success rate obtained from the statistics is higher than the preset lower limit but lower than the preset upper limit, it indicates that the task has moderate difficulty and certain exploratory value, and can provide effective support for model optimization. At this time, the complete data formed in one execution of the task (including the action sequence of interaction with the GUI, the state information of each action in the action sequence, the inference process text of each action, and the execution result of the task) is selected, integrated into a training data, and stored in the training dataset.
[0060] For example, the repeated execution results of a single task can be divided into different levels (e.g., when n=8 times, Level 1 corresponds to 1 successful completion of the task out of 8 executions). The trajectories of Level 0 (8 failures, extremely difficult task) and Level 8 (8 successes, extremely easy task) have completely identical execution results, which makes it impossible for the model to learn the value differences of different interaction actions and contributes very little to enhancing the model's exploration ability. Therefore, they are removed from the training set by proportion selection, and only the task trajectories of intermediate difficulty are retained.
[0061] The above solution effectively solves the problems of extreme difficulty task trajectories interfering with model training, low value of training data, and unstable model optimization. It achieves precise stratification and screening of training data, eliminates extreme trajectories with no exploration value, retains effective data with moderate difficulty and training value, and improves the overall quality of the training dataset.
[0062] In some embodiments, such as Figure 6As shown, both the action evaluation model and the GUI agent's underlying model are supervised fine-tuned multimodal language models. For example, multiple sample data points can be acquired. Each sample data point contains a standard action sequence formed through GUI interaction (annotated or verified by domain experts or expert models to ensure the rationality and effectiveness of the interaction logic), state information corresponding to each single-step action in the standard action sequence (fitting the GUI interaction scenario and providing scenario reference for the model), and standard reasoning process text corresponding to each action (annotated by domain experts or expert models, clearly recording the logical basis of action execution and assisting the model in learning reasonable interaction reasoning methods). Then, the state information of each action in each sample data point is used as input to the multimodal language model. The model outputs the corresponding predicted action sequence, and simultaneously outputs the predicted reasoning process text for each action in the predicted action sequence. Based on the differences between the predicted action sequence and the standard action sequence, and the differences between the predicted reasoning process text and the standard reasoning process text, a loss function can be constructed and backpropagated to adjust the model parameters of the multimodal language model. Iterative optimization continues until the model's prediction results stabilize, completing the supervised fine-tuning of the multimodal language model. The supervised fine-tuned multimodal language model possesses certain web navigation capabilities and can serve as the base model for the action evaluation model and the GUI agent. Then, the supervised fine-tuned multimodal language model can be trained using the aforementioned second action sequence to obtain the action evaluation model. Finally, the supervised fine-tuned multimodal language model can be trained using the aforementioned first action sequence and the supervisory information output by the action evaluation model to obtain the GUI agent.
[0063] Considering that for a certain state information, if the multiple candidate actions predicted by the multimodal language model based on the state information are completely the same (lacking diversity), or the scores of multiple candidate actions are highly concentrated (extremely low standard deviation), such state information cannot provide the model with effective action comparison and optimization direction, which is equivalent to invalid training samples. This will not only waste training resources and reduce training efficiency, but also interfere with the optimization process of the model, making it difficult for the model to learn the value differences of different actions. In order to address the above problems, in some embodiments, an action-level reward (i.e., score) filtering strategy is introduced, and a state information validity judgment process is added before training. Only when the state information of a single action is determined to be valid state information will the state information be used for subsequent model training. The state information is determined to be valid state information when the following two conditions are met: (1) When there are at least two different candidate actions among the multiple candidate actions generated, that is, the candidate actions have a certain degree of diversity and can provide the model with a basis for action comparison and optimization, the state information is considered to be valid. (2) The standard deviation of the scores of multiple candidate actions is greater than a preset threshold, that is, the value differences of each candidate action are significant and not highly concentrated, which can allow the model to clearly capture the relative advantages and disadvantages of different actions, the state information is considered to be valid. If either of the above two conditions is not met, i.e. all candidate actions are exactly the same, or the scores of each action are highly concentrated (standard deviation less than or equal to a preset threshold), then the state information is determined to be invalid state information and the sample is filtered out.
[0064] By using action-level reward filtering strategies and state information validity determination, we can avoid the problems of wasting training resources, interfering with model optimization, and causing low training efficiency due to invalid state information.
[0065] In some embodiments, when training a multimodal language model with the training objective of maximizing the cumulative advantage values of multiple candidate actions, considering the differences in difficulty between different tasks, applying uniform training weights to all tasks would lead to unreasonable resource allocation during training, resulting in insufficient generalization performance of the model. Therefore, the difficulty coefficient of the task associated with the first action sequence can be determined first. This difficulty coefficient is negatively correlated with the average score of multiple candidate actions; that is, the lower the average score of multiple candidate actions, the more difficult it is to generate high-quality actions in the current task, the higher the task difficulty, and the larger the corresponding difficulty coefficient. Conversely, the higher the average score, the lower the task difficulty, and the smaller the difficulty coefficient. This setting can accurately quantify the differences in task difficulty. Then, based on the determined difficulty coefficient and the cumulative advantage value, a target loss can be constructed and determined. The target loss is positively correlated with the difficulty coefficient and negatively correlated with the cumulative advantage value. That is, the higher the difficulty coefficient (the more difficult the task), the higher the target loss weight; the larger the cumulative advantage value (the more obvious the action advantage), the lower the target loss, and the smaller the adjustment range of the model parameters, ensuring that the optimization direction aligns with the core objective. Based on the constructed target loss, the model parameters of the multimodal language model can be iteratively adjusted through the backpropagation algorithm to continuously optimize the model's action generation logic and complete the targeted training of the multimodal language model.
[0066] By accurately quantifying the difficulty of tasks with a difficulty coefficient and combining it with an adaptive weighting strategy, training resources are allocated reasonably, allowing the model to focus on more difficult tasks and improve the quality of action generation under difficult tasks.
[0067] In some embodiments, to avoid training oscillations and aggressive model parameter updates when training a GUI agent, an importance sampling ratio can be introduced as a constraint in the training objective. For example, a multimodal language model, while predicting and generating multiple candidate actions, simultaneously outputs the reasoning process text corresponding to each candidate action. For each candidate action, a weight for its advantage value can be determined, which can be based on the importance sampling ratio of each word in the policy text corresponding to that candidate action. The policy text corresponding to each candidate action includes the description text of the candidate action and the reasoning process text corresponding to that candidate action. The importance sampling ratio is essentially the ratio of the probability of the policy text output by the multimodal language model in the current training round to the probability of the old policy text output in the previous training round being in the same state-action pair. For example, during the Nth training round, the probability of generating the "click" token for the j-th word in the i-th candidate action is 30%. During the (N+1)th training round, the probability of generating the "click" token for the j-th word in the i-th candidate action increases to 60%. Therefore, the importance sampling ratio of the j-th word in the i-th candidate action is 60% / 30% = 2. For each candidate action, the cumulative sum of the importance sampling ratios of all words in the policy text of that candidate action can be used as the weight of the candidate action's advantage value. In some embodiments, to ensure the stability of the training process, the importance sampling ratios of each word can be pruned first to adjust the fluctuation range of the importance sampling ratios, and then the cumulative sum of the pruned importance sampling ratios can be used as the weight.
[0068] In some embodiments, the target loss can be determined by simultaneously combining the dominance value of each candidate action, the task difficulty coefficient, and the importance sampling ratio. The multimodal language model is then trained based on this target loss to obtain the GUI agent. The target loss is positively correlated with the task difficulty coefficient and negatively correlated with the dominance value and the importance sampling ratio.
[0069] In some embodiments, the parameters of a multimodal language model can be optimized by maximizing the following function to train a GUI agent: in, E[⋅] represents the expectation; This represents the difficulty coefficient of the task in state s, which is the negative reciprocal of the average score of multiple candidate actions output by the multimodal language model. K represents the total number of candidate actions predicted and generated by the multimodal language model under a single state information. i represents the i-th candidate action, with a value ranging from 1 to K; This represents the reasoning process text corresponding to the i-th candidate action. The total number of lexical units; This represents the description text corresponding to the i-th candidate action. The total number of lexical units; j represents the j-th word in the text of the i-th candidate action strategy, with a value ranging from 1 to... ; θ represents the importance sampling ratio of the j-th word in the i-th candidate action policy text, and θ is the current parameter of the multimodal language model; This represents the advantage value of the j-th word element in the text of the i-th candidate action strategy; This represents the pruning function, used to sample the importance of words by ratio. Limited to Within the interval, avoid extreme sampling ratios of individual lexical terms that could cause model training oscillations.
[0070] ϵ represents a preset constant (hyperparameter) used to control the clipping range of the clip function, adjust the fluctuation range of the importance sampling ratio, and ensure the stability of the training process.
[0071] The following describes the method for generating GUI smart agents provided in this specification, using a specific embodiment as an example.
[0072] Multimodal large language models (MLLMs) have attracted widespread attention due to their powerful perception and reasoning capabilities, which lay the foundation for developing autonomous intelligent agents that can interact with graphical user interfaces (GUIs) to accomplish complex real-world tasks.
[0073] Building GUI agents still faces three major challenges: expensive data preparation, sparse stepwise supervision signals, and unstable optimization processes. First, endowing MLLMs with competent GUI navigation capabilities typically requires a large number of high-quality interaction trajectories. Unlike conventional text and image data, these trajectories not only need to capture the GUI environment state (e.g., GUI screenshots) but also need to include detailed reasoning processes and corresponding actions, heavily relying on expert participation, leading to high costs for data organization and annotation. Second, tasks in the GUI environment usually involve multiple rounds of interaction, forming long-view trajectories, while feedback is only provided in the final step, resulting in sparse and delayed supervision information, making it difficult to effectively guide the optimization of intermediate steps. Furthermore, in related technologies, GUI agent training strategies often employ verifiable reward reinforcement learning (RLVR) frameworks, but due to the delayed supervision information, policy optimization must be tightly coupled with the GUI agent's execution environment, which is often non-stationary, leading to low training efficiency and an unstable process.
[0074] Based on this, this embodiment provides a training framework for GUI agents, enabling controllable data processing and supporting stable and efficient optimization. Unlike commonly used RLVR frameworks, the core idea of this embodiment is to train an action evaluation model to assess the value of each action in the state-action trajectory (i.e., the contribution of that action to task success). On the one hand, this alleviates the problem of sparse supervision in the state-action trajectory; on the other hand, policy optimization is decoupled from the environment, mitigating the impact of environmental non-stationarity on the training process. This framework contains two key components: (1) Action Evaluation Model: The action evaluation model is used to evaluate the value of a single action in a given state within a GUI environment. To do this, a set of self-generated state-action trajectories can be collected, and the final supervision signal is backpropagated along the trajectory to each intermediate step. These step-level annotations are then used to train the action evaluation model in a binary classification manner. For example, if the task is executed successfully, the contribution value of each action in the trajectory is 1; if the task fails, the contribution value of each action in the trajectory is 0.
[0075] (2) GUI Agent: With the help of the action evaluation model, fine-grained policy optimization can be easily performed using the stepwise trajectory generated by the policy itself. Specifically, at each step, the action evaluation model evaluates the value of the action taken by the current policy in the corresponding state, and optimizes the policy accordingly through reinforcement learning. In terms of algorithm, a stepwise variant of the commentator-free method can be adopted, combined with action-level reward filtering and adaptive group weighting mechanism to improve training stability.
[0076] The feasibility of action evaluation model estimation stems from the natural interaction patterns of the GUI agent environment, which is fundamentally different from other reasoning and decision-making tasks (such as mathematical solving or code generation). Specifically, the GUI agent operates in a multi-turn interactive environment where state transitions and actions are well-defined and observable. Each state directly reflects the visual feedback of the environment (i.e., webpage layout), forming a well-structured and semantically rich observation space. Furthermore, the action space is finite and task-specific, typically containing only a few atomic operations such as clicks, scrolling, and input, naturally supporting feasible action evaluation model estimation. Notably, this framework has two major advantages: (i) all training state-action trajectories can be self-generated by a multimodal language model through interaction with the GUI environment, eliminating the need for expensive expert annotation; (ii) since the action evaluation model provides intermediate supervision at each decision step, policy updates do not depend on delayed trajectory-level feedback, thus completely decoupling from the non-stationary execution environment and avoiding training instability and inefficiency.
[0077] Empirically, to evaluate performance in real-world scenarios, a real-time environment was constructed, enabling the policy to interact with real websites. A publicly available set of websites (e.g., WebVoyager and Online-Mind2Web) was used as a benchmark, both containing tasks from real websites. Furthermore, ScreenSpot was incorporated to evaluate the basic localization capabilities of the GUI agent. State-action trajectories can be generated by interacting with websites in the set of websites (e.g., WebVoyager) using existing open-source multimodal language models (e.g., Ovis2.5-9B) as a base model. Overall, experimental results show that the trained framework performs well, achieving state-of-the-art performance among similarly sized models (e.g., Qwen3VL8B and UI-TARS1.5-7B). The model performs exceptionally well on publicly available website sets such as WebVoyager and Online-Mind2Web, demonstrating its strong generalization ability.
[0078] In this regard, GUI navigation can be modeled as a Markov Decision Process (MDP): Where S is the state space, A is the action space, P is the transition function, and R is the reward oracle. This paper considers the no-discount case (discount factor γ=1), and γ is omitted for simplicity. Specifically, in the i-th round, the agent with parameter θ... Receive state from state space S And based on its own reasoning and planning Select Action Subsequently, the environment was based on Update status to To enable intelligent agents Can choose an action for the next round This process is repeated until the agent considers the task complete or reaches a termination state. Finally, the agent receives a reward r from R, indicating whether the task was successfully completed. In this framework: (i) State Includes task query, user commands, historical interactions, and screenshot of the i-th round; (ii) actions Selected from discrete finite space A, covering common web page operations (such as left-click and scrolling), see Appendix B for details; (iii) Reflection yes The generated text summarizes the actions. Reasoning and planning.
[0079] Define the i-th step of the iterative process as a tuple. The complete trajectory with a total number of rounds T is represented as: Note that there is only a termination reward rT (indicating whether the task is successfully completed), and intermediate step rewards are not available, so ri = 0 for all i < T. Under this setting, the return at the i-th step simplifies to . Therefore, its expectation measures the probability of task success given the first i steps.
[0080] Proximal Policy Optimization (PPO): Reinforcement Learning (RL) is a fundamental paradigm for enhancing agent-environment interaction. For large language models, the widely used algorithm in RL is PPO, which iteratively updates the policy using an actor-critic architecture. Formally, PPO aims to maximize the following objective: where , represent the current and old policies respectively, x ∼ q(·), y ∼ represent the prompt and its corresponding response. The importance sampling ratio is used for unbiased estimation, and the clipping hyperparameter ϵ is used to stabilize the optimization. The advantage At measures the relative improvement in the expected return of yt with respect to a baseline estimated by a dedicated critic policy. Typically, the critic policy is of comparable size to πθ, resulting in significant memory and computational overhead, especially in the context of large language models.
[0081] Critic-Free Methods: Recently, critic-free methods have emerged as efficient alternatives to PPO due to the absence of an independent critic policy. Well-known methods include Group Relative Policy Optimization (GRPO), which calculates the baseline for At using the average reward of a set of responses generated by the same prompt. Mathematically, GRPO aims to optimize: where K is the group size, and the advantage Ai,t is calculated as: ri is the reward for the i-th response within the group. Similarly, REINFORCE Leave-One-Out (RLOO) calculates the advantage as: to ensure unbiased estimation.
[0082] REINFORCE++ proposes using global batch statistics for normalization: The details of data collation and cold-start training, as well as the training of the action evaluation model and the GUI agent, are described in sequence below.
[0083] (1) Construction of Training Data Detection-Rerun Policy: The operating environment of GUI agents is often non-stable, posing significant challenges to data collection and performance evaluation. For example, due to load limitations and system configuration, constructing web browsing trajectories may encounter environmental problems such as human verification, website inaccessibility, and browser crashes. In such cases, training data may unintentionally introduce task-irrelevant noise, hindering the model from capturing task-specific patterns; moreover, such failed tasks cannot be directly attributed to insufficient model performance during evaluation. To address this, a simple and effective "detect-rerun" strategy can be adopted: an expert model is used to evaluate whether a given trajectory is affected by environmental problems; if interference is detected, the corresponding task is rerun and the trajectory is reconstructed. In practice, this strategy can almost eliminate all environmental problems.
[0084] Data tiering strategy: To ensure stable and effective optimization of the agent model, the training set trajectories can be stratified. Specifically, a single task is repeatedly browsed on a webpage n=8 times, and the trajectories are divided into Levell∈{0,1,...,8} based on the number of successful trajectories, where Levell represents a trajectories where the task is successfully completed 1 out of 8. It should be noted that Level0 (extremely difficult) and Level8 (extremely easy) tasks produce identical results (all successes or all failures) in the repeated trajectories, contributing very little to augmentation exploration, and are therefore removed from the training set.
[0085] Process monitoring annotation: In state-action trajectories, only the final step has a supervised signal, lacking procedural guidance for intermediate steps, posing a challenge to training action evaluation models. A natural approach is to use high-level models to label intermediate supervision, but this is costly and may not reflect the true contribution of each step to the final result. To address the scarcity of procedural supervision, this paper proposes backpropagating the final result to each intermediate step, serving as a procedural label for training the action evaluation model. Formally, for the trajectory... The reward corresponding to the previous i steps yes The unbiased estimator can therefore be directly optimized by minimizing empirical risk. .
[0086] To give the model Basic GUI navigation capabilities are fine-tuned based on location data and expert webpage navigation trajectories, where the trajectory data only records browsing behavior without task completion supervision. We employ Supervised Fine-Tuning (SFT) for cold start. For example, given webpage navigation trajectory data... The objective of SFT is to minimize: DSFT is the SFT trajectory dataset. After a cold start, the MLLM-based agent acquires basic GUI navigation capabilities (such as understanding real web pages and generating valid actions). Starting with this initial model πθSFT, an action evaluation model can be trained, and reinforcement learning can be used to further improve the GUI navigation capabilities.
[0087] Training of the action evaluation model Motion evaluation model estimation begins with data collection. The system generates its own state-action trajectories and backpropagates process supervision from the final result. Subsequently, the entire trajectory can be decomposed into progressively smaller segments rather than treated as a single sample. Specifically, the trajectory... Every step in Treated as an independent training instance. Action evaluation model. The goal is to predict steps Corresponding returns .because Since it is binary, natural selection minimizes the cross-entropy loss: in, It includes stepwise samples. Based on the favorable properties of cross-entropy loss, the action evaluation model with optimal parameters can accurately capture the reward distribution.
[0088] To ensure good generalization of the action evaluation model and support for stable policy optimization, two techniques are introduced: sliding window and action focus. Specifically, each step state... Encoding all historical interactions can be problematic. When the interaction history is too long, the action evaluation model may struggle to focus on the current context, leading to unstable policy optimization in subsequent steps. To address this, a sliding window strategy can be employed to limit the historical interactions in the input sample (si) to the most recent few steps. Furthermore, since each step's sample contains both reasoning and action, an action-focusing strategy can be applied. This involves masking the reasoning portion during training, prompting the action evaluation model to focus on the actions themselves and reducing potential interference from the reasoning.
[0089] Training of GUI agents Based on the initial model and self-generated stepwise trajectory To further enhance the model's GUI navigation capabilities, a method combining learned actions with an evaluation model is proposed. A commentator-free stepwise policy optimization method is used for reinforcement learning. Compared to existing methods (such as GRPO), the following improvements are made: (1) Instead of using the final result as supervision, a method is adopted. The predictions are used to optimize the strategy. This allows us to easily use stepwise samples and decouple the GUI execution environment from the strategy optimization, improving efficiency. (2) Introduce an action-level reward filtering strategy: Given a state s, the model Generate K candidate actions, action evaluation model Predict the reward for each action. If all actions are identical (lack of diversity) or the K rewards are highly concentrated (low standard deviation), then filter out the sample s; (3) Explicitly consider task difficulty through adaptive group weighting. The inverse of the average reward within a group can be used as a weight to encourage attention to more difficult tasks.
[0090] The above design can be naturally integrated into existing commentator-free strategy optimization methods. Taking GRPO as an example, its progressive variant objective can be expressed as: , Among them, state From the filter set Sampling, hoping to )and )calculate.
[0091] Advantages With task weight The calculation is as follows: , in for For the return prediction of step (s,ti,ai), Let be the average return of s. Similarly, stepwise variants of RLOO and REINFORCE++ can be derived, denoted as S-RLOO and S-RF++, respectively.
[0092] Furthermore, this specification also provides a task processing method that can utilize a trained GUI agent to process GUI-related tasks. Specifically, during the execution of a target task, current state information can be obtained, including task information of the target task, a current GUI screenshot, and historical interaction information with the GUI. The current state information can then be input into a pre-trained GUI agent to predict and execute the next interaction action. The GUI agent is trained using the method mentioned in any of the above embodiments.
[0093] The various technical features in the above embodiments can be combined arbitrarily, as long as there is no conflict or contradiction between the combinations of features. However, due to space limitations, they are not described one by one. Therefore, the arbitrary combination of various technical features in the above embodiments is also within the scope of this specification.
[0094] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.
[0095] In some embodiments, this specification also provides an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein the processor implements the method described in any one of the above embodiments by executing the executable instructions.
[0096] Figure 7 This is a schematic structural diagram of a device provided in an exemplary embodiment. Please refer to... Figure 7 At the hardware level, the device includes a processor 702, an internal bus 704, a network interface 706, memory 708, and non-volatile memory 710, and may also include other hardware required for its functions. One or more embodiments of this specification can be implemented in software, for example, the processor 702 reads the corresponding computer program from the non-volatile memory 710 into memory 708 and then runs it. Of course, in addition to software implementation, one or more embodiments of this specification do not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.
[0097] In some embodiments, the GUI agent training device can be applied to, for example... Figure 7 The device shown is used to implement the technical solution of this specification. The device may include: The acquisition module is used to acquire the first action sequence formed by interaction with the GUI; The action prediction module is used to input the state information of a single action in the first action sequence into a multimodal language model to predict multiple candidate actions under the state information. The state information of each action includes the task information of the task associated with the first action sequence, the GUI screenshot before the action is executed, and the relevant information of the historical actions executed before the action. The evaluation module is used to input the state information and the candidate action into a pre-trained action evaluation model for each candidate action to predict the score of the candidate action. The score is used to characterize the degree of contribution of performing the candidate action under the state information to the success of task execution. The training module is used to determine the mean score of each of the plurality of candidate actions, determine the advantage value of each candidate action based on the difference between the score of each candidate action and the mean, the advantage value being positively correlated with the difference; and train the multimodal language model with at least maximizing the cumulative amount of the advantage values of the plurality of candidate actions as the training objective, and use the trained multimodal language model as the GUI agent.
[0098] The specific implementation process of the functions and roles of each module in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0099] Based on the same concept as the methods described above, this specification also provides a computer-readable storage medium having computer instructions stored thereon that, when executed by a processor, implement the steps of the methods as described in any of the above embodiments.
[0100] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0101] Based on the same concept as the methods described above, this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the methods as described in any of the above embodiments.
[0102] The above description is merely a preferred embodiment of one or more embodiments of this specification and is not intended to limit the scope of one or more embodiments of this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of one or more embodiments of this specification should be included within the protection scope of one or more embodiments of this specification.
Claims
1. A method for training a graphical user interface (GUI) agent, the GUI agent being used to interact with the GUI to complete tasks input by the user, the method comprising: Obtain the first action sequence formed by interaction with the GUI; The state information of a single step action in the first action sequence is input into a multimodal language model to predict multiple candidate actions under the state information. The state information of each single step action includes the task information of the task associated with the first action sequence, the GUI screenshot before the single step action is executed, and the relevant information of the historical actions executed before the single step action. For each candidate action, the state information and the candidate action are input into a pre-trained action evaluation model to predict the score of the candidate action. The score is used to characterize the degree of contribution of performing the candidate action under the state information to the success of task execution. The mean score of each of the plurality of candidate actions is determined, and the advantage value of each candidate action is determined based on the difference between the score of each candidate action and the mean score, wherein the advantage value is positively correlated with the difference. The multimodal language model is trained with the goal of maximizing the cumulative advantage value of each of the multiple candidate actions, and the trained multimodal language model is used as the GUI agent.
2. The method according to claim 1, wherein the action evaluation model is trained in the following manner: Obtain the second action sequence formed by interaction with the GUI; The multimodal language model is trained using a single-step action from the second action sequence and the state information of that single-step action as input, and the actual score of that single-step action as supervision information, to obtain the action evaluation model; wherein, The actual score of each single step in the second action sequence is determined based on the execution result of the task associated with the second action sequence.
3. According to the method of claim 2, if the execution result of the task associated with the second action sequence is successful, then the actual score of each single action in the second action sequence is a preset first score; if the execution result of the task associated with the second action sequence is unsuccessful, then the actual score of each single action in the second action sequence is a preset second score, wherein the first score is greater than the second score.
4. The method according to claim 2, wherein the state information of each single-step action in the second action sequence includes: The second action sequence includes the task information of the task associated with it, the GUI screenshot before executing the single-step action, and the relevant information of the previous N historical actions of the single-step action, where N is an integer.
5. The method according to claim 2, wherein the first action sequence and / or the second action sequence are obtained from a pre-constructed training dataset, the training dataset including multiple training data, each training data including an action sequence formed by interaction with the GUI during the execution of a specified task using a multimodal language model, the state information of each action in the action sequence, the inference process text corresponding to each action, and the execution result of the specified task.
6. The method according to claim 5, wherein the training dataset is constructed based on the following method: For any GUI-related task, construct prompt words based on the task information, use the prompt words to guide the multimodal language model to execute the task, and output the action sequence formed by the interaction with the GUI during the execution of the task, the state information of each action in the action sequence, the reasoning process text of each action, and the execution result of the task. The environmental interference determination model is used to determine whether the execution of the task is affected by network environmental factors based on the state information of each action in the action sequence, the reasoning process text of each action, and the execution result of the task. If not, the action sequence, the state information of each action in the action sequence, the reasoning process text of each action, and the execution result of the task are used as one training data in the training dataset. If so, then the task is performed again using a multimodal language model.
7. The method according to claim 5, wherein the training dataset is constructed based on the following method: For any GUI-related task, the task is executed multiple times using a multimodal language model. Each execution is recorded, including the action sequence formed by the interaction with the GUI, the state information of each action in the action sequence, the inference process text of each action, and the execution result of the task. Calculate the percentage of successful executions of this task relative to the total number of executions; If the percentage is higher than the preset lower limit but lower than the preset upper limit, then the action sequence formed by the task in one execution, the state information of each action in the action sequence, the reasoning process text of each action, and the execution result of the task are used as a training data.
8. The method according to claim 1, wherein the multimodal language model is obtained through supervised fine-tuning, wherein, The process of monitoring and fine-tuning is as follows: Acquire multiple sample data sets, each of which includes a standard action sequence formed by interaction with the GUI, the state information of each action in the standard action sequence, and the standard reasoning process text for each action; The state information of each action is used as the input of the multimodal language model, and the multimodal language model is used to output the predicted action sequence and the text of the prediction reasoning process of each action in the predicted action sequence. The model parameters of the multimodal language model are adjusted based on the differences between the predicted action sequence and the standard action sequence, as well as the differences between the predicted inference process and the standard inference process, in order to perform supervised fine-tuning of the multimodal language model.
9. The method according to claim 1, wherein, before training the multimodal language model, the method further comprises: (1) using at least maximizing the cumulative amount of the advantage values of the plurality of candidate actions as the training objective; and (2) prior to training the multimodal language model. If the state information of a single action is determined to be valid state information, the multimodal language model is trained with the goal of maximizing the cumulative amount of the advantage values of the multiple candidate actions. Wherein, if there are at least two different candidate actions among the plurality of candidate actions, and the standard deviation of the scores of each of the plurality of candidate actions is greater than a preset threshold, the state information is determined to be valid state information.
10. The method according to claim 1, wherein training the multimodal language model with at least maximizing the cumulative amount of the advantage values of the plurality of candidate actions as the training objective, comprises: Determine the difficulty coefficient of the task associated with the first action sequence, wherein the difficulty coefficient is negatively correlated with the average score of the plurality of candidate actions; The target loss is determined based on the difficulty coefficient and the cumulative amount, wherein the target loss is positively correlated with the difficulty coefficient and negatively correlated with the cumulative amount; The model parameters of the multimodal language model are adjusted based on the target loss to train the multimodal language model.
11. The method according to claim 1 or 10, wherein the multimodal language model is further configured to output the reasoning process text corresponding to each candidate action, wherein the cumulative amount is the weighted cumulative amount of the advantage values of the plurality of candidate actions, the weight of the advantage value of each candidate action is determined based on the importance sampling ratio of each word in the strategy text corresponding to the candidate action, and the strategy text includes the description text of the candidate action and the reasoning process text of the candidate action. and / or The method further includes: After using the action evaluation model to predict the scores of the multiple candidate actions, each candidate action and its corresponding score are recorded in a log file.
12. A task processing method, the method comprising: During the execution of the target task, current status information is obtained, including the task information of the target task, the current GUI screenshot, and the historical interaction information with the GUI; The current state information is input into a pre-trained GUI agent to predict and execute the next interaction action; wherein the GUI agent is trained by the method described in any one of claims 1-11.
13. An electronic device, comprising: processor; A memory for storing processor-executable instructions; wherein the processor implements the steps of the method as described in any one of claims 1 to 12 by executing the executable instructions.
14. A computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 12.
15. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 12.