A GUI agent evaluation method, device, product, equipment and medium

CN122197934APending Publication Date: 2026-06-12BEIJING XIAOMI MOBILE SOFTWARE CO LTD +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING XIAOMI MOBILE SOFTWARE CO LTD
Filing Date: 2026-01-23
Publication Date: 2026-06-12

Application Information

Patent Timeline

23 Jan 2026

Application

12 Jun 2026

Publication

CN122197934A

IPC: G06N3/006; G06F9/451

AI Tagging

Application Domain

Biological models Execution for user interfaces

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122197934A_ABST

Patent Text Reader

Abstract

The present disclosure provides a GUI agent evaluation method, device, product, equipment and medium. The method comprises: controlling a GUI agent to be evaluated to perform a target task, and obtaining generated interaction track data; based on a task success condition predefined for the target task, the interaction track data is evaluated to determine whether the GUI agent successfully performs the target task; wherein the task success condition is determined based on the interaction elements and / or interaction actions required to successfully perform the target task. Since the task success condition is determined based on the interaction elements and / or interaction actions triggered by the successful execution of the target task, it can objectively evaluate whether the GUI agent completes the necessary interaction, avoid misjudgment of the task execution effect of the GUI agent due to differences in interaction paths, and thus improve the accuracy of evaluating the task execution capability of the GUI agent.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a method, apparatus, product, device and medium for evaluating GUI intelligent agents. Background Technology

[0002] With the rapid development of Large Language Models (LLMs) and multimodal technologies, the application of Graphical User Interface (GUI) agents is becoming increasingly widespread. GUI agents can automatically complete various tasks by interacting with the graphical user interface and simulating human operational logic.

[0003] To accurately measure the task execution capability of GUI agents, related evaluation schemes for GUI agents typically rely on preset fixed interaction trajectory templates. These templates require the GUI agent's interaction trajectory to be completely consistent with the template to be considered a successful task execution. Alternatively, they may rely solely on the final page display results for judgment, which can easily lead to misjudgments and fail to truly reflect the GUI agent's task execution capability. Summary of the Invention

[0004] To overcome the problems existing in related technologies, this disclosure provides a method, apparatus, product, device and medium for evaluating GUI intelligent agents.

[0005] The first aspect of this disclosure provides a method for evaluating a GUI agent, the method comprising: Control the GUI agent to be evaluated to perform the target task and acquire the generated interaction trajectory data; Based on the predefined task success conditions for the target task, the interaction trajectory data is evaluated to determine whether the GUI agent has successfully executed the target task; The success conditions for the task are determined based on the interactive elements and / or interactive actions required to successfully execute the target task.

[0006] Optionally, the target task is executed in the interactive environment of the terminal device.

[0007] Optionally, the method further includes: The GUI agent is controlled to execute a reset task, which is used to reset the terminal device to its initial state.

[0008] Optionally, the reset task includes at least one of the following: Task-level reset task is used to reset the application state corresponding to each target task; An application-level reset task is used to reset the application state shared by multiple target tasks in the interactive environment.

[0009] Optionally, for the target task that triggers the persistent change, the task success condition is determined based on the interaction elements and / or interaction actions associated with the application state that has been persistently changed.

[0010] Optionally, controlling the GUI agent to be evaluated to perform the target task includes: The application state of the interactive environment is input to the GUI agent, so that the GUI agent determines control commands based on the application state. The control command is executed in the interactive environment to update the application state; Repeat the above steps of state input, instruction determination, and state update until the trajectory termination condition of the target task is met.

[0011] Optionally, each target task corresponds to at least one task success condition; The step of evaluating the interaction trajectory data based on predefined task success conditions for the target task to determine whether the GUI agent has successfully executed the target task includes: In response to the interaction trajectory data satisfying any task success condition, it is determined that the GUI agent has successfully executed the target task; In response to the fact that the interaction trajectory data does not meet any task success condition, it is determined that the GUI agent has failed to execute the target task.

[0012] Optionally, the task success condition includes at least one of the following sub-conditions: The GUI elements in the interaction trajectory data are matched with the attribute information of the interaction elements; The location in the interaction trajectory data where the interactive action is performed on the GUI element is within a preset range.

[0013] A second aspect of this disclosure provides an evaluation apparatus for a GUI intelligent agent, the apparatus comprising: The interaction module is used to control the GUI agent to be evaluated to perform the target task and to acquire the generated interaction trajectory data; An evaluation module is used to evaluate the interaction trajectory data based on predefined task success conditions for the target task, so as to determine whether the GUI agent has successfully executed the target task; The success conditions for the task are determined based on the interactive elements and / or interactive actions required to successfully execute the target task.

[0014] Optionally, the target task is executed in the interactive environment of the terminal device.

[0015] Optionally, the reset task includes at least one of the following: Task-level reset task is used to reset the application state corresponding to each target task; An application-level reset task is used to reset the application state shared by multiple target tasks in the interactive environment.

[0016] Optionally, the device further includes: The reset module is used to control the GUI agent to execute a reset task, which is used to reset the terminal device to its initial state.

[0017] Optionally, for the target task that triggers the persistent change, the task success condition is determined based on the interaction elements and / or interaction actions associated with the application state that has been persistently changed.

[0018] Optionally, the interaction module includes: The input module is used to input the application state of the interactive environment to the GUI agent, so that the GUI agent can determine control commands based on the application state; An update module is used to execute the control instructions in the interactive environment to update the application state; The repeat execution module is used to repeatedly execute the above-mentioned state input, instruction determination and state update steps until the trajectory termination condition of the target task is met.

[0019] Optionally, each target task corresponds to at least one task success condition; the evaluation module includes: The first evaluation module is used to determine that the GUI agent has successfully executed the target task in response to the interaction trajectory data satisfying any task success condition. The second evaluation module is used to determine that the GUI agent has failed to execute the target task in response to the interaction trajectory data not meeting any task success condition.

[0020] Optionally, the task success condition includes at least one of the following sub-conditions: The GUI elements in the interaction trajectory data are matched with the attribute information of the interaction elements; The location in the interaction trajectory data where the interactive action is performed on the GUI element is within a preset range.

[0021] A third aspect of this disclosure provides a computer program product including a computer program / instructions that, when executed by a processor, implement the method described in the first aspect.

[0022] A fourth aspect of this disclosure provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method as described in the first aspect.

[0023] The fifth aspect of this disclosure provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method described in the first aspect.

[0024] The technical solutions provided by the embodiments of this disclosure may include the following beneficial effects: This embodiment controls a GUI agent to be evaluated to execute a target task and acquires the generated interaction trajectory data. Based on predefined task success conditions for the target task, the interaction trajectory data is evaluated to determine whether the GUI agent has successfully executed the target task. Since the task success conditions are determined based on the interactive elements and / or interactive actions triggered by the successful execution of the target task, it is possible to objectively evaluate whether the GUI agent has completed the necessary interactions, avoiding misjudgments of the task execution effect of the GUI agent due to differences in interaction paths, thereby improving the accuracy of evaluating the task execution capability of the GUI agent.

[0025] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0026] The accompanying drawings, which are incorporated in and form part of this disclosure, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.

[0027] Figure 1 This is a flowchart illustrating an evaluation method for a GUI agent, as shown in some exemplary embodiments; Figure 2 These are schematic diagrams illustrating two interaction paths for a target task, as shown in some exemplary embodiments. Figure 3 This is a flowchart illustrating another method for evaluating a GUI agent, as shown in some exemplary embodiments; Figure 4 These are schematic diagrams illustrating an evaluation process for a GUI agent, showcasing some exemplary embodiments. Figure 5 This is a block diagram illustrating an evaluation device for a GUI intelligent agent, as shown in some exemplary embodiments; Figure 6 These are hardware structure diagrams of an electronic device illustrated by some exemplary embodiments. Detailed Implementation

[0028] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.

[0029] In related technologies, when evaluating a GUI agent, a standard GUI operation trajectory is pre-recorded or defined for the task. During evaluation, the interaction trajectory generated by the agent must be consistent with this standard trajectory for the task to be considered successfully executed.

[0030] However, this evaluation method overlooks the fact that in real-world application scenarios, there are often multiple reasonable and effective interaction paths to complete the same task. For example, users can locate a target through the search function or find the search function step by step through the navigation menu. If the agent completes the task using a logically correct path that differs from the standard trajectory, the evaluation result will still be judged as a failure, failing to accurately assess the task execution capability of the GUI agent.

[0031] In view of the above, this disclosure proposes an evaluation method, apparatus, product, device and medium for GUI intelligent agents, and the embodiments of this disclosure will be described in detail below.

[0032] The first aspect of this disclosure provides a method for evaluating GUI intelligent agents, which can be applied, exemplarily, to terminal devices such as smartphones, tablets, laptops, desktop computers, in-vehicle intelligent terminals, smart home devices, etc., and can also be applied to test platforms, test servers, etc. This method can also be executed by other devices in different application scenarios, and the embodiments of this disclosure do not limit this.

[0033] Please see Figure 1 , Figure 1 This is a flowchart illustrating an evaluation method for a GUI agent using some exemplary embodiments. The evaluation method for the GUI agent may include the following steps: Step S101: Control the GUI agent to be evaluated to execute the target task and obtain the generated interaction trajectory data.

[0034] In this step, the task description of the target task is input into the GUI agent to be evaluated through test platform scheduling, remote command issuance, local process invocation, etc., and the GUI agent is controlled to enter the execution environment corresponding to the target task and start the execution process of the target task.

[0035] During the execution of the target task by the GUI agent, the current environmental status information can be input into the GUI agent by collecting structured data such as screenshots of the terminal device interface, element text and position of GUI elements, and operation logs in real time.

[0036] Based on the input state information and built-in task decision-making logic, the GUI agent generates corresponding control commands such as clicks, inputs, selections, and switches, and sends them to the terminal device for execution via protocol transmission and inter-process communication. After the control commands are executed, the environmental state changes after the action are obtained by collecting the interface state and execution results, and this change is fed back to the GUI agent to trigger the next round of interaction decisions.

[0037] The above interaction process will continue to loop until the preset trajectory termination conditions are met, such as triggering the task completion flag, reaching the preset maximum number of interaction steps, or the occurrence of an abnormal task execution state, at which point the task execution will terminate and complete interaction trajectory data will be generated.

[0038] The target task is a preset interactive task for evaluating the interactive execution capability of the GUI agent to be evaluated. It may include information such as task description, task objective and execution environment, and is applicable to interactive scenarios such as application function operation, interface element interaction and cross-page linkage of the GUI agent.

[0039] Interaction trajectory data is a record of the entire task execution process. It can include attribute information such as element text, resource identifier, class name, package name, bounding box position, and status information of GUI elements involved in the task execution; information such as action type, execution sequence, and associated elements for each interaction action; and environmental state change data such as the update records of the interface screenshots corresponding to each interaction action. GUI elements can be graphical elements such as buttons, icons, labels, and menus.

[0040] Step S102: Based on the predefined task success conditions for the target task, evaluate the interaction trajectory data to determine whether the GUI agent has successfully executed the target task; wherein, the task success conditions are determined based on the interaction elements and / or interaction actions required to successfully execute the target task.

[0041] In this step, the necessary interactive elements and / or actions for successfully executing the target task can be determined in advance by analyzing the interaction trajectory of successfully executing the target task and breaking down the execution logic of task completion. Only the core elements that are directly related to the completion of the task are retained, such as the notification settings button element and the start-up switching action required to complete the task of enabling application notifications. The success conditions of the target task are defined.

[0042] When evaluating interaction trajectory data, the core elements can be extracted first, filtering out attribute information and interaction action information of GUI elements related to task success conditions. Then, using methods such as rule-based validation, feature matching, and temporal consistency verification, the acquired interaction trajectory data is compared and analyzed against predefined task success conditions.

[0043] The result of the GUI agent's execution of the target task is determined based on the comparative analysis results. If the core elements extracted from the interaction trajectory data meet the requirements for task success, the GUI agent is deemed to have successfully executed the target task; otherwise, the GUI agent is deemed to have failed to execute the target task.

[0044] During the evaluation process, the evaluation results, interaction trajectory data, target task, and task success conditions that reflect whether the GUI agent has successfully executed the target task can also be stored, so as to provide a reference for subsequent performance optimization of the GUI agent.

[0045] In one embodiment, the task success condition may include at least one of the following sub-conditions: the GUI element in the interaction trajectory data matches the attribute information of the interaction element; the position of the interaction action performed on the GUI element in the interaction trajectory data is within a preset range.

[0046] The attribute information of interactive elements can include text content, resource identifiers, bounding box coordinates, element type, and status indicators. The attribute information of GUI elements in the interaction trajectory data can be obtained by parsing structured interface data, performing element detection and attribute extraction on interface screenshots, and reading application control attribute logs.

[0047] During matching and validation, methods such as regular expression matching, feature vector similarity calculation, and attribute field validation can be used to determine whether the attribute information of GUI elements in the interaction trajectory data matches the attribute information of predefined interaction elements. For example, it can validate whether the text attribute of the elderly mode switch element in the trajectory is "elderly mode" and whether the control ID matches the preset value. If the attribute information of the GUI elements in the interaction trajectory data matches the attribute information of the interaction elements, then the sub-condition is satisfied.

[0048] The preset range can be defined based on attributes such as the bounding box coordinates and element size of the GUI element. For example, the bounding box coverage area and triggerable area of the GUI element can be set as the preset range. The position information of the interactive action can be extracted from the interaction trajectory data by recording the touch coordinates of the terminal device and the position parameters in the interaction command.

[0049] During location verification, methods such as coordinate range comparison and region containment detection can be used to verify whether the execution location of the interactive action is within a preset range. For example, verifying whether the coordinates of clicking the "Confirm Order" button are within the bounding box of the button, or whether the start and end positions of the slider cover the slider's sliding area. If the location of the interactive action performed on the GUI element in the interaction trajectory data is within the preset range, then the sub-condition is satisfied.

[0050] The above sub-conditions can be used individually or in combination according to the evaluation requirements of the target task. For example, for button-type interactive tasks, both attribute information matching and position range verification can be used simultaneously. For display tasks that only require element recognition, only attribute information matching can be used, thereby adapting to agent evaluation scenarios for different types of target tasks.

[0051] As mentioned above, by defining the task success condition as a sub-condition of GUI element attribute matching and / or interactive action position verification, the interaction trajectory can be verified from two dimensions: element recognition accuracy and action execution effectiveness, thereby improving the accuracy of task execution evaluation results.

[0052] The GUI agent evaluation method in this embodiment controls the GUI agent to be evaluated to execute a target task and acquires the generated interaction trajectory data. Based on the predefined task success conditions for the target task, the interaction trajectory data is evaluated to determine whether the GUI agent has successfully executed the target task. Since the task success conditions are determined based on the interactive elements and / or interactive actions triggered by the successful execution of the target task, it can objectively evaluate whether the GUI agent has completed the necessary interaction, avoid misjudgment of the task execution effect of the GUI agent due to differences in the interaction path, and thus improve the accuracy of evaluating the task execution capability of the GUI agent.

[0053] Furthermore, for the task description of the target task, there is no need to provide cumbersome step-by-step instructions. Only concise and clear user instructions are needed to effectively evaluate the GUI agent's ability to autonomously locate the function entry point and explore the interface interaction path to complete the task. This further closely reflects the operation scenario of real users and improves the authenticity and reliability of the evaluation results.

[0054] In the foregoing embodiments, it was described how to control a GUI agent to execute a target task and acquire interaction trajectory data, and then evaluate the GUI agent based on the interaction elements and / or actions required to successfully execute the target task, thereby accurately assessing the GUI agent's task execution capability. In the following embodiments, the evaluation process of the GUI agent will be described in more detail, and can be applied to any of the above embodiments.

[0055] In one embodiment, the target task is executed within the interactive environment of the terminal device. The terminal device is an electronic device with graphical user interface (GUI) interaction capabilities, such as a smartphone, tablet, laptop, desktop computer, smart TV, or in-vehicle smart terminal. The interactive environment of the terminal device is the environment provided by the terminal device for application execution and human-computer interaction, and may include the graphical user interface of the applications installed and running on the terminal device, as well as system resources supporting interaction.

[0056] By calling test interfaces or test tools, the GUI agent to be evaluated can be associated with the interactive environment of the terminal device. This allows the GUI agent to send interactive action commands to the interactive environment and obtain interface state information from the environment. The interactive environment of the terminal device can support the execution of various interactive actions such as touch clicks and text input, adapting to the operational requirements of different types of target tasks.

[0057] The interactive environment is a realistic application runtime environment, capable of fully replicating the execution logic of the target task in actual usage scenarios. For example, in tasks involving network interaction, the interactive environment can achieve data transmission with the server through network connections; in tasks involving local data storage, the interactive environment can support application read and write operations on the terminal's local storage, thereby ensuring that the execution process of the target task is consistent with the actual application scenario.

[0058] As mentioned above, by executing the target task in the real interactive environment of the terminal device, the task execution logic of the GUI agent in the actual user scenario can be restored, avoiding the deviation between the simulated running environment and the real application scenario, ensuring that the evaluation results match the actual running effect of the application, thereby improving the reliability of the evaluation results.

[0059] In one embodiment, in order to ensure the consistency of the environment in multiple rounds of evaluation and to avoid the execution of the preceding target task from interfering with the execution and evaluation of subsequent tasks, the GUI agent can be controlled to execute a reset task after the target task is completed.

[0060] The initial state refers to the state of the terminal device before starting the target task. It may include the default configuration state of the application, the basic interaction state of the terminal system, and the state of the pre-requisite environment related to the target task.

[0061] A reset task is used to reset a terminal device to its initial state. A reset task can be generated by analyzing the execution logic of the target task, identifying the stages during task execution that change the state of the terminal device, and extracting the interactive actions required to restore the state.

[0062] For example, the GUI agent can be controlled to perform corresponding recovery actions to execute the reset task according to the requirements of the reset task, such as turning off enabled function switches, deleting local data generated during task execution, and returning to the application's initial interface.

[0063] Taking the target task of enabling the elderly mode of the application as an example, the corresponding reset task is the fine-grained inverse task of the target task, which restores the running state of the application from the elderly mode enabled state to the default running mode before the target task was executed, that is, restores it to the initial state.

[0064] For example, the application can be launched first, triggering the GUI agent to click the personal center interactive element in the lower right corner of the application's homepage, which will then navigate to the personal center interface. After entering the personal center interface, the GUI agent will inspect the interface elements and determine whether there is an interactive element with the text attribute set to "close large font" in the upper right corner of the interface.

[0065] If this interactive element is detected, it means that the application is currently in the elderly mode and needs to continue with the subsequent reset operation; if this interactive element is not detected, it means that the application is already in the initial default running mode and does not need to perform a reset action.

[0066] When the application determines that the elderly mode is enabled, the GUI agent clicks the large font closing interactive element, navigating to the elderly mode settings interface. In the settings interface, it locates the interactive element with the text attribute set to disable elderly mode and clicks it, triggering the elderly mode disabling process.

[0067] After the elderly mode is turned off, the GUI agent returns to the application's home screen, completing the entire reset process and ensuring that the application's running status and interface display status are restored to the initial state before the elderly mode task was executed.

[0068] As described above, by controlling the GUI agent to execute a reset task, the terminal device can be restored to its initial state before the target task was executed. This avoids the state changes that occurred during the execution of the previous target task from interfering with the execution and evaluation of subsequent tasks, thereby improving the stability and repeatability of task execution evaluation.

[0069] In one embodiment, the reset task may include at least one of the following: a task-level reset task and an application-level reset task. The task-level reset task is used to reset the application state corresponding to each target task, which can pinpoint the scope of influence of a single task's state. It is suitable for scenarios where the states of different target tasks are independent in multiple rounds of continuous evaluation, thus avoiding excessive resets.

[0070] The application state corresponding to each target task is state information that changes independently during the execution of the target task and only affects the execution logic of the current task. Examples include temporary functional modules started by a single task, temporary operation records generated during task execution, and temporary interface states. Task-level resets can be achieved by executing reverse task operations or calling state rollback interfaces.

[0071] Application-level reset tasks are used to reset the application state shared by multiple target tasks in an interactive environment. They are suitable for scenarios where the shared state of multiple target tasks has been changed during multiple rounds of evaluation and needs to be uniformly restored to the initial state.

[0072] Shared application state refers to state information that affects the execution logic of multiple target tasks. Examples include global configuration parameters such as application font size and theme mode, as well as basic data shared by multiple tasks, such as shopping cart item data and user preference settings. Application-level task resets can be achieved by overriding global default configurations, clearing shared data caches, or restarting the application.

[0073] In practical applications, based on data such as the task type and status change content of the preceding target task, you can choose to execute a task-level reset task alone, execute an application-level reset task alone, or combine the two types of reset tasks.

[0074] For example, if the preceding target task only changes the temporary state, then only the task-level reset task can be executed; if the preceding target task only changes the global shared configuration, then only the application-level reset task can be executed; if the preceding target task changes both the temporary state and the global shared configuration, then both the task-level reset task and the application-level reset task can be executed simultaneously.

[0075] As mentioned above, by dividing reset tasks into task-level reset tasks and application-level reset tasks, the appropriate reset method can be determined based on the scope of impact of state changes, and different levels of state interference can be eliminated in a targeted manner, thereby improving the accuracy and efficiency of reset operations.

[0076] In one embodiment, for a target task that triggers a persistent change, the success condition is determined based on the interactive elements and / or interactive actions associated with the application state that has undergone a persistent change. The target task that triggers the persistent change is a task whose execution has an irreversible impact on the application's running state. Examples include server-related interactive tasks such as account login, account registration, and order payment, as well as irreversible operations such as deleting account data and canceling user accounts.

[0077] Interactive elements associated with persistently changed application states can include identifiers that indicate the persistent state is in effect, such as the username, user avatar icon, and "login" status text displayed on the user profile page. Interactive actions associated with persistently changed application states can include operations that trigger the persistent state change, such as clicking the login button or confirming the login after submitting the account and password.

[0078] For target tasks that trigger persistent changes, taking the login scenario as an example, the login state is non-resettable. After the agent completes the login operation, the server persistently stores the user session information. At the same time, the login state of the terminal device and the server are bound together. Resetting methods such as clearing the local cache and restarting the application cannot simultaneously clear the session records on the server, nor can they reset the interaction environment to the initial state before login. If validation is still based on interactive elements and / or interactive actions, misjudgments are likely to occur.

[0079] During the evaluation process, the success of the GUI agent in executing the target task can be determined by verifying whether the interaction trajectory data involves the aforementioned related interaction elements and / or actions. For example, for a persistent change task like login, the evaluation can verify whether the agent clicked the login button and detected the "login" text.

[0080] As mentioned above, for target tasks that trigger persistent changes, determining task success conditions based on interactive elements and / or interactive actions associated with the application state that has undergone persistent changes can avoid misjudgments caused by relying solely on temporary feedback and ensure the accuracy of evaluation results for tasks involving server-side interactions or irreversible operations.

[0081] In one embodiment, when controlling the GUI agent to be evaluated to execute the target task, the application state of the interactive environment can be input to the GUI agent first, so that the GUI agent can determine the control command based on the application state; then, the control command is executed in the interactive environment to update the application state; the above-mentioned state input, command determination and state update steps are repeated until the trajectory termination condition of the target task is reached.

[0082] For example, the application state of the interactive environment can be input into the GUI agent to be evaluated first. The application state is a snapshot of the interactive environment in its current state, which can be obtained through methods such as capturing screenshots of the interface, analyzing structured data of GUI elements, and extracting application logs. The application state can include the visual content of the current interface, the attribute information of all GUI elements within the interface, and the application's running status.

[0083] The GUI agent can determine control commands such as clicks, swipes, and text input that match the current task execution stage based on the input application state, combined with built-in task decision logic and historical interaction trajectory data.

[0084] Subsequently, the control commands generated by the GUI agent are sent to and executed in the interactive environment to update the application state within it. After execution, the interactive environment undergoes corresponding state changes, such as the interface switching to a new page or the state of GUI elements changing. The interactive environment can record the execution results and state change details, providing a data foundation for the state input in the next loop.

[0085] The steps of state input, instruction determination, and state update are repeated to form a continuous dynamic interaction loop. Each loop is based on the application state updated in the previous loop, ensuring that the GUI agent's instruction decisions can adapt to changes in the interaction environment in real time, until the preset trajectory termination condition is reached.

[0086] The trajectory termination condition can be determined by methods such as task completion status recognition and abnormal status monitoring. For example, the GUI agent triggers the task completion flag element, or the interactive environment experiences an abnormal state such as application crash, which prevents the task from continuing.

[0087] When the trajectory termination condition of the target task is triggered, the interactive loop can be stopped, and the application status, control commands and interactive timing data recorded throughout the loop can be integrated to generate complete interactive trajectory data.

[0088] As described above, by adopting an interactive loop of state input, instruction determination, and state update, the GUI agent is controlled to execute the target task. This allows the GUI agent to dynamically adjust its decision-making logic based on the dynamic changes in the interactive environment, ensuring that the generated interactive trajectory data can more completely reflect the agent's task execution capabilities.

[0089] In one embodiment, to adapt to the diversity of paths for GUI agents to execute target tasks in an interactive environment, each target task corresponds to at least one task success condition. Each task success condition corresponds to an interactive path for successfully executing the target task, and each task success condition differs in the combination of interactive elements and / or interactive actions required for successful task execution, as well as the order in which elements are triggered.

[0090] When evaluating the interaction trajectory data based on task success conditions to determine the evaluation result, if the interaction trajectory data meets any task success condition, it is determined that the GUI agent has successfully executed the target task; if the interaction trajectory data does not meet any task success condition, it is determined that the GUI agent has not successfully executed the target task.

[0091] For example, information such as attribute information of GUI elements related to task success conditions, execution content and timing of interactive actions, and changes in application state can be extracted from the interaction trajectory data. The extracted information is then compared with each task success condition corresponding to the target task to verify whether the trajectory data meets the constraints of a single task success condition.

[0092] If the interaction trajectory data satisfies any of the task success conditions, the GUI agent is determined to have successfully executed the target task; if the interaction trajectory data does not satisfy any of the task success conditions corresponding to the target task, the GUI agent is determined to have failed to execute the target task.

[0093] Please see Figure 2 , Figure 2 This diagram illustrates two interaction paths for a single target task. Taking the target task of changing the user's avatar in application A as an example... Figure 2 The example demonstrates two valid interaction paths for successfully executing the task, corresponding to the direct image replacement path at the top of the image and the data editing and image replacement path at the bottom.

[0094] For example, the interaction logic corresponding to the path above is as follows: The GUI agent clicks on application A from the desktop page of the terminal device to enter the home page of application A. Clicking the "My" button in the bottom navigation bar enters the personal page, clicking the user avatar enters the user space page, clicking the user avatar again enters the avatar editing page, clicking the "Change Image" button and performing the image saving action completes the avatar change.

[0095] The interactive elements that define the success conditions for the task corresponding to this path may include a "Change Image" button and a "Save Image" button; the interactive actions may include clicking the "Change Image" button and clicking the "Save Image" button.

[0096] The interaction logic corresponding to the path below is as follows: The GUI agent clicks on application A from the desktop page of the terminal device to enter the homepage of application A. Clicking the "My" button in the bottom navigation bar enters the personal page, clicking the "User Space" entry enters the user space page, clicking the "Edit Profile" button enters the profile editing page, clicking the user avatar triggers the image selection pop-up window, and selecting the random option completes the avatar change.

[0097] The interactive elements that define the success conditions for the task corresponding to this path may include the user's avatar and random options; the interactive actions may include clicking the user's avatar and selecting random options.

[0098] During evaluation, core information such as the attribute information of GUI elements and the execution content and timing of interactive actions can be extracted from the interaction trajectory data and then compared with the success conditions of the two tasks mentioned above. If the interaction trajectory data meets either task success condition, the GUI agent is determined to have successfully executed the target task; if the trajectory data does not meet either task success condition, such as clicking the user's avatar without triggering an image change action, the GUI agent is determined to have failed to successfully execute the target task.

[0099] As described above, by assigning at least one success condition to each target task and determining that the task is successfully executed if any condition is met, it can adapt to multiple effective interaction paths for the GUI agent to complete the task, avoid misjudgment caused by relying on a single fixed trajectory template, and thus more realistically reflect the generalization ability and decision-making ability of the GUI agent.

[0100] To further explain the evaluation process of GUI agents, Figure 3 A flowchart of another method for evaluating GUI agents is shown. This method may include the following steps: Step S301: Input the application state of the interactive environment into the GUI agent.

[0101] Please see Figure 4 , Figure 4 A schematic diagram of an evaluation process for a GUI agent is shown. By collecting screenshots of the interface and parsing the structured documents of the interface elements in Extensible Markup Language (XML), multimodal data such as the visual content of the current interface and the attribute information of the GUI elements are obtained to determine the application state of the interactive environment.

[0102] The application state is formatted to obtain input data in a unified format. This input data, along with historical data from the GUI agent, is then fed into the GUI agent, enabling it to generate output data based on the current environmental state and historical decision-making experience. The output data is then formatted to generate control instructions adapted to the current task stage.

[0103] Step S302: In the interactive environment of the terminal device, execute the control commands output by the GUI agent.

[0104] In this step, the generated control commands are sent to the interactive environment of the terminal device and executed within the application. After the commands are executed, the application state of the interactive environment changes accordingly, such as the interface switching to a new page or the state of GUI elements changing. The interactive environment then converts the new application state to obtain the input data for the next round.

[0105] Repeat steps S301 to S302 above, performing the above status input, instruction determination and status update steps, until the trajectory termination condition of the target task is reached.

[0106] Step S303: Obtain the generated interaction trajectory data.

[0107] In this step, based on the application status and control command data of each round of interaction, interaction trajectory data of the target task execution process is generated.

[0108] Step S304: Evaluate the interaction trajectory data based on the task success conditions.

[0109] In this step, to ensure the standardization and scalability of the task success conditions, the task success conditions can be defined using rule-based languages such as XML Path Language (XPath) to define the interactive elements and / or interactive action requirements required to successfully execute the target task.

[0110] For example, attribute information such as text, resource ID, package name, and bounding box of each GUI element can be extracted from the XML structured document to construct sub-conditions for task success. For instance, the rule `contains(@package, "application package name")` confirms that the current page belongs to the target application, and `text="interactive element text"` verifies that the page contains the interactive elements required for the task. Figure 2 The user avatar and image change button are displayed in the profile picture changing task.

[0111] By verifying that the execution location of the interactive action is within the bounding box of the target element, such as... Figure 2 The coordinates of the user's avatar must be within the bounding box of the avatar element when clicked, and the coordinates of the random option must be within the bounding box of the random option when clicked.

[0112] Each task success condition consists of a combination of several sub-conditions, and each target task corresponds to at least one task success condition. For example... Figure 2 The task of changing the profile picture in the middle corresponds to two interaction paths: directly changing the image and changing the image through data editing, which correspond to two success conditions for the task.

[0113] For directly changing the image path, the success conditions for the task can be defined as: Sub-condition 1: the text contains the image or the image is changed; Sub-condition 2: the click action location is within the bounding box of the image element being changed; Sub-condition 3: the target application package name is matched.

[0114] For editing data and changing image paths, the success conditions can be defined as: sub-condition 1, the text contains random sub-condition 2, the click action position is within the bounding box of the random element, and sub-condition 3, the target application package name is matched.

[0115] During evaluation, attribute information of GUI elements and interactive actions performed on GUI elements can be extracted from the interaction trajectory data and compared with the sub-conditions of each task's success condition. If the interaction trajectory data satisfies all sub-conditions of any task's success condition, the target task is determined to have been successfully executed; if the interaction trajectory data does not satisfy all sub-conditions of any task's success condition, the target task is determined to have failed.

[0116] After determining whether the target task was executed successfully or failed, a final evaluation result can be generated by combining the marker elements that triggered task completion and the interaction termination reasons such as timeout. For example, if the interaction trajectory data meets any task success condition and the interaction termination reason is the marker element that triggered task completion, the final evaluation result can be determined as task execution success.

[0117] When the interactive environment experiences abnormal states such as application crashes or UI freezes, preventing the task from continuing, the reason for the interaction termination is an abnormal interruption of task execution, and the final evaluation result can be determined as premature termination. When the number of interaction loops reaches the preset maximum step threshold, but the task completion flag or abnormal state is still not triggered, the reason for the interaction termination is a task execution timeout, and the final evaluation result can be determined as timeout termination.

[0118] When the interaction trajectory data does not meet any of the task success conditions and does not trigger early termination or timeout termination, the reasons for interaction termination may include missing interaction elements, deviation of interaction action position, failure of interaction action, etc., and the final evaluation result can be determined as task execution failure.

[0119] Step S305: Control the GUI agent to execute the reset task.

[0120] In this step, to avoid the state changes of preceding tasks interfering with the execution and evaluation of subsequent tasks, a corresponding reset task can be determined and executed based on the execution result of the target task and the reason for the termination of the interaction. For example, for a state change specific to a single task, the initial state can be restored by executing a task-level reset task in reverse order; for a state change shared by multiple tasks, the initial state can be restored by executing an application-level reset task through methods such as restarting the application or restoring default configurations.

[0121] The execution logic of the reset task is consistent with that of the target task. Through an interactive loop of status input, instruction generation and status update, the interactive environment of the terminal device is restored to the initial state before the target task is executed, ensuring that subsequent multi-round task evaluations are carried out in a consistent interactive environment.

[0122] The GUI agent evaluation method of this scheme, through the coordinated execution of the above steps, demonstrates significant evaluation results in actual testing and verification. For example, Table 1 shows the evaluation results for different GUI agents: The overall task is the sum of all target tasks, representing the comprehensive performance of the GUI agent across all scenarios and reflecting its overall task execution capability. High-frequency tasks are target tasks in scenarios frequently used by users, reflecting the GUI agent's task execution capability in popular scenarios. Low-frequency tasks are target tasks used in scenarios with lower frequency, i.e., long-tail tasks, reflecting the GUI agent's generalization ability and its ability to handle less popular scenarios.

[0123] The agent 1 evaluated using the proposed method exhibits the best performance in terms of task success rate, sub-condition success rate, and step size ratio. The closer the step size ratio is to 1, the higher the task execution efficiency, indicating that the agent is less likely to get stuck in loops or meaningless explorations under the proposed evaluation system. The generated interaction trajectory is closer to the standard path, fully reflecting the ability of the proposed evaluation method to accurately capture the true task execution capabilities of the GUI agent.

[0124] Although Agent 2 can generate appropriate task intentions, its task success rate is less than 5% due to insufficient element localization ability. Although Agent 3 has weak element localization ability, it can perceive state changes through multiple image inputs, which helps to correct errors and recover from dead loop states. This method can also accurately identify the impact of this characteristic on the task success rate. Agent 4 has significantly improved sub-condition success rate, but its overall success rate is limited due to the lack of ability to judge when the task will terminate. This further demonstrates the ability of this evaluation method to distinguish the performance of agents.

[0125] For example, Table 2 shows the consistency verification results between the evaluation method of this scheme and the manual evaluation: Among them, the quantity corresponds to the number of samples for each type of judgment result. True Positive (TP) means the number of samples that are judged as successful by both automatic assessment and manual assessment. False Positive (FP) means the number of samples that are judged as successful by both automatic assessment and manual assessment. False Negative (FN) means the number of samples that are judged as unsuccessful by both automatic assessment and manual assessment. True Negative (TN) means the number of samples that are judged as unsuccessful by both automatic assessment and manual assessment.

[0126] Automatic evaluation refers to the percentage of successful task execution calculated using the evaluation framework of this scheme, while manual evaluation refers to the percentage of successful task execution verified by humans. The accuracy rate of automatic evaluation is the percentage of automatic evaluation results that are consistent with manual evaluation results.

[0127] The overall accuracy of the evaluation framework in this scheme is approximately 95%. The false positive rate is 0, indicating that the scheme almost never misclassifies interactions deemed successful by humans as failures, demonstrating high reliability. The occasional false negatives suggest that the automatic evaluation framework's estimation of agent performance is conservative, further ensuring the rigor of the evaluation results. These validation results demonstrate that the automatic evaluation framework of this scheme can accurately label and evaluate GUI interaction trajectories, thereby supporting large-scale and efficient evaluation of GUI agents.

[0128] To verify the effectiveness of the reset task, the execution results were verified through both automated and manual evaluation. Table 3, for example, shows the manual evaluation results of the reset task mechanism: In 230 fine-grained reset tasks, the success rate of manual evaluation reached 94.07%, and the success rate of automatic evaluation was close to that of manual evaluation. This result verifies that the hierarchical design of task-level reset tasks and application-level reset tasks in this solution can effectively restore the terminal device to its initial state, avoid interference from the state changes of previous tasks on subsequent evaluations, and improve the reliability of multi-round evaluation results.

[0129] In summary, the GUI agent evaluation method disclosed in this paper, through the definition of regularized task success conditions, evaluation logic compatible with multiple interaction paths, and a precise task reset mechanism, can not only objectively and accurately evaluate the task execution capability of GUI agents, but also support large-scale, multi-round agent performance verification, providing a foundation for the algorithm optimization and iteration of GUI agents.

[0130] Corresponding to the embodiments of the foregoing methods, this disclosure also provides embodiments of the apparatus and the terminal to which it is applied.

[0131] A second aspect of this disclosure provides an evaluation apparatus for a GUI intelligent agent; please refer to [link to relevant documentation]. Figure 5 The device includes: The interaction module 501 is used to control the GUI agent to be evaluated to perform the target task and to acquire the generated interaction trajectory data; Evaluation module 502 is used to evaluate the interaction trajectory data based on predefined task success conditions for the target task, so as to determine whether the GUI agent has successfully executed the target task; The success conditions for the task are determined based on the interactive elements and / or interactive actions required to successfully execute the target task.

[0132] Optionally, the target task is executed in the interactive environment of the terminal device.

[0133] Optionally, the reset task includes at least one of the following: Task-level reset task is used to reset the application state corresponding to each target task; An application-level reset task is used to reset the application state shared by multiple target tasks in the interactive environment.

[0134] Optionally, the device further includes: The reset module is used to control the GUI agent to execute a reset task, which is used to reset the terminal device to its initial state.

[0135] Optionally, for the target task that triggers the persistent change, the task success condition is determined based on the interaction elements and / or interaction actions associated with the application state that has been persistently changed.

[0136] Optionally, the interaction module 501 includes: The input module is used to input the application state of the interactive environment to the GUI agent, so that the GUI agent can determine control commands based on the application state; An update module is used to execute the control instructions in the interactive environment to update the application state; The repeat execution module is used to repeatedly execute the above-mentioned state input, instruction determination and state update steps until the trajectory termination condition of the target task is met.

[0137] Optionally, each target task corresponds to at least one task success condition; the evaluation module 502 includes: The first evaluation module is used to determine that the GUI agent has successfully executed the target task in response to the interaction trajectory data satisfying any task success condition. The second evaluation module is used to determine that the GUI agent has failed to execute the target task in response to the interaction trajectory data not meeting any task success condition.

[0138] Optionally, the task success condition includes at least one of the following sub-conditions: The GUI elements in the interaction trajectory data are matched with the attribute information of the interaction elements; The location in the interaction trajectory data where the interactive action is performed on the GUI element is within a preset range.

[0139] The specific implementation process of the functions and roles of each module in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0140] A third aspect of this disclosure provides a computer program product including a computer program / instructions that, when executed by a processor, implement the method described in the first aspect.

[0141] For the device embodiments and computer program product embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. Furthermore, the device embodiments described above are merely illustrative; the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, i.e., they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this disclosure according to actual needs. Those skilled in the art can understand and implement this without any inventive effort.

[0142] Fourthly, embodiments of the GUI agent evaluation apparatus provided in this disclosure can be applied to electronic devices. See also... Figure 6 The illustration exemplifies a hardware schematic of an electronic device. For example, device 600 could be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness equipment, personal digital assistant, etc.

[0143] Device 600 may include one or more of the following components: processing component 601, memory 602, power supply component 603, multimedia component 604, audio component 605, input / output (I / O) interface 606, sensor component 607, and communication component 608.

[0144] Processing component 601 typically controls the overall operation of device 600, such as operations associated with display, telephone calls, data communication, camera operation, and recording. Processing component 601 may include one or more processors 609 to execute instructions to perform all or part of the steps of the methods described above. Furthermore, processing component 601 may include one or more modules to facilitate interaction between processing component 601 and other components. For example, processing component 601 may include a multimedia module to facilitate interaction between multimedia component 604 and processing component 601.

[0145] Memory 602 is configured to store various types of data to support the operation of device 600. Examples of this data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, etc. Memory 602 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0146] Power supply component 603 provides power to various components of device 600. Power supply component 603 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 600.

[0147] Multimedia component 604 includes a screen that provides an output interface between the device 600 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 604 includes a front-facing camera and / or a rear-facing camera. When the device 600 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

[0148] Audio component 605 is configured to output and / or input audio signals. For example, audio component 605 includes a microphone (MIC) configured to receive external audio signals when device 600 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 602 or transmitted via communication component 608. In some embodiments, audio component 605 also includes a speaker for outputting audio signals.

[0149] Input / output (I / O) interface 606 provides an interface between processing component 601 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.

[0150] Sensor assembly 607 includes one or more sensors for providing state assessments of various aspects of device 600. For example, sensor assembly 607 may detect the on / off state of device 600, the relative positioning of components such as the display and keypad of device 600, changes in the position of device 600 or a component of device 600, the presence or absence of user contact with device 600, the orientation or acceleration / deceleration of device 600, and temperature changes of device 600. Sensor assembly 607 may also include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 607 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 607 may also include an accelerometer, a gyroscope, a magnetometer, a pressure sensor, or a temperature sensor.

[0151] Communication component 608 is configured to facilitate wired or wireless communication between device 600 and other devices. Device 600 can access wireless networks based on communication standards, such as WiFi, 2G or 3G, 4G or 5G, or combinations thereof. In one exemplary embodiment, communication component 608 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 608 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

[0152] In an exemplary embodiment, device 600 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform an evaluation method for a GUI agent of the aforementioned electronic device.

[0153] Fifthly, in exemplary embodiments, this disclosure also provides a non-transitory computer-readable storage medium including instructions, such as a memory 602 including instructions, which can be executed by a processor 609 of device 600 to complete the evaluation method of the GUI agent of the electronic device. For example, the non-transitory computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.

[0154] The foregoing has described specific embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0155] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention applied herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not claimed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.

[0156] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.

[0157] The above description is merely a preferred embodiment of this disclosure and is not intended to limit this disclosure. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A method for evaluating a graphical user interface (GUI) intelligent agent, characterized in that, The method includes: Control the GUI agent to be evaluated to perform the target task and acquire the generated interaction trajectory data; Based on the predefined task success conditions for the target task, the interaction trajectory data is evaluated to determine whether the GUI agent has successfully executed the target task; The success conditions for the task are determined based on the interactive elements and / or interactive actions required to successfully execute the target task.

2. The method according to claim 1, characterized in that, The target task is executed in the interactive environment of the terminal device.

3. The method according to claim 2, characterized in that, The method further includes: The GUI agent is controlled to execute a reset task, which is used to reset the terminal device to its initial state.

4. The method according to claim 3, characterized in that, The reset task includes at least one of the following: Task-level reset task is used to reset the application state corresponding to each target task; An application-level reset task is used to reset the application state shared by multiple target tasks in the interactive environment.

5. The method according to claim 2, characterized in that, For a target task that triggers a persistent change, the success condition of the task is determined based on the interaction elements and / or interaction actions associated with the application state that has been persistently changed.

6. The method according to claim 2, characterized in that, The process of controlling the GUI agent to be evaluated to perform the target task includes: The application state of the interactive environment is input to the GUI agent, so that the GUI agent determines control commands based on the application state. The control command is executed in the interactive environment to update the application state; Repeat the above steps of state input, instruction determination, and state update until the trajectory termination condition of the target task is met.

7. The method according to claim 1, characterized in that, Each target task corresponds to at least one task success condition; The step of evaluating the interaction trajectory data based on predefined task success conditions for the target task to determine whether the GUI agent has successfully executed the target task includes: In response to the interaction trajectory data satisfying any task success condition, it is determined that the GUI agent has successfully executed the target task; In response to the fact that the interaction trajectory data does not meet any task success condition, it is determined that the GUI agent has failed to execute the target task.

8. The method according to claim 1, characterized in that, The task success condition includes at least one of the following sub-conditions: The GUI elements in the interaction trajectory data are matched with the attribute information of the interaction elements; The location in the interaction trajectory data where the interactive action is performed on the GUI element is within a preset range.

9. An evaluation device for a GUI intelligent agent, characterized in that, The device includes: The interaction module is used to control the GUI agent to be evaluated to perform the target task and to acquire the generated interaction trajectory data; An evaluation module is used to evaluate the interaction trajectory data based on predefined task success conditions for the target task, so as to determine whether the GUI agent has successfully executed the target task; The success conditions for the task are determined based on the interactive elements and / or interactive actions required to successfully execute the target task.

10. The apparatus according to claim 9, characterized in that, The target task is executed in the interactive environment of the terminal device.

11. The apparatus according to claim 10, characterized in that, The device further includes: The reset module is used to control the GUI agent to execute a reset task, which is used to reset the terminal device to its initial state.

12. The apparatus according to claim 10, characterized in that, The interaction module includes: The input module is used to input the application state of the interactive environment to the GUI agent, so that the GUI agent can determine control commands based on the application state; An update module is used to execute the control instructions in the interactive environment to update the application state; The repeat execution module is used to repeatedly execute the above-mentioned state input, instruction determination and state update steps until the trajectory termination condition of the target task is met.

13. A computer program product having a computer program / instructions stored thereon, characterized in that, When the computer program / instruction is executed by the processor, it implements the method as described in any one of claims 1-8.

14. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor implements the method as described in any one of claims 1-8 by executing the executable instructions.

15. A computer-readable storage medium storing computer instructions thereon, characterized in that, When executed by the processor, this instruction implements the method as described in any one of claims 1-8.