A task difficulty assessment method, device, equipment, medium and program product

By combining a multi-dimensional evaluation model with real-time computing power data, the problem of insufficient GUI task difficulty assessment in existing technologies has been solved, achieving high efficiency, stability, and improved success rate of task execution.

CN122221131APending Publication Date: 2026-06-16FANXING INTELLIGENT COMPUTING TECHNOLOGY (BEIJING) CO LTD +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
FANXING INTELLIGENT COMPUTING TECHNOLOGY (BEIJING) CO LTD
Filing Date
2026-02-04
Publication Date
2026-06-16

Smart Images

  • Figure CN122221131A_ABST
    Figure CN122221131A_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a task difficulty evaluation method, device, equipment, medium and program product. The present application belongs to the technical field of computers. The method comprises: acquiring multi-modal data related to a task, standardizing the multi-modal data to obtain task description parameter data; extracting a task instruction based on the task description parameter data and calculating multi-dimensional index base parameters; including: decision complexity, application span, operation complexity, interface variability and intention ambiguity; calculating the difficulty values of each dimension through a pre-constructed multi-dimensional evaluation model based on the index base parameters of the corresponding dimension; and performing task difficulty evaluation on the difficulty values of each dimension based on a preset fusion model to obtain a task difficulty evaluation result of the task instruction. The technical solution outputs a comprehensive evaluation result through the difficulty of five dimensions, performs task difficulty evaluation, provides a basis for subsequent resource scheduling, and improves the task execution success rate of the GUI intelligent agent.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of computer technology, and in particular relates to a method, apparatus, device, medium and program product for assessing task difficulty. Background Technology

[0002] With the rapid development of artificial intelligence technology, how to achieve autonomous task execution through intelligent agents is a technical issue of great concern to those skilled in the art.

[0003] Existing technologies have conducted in-depth research on the automatic task execution of intelligent agents. For example, some solutions improve the agent's decision-making strategy by refining reinforcement learning algorithms, thereby enhancing its adaptive execution capability in different environments. Other solutions optimize the construction of intelligent agent models to improve their adaptability in multi-version application scenarios. While these technologies have improved the automatic task execution effect to some extent in specific scenarios, they still have significant limitations. For instance, most solutions lack a difficulty assessment system for tasks that can support automated execution, leading to insufficient allocation of execution resources and ineffective task execution. This results in problems such as execution failure or excessive latency for complex tasks. Furthermore, some solutions focus on model adaptation during the training phase and lack an online scheduling mechanism based on task difficulty during the runtime phase. This makes it difficult to cope with interface differences across devices and versions, resulting in significant deficiencies in the stability and adaptability of automatic agent execution, failing to meet the automated interaction needs in practical applications. Summary of the Invention

[0004] This application provides a task difficulty assessment method, apparatus, device, medium, and program product, aiming to solve the problem of lacking a systematic difficulty assessment system adapted to the characteristics of GUI tasks, which leads to the inability to perform online resource scheduling based on task difficulty during the operation phase, resulting in problems such as low success rate of intelligent agent task execution and high latency. At the same time, there is a technical problem of insufficient adaptability in cross-device and cross-version scenarios.

[0005] In a first aspect, embodiments of this application provide a method for assessing task difficulty, the method comprising: Acquire multimodal data related to the task, and perform standardization processing on the multimodal data to obtain task description parameter data; Task instructions are extracted based on the task description parameter data, and multi-dimensional indicator basic parameters are calculated; wherein, the multi-dimensional parameters include: decision complexity, application span, operation complexity, interface variability, and intent ambiguity. The difficulty value of each dimension is obtained by calculating based on the basic parameters of the corresponding indicators in a pre-built multi-dimensional evaluation model. The task difficulty is evaluated based on the difficulty values ​​of each dimension using a preset fusion model, and the task difficulty evaluation result of the task instruction is obtained.

[0006] In some feasible embodiments, after obtaining the task difficulty assessment result of the task instruction, the method further includes: Obtain real-time computing power data from the task execution device; Based on the task difficulty assessment results and the real-time computing power data, the computing power data of the task instructions is dynamically configured.

[0007] In some possible implementations: The basic parameters of the decision complexity index include one or more of the following: depth of the task action graph, average branch factor, number of explicit condition judgments, and historical failure rollback ratio. The basic parameters for the application span metrics include one or more of the following: number of cross-applications, number of application switching times, and number of cross-application data transfer events; The basic parameters of the operation complexity index include one or more of the following: number of operation steps, number of input characters, operation type weight, and single-step error rate; The basic parameters of the interface variability index include one or more of the following: similarity of the interactive interface tree structure, pixel similarity, and average displacement of elements in the interface. The basic parameters of the intent ambiguity index include one or more of the following: the maximum similarity between the instruction and the preset instruction template, the number of candidate intents, and the perplexity of the language model.

[0008] In some feasible embodiments, the multidimensional evaluation model includes a decision complexity sub-model; The process of calculating the decision complexity difficulty value through the decision complexity sub-model includes: Obtain the basic parameters for determining decision complexity; The depth, average branch factor, and number of explicit conditional judgments of the task action graph are normalized to obtain the normalized results of each index of decision complexity. The normalized results of each indicator of decision complexity are input into the decision complexity sub-model. The decision complexity sub-model performs a weighted summation on the normalized results of each indicator of decision complexity to obtain a weighted summation result. The weighted summation result is then compressed to obtain the decision complexity difficulty value.

[0009] In some feasible embodiments, the multidimensional evaluation model includes the application of a span sub-model; The process of calculating the application span difficulty value through the application span sub-model includes: Obtain the basic parameters of the application span metrics; The number of cross-applications, the number of application switching times, and the number of cross-application data transmission events are normalized to obtain the normalized results of each indicator of application span. The normalized results of each indicator of the application span are input into the application span sub-model to obtain a preliminary score; The initial scores are then compressed and range-limited sequentially to obtain the application span difficulty value.

[0010] In some feasible embodiments, the multidimensional evaluation model includes an operational complexity sub-model; The process of calculating the difficulty value of operation complexity through the operation complexity sub-model includes: Obtain the basic parameters for the metrics of operational complexity; The number of operation steps, the number of input characters, and the complexity of operation type are normalized according to preset maximum values ​​to obtain the normalized results of each index of operation complexity. The normalized results of each indicator of the operation complexity and the single-step error rate are used as input data and input into the operation complexity sub-model to obtain the dependency relationship in the operation time sequence. The normalized results of each indicator of the operation complexity are weighted and fused according to the preset weight to obtain the weighted fusion result. The weighted fusion result is compressed, and the operation complexity difficulty value is output.

[0011] In some feasible embodiments, the multidimensional evaluation model includes an interface variability sub-model; The process of calculating the interface variability difficulty value through the interface variability sub-model includes: Obtain the basic parameters for the interface variability index; Based on the similarity of the interactive interface tree structure and the pixel similarity, structural difference parameters and pixel difference parameters are calculated. The structural difference parameters, the pixel difference parameters, and the average element displacement are used as input data and input to the interface variability sub-model. The input data is fused using the attention mechanism of the interface variability sub-model, and then weighted and summed according to preset weights to obtain the interface variability weighted sum result. The weighted summation result of the interface variability is processed by range limitation, and the interface variability difficulty value is output.

[0012] In some feasible embodiments, the average displacement of the element is calculated using the following formula: ; in, A collection of interface elements. Let be the coordinates of element u at time t. Let W be the coordinates of element u at time t+1, W be the screen width, and H be the screen height.

[0013] In some feasible embodiments, the multidimensional evaluation model includes an intent ambiguity sub-model; The process of calculating the intent ambiguity difficulty value through the intent ambiguity sub-model includes: Obtain the basic parameters for the ambiguity of intent; The number of candidate intentions and the perplexity of the language model are normalized respectively. The graph difference parameter is used as input data and input into the intention ambiguity sub-model. The input data is weighted and summed according to preset weights to obtain the weighted sum of intention ambiguity. The weighted summation result of the intent ambiguity is processed by range limitation, and the intent ambiguity difficulty value is output.

[0014] In some feasible embodiments, the task difficulty is assessed based on a preset fusion model for each dimension of difficulty value to obtain the task difficulty assessment result of the task instruction, including: The difficulty levels of each dimension are constructed as input data and input into a preset fusion model to obtain the task difficulty assessment result of the task instruction output by the preset fusion model.

[0015] In some feasible embodiments, the difficulty level of each dimension is determined based on the difficulty value of each dimension and a preset mapping relationship, including: Set a first difficulty threshold and a second difficulty threshold, wherein the first difficulty threshold is less than the second difficulty threshold; The difficulty level of each dimension is determined based on the comparison results between the difficulty value of each dimension and the first difficulty threshold and the second difficulty threshold.

[0016] In some feasible embodiments, the multimodal data includes: task description data, user interaction logs, user interface data, and device operating context data; The standardization process includes at least one of the following methods: using a key-value pair format to unify the data structure, filtering illegal characters, sorting in chronological order, and removing invalid data.

[0017] Secondly, embodiments of this application provide a task difficulty assessment device, the device comprising: The data acquisition module is used to acquire multimodal data related to the task, and to standardize the multimodal data to obtain task description parameter data. The basic parameter calculation module is used to extract task instructions based on the task description parameter data and calculate multi-dimensional indicator basic parameters; wherein, the multi-dimensional parameters include: decision complexity, application span, operation complexity, interface variability, and intent ambiguity. The module for determining the difficulty value of each dimension is used to calculate the difficulty value of each dimension based on the basic parameters of the corresponding dimension indicators through a pre-built multi-dimensional evaluation model. The task difficulty assessment module is used to assess the task difficulty based on the difficulty values ​​of each dimension using a preset fusion model, and obtain the task difficulty assessment result of the task instruction.

[0018] Thirdly, embodiments of this application provide a task difficulty assessment device, the device comprising: a processor, and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the task difficulty assessment method described above.

[0019] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer program instructions, which, when executed by a processor, implement the task difficulty assessment method described above.

[0020] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the task difficulty assessment method described above.

[0021] The task difficulty assessment method, apparatus, device, medium, and program product of this application collects standardized multimodal data, extracts multidimensional core difficulty indicators, calculates the difficulty of each dimension, and then outputs the overall assessment result through a fusion model, thereby achieving accurate assessment of GUI task difficulty. It can be used to support adaptive scheduling in the training and inference stages, dynamically match resource configuration and execution strategies, effectively improve the success rate of GUI agent task execution in cross-device and cross-version scenarios, and reduce the number of actions and latency in the automatic execution process of tasks. Attached Figure Description

[0022] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a flowchart illustrating a task difficulty assessment method provided in an embodiment of this application; Figure 2 This is a schematic diagram of a task difficulty assessment process provided in an embodiment of this application; Figure 3 This is a schematic diagram of a task difficulty assessment device provided in an embodiment of this application; Figure 4 This is a schematic diagram of the structure of a task difficulty assessment device provided in an embodiment of this application. Detailed Implementation

[0024] The features and exemplary embodiments of various aspects of this application will be described in detail below. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only intended to explain this application and not to limit it. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples.

[0025] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.

[0026] To address the problems in existing technologies, embodiments of this application provide a method, apparatus, device, medium, and program product for task difficulty assessment. The technical solution provided in this application automatically performs task identification and processing through a GUI agent. First, it collects and standardizes multimodal data such as task description, interaction logs, UI interface, and device context. Then, it extracts basic parameters of five-dimensional difficulty indicators, including decision complexity and application span. Lightweight sub-models such as GAT graph networks and Light GBM are used to quantify and classify the difficulty of each dimension. An MLP fusion model is then used to obtain the overall difficulty assessment result. Finally, a phased strategy is adopted during the training phase, and during the inference phase, the inference engine, number of steps, tool activation, and fault tolerance mechanism are dynamically adjusted according to the difficulty, achieving efficient and stable task execution.

[0027] The task difficulty assessment method provided in the embodiments of this application will be introduced first below.

[0028] Figure 1This is a flowchart illustrating a task difficulty assessment method provided in an embodiment of this application. Figure 1 As shown, the method may include the following steps: S101, acquire multimodal data related to the task, and perform standardization processing on the multimodal data to obtain task description parameter data; Multimodal data can be a collection of heterogeneous information in different formats, including text, logs, images, and device environment, generated throughout the entire task execution process. This includes user-inputted natural language commands, interaction logs generated during operation, screenshots and structure files of the application interface, and context configuration data of device operation. For example, in a task of planning a 3-day trip from point A to point B, the user's input command "including round-trip airfare + four-star hotel + reservations for 3 popular attractions + 2 reservations for specialty restaurants", click and swipe trajectory logs, the UI tree structure of the airfare booking page, and the screen resolution information of the Android device are all multimodal data.

[0029] This solution can actively collect or passively receive task-related data from local storage on the mobile terminal, associated servers, or third-party applications through various means such as sensors on the mobile terminal, application programming interfaces, and system log reading tools. This ensures that the collected data can fully cover the key information of the entire task process from instruction input to execution completion.

[0030] Standardization processing can be a series of combined operations, such as data format conversion, illegal information filtering, time-series logic organization, and invalid data removal, to transform heterogeneous data from different sources and in different formats into a unified and standardized structure. This includes at least one processing method such as using JSON format or key-value pair format to unify the data structure, filtering illegal characters such as date format errors using regular expressions, sorting interaction logs according to timestamps, and removing duplicate or incomplete data that does not record operation types.

[0031] Task description parameter data can be a set of data that, after standardization, clearly and structurally presents core information such as task objectives, operation steps, involved applications, interface element attributes, and device operating environment.

[0032] S102, extract task instructions based on the task description parameter data, and calculate multi-dimensional indicator basic parameters; wherein, the multi-dimensional parameters include: decision complexity, application span, operation complexity, interface variability, and intent ambiguity. Task instructions can be key information extracted from standardized task description parameter data, which clarifies the core objectives and execution requirements of the task. They can be natural language instructions directly input by the user or standardized instructions obtained by parsing structured task templates, such as "Plan a 1-day tour of Chengdu, including 2 cultural attractions + 1 local food reservation".

[0033] Extraction can be the process of using a large language model finely tuned with specific instructions to perform Natural Language Processing (NLP) to accurately identify and extract key information such as the core requirements, action types, and applications involved in the task from a structured task description.

[0034] The basic parameters of the multi-dimensional indicators can be a set of basic data that can be quantified and obtained through statistical analysis, structural calculation and semantic parsing, reflecting the difficulty characteristics of each dimension, corresponding to five dimensions: decision complexity, application span, operation complexity, interface variability and intent ambiguity.

[0035] Decision complexity can refer to the complexity of the branch decision logic involved in the execution of a task. Specifically, it is a comprehensive reflection of characteristics such as the depth of the task action graph, branch density, number of condition judgments, and historical execution rollback.

[0036] Application span can refer to the degree of cross-application collaboration involved in the task execution process, specifically reflected in the number of applications involved in the interaction chain, the frequency of switching between applications, and the frequency of cross-application data transfer.

[0037] Operational complexity refers to the overall complexity of a task, including the number of operational steps, types of operations, the amount of input, and the tolerance for errors during execution. It reflects the cumbersome nature of the operation and the difficulty of success during task execution.

[0038] Interface variability refers to the degree of change in the structure, logic, visual presentation, and element positions of the interactive interface during task execution, reflecting the impact of dynamic interface changes on task execution.

[0039] Intent ambiguity refers to the clarity of the user's intent expressed by the task instruction, and is affected by various factors such as semantic ambiguity of the instruction text, the number of candidate intents, and the difficulty of language model prediction.

[0040] This solution utilizes various techniques, including statistical analysis, image recognition, structural comparison, and semantic parsing, to extract relevant information and transform it into quantitative indicators from task description parameter data. For example, it counts the number of operation steps in interaction logs and calculates structural similarity for UI interface data. Specifically, it uses a language-based large-scale model fine-tuned for specific instructions to perform natural language processing on the task description parameter data, extracting features such as task instruction and intent categories, action types, and the number of applications involved. It also performs statistical analysis on interaction logs, calculating parameters such as the number of operation steps and the number of input characters. Finally, it uses a multimodal large-scale model fine-tuned for specific instructions to perform image recognition and structural analysis on UI interface data, calculating parameters such as UI tree structure similarity and pixel similarity. Ultimately, it obtains the basic parameters corresponding to decision complexity, application span, operation complexity, interface variability, and intent ambiguity.

[0041] S103, through a pre-built multi-dimensional evaluation model, calculates the difficulty value of each dimension based on the basic parameters of the corresponding dimensions; A multi-dimensional evaluation model can be a set of models consisting of dedicated sub-models designed for five difficulty dimensions. Each sub-model adopts a technical architecture adapted to the characteristics of the corresponding dimension, including decision complexity sub-model, application span sub-model, operational complexity sub-model, interface variability sub-model, and intent ambiguity sub-model.

[0042] Pre-construction can refer to the process of training sub-models for each dimension based on a certain number of GUI task sample data, combined with technologies adapted to mobile devices such as graph networks, gradient boosting trees, temporal convolutional networks, and lightweight Transformers, and determining the model structure, network parameters, and weight coefficients.

[0043] The difficulty value for each dimension can be a value normalized to the [0,1] range after the basic parameters of the indicator are processed by the corresponding sub-model. The larger the value, the higher the difficulty of that dimension.

[0044] This solution allows you to input the basic parameters of each dimension's indicators into the corresponding sub-model according to a preset format. The sub-model's proprietary algorithm logic then performs data processing, feature fusion, and numerical calculations, ultimately outputting the difficulty value for each dimension.

[0045] Specifically, the basic parameters of decision complexity are input into the decision complexity sub-model, the basic parameters of application span are input into the application span sub-model, the basic parameters of operation complexity are input into the operation complexity sub-model, the basic parameters of interface variability are input into the interface variability sub-model, and the basic parameters of intent ambiguity are input into the intent ambiguity sub-model. Each sub-model is calculated using its own algorithm and outputs the corresponding decision complexity difficulty value, application span difficulty value, operation complexity difficulty value, interface variability difficulty value, and intent ambiguity difficulty value.

[0046] S104, Based on the preset fusion model, the task difficulty value of each dimension is evaluated to obtain the task difficulty evaluation result of the task instruction.

[0047] The pre-defined fusion model can refer to a model that has been pre-trained based on a large number of GUI task samples and manually labeled difficulty levels. It is constructed using a multi-layer perceptron (MLP) with Dropout (0.2) and includes a shared trunk path and dual output branches for classification and regression. It can integrate difficulty information from various dimensions to output an overall evaluation result.

[0048] The task difficulty assessment results can include an overall difficulty level, such as level 0 for easy, level 1 for medium, and level 2 for difficult, and can also include the level confidence level. This can reflect both the overall difficulty of the task and provide a reliable reference for the assessment results.

[0049] This scheme can construct input data by combining the difficulty values ​​of each dimension with the corresponding one-hot encoded vectors of the difficulty level, inputting them into a preset fusion model, calculating the overall difficulty level probability through the classification branch of the model, calculating the overall continuous difficulty score through the regression branch, and then determining the overall difficulty level by combining the preset threshold.

[0050] The technical solution provided in this embodiment collects and standardizes task-related multimodal data, extracts indicators from five dimensions such as decision complexity, calculates the difficulty of each dimension, and then outputs a comprehensive evaluation result through a fusion model. This enables the evaluation of the difficulty of GUI tasks, provides a reliable basis for subsequent resource scheduling, effectively supports the accurate matching of task execution resources, and improves the task execution success rate of GUI agents in cross-device and cross-version scenarios, while reducing the number of actions and latency.

[0051] In one feasible embodiment, after obtaining the task difficulty assessment result of the task instruction, the method further includes: Obtain real-time computing power data from the task execution device; Based on the task difficulty assessment results and the real-time computing power data, the computing power data of the task instructions is dynamically configured.

[0052] Task execution device can refer to various mobile terminal devices used to execute the current GUI task, including smartphones, tablets, smart cockpit terminals, etc., such as smartphones and tablets with a resolution of 1920×1080.

[0053] Real-time computing power data can refer to the computing resource occupancy status and data processing capabilities of a task execution device at a given moment, including processor utilization, memory occupancy, GPU computing speed, and computing power supply capacity corresponding to remaining power.

[0054] This solution utilizes system interfaces and hardware monitoring tools provided by the device's operating system to collect real-time computing power data from the task execution device, ensuring that the data accurately reflects the device's current operating status and computing power level. Based on the task difficulty assessment results and the real-time computing power data, the computing power data for the task instructions is dynamically configured.

[0055] Dynamic configuration can refer to the process of flexibly adjusting computing resource allocation, inference engine type, maximum inference steps, single-step latency threshold, tool activation status, and other computing power-related parameters for the current task based on the computing power requirements corresponding to the task difficulty level and the real-time computing power status of the device.

[0056] Computing power data can refer to a set of parameters related to various computing resources allocated to task execution, including the selection of lightweight / full mode for the inference engine, the maximum number of inference steps limit, the latency limit for single-step operation, and the activation of auxiliary tools such as optical character recognition.

[0057] This technical solution achieves dynamic matching between task computing power requirements and device real-time computing power, avoiding resource waste or insufficiency. While ensuring efficient task execution, it also ensures device operational stability and improves the task execution success rate and latency control effect of the GUI intelligent agent.

[0058] In one feasible embodiment, the basic parameters for the decision complexity metrics include one or more of the following: depth of the task action graph, average branch factor, number of explicit conditional judgments, and historical failure rollback ratio. The basic parameters for the application span metrics include one or more of the following: number of cross-applications, number of application switching times, and number of cross-application data transfer events; The basic parameters of the operation complexity index include one or more of the following: number of operation steps, number of input characters, operation type weight, and single-step error rate; The basic parameters of the interface variability index include one or more of the following: similarity of the interactive interface tree structure, pixel similarity, and average displacement of elements in the interface. The basic parameters of the intent ambiguity index include one or more of the following: the maximum similarity between the instruction and the preset instruction template, the number of candidate intents, and the perplexity of the language model.

[0059] The depth of a task action graph can refer to the number of steps in the longest or shortest path from the starting node to the target node in a task action graph (G=(V,E)) consisting of action-page nodes (V) and reachable transition edges (E) that breaks down the task. For example, the depth of the action graph for planning a 3-day trip to location B may reach 25 steps.

[0060] The average branch factor can be the arithmetic mean of the number of branches contained in all nodes of the task action graph. It can reflect the density of branch decisions during task execution. The higher the branch factor, the more complex the decision selection.

[0061] The number of explicit conditional statements can refer to the number of nodes in a task action graph that contain if-else logic, such as the number of nodes that need to make conditional judgments, such as whether the airfare price meets the target, whether the hotel rating meets the requirements, and whether the attraction has visitor limits.

[0062] The historical failure rollback ratio refers to the ratio of the number of times a task needs to be rolled back to previous steps due to decision-making errors or operational failures during its historical execution to the total number of times the task has been executed. It reflects the fault tolerance level of the task's decision-making and execution.

[0063] The number of cross-applications can refer to the number of different applications involved in the interaction chain of task execution. For example, a travel planning task may involve flight booking apps, hotel booking apps, attraction reservation apps, and map navigation apps, and the number of cross-applications may be 4.

[0064] Application switching frequency can refer to the total number of times a task is switched between different applications during execution. A higher switching frequency indicates more frequent cross-application collaboration.

[0065] The number of cross-application data transfer events can refer to the number of events in which data is synchronized or shared between different applications during task execution, such as the total number of events where flight booking information is synchronized to a travel planning app, hotel addresses are synchronized to a map navigation app, and attraction reservation records are synchronized to a calendar app.

[0066] The number of operation steps can refer to the total number of all operation steps required to complete the task. It includes the steps in the complete process from opening the initial interface and inputting parameters to submitting the final task. For example, checking the airfare may require 5 steps, while a complex hotel booking may require 12 steps.

[0067] The number of input characters can refer to the total number of text characters that the user or agent needs to manually input during the task execution process, including the total number of characters of all input content such as departure point, destination, date, contact information, and remarks.

[0068] The sum of operation type weights can refer to the value obtained by weighting each operation type in the task according to preset weights. The preset weights are set according to the complexity of the operation, for example, click = 1, long press = 1.5, swipe = 2, text input = 1.5, drag = 2. The more complex the operation type, the higher the weight.

[0069] The single-step error rate can be defined as the ratio of the number of times a specific operation step fails during the historical execution of a task to the total number of times that step is executed. A higher error rate indicates that the operation step is more difficult.

[0070] The similarity of the interactive interface tree structure can refer to the degree of similarity of the UI tree structure at different times calculated based on the graph edit distance (GED). The closer the value is to 1, the smaller the structural difference.

[0071] Pixel similarity can refer to the degree of pixel matching between interface images at different times, calculated based on the Structural Similarity Index Metric (SSIM). The value ranges from [0,1], and the higher the value, the more consistent the visual presentation.

[0072] The average displacement of elements in the interface can refer to the average distance of the position change of all interactive elements in the set of interface elements at different times. It is calculated by Euclidean distance and then normalized according to the screen diagonal length to eliminate the influence of device screen size differences.

[0073] The maximum similarity between the instruction and the preset instruction template can be defined as the maximum similarity value obtained by calculating the cosine similarity between the task instruction text and all templates in the preset instruction template library. It reflects the degree of fit between the instruction and the standard template, and the higher the value, the clearer the intent.

[0074] The number of candidate intents can refer to the number of intents corresponding to preset instruction templates with a similarity of ≥0.6 to the task instruction. For example, "plan a fun trip" may correspond to three candidate intents: natural scenery tour, cultural and historical site tour, and theme park tour.

[0075] Language model perplexity refers to the difficulty of a pre-trained language model in semantically predicting the text of a task instruction. A higher perplexity indicates that the instruction is more ambiguous. Normalization can be performed here to facilitate subsequent calculations.

[0076] This technical solution clarifies the core quantitative indicators for difficulty assessment in each dimension, covers the key influencing factors of GUI task difficulty, and provides specific and calculable input parameters for each dimension's sub-model, ensuring the relevance and accuracy of difficulty assessment and avoiding assessment bias caused by ambiguous indicators.

[0077] In one feasible embodiment, the multidimensional evaluation model includes a decision complexity sub-model; The process of calculating the decision complexity difficulty value through the decision complexity sub-model includes: Obtain the basic parameters for determining decision complexity; The depth, average branch factor, and number of explicit conditional judgments of the task action graph are normalized to obtain the normalized results of each index of decision complexity. The normalized results of each indicator of decision complexity are input into the decision complexity sub-model. The decision complexity sub-model performs a weighted summation on the normalized results of each indicator of decision complexity to obtain a weighted summation result. The weighted summation result is then compressed to obtain the decision complexity difficulty value.

[0078] Normalization can refer to the standardization process of mapping the original values ​​of each indicator to the [0,1] range according to the maximum value set by the industry scenario. The purpose is to eliminate the dimensional differences between different indicators.

[0079] The normalized results of the decision complexity indicators can refer to the standardized values ​​of the depth, average branch factor, and number of explicit conditional judgments of the task action graph within the [0,1] interval, obtained after the above normalization process. The normalized results of the decision complexity indicators are input into the decision complexity sub-model. The decision complexity sub-model performs a weighted summation on the normalized results of the decision complexity indicators to obtain a weighted summation result. This weighted summation result is then compressed to obtain the decision complexity difficulty value. The decision complexity sub-model can be a model specifically designed for calculating the decision complexity difficulty value, constructed using a shallow layered graph attention network (GAT). Its attention weights are optimized through training with GUI task samples, enabling it to accurately focus on high-complexity branch decision logic.

[0080] Based on the importance of each indicator in the decision complexity assessment, different attention weights can be assigned. The mathematical operation process of weighted summation of the normalized results of each indicator is then performed. The original value obtained after the weighted summation operation, which has an unrestricted range, is positively correlated with the decision complexity.

[0081] Compression processing can refer to the process of using the Sigmoid function to compress the weighted summation result to the [0,1] interval, ensuring that the output difficulty value has a uniform quantitative standard.

[0082] The decision complexity difficulty value can refer to a quantitative value that reflects the degree of decision complexity after compression processing. The value ranges from [0,1]. The larger the value, the more complex the decision logic of the task.

[0083] This technical solution uses graph attention networks to accurately capture the complexity of task branch decision logic, and combines normalization and compression processing to ensure the standardization of difficulty values, thereby achieving accurate calculation of decision complexity and providing accurate dimensional support for overall difficulty assessment. This solves the problem of one-sided assessment caused by simply counting the number of steps in traditional methods.

[0084] In one feasible embodiment, the multidimensional evaluation model includes the application of a span sub-model; The process of calculating the application span difficulty value through the application span sub-model includes: Obtain the basic parameters of the application span metrics; The number of cross-applications, the number of application switching times, and the number of cross-application data transmission events are normalized to obtain the normalized results of each indicator of application span. The normalized results of each indicator of the application span are input into the application span sub-model to obtain a preliminary score; The initial scores are then compressed and range-limited sequentially to obtain the application span difficulty value.

[0085] The number of cross-applications, the number of application switching times, and the number of cross-application data transmission events are normalized to obtain the normalized results of each indicator of application span. The normalized results of the various indicators of application span can refer to the standardized values ​​of the number of cross-applications, the number of application switching times, and the number of cross-application data transfer events obtained after the above normalization process and falling within the range of [0,1].

[0086] The application span sub-model can refer to a model built using the Light GBM gradient boosting tree model, which is specifically designed to mine the criticality of cross-application interactions and can capture the non-linear impact of key features such as cross-application data transfer on difficulty.

[0087] The initial score can refer to the raw score output by the model after performing feature extraction and nonlinear operations on the normalized input index, without being subject to range restrictions. Its value reflects the complexity of the application scope.

[0088] Compression processing can refer to the process of using the Sigmoid function to nonlinearly compress the initial score, making the value closer to the [0,1] interval.

[0089] Range limiting processing can refer to the process of using the clip function to strictly limit the compressed score to the range [0,1], ensuring that outliers outside this range are corrected to 0 or 1.

[0090] The application span difficulty value can be a quantitative value that reflects the complexity of the application span after a series of processing. The value ranges from [0,1]. The larger the value, the higher the difficulty of cross-application collaboration.

[0091] This technical solution uses a gradient boosting tree model to explore the nonlinear effects of key features such as cross-application data transfer, breaking through the limitations of traditional methods that only count the number of applications. It achieves accurate quantification of application span, provides reliable support for assessing the difficulty of cross-application collaborative tasks, and ensures that scheduling strategies can be specifically adapted to cross-application interaction needs.

[0092] In one feasible embodiment, the multi-dimensional evaluation model includes an operational complexity sub-model; The process of calculating the difficulty value of operation complexity through the operation complexity sub-model includes: Obtain the basic parameters for the metrics of operational complexity; The number of operation steps, the number of input characters, and the complexity of operation type are normalized according to preset maximum values ​​to obtain the normalized results of each index of operation complexity. The normalized results of each indicator of the operation complexity and the single-step error rate are used as input data and input into the operation complexity sub-model to obtain the dependency relationship in the operation time sequence. The normalized results of each indicator of the operation complexity are weighted and fused according to the preset weight to obtain the weighted fusion result. The weighted fusion result is compressed, and the operation complexity difficulty value is output.

[0093] The number of operation steps, the number of input characters, and the complexity of operation type are normalized according to preset maximum values ​​to obtain the normalized results of each index of operation complexity. The preset maximum value can refer to the reasonable upper limit value of each indicator determined through a large number of sample statistical analysis based on the actual execution scenario of mobile GUI tasks. For example, the maximum number of operation steps K_max=50, the maximum number of input characters L_max=200, and the maximum operation type complexity G_max=100.

[0094] Normalization can refer to the process of dividing the number of operation steps, the number of input characters, and the complexity of operation type by their respective preset maximum values ​​to map the original values ​​to the interval [0,1]. The formulas are K_hat=K / K_max, L_hat=L_char / L_max, and G_hat=G / G_max, respectively.

[0095] The normalized results of the various indicators of operational complexity can refer to the standardized values ​​of the number of operation steps, the number of input characters, and the complexity of operation type, which are in the range [0,1] after the above normalization process.

[0096] Input data can refer to the time sequence data of operations, which is composed of the normalized results of various indicators of operation complexity and the original single-step error rate (∈[0,1]) arranged in chronological order. It can fully reflect the order of operation execution and the characteristics of each step.

[0097] Operational complexity sub-models can refer to models built using shallow temporal convolutional networks specifically designed to capture temporal dependencies in operations. By employing dilated convolution techniques, they can effectively capture long-distance operational logical connections.

[0098] Dependencies in an operation sequence can refer to the logical connections and mutual influences between task operation steps. For example, a successful flight booking is a prerequisite for a hotel booking, and a failed attraction booking will lead to a subsequent re-selection operation.

[0099] Preset weights can refer to the importance weights of each indicator determined after training and verification based on a large number of GUI task samples, such as β1=0.25, β2=0.2, β3=0.35, β4=0.2. The weight values ​​reflect the degree of contribution of the corresponding indicator to the operational complexity.

[0100] Weighted fusion can refer to the process of weighting and accumulating various indicator data according to preset weights, integrating features from multiple dimensions into a comprehensive value, and fully reflecting the complexity of the operation.

[0101] The weighted fusion result can refer to the original value obtained after weighted summation, without compression processing. Its size is positively correlated with the complexity of the operation.

[0102] Compression processing can refer to the process of using the Sigmoid function to nonlinearly compress the weighted fusion result, so that the value is strictly mapped to the [0,1] interval.

[0103] The operation complexity difficulty value can refer to a quantitative value that reflects the degree of operation complexity after compression processing. The value range is [0,1]. The larger the value, the higher the complexity and difficulty of the operation.

[0104] This technical solution captures the long-term dependencies of operation sequences through temporal convolutional networks, and combines the differences in operation type weights and single-step error rates to achieve a comprehensive quantification of operation complexity. It breaks through the limitations of traditional methods that only count the number of operation steps, and provides accurate support for the difficulty assessment of multi-step and complex operation tasks, ensuring that the scheduling strategy can adapt to the operation execution requirements.

[0105] In one feasible embodiment, the multidimensional evaluation model includes an interface variability sub-model; The process of calculating the interface variability difficulty value through the interface variability sub-model includes: Obtain the basic parameters for the interface variability index; Based on the similarity of the interactive interface tree structure and the pixel similarity, structural difference parameters and pixel difference parameters are calculated. The structural difference parameters, the pixel difference parameters, and the average element displacement are used as input data and input to the interface variability sub-model. The input data is fused using the attention mechanism of the interface variability sub-model, and then weighted and summed according to preset weights to obtain the interface variability weighted sum result. The weighted summation result of the interface variability is processed by range limitation, and the interface variability difficulty value is output.

[0106] The structural difference parameter can be a quantitative parameter that reflects the degree of difference in the structure of the interactive interface tree. It is obtained by performing a reverse calculation on the similarity of the interactive interface tree structure. The larger the value, the more significant the structural difference.

[0107] Pixel difference parameters can be quantitative parameters that reflect the degree of visual difference between pixels on the interface. They are obtained by performing inverse calculations on pixel similarity. The larger the value, the more obvious the visual difference.

[0108] This solution can convert similarity metrics into difference metrics by performing a simple subtraction operation on the similarity of the interactive interface tree structure and pixel similarity, which can directly reflect the degree of variation of the interface.

[0109] The structural difference parameters, the pixel difference parameters, and the average element displacement are used as input data and input to the interface variability sub-model. The input data is fused using the attention mechanism of the interface variability sub-model, and then weighted and summed according to preset weights to obtain the interface variability weighted sum result. Input data can refer to a multimodal feature data set consisting of structural difference parameters, pixel difference parameters, and average element displacement, which can comprehensively cover the variation features of the interface in three dimensions: structure, vision, and element position.

[0110] The interface variability sub-model can refer to a model built using a lightweight visual Transformer model, specifically designed to fuse multimodal interface features, and adapted to the low-computing-power deployment requirements of mobile devices.

[0111] Attention mechanism can refer to the mechanism by which a model can automatically focus on key features that have a significant impact on interface variation, assign different attention weights to different features, and thus accurately integrate the features of the three modalities of structure, vision, and displacement.

[0112] This scheme can integrate the high-dimensional feature information of three types of input data into a unified feature representation vector through an attention mechanism, thereby achieving the process of complementarity and enhancement of multi-dimensional information.

[0113] Preset weights refer to the feature weight coefficients determined through sample training based on the actual needs of interface variation assessment. For example, γ1=0.4, γ2=0.3, and γ3=0.3 correspond to structural difference parameters, pixel difference parameters, and average element displacement, respectively. The process of weighting and accumulating the fused unified feature vector according to the preset weights yields a numerical value that comprehensively reflects the degree of interface variation.

[0114] The weighted summation result of interface variability can refer to the original value obtained after weighted accumulation operation without range restrictions, and its magnitude is positively correlated with the degree of interface variability.

[0115] Range restriction processing can refer to the process of using the clip function to strictly limit the weighted summation result of interface variability to the range [0,1], so as to ensure the standardization and comparability of difficulty values.

[0116] The interface variability difficulty value can be a quantitative value that reflects the degree of interface variability after range restriction processing. The value range is [0,1]. The larger the value, the more significant the interface variability and the greater the interference with task execution.

[0117] This technical solution solves the misjudgment problem caused by relying solely on pixel similarity by fusing multimodal interface features through model fusion. It achieves accurate quantification of interface variability, provides reliable support for difficulty assessment of cross-device and cross-version interface interaction tasks, and ensures that the GUI intelligent agent can adapt to dynamic changes in the interface.

[0118] In one feasible embodiment, the average displacement of the element is calculated using the following formula: ; in, A collection of interface elements. Let be the coordinates of element u at time t. Let W be the coordinates of element u at time t+1, W be the screen width, and H be the screen height.

[0119] A collection of interface elements can refer to the collection of all interactive or displayable visual elements in an interactive interface, including various interface components such as buttons, input boxes, images, text labels, and drop-down menus.

[0120] coordinate , can refer to the specific position coordinates of the interface element u in the screen coordinate system at time t. It is usually represented by the upper left corner of the screen as the origin and in the form of horizontal and vertical coordinates. For example, (100,200) represents a horizontal coordinate of 100 pixels and a vertical coordinate of 200 pixels.

[0121] coordinate , can refer to the specific position coordinates of the interface element u in the screen coordinate system at time t+1, such as after the interface is refreshed or the page is switched.

[0122] Screen width can refer to the horizontal pixel length of the screen of the device performing the task, such as 1920 pixels, 2560 pixels, etc., which is determined by the device's hardware parameters.

[0123] Screen height can refer to the vertical pixel length of the screen of the device performing the task, such as 1080 pixels, 1440 pixels, etc., which is also determined by the device's hardware parameters.

[0124] This scheme determines the displacement of element u based on the Euclidean distance between its coordinates at time t and time t+1. The displacement is calculated as the square root of the sum of the squares of the differences in the horizontal and vertical coordinates. After summing the Euclidean distances of all elements in the interface element set, the average displacement is obtained by dividing by the total number of elements in the set.

[0125] This technical solution provides a method for calculating the average displacement of elements. By calculating the Euclidean distance and normalizing the screen diagonal, it ensures the uniformity and comparability of the displacement quantification results, avoids evaluation deviations caused by differences in device screen size, and provides a reliable quantitative basis for the accurate evaluation of interface variability.

[0126] In one feasible embodiment, the multi-dimensional evaluation model includes an intent ambiguity sub-model; The process of calculating the intent ambiguity difficulty value through the intent ambiguity sub-model includes: Obtain the basic parameters for the ambiguity of intent; The number of candidate intentions and the perplexity of the language model are normalized respectively. The graph difference parameter is used as input data and input into the intention ambiguity sub-model. The input data is weighted and summed according to preset weights to obtain the weighted sum of intention ambiguity. The weighted summation result of the intent ambiguity is processed by range limitation, and the intent ambiguity difficulty value is output.

[0127] Normalization can refer to the process of mapping the original values ​​of candidate intent number and language model perplexity to the [0,1] interval.

[0128] The intent difference parameter can be a quantitative parameter that reflects the degree of difference between the task instruction and the preset template. It is obtained by performing a reverse calculation on the maximum similarity between the instruction and the preset instruction template.

[0129] Input data can refer to a set of semantic feature data consisting of intent difference parameters, normalized candidate intent numbers, and normalized language model perplexity, which can comprehensively reflect the ambiguity of instruction semantics.

[0130] The intent ambiguity sub-model can refer to a model that uses a lightweight language model, such as BERT-tiny, to encode semantic ambiguity. It can deeply mine the deep semantic features of instruction text and adapt to the deployment requirements of mobile devices.

[0131] Preset weights can refer to the feature weight coefficients set based on the actual needs of semantic ambiguity assessment, such as η1=0.4, η2=0.3, and η3=0.3, which correspond to the intent difference parameter, the number of normalized candidate intents, and the perplexity of the normalized language model, respectively.

[0132] Weighted summation can refer to the process of performing a weighted summation operation on three input semantic feature data according to preset weights to obtain a numerical value that comprehensively reflects the degree of ambiguity of the intent.

[0133] The weighted summation result of intent ambiguity can refer to the original value obtained after weighted accumulation operation without range restriction, and its magnitude is positively correlated with the degree of intent ambiguity.

[0134] Range restriction processing can refer to the process of using the clip function to strictly limit the weighted sum of intent ambiguity results to the range of [0,1], ensuring the standardization and comparability of difficulty values.

[0135] The intent ambiguity difficulty value can refer to a quantitative value that reflects the degree of ambiguity of intent after range restriction processing. The value range is [0,1]. The larger the value, the more ambiguous the user's intent is, and the higher the difficulty for the agent to understand and execute it.

[0136] This technical solution uses a lightweight language model to deeply mine the semantic ambiguity of instruction text. By combining the number of candidate intents and language perplexity, it achieves accurate quantification of intent ambiguity, breaking through the limitations of traditional methods that rely solely on text matching. It provides reliable support for assessing the difficulty of tasks with unclear intents and ensures that the GUI intelligent agent can accurately understand user needs.

[0137] In one feasible embodiment, the task difficulty is assessed based on a preset fusion model for each dimension of difficulty value to obtain the task difficulty assessment result of the task instruction, including: The difficulty levels of each dimension are constructed as input data and input into a preset fusion model to obtain the task difficulty assessment result of the task instruction output by the preset fusion model.

[0138] The difficulty level of each dimension can refer to the discrete level determined by comparing the difficulty value of each dimension with the preset first difficulty threshold (0.33) and second difficulty threshold (0.66), specifically divided into three levels: level 0 (easy), level 1 (medium), and level 2 (difficult).

[0139] The model extracts features, performs deep fusion and computation on the input fused feature data, and outputs the overall difficulty level probability and the overall continuous difficulty score, respectively. Then, it combines the results with the preset threshold judgment rules to obtain the comprehensive evaluation result.

[0140] This technical solution integrates difficulty level information and quantitative information from various dimensions, and outputs evaluation results based on the fusion of multi-dimensional features, providing a data foundation for the formulation of subsequent scheduling strategies.

[0141] In one feasible embodiment, the difficulty level of each dimension is determined based on the difficulty values ​​of each dimension and a preset mapping relationship, including: Set a first difficulty threshold and a second difficulty threshold, wherein the first difficulty threshold is less than the second difficulty threshold; The difficulty level of each dimension is determined based on the comparison results between the difficulty value of each dimension and the first difficulty threshold and the second difficulty threshold.

[0142] The first difficulty threshold can refer to the critical value used to distinguish between easy and medium difficulty. It is set based on the user experience goals and resource adaptation requirements of mobile GUI tasks, and the specific value is 0.33.

[0143] The second difficulty threshold can refer to the critical value used to distinguish between medium and hard difficulty. It is also set based on the actual scenario requirements of mobile GUI tasks, and the specific value is 0.66.

[0144] This solution can predefine the threshold values ​​for difficulty level division based on the statistical analysis of the execution effect of a large number of GUI tasks, user experience surveys, and resource configuration threshold analysis, ensuring the rationality and practicality of the level division.

[0145] The comparison result can refer to the logical relationship obtained by comparing the difficulty value of each dimension with the first difficulty threshold and the second difficulty threshold respectively. Specifically, it includes three cases: dimension value < first difficulty threshold, first difficulty threshold ≤ dimension value < second difficulty threshold, and dimension value ≥ second difficulty threshold.

[0146] The difficulty level of each dimension can refer to the discrete level determined based on the above comparison results, where a dimension value < 0.33 corresponds to level 0 (easy), 0.33 ≤ dimension value < 0.66 corresponds to level 1 (medium), and a dimension value ≥ 0.66 corresponds to level 2 (difficult).

[0147] This technical solution establishes standardized difficulty level classification rules to achieve a unified mapping of continuous difficulty values ​​across dimensions to discrete levels, ensuring consistency in difficulty level determination across dimensions and scenarios, and providing a unified level basis for subsequent fusion model input and scheduling strategy formulation.

[0148] In one feasible embodiment, the multimodal data includes: task description data, user interaction logs, user interface data, and device operating context data; The standardization process includes at least one of the following methods: using a key-value pair format to unify the data structure, filtering illegal characters, sorting in chronological order, and removing invalid data.

[0149] Task description data can refer to various types of information input by users to clarify task requirements. It can be instruction text in natural language form or structured task template data, such as a natural language instruction like "Plan a 3-day tour from location A to location B, including round-trip airfare + four-star hotel + reservations for 3 popular attractions + 2 reservations for specialty restaurants", or itinerary planning template data containing fixed fields.

[0150] User interaction logs refer to the operational behavior data automatically recorded by the system during the execution of tasks by users or intelligent agents, including the coordinates and time of click events, the path and speed of swipe trajectories, the reasons and number of operation failures, the duration of page dwell, and other detailed information.

[0151] Interactive interface data can refer to all application interface-related data involved in the task execution process, including screenshots, UI tree structure files, the types and attributes of interface elements, element position coordinates, layout rules, etc., such as screenshots of flight booking pages, UI tree structures of hotel lists, and button and input box attribute data of attraction reservation pages.

[0152] Device runtime context data can refer to data related to the hardware configuration and software runtime environment of the device executing the task, including device model, operating system version, screen resolution, theme style (such as dark theme, light theme), network status, remaining battery power, etc., such as an Android 14 device with a 1920×1080 resolution or a tablet device in a 5G network environment.

[0153] Using a key-value pair format to unify the data structure can mean storing and organizing different types of data, such as task description data and device context data, in the form of key-value pairs of attribute name and attribute value. This ensures the structural consistency of data from different sources. For example, travel itinerary nodes and flight / hotel booking parameters can be encapsulated into key-value pairs using fixed fields such as "departure point - location A", "destination point - location B", and "hotel star rating - four-star".

[0154] Filtering illegal characters can refer to using techniques such as regular expression matching and character encoding verification to remove characters in data that do not meet the preset format requirements, such as special symbols in travel dates, garbled characters in text instructions, and illegal Unicode encoded characters.

[0155] Time-series sorting refers to the process of arranging data with time-series characteristics, such as user interaction logs, in chronological order based on the timestamp information contained in the data, ensuring a clear logical sequence of operations.

[0156] Invalid data removal can refer to removing redundant data that is repeatedly recorded, incomplete data that lacks key fields, and empty data that has no practical meaning through methods such as data integrity verification and duplicate detection. Examples include interaction logs that do not record operation types and coordinates, and incomplete attribute data of UI elements.

[0157] This technical solution ensures that the collected data comprehensively covers key information for task execution by using specific types of multimodal data and standardized processing methods. At the same time, standardization processing eliminates data format differences and interference from invalid information, providing standardized basic data for subsequent feature extraction and difficulty assessment, and ensuring the accuracy and efficiency of the entire assessment process.

[0158] To enable those skilled in the art to better understand this solution, this application also provides a preferred embodiment. Figure 2 This is a schematic diagram illustrating a task difficulty assessment process provided in an embodiment of this application. For example... Figure 2 As shown, the task difficulty assessment process includes: Step S1: Multi-source data acquisition and standardization processing; Obtain task-related multimodal data from mobile terminals and related systems to provide foundational data support for subsequent difficulty assessment. Specifically, this includes: Task description data (user-input natural language commands or structured task templates, such as "Plan a 3-day trip from location A to location B, including round-trip airfare + four-star hotel + reservations for 3 popular attractions + 2 reservations for specialty restaurants"), user interaction logs (click events, swipe trajectories, operation failure records, etc.), UI interface data (application interface screenshots, UI tree structure files, such as screenshots of the flight booking page, UI tree structure of the hotel list, element information of the attraction reservation page, layout data of the Dianping restaurant reservation interface, etc.), device context data (device model, operating system version, screen resolution, theme style, such as a 1920×1080 resolution Android 14 device, dark theme); The system adopts JSON format to unify task descriptions and UI tree data structures (e.g., encapsulating travel itinerary nodes, flight / hotel booking parameters, and attraction filtering conditions as fixed fields); it filters illegal characters in the data using regular expressions (e.g., incorrect characters in the itinerary date format); it sorts interaction logs chronologically based on timestamps and removes duplicate, incomplete, and invalid data (e.g., logs that do not record operation types); and it standardizes the storage of device context data using key-value pairs to ensure data format consistency across devices. The standardized multi-source dataset includes structured task descriptions (including task objectives, operation steps, and identifiers of involved applications and UI elements), cleaned interaction logs (including fields such as timestamps, operation types, and coordinate locations), parsable UI interface data (including element types, hierarchical relationships, text content, and location coordinates), and normalized device context information.

[0159] Step S2: Extraction of task features and five-dimensional difficulty indicators; Based on the acquired standardized data, core task features are extracted, and fundamental parameters of the five-dimensional difficulty index are calculated to provide a data basis for difficulty classification. Specifically, this includes: The language-based large-scale model, fine-tuned with specific instructions, performs natural language processing on the task description to extract intent categories (e.g., "multi-day cross-city travel planning"), action types (e.g., "open application", "enter text", "click button"), and the number of applications involved. Statistical analysis is performed on the interaction logs to calculate the number of operation steps, the number of input characters, the number of application switches, and the number of cross-application data transfer events. The multimodal large-scale model, also fine-tuned with specific instructions, performs image recognition and structural analysis on the UI interface data to calculate the frequency of UI element changes (e.g., the real-time update frequency of promotional labels in a travel app), layout structure differences (based on graph edit distance GED), and adjacent frame pixel similarity (based on structural similarity index SSIM). To address decision complexity, a task action graph G=(V,E) is constructed (V represents action / page nodes, E represents reachable transition edges), and the graph depth d, average branch factor b, number of explicit conditional judgments m, and historical failure rollback ratio r are calculated. For application span, the number of cross-applications N_app, number of switches S, and number of cross-application data transfer events H are statistically analyzed. For operation complexity, the complexity of operation types is accumulated by weight (click weight 1, long press weight 2, drag weight 3, text input weight 4), and basic parameters are calculated by combining the number of steps K, the number of input characters L_char, and the single-step error rate e. For interface variability, the UI tree structure similarity S_struct is calculated based on graph edit distance, pixel similarity is calculated based on structural similarity index, and average displacement Δ is calculated based on element coordinates. For intent ambiguity, the maximum similarity s_max between the instruction text and the template library, the number of candidate intents M, and the language model perplexity PPL are calculated. The basic parameter set of the five-dimensional difficulty index includes decision complexity parameters (d, b, m, r), application span parameters (N_app, S, H), operation complexity parameters (K, L_char, operation type weight sum, e), interface variability parameters (S_struct, SSIM, Δ), and intent ambiguity parameters (s_max, M, PPL).

[0160] Step S3: Five-dimensional difficulty model assessment and grading; Based on the extracted five fundamental parameters (decision complexity, application scope, operational complexity, interface variability, and intent ambiguity), the difficulty value of each dimension is calculated through five innovative sub-models. A unified threshold is then used to determine the discrete difficulty level, ultimately outputting a dimensional and interpretable difficulty assessment result. The specific process is as follows: 1. Sub-model calculation; To address the five core difficulty factors affecting mobile GUI tasks, a sub-model architecture combining "scenario adaptation and technological innovation" is adopted. This breaks through the limitations of traditional single-dimensional, static evaluation and achieves precise quantification of difficulty in each dimension. (1) Decision complexity sub-model: GAT graph network modeling of branch decision logic; The decision complexity sub-model differs from the traditional evaluation method of "only counting the number of steps". It adopts a shallow layered graph attention network (GAT) to decompose the task into a task action graph composed of "action-page" nodes (V) and "reachable transition" edges (E). Through the attention mechanism, it focuses on high-complexity branches (such as conditional nodes such as "whether the airfare price meets the standard in travel planning" and "whether the hotel rating meets the requirements"), accurately capturing the impact of decision depth and branch density on difficulty. Extracted action graph depth d (longest path steps), average branch factor b (average number of node branches), number of explicit condition judgments m (number of nodes containing if-else logic), and historical failure rollback ratio r (percentage of rollbacks for this type of task). First, normalize d, b, and m to [0,1] using the maximum values ​​of the industry scenario (e.g., d_max=30, b_max=10, m_max=15), and input them into the GAT network along with the original r value ([0,1]). After weighted summation (attention weights are optimized based on training with 100,000+ GUI tasks), the normalized difficulty value C_dec∈[0,1] is compressed and output by the Sigmoid function. The following formulas (1) and (2) are then calculated: (1) (2) (2) Application span sub-model: Light GBM captures the criticality of cross-application interactions; By applying the span sub-model to overcome the limitations of existing technologies that only count the number of applications, the Light GBM gradient boosting tree model is adopted to focus on exploring the non-linear effects of key features such as "cross-application data transfer" and "switching timing" (e.g., the impact of data transfer from "document to calendar" on difficulty is much greater than that of "application switching without data transfer"). Extract the number of cross-application events N_app (number of applications in the interaction chain), the number of switching events S (number of application jumps), and the number of cross-application data transfer events H (such as the number of events for "synchronizing flight information to itinerary" and "synchronizing hotel address to map navigation"). Normalize N_app according to the "mobile cross-application limit" (N_max=8) (formula: Nhat=(N_app-1) / (N_max-1), to eliminate the baseline deviation of single application tasks), and directly normalize S (S_max=20) and H (H_max=10) and input them into Light GBM. Output the preliminary score through the leaf node weights. After Sigmoid compression and clip(……,0,1) range restriction, we get C_app∈[0,1]. The calculation formulas are as follows: formula (3) and formula (4): (3) (4) (3) Operational complexity sub-model: TCN models the temporal dependencies of operations; The operation complexity sub-model breaks through the limitations of the traditional "only counting operation steps" and adopts a shallow temporal convolutional network (TCN). It captures the long-term dependencies of operation sequences by dilating convolutions (such as the sequential logic of "flight booking → hotel booking → attraction booking" and "the impact of previous booking failure on subsequent re-selection"). At the same time, it incorporates the weight differences of operation type (such as drag and drop complexity > click). Extracted operation steps K, input characters L_char, operation type complexity G (weighted summation of "click=1, long press=1.5, swipe=2, text input=1.5, drag=2"), and single-step error rate e (percentage of failures for this step in historical executions). After normalizing K (K_max=50), L_char (L_max=200), and G (G_max=100), they are combined with the original e value ([0,1]) to form an operation time sequence. This sequence is input into the TCN network to capture dependencies. After weighted fusion (β1=0.25, β2=0.2, β3=0.35, β4=0.2) and the output C_op∈[0,1] of the Sigmoid function, the calculation formulas are as follows: Formula (5) and Formula (6): (5) (6) (4) Interface variability sub-model: MobileViT-Tiny integrates multimodal interface features; The UI variability sub-model differs from existing technologies that rely solely on pixel similarity. It employs a lightweight visual Transformer model (such as MobileViT-Tiny) and integrates three modal features: "UI structural logic," "pixel vision," and "element displacement." This addresses the misjudgment issues of "same structure but large visual variations" and "visual similarity but large structural differences." Input parameters: UI structure similarity S_struct (1-normalized GED) extracted in step S2, pixel similarity SSIM, and average element displacement Δ (normalized by screen diagonal). Inputting S_struct, SSIM, and Δ into MobileViT-Tiny, the three features of "structural difference (1-S_struct)," "pixel difference (1-SSIM)," and "displacement Δ" are fused through an attention mechanism. After weighted summation with weights γ1=0.4, γ2=0.3, and γ3=0.3, and then limited by clip(·,0,1), the output C_ui∈[0,1] is obtained (the larger the value, the more significant the interface variation). The calculation formula is as follows: Structural similarity: (7) Pixel similarity: (8) Element displacement: (9) (10) (5) Intent ambiguity sub-model: BERT-tiny encoding semantic ambiguity; The intent ambiguity sub-model breaks through the limitations of traditional "text-only matching" by using a lightweight language model (such as BERT-tiny) to deeply mine the semantic ambiguity of the instruction text (e.g., "plan a fun trip" can correspond to multiple intents such as "natural scenery tour / cultural and historical site tour / theme park tour"; "plan a cost-effective intercity tour" can correspond to multiple combinations such as "budget hotel + low-priced airfare" and "mid-range accommodation + discounted attraction tickets"). At the same time, it combines the number of candidate intents and language perplexity to improve the comprehensiveness of the evaluation. Extracted instruction-template maximum similarity s_max (cosine similarity), candidate intent number M (number of templates with similarity ≥ 0.6), language model perplexity PPL (log PPL normalized to [0,1]); Calculation process: Input "intent difference (1-s_max)", "normalized candidate intent number Mhat", and "normalized perplexity Phat" into the output layer of BERT-tiny. After weighted summation with weights η1=0.4, η2=0.3, and η3=0.3, and then limiting the range by clip(·,0,1), output C_intent∈[0,1]. The calculation formula is as follows: (11) (12) 2. Level determination; Based on the user experience goals and resource adaptation requirements of mobile GUI tasks, a unified threshold rule is established to map continuous difficulty values ​​of each dimension to discrete levels (0=easy, 1=medium, 2=hard), ensuring consistency in evaluation across dimensions and scenarios. Level 0 (Simple): Dimension value < 0.33, corresponding to tasks with "single application, few steps, no branches, clear intent, and stable interface" (such as "querying the price of tomorrow's airfare from location A to location B" or "booking a specific hotel in location B for 1 night"). Level 1 (Medium): 0.33 ≤ Dimension value < 0.66, corresponding to tasks with "2-3 applications, 10-20 steps, a few branches, relatively clear intent, and minor interface variations" (such as "planning a 1-day trip to Chengdu, including 2 cultural attractions + 1 local food reservation" or "booking round-trip airfare from location A to Guangzhou + 1 three-star hotel"). Level 2 (Difficult): Dimension value ≥ 0.66, corresponding to tasks with "≥ 3 applications, > 20 steps, multiple branches, ambiguous intent, and significant interface changes" (e.g., "Plan a 3-day trip to location B, including round-trip low-priced airfare (≤ 1500 yuan), a four-star hotel (close to the attraction, rating ≥ 4.8), reservations for 3 popular attractions (including attractions with limited capacity), and reservations for 2 specialty restaurants (avoiding peak seasons)").

[0161] 3. Output results; The output is a five-dimensional difficulty assessment vector, containing "continuous difficulty value + discrete level" for each dimension. This not only reflects the subtle differences in difficulty across dimensions but also provides an interpretable basis for subsequent fusion models and scheduling strategies. An example is shown below: -Task Example: "Plan a 3-day trip to location B → Book round-trip low-cost airfare → Book a 4-star hotel → Reservations for 3 popular attractions with limited capacity → Reservations for 2 specialty restaurants → Sync the itinerary to your phone's calendar" - Evaluation results: [C_dec=0.58 (s_dec=1), C_app=0.72 (s_app=2), C_op=0.63 (s_op=2), C_ui=0.35 (s_ui=1), C_intent=0.42 (s_intent=1)] Interpretation of Results: The core difficulty of this travel planning task stems from "cross-application collaboration (C_app=0.72, Level 2)" and "operational complexity (C_op=0.63, Level 2)". It requires the linkage of multiple travel apps to complete data transfer and multi-step operations. The decision-making and intent difficulty is moderate, and the interface is relatively stable. Subsequent scheduling needs to focus on ensuring resource support for cross-application data synchronization (such as using full tools to extract order information from each app) and fault tolerance mechanisms for multi-step operations (such as re-recommendation after failed attraction reservations).

[0162] Step S4: Integrate model reasoning with overall difficulty assessment; By combining the five-dimensional difficulty assessment results with device context features, a fusion model is used to calculate the overall task difficulty, providing a core basis for scheduling strategies. Specifically, this includes: The five-dimensional continuous difficulty value vector c=[C_dec,C_app,C_op,C_ui,C_intent] and the discrete level onehot encoding vector onehot(s) are combined to form the input vector x=[c,onehot(s)]. Fusion model computation: A multilayer perceptron (MLP) with Dropout (0.2) is used as the fusion model. The network is divided into a shared trunk and dual output branches: the classification branch outputs the overall difficulty level probability p (∈Δ², where Δ² is the three-dimensional probability simplex space) through the Softmax function, and the regression branch outputs the overall continuous difficulty score D (∈[0,1]) through the Sigmoid function; the model training uses the loss function L_fuse=α. CE(p,L_gt)+(1-α) Huber(D,E) (α∈[0.5,0.7], L_gt is the difficulty level of manual annotation, and E is the standardized observation effort, which is calculated by weighting the number of steps, time delay, and number of backoffs). The overall difficulty level L is determined based on the continuous score D: Level 0 (easy) when D < 0.33, Level 1 (medium) when 0.33 ≤ D < 0.66, and Level 2 (difficult) when D ≥ 0.66. The level confidence score (the maximum probability value of the classification branch) is also output. Output results: Overall difficulty assessment of the task, including overall difficulty level L (0 / 1 / 2), continuous score D, and level confidence. For example, the overall difficulty assessment results of the travel planning task are L=2 (difficult), D=0.71, and confidence level=0.94.

[0163] Step S5: Difficulty-based adaptive scheduling execution; Based on the overall difficulty assessment results and combined with the real-time computing power of the equipment, the resource allocation and execution strategies during the training and inference phases are dynamically adjusted to ensure efficient task completion. Specifically, this includes: A phased training strategy of "from easy to difficult" is adopted: in the initial stage, the model is trained only with level 0 (simple) task data. After the model's success rate on level 0 tasks is ≥95%, level 1 (medium) task data is introduced for incremental training. Finally, level 2 (difficult) task data is added to optimize the model's generalization ability. During the training process, the sample sampling weight is increased for high-difficulty tasks (L=2) (e.g., 1.5 times that of simple tasks) to improve the model's adaptability to complex tasks. Inference parameters are dynamically configured based on difficulty level, with the following specific rules: When L=0 or D<0.33 (simple task): enable the lightweight inference engine (Light), limit the maximum number of inference steps to ≤20 steps, the single step latency to ≤800ms, and disable OCR tools to save resources, such as the task of "querying the airfare price of location A to location B tomorrow"; When L=1 or 0.33≤D<0.66 (medium task): enable medium-sized inference engine, maximum inference steps ≤40 steps, single step latency ≤1200ms, selectively enable OCR tool (only recognize key text elements), configure heuristic rollback strategy (roll back to the first 3 steps and re-execute after 1 failure), such as the "Chengdu 1-day tour planning" task; When L=2 or D≥0.66 (difficult task): Enable the full inference engine, the maximum number of inference steps ≤60 steps, the single step latency ≤2000ms, enable all tools (OCR, text recognition, element localization), increase the search depth, and allow one re-solution of the plan, such as the task of "3-day trip to location B (low-priced airfare + high-rated hotel + attractions + specialty restaurants) multi-application collaborative planning"; If the number of failures / rollbacks during task execution exceeds 2, automatically switch to a conservative strategy (reduce the maximum number of inference steps, extend the single-step latency, and enable more accurate tools). Training plans / inference configuration schemes adapted to the difficulty of the task, as well as resource scheduling instructions during task execution, ensure training efficiency and inference stability.

[0164] This proposal, through the above steps, realizes a complete technical solution from data collection to dynamic optimization, which can be widely applied to scenarios such as intelligent assistants, mobile office, and automated operation. It enables edge deployment without manual annotation or private permission dependencies, significantly improving the task processing capabilities of mobile GUI agents.

[0165] Figure 3 This is a schematic diagram of a task difficulty assessment device provided in an embodiment of this application. Figure 3 As shown, the device may include: The data acquisition module 310 is used to acquire multimodal data related to the task, and to standardize the multimodal data to obtain task description parameter data. The basic parameter calculation module 320 is used to extract task instructions based on the task description parameter data and calculate multi-dimensional indicator basic parameters; wherein, the multi-dimensional parameters include: decision complexity, application span, operation complexity, interface variability and intent ambiguity. The module 330 for determining the difficulty value of each dimension is used to calculate the difficulty value of each dimension based on the basic parameters of the corresponding dimension indicators through a pre-built multi-dimensional evaluation model. The task difficulty assessment module 340 is used to assess the task difficulty based on the difficulty values ​​of each dimension using a preset fusion model, and obtain the task difficulty assessment result of the task instruction.

[0166] The task difficulty assessment device provided in this embodiment has the same functional modules and beneficial effects as the task difficulty assessment method described above. To avoid repetition, it will not be described in detail here.

[0167] Figure 4 This is a schematic diagram of the structure of a task difficulty assessment device provided in an embodiment of this application. Figure 4 As shown, the task difficulty assessment device may include a processor 401 and a memory 402 storing computer program instructions.

[0168] Specifically, the processor 401 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.

[0169] Memory 402 may include mass storage for data or instructions. For example, and not limitingly, memory 402 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. In one instance, memory 402 may include removable or non-removable (or fixed) media, or memory 402 may be non-volatile solid-state memory. Memory 402 may be internal or external to the integrated gateway disaster recovery device.

[0170] In one instance, memory 402 may be read-only memory (ROM). In one instance, the ROM may be a mask-programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

[0171] Memory 402 may include read-only memory (ROM), random access memory (RAM), disk storage media device, optical storage media device, flash memory device, electrical, optical, or other physical / tangible memory storage device. Therefore, generally, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to one aspect of this disclosure.

[0172] The processor 401 implements the task difficulty assessment method in the above embodiments by reading and executing computer program instructions stored in the memory 402.

[0173] In one example, the task difficulty assessment device may also include a communication interface 403 and a bus 404. For example, Figure 3 As shown, the processor 401, memory 402, and communication interface 403 are connected through bus 404 and complete communication with each other.

[0174] The communication interface 403 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application.

[0175] Bus 404 includes hardware, software, or both, that couples components of a task difficulty assessment device together. For example, and not as a limitation, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Extended Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hyper Transport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 404 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, this application contemplates any suitable bus or interconnect.

[0176] The task difficulty assessment device can execute the task difficulty assessment method in the embodiments of this application, thereby realizing the task difficulty assessment method described in the above embodiments.

[0177] Furthermore, in conjunction with the task difficulty assessment device method in the above embodiments, this application embodiment can provide a computer storage medium for implementation. The computer storage medium stores computer program instructions; when these computer program instructions are executed by a processor, they implement any of the task difficulty assessment device methods in the above embodiments.

[0178] This application also provides a computer program product, including a computer program that, when executed by a processor, implements any of the task difficulty assessment device methods described in the above embodiments.

[0179] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.

[0180] The functional blocks shown in the above-described block diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, read-only memory (ROM), flash memory, erasable read-only memory (EROM), floppy disks, compact disc read-only memory (CD-ROM), optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.

[0181] It should also be noted that the exemplary embodiments mentioned in this application describe methods or systems based on a series of steps or apparatus. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.

[0182] The aspects of this disclosure have been described above with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block in the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to create a machine such that these instructions, executable via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions / actions specified in one or more blocks of the flowchart illustrations and / or block diagrams. Such a processor can be, but is not limited to, a general-purpose processor, a special-purpose processor, a special application processor, or a field-programmable logic circuit. It is also understood that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can also be implemented by special-purpose hardware performing the specified functions or actions, or can be implemented by a combination of special-purpose hardware and computer instructions.

[0183] The above description is merely a specific implementation of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.

Claims

1. A method for assessing task difficulty, characterized in that, The method includes: Acquire multimodal data related to the task, and perform standardization processing on the multimodal data to obtain task description parameter data; Task instructions are extracted based on the task description parameter data, and multi-dimensional indicator basic parameters are calculated; wherein, the multi-dimensional parameters include: decision complexity, application span, operation complexity, interface variability, and intent ambiguity. The difficulty value of each dimension is obtained by calculating based on the basic parameters of the corresponding indicators in a pre-built multi-dimensional evaluation model. The task difficulty is evaluated based on the difficulty values ​​of each dimension using a preset fusion model, and the task difficulty evaluation result of the task instruction is obtained.

2. The method according to claim 1, characterized in that, After obtaining the task difficulty assessment result of the task instruction, the method further includes: Obtain real-time computing power data from the task execution device; Based on the task difficulty assessment results and the real-time computing power data, the computing power data of the task instructions is dynamically configured.

3. The method according to claim 1 or 2, characterized in that: The basic parameters of the decision complexity index include one or more of the following: depth of the task action graph, average branch factor, number of explicit condition judgments, and historical failure rollback ratio. The basic parameters for the application span metrics include one or more of the following: number of cross-applications, number of application switching times, and number of cross-application data transfer events; The basic parameters of the operation complexity index include one or more of the following: number of operation steps, number of input characters, operation type weight, and single-step error rate; The basic parameters of the interface variability index include one or more of the following: similarity of the interactive interface tree structure, pixel similarity, and average displacement of elements in the interface. The basic parameters of the intent ambiguity index include one or more of the following: the maximum similarity between the instruction and the preset instruction template, the number of candidate intents, and the perplexity of the language model.

4. The method according to claim 3, characterized in that, The multi-dimensional evaluation model includes a decision complexity sub-model; The process of calculating the decision complexity difficulty value through the decision complexity sub-model includes: Obtain the basic parameters for determining decision complexity; The depth, average branch factor, and number of explicit conditional judgments of the task action graph are normalized to obtain the normalized results of each index of decision complexity. The normalized results of each indicator of decision complexity are input into the decision complexity sub-model. The decision complexity sub-model performs a weighted summation on the normalized results of each indicator of decision complexity to obtain a weighted summation result. The weighted summation result is then compressed to obtain the decision complexity difficulty value.

5. The method according to claim 3, characterized in that, The multi-dimensional evaluation model includes an application span sub-model; The process of calculating the application span difficulty value through the application span sub-model includes: Obtain the basic parameters of the application span metrics; The number of cross-applications, the number of application switching times, and the number of cross-application data transmission events are normalized to obtain the normalized results of each indicator of application span. The normalized results of each indicator of the application span are input into the application span sub-model to obtain a preliminary score; The initial scores are then compressed and range-limited sequentially to obtain the application span difficulty value.

6. The method according to claim 3, characterized in that, The multi-dimensional evaluation model includes an operational complexity sub-model; The process of calculating the difficulty value of operation complexity through the operation complexity sub-model includes: Obtain the basic parameters for the metrics of operational complexity; The number of operation steps, the number of input characters, and the complexity of operation type are normalized according to preset maximum values ​​to obtain the normalized results of each index of operation complexity. The normalized results of each indicator of the operation complexity and the single-step error rate are used as input data and input into the operation complexity sub-model to obtain the dependency relationship in the operation time sequence. The normalized results of each indicator of the operation complexity are weighted and fused according to the preset weight to obtain the weighted fusion result. The weighted fusion result is compressed, and the operation complexity difficulty value is output.

7. The method according to claim 3, characterized in that, The multi-dimensional evaluation model includes an interface variability sub-model; The process of calculating the interface variability difficulty value through the interface variability sub-model includes: Obtain the basic parameters for the interface variability index; Based on the similarity of the interactive interface tree structure and the pixel similarity, structural difference parameters and pixel difference parameters are calculated. The structural difference parameters, the pixel difference parameters, and the average element displacement are used as input data and input to the interface variability sub-model. The input data is fused using the attention mechanism of the interface variability sub-model, and then weighted and summed according to preset weights to obtain the interface variability weighted sum result. The weighted summation result of the interface variability is processed by range limitation, and the interface variability difficulty value is output.

8. The method according to claim 7, characterized in that, The average displacement of the element is calculated using the following formula: ; in, A collection of interface elements. Let be the coordinates of element u at time t. Let W be the coordinates of element u at time t+1, W be the screen width, and H be the screen height.

9. The method according to claim 3, characterized in that, The multi-dimensional evaluation model includes an intent ambiguity sub-model; The process of calculating the intent ambiguity difficulty value through the intent ambiguity sub-model includes: Obtain the basic parameters for the ambiguity of intent; The number of candidate intentions and the perplexity of the language model are normalized respectively. The graph difference parameter is used as input data and input into the intention ambiguity sub-model. The input data is weighted and summed according to preset weights to obtain the weighted sum of intention ambiguity. The weighted summation result of the intent ambiguity is processed by range limitation, and the intent ambiguity difficulty value is output.

10. The method according to claim 1 or 2, characterized in that, Based on a preset fusion model, the task difficulty is assessed for each dimension to obtain the task difficulty assessment result for the task instruction, including: The difficulty levels of each dimension are constructed as input data and input into a preset fusion model to obtain the task difficulty assessment result of the task instruction output by the preset fusion model.

11. The method according to claim 1 or 2, characterized in that, The difficulty level of each dimension is determined based on the difficulty value of each dimension and the preset mapping relationship, including: Set a first difficulty threshold and a second difficulty threshold, wherein the first difficulty threshold is less than the second difficulty threshold; The difficulty level of each dimension is determined based on the comparison results between the difficulty value of each dimension and the first difficulty threshold and the second difficulty threshold.

12. The method according to claim 1 or 2, characterized in that, The multimodal data includes: task description data, user interaction logs, interaction interface data, and device operating context data; The standardization process includes at least one of the following methods: using a key-value pair format to unify the data structure, filtering illegal characters, sorting in chronological order, and removing invalid data.

13. A task difficulty assessment device, characterized in that, The device includes: The data acquisition module is used to acquire multimodal data related to the task, and to standardize the multimodal data to obtain task description parameter data. The basic parameter calculation module is used to extract task instructions based on the task description parameter data and calculate multi-dimensional indicator basic parameters; wherein, the multi-dimensional parameters include: decision complexity, application span, operation complexity, interface variability, and intent ambiguity. The module for determining the difficulty value of each dimension is used to calculate the difficulty value of each dimension based on the basic parameters of the corresponding dimension indicators through a pre-built multi-dimensional evaluation model. The task difficulty assessment module is used to assess the task difficulty based on the difficulty values ​​of each dimension using a preset fusion model, and obtain the task difficulty assessment result of the task instruction.

14. A task difficulty assessment device, characterized in that, The device includes: a processor and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the task difficulty assessment method as described in any one of claims 1-12.

15. A computer-readable storage medium, characterized in that, The computer storage medium stores computer program instructions, which, when executed by a processor, implement the task difficulty assessment method as described in any one of claims 1-12.

16. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the task difficulty assessment method as described in any one of claims 1-12.