Intelligent workflow dynamic construction method and device based on multi-modal demonstration learning
By using a multimodal demonstration learning method, intelligent workflows are automatically built, which solves the problems of fragility of traditional RPA tools and high barriers to entry for low-code platforms. It achieves stable execution and cross-platform collaboration in dynamic environments, improving the stability and reliability of workflows.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN YOUYOU INTERNET TECH CO LTD
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
Smart Images

Figure CN122241614A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of office automation technology, specifically relating to a method and apparatus for dynamically constructing intelligent workflows based on multimodal demonstration learning. Background Technology
[0002] Businesses and individual users face numerous repetitive, cross-system tasks, such as data entry, report generation, and cross-platform information synchronization. Automating these tedious manual operations is a core need in the field of current office efficiency. Various automation and workflow management technologies have been researched to address this, but each has significant technical limitations.
[0003] For example, traditional RPA (Robotic Process Automation) tools automate based on screen coordinates, image matching, or fixed paths of UI elements. However, any slight adjustment to the target software interface can cause the process to crash, and they are completely unable to handle unexpected pop-ups, CAPTCHAs, or dynamic verifications that occur during the process, making them extremely vulnerable. In addition, traditional RPA tools require users to build processes programmatically, requiring programming mindset or specialized training, which excludes the vast majority of ordinary office workers.
[0004] No-code / low-code workflow platforms excel at connecting to standardized cloud services with existing APIs, but they often fall short when it comes to complex interactions in legacy desktop software, internal enterprise systems, or web pages that do not have open APIs; and users still need to build processes using abstract concepts such as flowcharts, triggers, and actions.
[0005] In addition, macros and shortcuts built into operating systems or applications are usually limited to a single operating system or application, making it impossible to achieve cross-platform and cross-application collaboration. They can only handle linear and deterministic processes and lack the ability to handle branches, exceptions, and dynamic changes.
[0006] Furthermore, although agents driven by large language models (LLM) can understand complex instructions, they exhibit uncertainty when performing specific software operations, i.e., the illusion problem, which may lead to catastrophic errors. They lack the ability to stably perceive the state of the software interface, and may take different paths each time they are executed. Moreover, the cost of frequently calling large models is too high and the execution speed is slow, which does not meet the requirements of office automation that requires accuracy and repeatability.
[0007] Patent CN113391871A discloses a method and system for intelligent element fusion picking in RPA. It combines deep neural network-based CV element picking technology with traditional element picking techniques. In the RPA software backend, it automatically selects a more accurate and suitable element picking method based on the user's desired software interface (it also supports manual switching of picking methods by the user). During operation, it achieves automatic and seamless switching of picking methods. Furthermore, based on the positioning of each element, it provides content parsing capabilities for interface elements, enabling the orderly output of element categories, attributes, positions, and hierarchies, thereby supporting more diverse element manipulation capabilities. These functions provide users with a smoother user experience, improve the usability and support scope of RPA software, reduce user costs, and shorten the time users spend editing and developing RPA workflows. However, this method only parses static interfaces and does not simultaneously collect dynamic signals such as keyboard, mouse, voice, and system-level events, making it unable to automatically infer user intent and generate semantic workflows. While its positioning mechanism can automatically switch between traditional and CV picking methods, it still lacks dynamic environment adaptability.
[0008] Therefore, how to enable the operating system to automatically build intelligent workflows that can be stably and repeatedly executed in dynamic and ever-changing real digital environments is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0009] To address the shortcomings of existing technologies, this invention provides a method and apparatus for dynamically constructing intelligent workflows based on multimodal demonstration learning. The method includes: acquiring multimodal signals corresponding to each time frame in demonstration data; performing semantic analysis on the multimodal signals to determine intent steps and provide an initial workflow; identifying target interface controls based on the analysis of each intent step; forming location identifiers based on the control description information of the target interface controls; updating the initial workflow by fusing the intent steps and location identifiers to form a running workflow; identifying interface anomalies by scanning the screen in conjunction with the running workflow; invoking anomaly handling strategies and optimizing the running workflow based on the interface anomaly identification results; and verifying the optimized running workflow to construct a workflow based on multimodal demonstration learning. Employing the core paradigm of "demonstration as programming," the workflow is automatically constructed, and through semantic fingerprint multidimensional positioning and a perception-decision-adaptation intelligent closed-loop execution strategy, it proactively responds to dynamic interference, improving the stability and reliability of the workflow in real-world environments.
[0010] In a first aspect, the present invention provides a method for dynamically constructing intelligent workflows based on multimodal demonstration learning, characterized by comprising the following steps: Obtain the multimodal signals corresponding to each time frame in the demonstration data; Perform semantic analysis on multimodal signals to determine the intended steps and provide an initial workflow; Based on the analysis of each intent step, the target interface control is identified; A location identifier is generated based on the control description information of the target interface control; By integrating intent steps and location identifiers, the initial workflow is updated to form the running workflow; Combined with the workflow, screen scanning is used to identify interface anomalies; Based on the interface anomaly identification results, the anomaly handling strategy is invoked, and the runtime workflow is optimized; The optimized workflow was validated, and a workflow based on multimodal demonstration learning was constructed.
[0011] Furthermore, semantic analysis is performed on the multimodal signals to determine the intended steps and provide an initial workflow, which specifically includes the following steps: Timestamp synchronization of multimodal signals; Perform format conversion on each modal signal, determine each initial event, and provide a list of initial events; Call the filtering function to filter each initial event and form a filtered event list; Based on the correlation of the operation sequence, the aggregation function is called to merge the filtered event list, determine the working events, and form the event list; Infer the intent of each work event in the event list, determine the intent steps of the corresponding event list, and provide an initial workflow.
[0012] Furthermore, intent inference is performed on each work event in the event list to determine the intent steps corresponding to the event list, and an initial workflow is provided, specifically including the following steps: Based on the multimodal intent inference function, intent inference is performed for each working event to generate an intent description; By combining the event list and intent description, determine the intent steps for each work event in the corresponding event list and provide an initial workflow.
[0013] Furthermore, the control description information includes semantic fingerprints, and the location identifier includes multiple location identifiers; Based on the control description information of the target interface control, a positioning identifier is generated, which specifically includes the following steps: Obtain the target modal signals of the corresponding target interface controls; Each target modal signal undergoes feature transformation to form corresponding target semantic features; Based on the weight allocation sub-rules, determine the weight coefficients corresponding to each target semantic feature; According to the matching threshold sub-rule, a corresponding matching threshold is set for each target semantic feature; The semantic features of each target, along with their corresponding weight coefficients and matching thresholds, are structurally encapsulated to form a multi-location identifier.
[0014] Furthermore, feature transformation is performed on each target modal signal to form corresponding target semantic features, specifically including: Based on the initial feature transformation strategy, the first target modal signal is transformed to form initial descriptive text features and initial target semantic features in multiple dimensions. The initial feature transformation strategy includes visual feature transformation, control type transformation, structural position transformation and descriptive text transformation. The initial target semantic features include initial descriptive text semantic features. The first target modal signal includes a local region image, which is a region image formed by expanding a preset pixel range around the click coordinates of the target interface control. Semantic relevance determination is performed on the second target modal signal, the association level with the target interface control is given, and the enhancement signals of each target are determined. The second target modal signal includes the target speech signal, and the timestamp of the target speech signal is aligned with the operation time of the target interface control. Based on the association level, the enhanced signals of each target and the initial descriptive text features are fused to form the corresponding enhanced descriptive text features; By combining initial target semantic features from multiple dimensions with enhanced descriptive text features, the target semantic features are presented.
[0015] Furthermore, based on the initial feature transformation strategy, feature transformation is performed on the first target modal signal to form initial descriptive text features and multiple dimensions of initial target semantic features, specifically including: The local region image is scaled to a preset size, converted to grayscale, and its low-frequency coefficients are extracted using a two-dimensional discrete cosine transform and binarized to generate a visual hash value. The average RGB channel values of the local region image are then calculated to obtain the region's average color features. The visual hash value and the region's average color features are fused to provide the target's visual semantic features; and / or, Calculate the aspect ratio and height of the local region image, compare them with preset control type discrimination rules, and determine the semantic features of the target control type; and / or, Calculate the proportional coordinates of the center point of the target interface control relative to the screen resolution, and based on a preset partition threshold, map them to relative position descriptors to form semantic features of the target structural position; and / or, Extract the text content, associated label text, and / or placeholder tooltip text displayed on the surface of the target interface control, provide character recognition results, and form initial descriptive text features.
[0016] Furthermore, in conjunction with the operational workflow, interface anomalies are identified through screen scanning, specifically including the following steps: Based on the running workflow, determine the baseline screenshot for the current time frame; Get a real-time screenshot of the current frame by scanning the screen; Perform a first detection on the real-time screenshot of the current frame to identify abnormal interface elements; By combining the baseline screenshot of the current frame, a second detection is performed on the real-time screenshot of the current frame to identify abnormal interface states; By integrating abnormal interface elements and abnormal interface states, the system provides interface anomaly identification results that include the interface anomaly type.
[0017] Furthermore, a first detection is performed on the real-time screenshot of the current frame to identify abnormal interface elements, specifically including: Perform edge detection on the real-time screenshot of the current frame to determine the edge contour; Based on the geometric features of the edge contour, feature filtering and analysis are performed to identify abnormal interface elements.
[0018] Furthermore, by combining the baseline screenshot of the current frame with the real-time screenshot of the current frame, a second detection is performed to identify abnormal interface states, specifically including: Pixel analysis is performed on the baseline screenshot and the real-time screenshot of the current frame, respectively, and the baseline pixel data and real-time pixel data are given. To obtain the proportion of specific color pixels in real-time pixel data and identify abnormal interface states; and / or, By comparing and analyzing baseline pixel data and real-time pixel data, abnormal interface states can be identified based on the pixel change ratio or the dynamic change area between time frames.
[0019] Furthermore, based on the interface anomaly identification results, the anomaly handling strategy is invoked, and the execution workflow is optimized, specifically including the following steps: Based on the interface anomaly identification results, matching processing strategies are retrieved from the hierarchical strategy library and identified as candidate processing strategies to form a set of candidate processing strategies. Based on preset triggering conditions and priority rules, the candidate processing strategy set is filtered and sorted to determine the target processing strategy and provide the target processing strategy set. Execute each target processing strategy in the target processing strategy set in sequence and monitor the strategy execution results; Based on the strategy execution results, optimize the running workflow and update the hierarchical strategy library.
[0020] Furthermore, matching processing strategies are retrieved from the hierarchical strategy library, specifically including: Using interface exception types as search elements, a full-level parallel search mechanism is adopted to traverse each level of the hierarchical strategy library and provide the processing strategy that matches the search element in each level.
[0021] Furthermore, based on preset triggering conditions and priority rules, the candidate processing strategy set is filtered and sorted to determine the target processing strategy, specifically including the following steps: Based on preset triggering conditions, a full trigger matching verification is performed on each candidate processing strategy in the candidate processing strategy set, filtering out candidate processing strategies that do not fully meet the triggering conditions, and providing an initial target processing strategy. By combining the adjustment factors corresponding to each level, the processing strategies for each initial target are weighted and calculated to give a comprehensive priority score; Based on the overall priority score, the target processing strategy is determined.
[0022] Secondly, the present invention also provides an intelligent workflow dynamic construction device based on multimodal demonstration learning, which employs the above-mentioned intelligent workflow dynamic construction method based on multimodal demonstration learning, specifically including: The demonstration learning module is used to acquire multimodal signals corresponding to each time frame in the demonstration data; perform semantic analysis on the multimodal signals to determine the intent steps and provide an initial workflow; and identify target interface controls based on the analysis of each intent step. The workflow building module is used to generate location identifiers based on the control description information of the target interface controls; and to update the initial workflow by integrating the intent steps and location identifiers to form a running workflow. The perception optimization module is used to combine the running workflow, identify interface anomalies through screen scanning; based on the interface anomaly identification results, invoke anomaly handling strategies and optimize the running workflow; and verify the optimized running workflow to construct a workflow based on multimodal demonstration learning.
[0023] The intelligent workflow dynamic construction method and apparatus based on multimodal demonstration learning provided by this invention have at least the following beneficial effects: (1) This invention forms a complete closed loop of “demonstration-construction-execution-verification-evolution” from multimodal signal acquisition, semantic analysis, interface control recognition, location identifier generation, workflow fusion, screen scanning, exception handling strategy invocation to execution verification, so that the intelligent workflow based on multimodal demonstration learning can truly have the ability to operate stably and reliably in dynamic and ever-changing real digital environments. The constructed workflow has semantic, executable, adaptive, evolvable and auditable characteristics.
[0024] (2) Users only need to demonstrate the operation once, and the workflow can be automatically built through multimodal recording and semantic transcription without writing code or configuring parameters, which fundamentally reduces the threshold for using automation tools.
[0025] (3) By generating multiple positioning identifiers through semantic fingerprints to replace fixed coordinate positioning, combined with screen scanning and exception handling strategy calls, it can automatically search for functionally equivalent replacement elements when the target interface changes, and actively respond to dynamic interference such as pop-ups, UI drift, and verification codes, which greatly improves the stability and reliability of the workflow in the real environment.
[0026] (4) Based on screen visual perception and event capture combined with keyboard and mouse, rather than relying on specific API interfaces, this invention can control any visible software interface, such as legacy desktop programs, internal systems without open APIs and web applications, and realize true linkage across operating systems and applications, solving the platform silo problem of existing tools.
[0027] (5) The present invention designs a hierarchical strategy library based on anomaly classification and an intelligent closed-loop execution mechanism of "perception-decision-adaptation", which enables the workflow execution to have real-time perception and autonomous response capabilities to the dynamic environment, rather than the mechanical replay mode of the prior art. Attached Figure Description
[0028] Figure 1 A schematic diagram of an intelligent workflow construction device based on multimodal demonstration learning provided by the present invention; Figure 2 A flowchart illustrating an intelligent workflow construction method based on multimodal demonstration learning provided by this invention; Figure 3 A schematic diagram illustrating the process of interface anomaly identification according to one embodiment of the present invention; Figure 4 A schematic diagram of the "perception-decision-execution" closed-loop mechanism of the intelligent workflow construction method based on multimodal demonstration learning provided by the present invention; Figure 5 A simplified example diagram of the intelligent workflow execution interface based on multimodal demonstration learning provided by this invention. Detailed Implementation
[0029] To better understand the above technical solutions, a detailed description of the solutions will be provided below in conjunction with the accompanying drawings and specific embodiments. Obviously, the described embodiments are merely some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.
[0030] The terminology used in the embodiments of this invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. The singular forms “a,” “the,” and “the” as used in the embodiments of this invention and the appended claims are also intended to include the plural forms, and “multiple” generally includes at least two unless the context clearly indicates otherwise.
[0031] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that an article or device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such an article or device. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the article or device that includes said element.
[0032] The goal is to enable the operating system to automatically build stable and repeatable intelligent workflows capable of operating in dynamic and ever-changing real-world digital environments, and to automatically learn and evolve from failures. Specifically, this can be achieved through three methods: First, automatically translating user actions into semantic, executable workflow descriptions without requiring users to write code or configure parameters; second, sensing environmental changes in real time during workflow execution (such as interface drift, pop-ups, CAPTCHAs, etc.) and automatically selecting appropriate strategies to ensure the robustness of the process in dynamic environments; and third, extracting new experiences from execution failures or manual interventions to dynamically update system strategies, achieving continuous self-evolution of the workflow.
[0033] Based on this, the present invention provides a method and apparatus for dynamically constructing intelligent workflows based on multimodal demonstration learning. With "demonstration as programming" as the core paradigm, it enables ordinary users without programming skills to automatically construct intelligent workflows that can be stably executed in dynamic and ever-changing real digital environments through just one intuitive operation demonstration.
[0034] This invention relates to a method and apparatus for dynamically constructing intelligent workflows based on multimodal demonstration learning, deployed and implemented in a client-server collaborative architecture. The client side runs on the user's local operating system (Windows / macOS / Linux), and uses system-level APIs and hook mechanisms to achieve frame-by-frame screen capture, keyboard and mouse operation event monitoring, microphone audio stream acquisition, and real-time monitoring of clipboard and window events. The server side is deployed in a private or public cloud environment, responsible for distributing cloud-based sharing policies, deduplicating local anonymization policies, and aggregating and analyzing group policies.
[0035] The core implementation is developed using Python, relying on OpenCV and Pillow libraries for image processing and perceptual hash calculation, utilizing Pytesseract or cloud-based OCR services for optical character recognition, and employing the SpeechRecognition library and VAD (Voice Activity Detection) technology for real-time transcription and noise filtering of speech narration. Local data persistence uses an SQLite database as the primary storage engine, supplemented by daily audit log backups in JSONL format text files; concurrent access to the policy library ensures data consistency through an exponential backoff retry mechanism.
[0036] During the execution phase, screen scanning, location matching, and interface anomaly identification all run in a local sandbox environment. Sensitive operations (such as file deletion and payment confirmation) force users to confirm twice via system pop-ups. All automated operations and decision-making processes are recorded in structured audit logs, supporting multi-dimensional retrieval and compliance auditing based on workflow identifiers, step identifiers, and timestamps. Through this hardware and software collaborative architecture, end-to-end automation capabilities are achieved, from multimodal signal acquisition and semantic workflow construction to intelligent closed-loop execution.
[0037] like Figure 1 As shown, the intelligent workflow dynamic construction device based on multimodal demonstration learning includes a demonstration learning module, a workflow construction module, and a perception optimization module. The demonstration learning module is used to acquire multimodal signals corresponding to each time frame in the demonstration data; perform semantic analysis on the multimodal signals to determine the intended steps and provide an initial workflow; and, based on the analysis of each intended step, identify the target interface controls. The workflow building module is used to generate location identifiers based on the control description information of the target interface controls; and to update the initial workflow by integrating the intent steps and location identifiers to form a running workflow. The perception optimization module is used to combine the running workflow, identify interface anomalies through screen scanning; based on the interface anomaly identification results, invoke anomaly handling strategies and optimize the running workflow; and verify the optimized running workflow to construct a workflow based on multimodal demonstration learning.
[0038] Specifically, the demonstration learning module automatically transforms a user's single operation demonstration into semantic intent steps and identifies the target interface controls actually operated by the user. The workflow construction module generates stable location identifiers for the identified target interface controls and integrates the intent steps with the location identifiers to encapsulate a runtime workflow with semantic description and addressing capabilities. The perception optimization module drives the closed-loop execution of the runtime workflow in a dynamically changing digital environment. It detects interface anomalies in real time through screen scanning, calls anomaly handling strategies for adaptive adjustments, verifies the execution results, and precipitates the execution experience into reusable anomaly handling strategies, achieving continuous optimization of workflow construction capabilities. Additionally, a security control module may be included to provide compliance assurance for the entire intelligent workflow construction process, including sandbox isolation, sensitive operation confirmation, and operation auditing.
[0039] During the training phase, the demonstration learning module synchronously collects multimodal signals during the user demonstration (i.e., recording), transforms them into readable intent steps through semantic analysis, and identifies the target interface controls; then, the workflow construction module generates stable location identifiers for the target interface controls, and integrates the intent steps and location identifiers to form an independently runnable workflow.
[0040] During the execution phase, the perception optimization module does not mechanically replay the recorded actions, but scans the current screen environment before each step of execution, identifies interface anomalies in real time, and dynamically optimizes and adaptively adjusts the running workflow by calling the corresponding anomaly handling strategies based on the layered strategy library. After execution verification, the closed loop of the workflow is completed.
[0041] The security management module runs through the entire training and execution process; the perception and optimization module solidifies the mapping relationship between abnormal scenarios and response strategies into the running workflow during the execution and verification process, thereby achieving continuous optimization of workflow execution capabilities.
[0042] Therefore, the intelligent workflow dynamic construction device based on multimodal demonstration learning fundamentally solves the technical defects of traditional RPA, such as its fragility and susceptibility to crashes due to its reliance on fixed coordinates, the high barrier to entry for low-code platforms, and the uncertainty in the execution of intelligent agents in large language models.
[0043] like Figure 2 As shown, the intelligent workflow dynamic construction method based on multimodal demonstration learning specifically includes the following steps: S1, acquire the multimodal signals corresponding to each time frame in the demonstration data.
[0044] Workflow refers to fixed business processes in office operations, and demonstration data consists of data from users demonstrating workflow operations. Specifically, after a user begins a demonstration operation, multimodal recording is initiated, synchronously capturing signals corresponding to each time frame during the operation to form multimodal signals. Multimodal signals refer to multi-source heterogeneous signals synchronously collected frame-by-frame during the user's workflow demonstration operation. Multimodal signals can include screen visual signals, mouse operation signals, keyboard operation signals, voice narration, system-level events, etc.
[0045] Specifically, for screen visual signals, the system captures screen images frame by frame in the form of pixel-level change streams, recording complete visual information of the interface state before and after the operation; for mouse or keyboard operation signals, it records mouse movement trajectories, click events (including click coordinates, key types, etc.), and keyboard input sequences (including key values, input timing, etc.) in real time; for voice narration, it collects the user's verbal operation intentions through the microphone, such as the user saying "Now I want to save the file" when clicking the "Save" button, and the operating system transcribes the speech in real time using the SpeechRecognition library; for system-level events, it captures operating system behaviors such as clipboard content changes, window focus switching, and window title changes.
[0046] Based on the unified monotonic clock provided by the operating system, an absolute timestamp is recorded for the acquisition time of each modal signal: for mouse and keyboard operation signals and system-level events, the timestamp is directly provided by the operating system message queue; for screen visual signals (such as screen images), the timestamp of the frame is the time when the frame capture is completed; for voice narration, the timestamp is the time when the audio sampling block is received.
[0047] A multimodal synchronous acquisition mechanism utilizes screen visual signals to provide objective evidence of the interface state, mouse or keyboard operation signals to record operation trajectories, voice narration to convey the user's subjective intent, and system-level events to supplement the contextual information of cross-application interactions. These various modal signals corroborate each other, providing a complete data foundation for subsequent semantic analysis.
[0048] S2 performs semantic analysis on the multimodal signal to determine the intended steps and provide an initial workflow. Specifically, it includes the following steps: Timestamp synchronization of multimodal signals; Perform format conversion on each modal signal, determine each initial event, and provide a list of initial events; Call the filtering function to filter each initial event and form a filtered event list; Based on the correlation of the operation sequence, the aggregation function is called to merge the filtered event list, determine the working events, and form the event list; Infer the intent of each work event in the event list, determine the intent steps of the corresponding event list, and provide an initial workflow.
[0049] Understandably, the multimodal signals acquired from the demonstration data are raw, heterogeneous signals arranged in a time series from different acquisition channels. These raw, heterogeneous signals differ significantly in data format, timing precision, and semantic hierarchy, and cannot be directly used for workflow construction.
[0050] Through a five-stage processing model of "synchronization-conversion-filtering-aggregation-inference", the original heterogeneous signals of various modalities are gradually refined and integrated into a high-level semantic, real-time initial workflow.
[0051] Since screen images, keyboard or mouse operation signals, voice narration, and system-level events are captured by different acquisition modules, and the local clocks, sampling frequencies, and reporting delays of each acquisition module differ, it is necessary to first synchronize the timestamps of the multimodal signals. During the timestamp synchronization process, the original heterogeneous signals of each modality are aligned to a unified timeline according to the order of the timestamps. This synchronization mechanism ensures that subsequent analysis can accurately reconstruct the complete context of the user's "what they see (screen) - what they do (keyboard and mouse) - what they say (voice) - what they touch (system-level event)" at a given moment, avoiding errors in understanding intent due to misaligned timing. For example, if a user clicks the "Save" button while simultaneously saying "Save file now," only after precise timestamp alignment can the voice content be associated with the click event, enhancing the accuracy of intent description and facilitating subsequent intent inference based on work time.
[0052] After timestamp synchronization is complete, a format conversion is performed to transform the multimodal signals into a unified format, thereby determining each initial event and generating a list of initial events. Initial events refer to atomic-level operation records formed after the multimodal format conversion. An atomic-level operation record represents the minimum single interactive action of the corresponding user.
[0053] It is understandable that screen visual signals are image frame sequences, keyboard and mouse operation signals are structured messages (including coordinates, key codes, etc.), voice narration is audio waveform data, and system-level events are operating system-level notifications. Through standardized encapsulation, the above multimodal data is uniformly mapped to initial events with the same field structure. Each initial event can include: event type identifier (such as mouse movement, mouse click, keyboard press, voice segment, clipboard change, etc.), event parameters (including operation coordinates, key codes, audio sampling identifiers, clipboard text, etc.), corresponding screenshot index, and a unified timestamp.
[0054] Taking a typical form filling operation as an example, the initial event list generated after format conversion might contain the following items: mouse movement event (x=450, y=300), mouse click event (x=450, y=300, button=left), continuous keyboard press event (key=h, e, l, l, o), enter key press event (key=enter), etc. At this point, the initial event list completely records all the user's atomic-level operations.
[0055] However, the initial event list contains a lot of redundancy and noise and does not yet have a semantic structure that can be directly understood. Then, a preset filtering function is called to filter the noise of each initial event in the initial event list, forming a filtered event list.
[0056] Here, the filtering function refers to the mathematical formula used to identify and remove noisy events from the initial event list. The filtering function makes filtering decisions based on event type characteristics and spatiotemporal proximity analysis. The filtering function can be pre-constructed based on historical data; its specific form is not further limited here. For example, filtering functions include mouse movement filtering sub-functions based on speed thresholds, hover filtering sub-functions based on dwell time, and keyboard bounce filtering sub-functions based on key press frequency.
[0057] In this example, the function "_filter_noise_events()" is called. This filters the various initial events, including: Remove mouse movement events that do not follow subsequent click actions.
[0058] Specifically, this involves filtering mouse movement events. In user demonstrations, the mouse's movement trajectory on the screen typically includes a large number of movement processes without any click significance (such as moving from one side of the screen to a target button). These pure movement events only reflect the cursor path and do not correspond to any substantial interface interaction. Retaining all of them would severely interfere with subsequent event aggregation and intent inference. Therefore, the filtering function removes mouse movement events without subsequent click actions, retaining only movement events directly associated with substantial operations such as clicking, dragging, and scrolling.
[0059] After the above filtering process, noisy events in the initial event list are effectively removed, and only valid events with clear interactive significance are retained, thus providing a clean data foundation for subsequent aggregation processing.
[0060] Based on the filtered event list, and further based on the correlation of the operation sequence, the aggregation function is called to merge the events in the filtered event list, aggregating them into working events with complete semantics, thus forming an event list.
[0061] In this context, an aggregation function is a mathematical expression that merges multiple temporally related initial atomic events into a single semantic working event. Aggregation functions identify consecutive events with strong temporal correlation and merge them into higher-level semantic units, thus defining them as working events. Aggregation functions can be pre-constructed based on historical data; their specific form is not further limited here. Aggregation functions can include continuous backspace deletion aggregation sub-functions, modifier key combination aggregation sub-functions, mouse drag aggregation sub-functions, etc. For example, the "_merge_consecutive_keys()" function can be used.
[0062] Taking the merging of consecutive keyboard inputs as a typical aggregation scenario as an example: when multiple consecutive, printable character key presses are detected (e.g., pressing h, e, l, l, o in sequence), the aggregation function merges them into a single text input event, whose data content is the complete input string "hello". The merged event not only retains the information of the original key sequence, but more importantly, it is presented in the semantic form of "input text 'hello'", which greatly reduces the processing complexity of subsequent intent inference.
[0063] In addition to merging keyboard input, aggregation functions also handle other related operation sequences. For example, consecutive backspace key presses can be aggregated into a "delete text" event; Ctrl+C can be aggregated into a "copy" event; and Ctrl+V can be aggregated into a "paste" event. During aggregation, the operating system uses temporal continuity (the interval between adjacent events is less than a preset threshold) and logical relevance (the event types belong to the same semantic cluster of operations) as merging criteria.
[0064] After aggregation, a large number of events in the filtered event list are merged into fewer, more semantically clear work events, forming the event list.
[0065] Among them, a task event is the smallest unit representing a user's intent. The event list provides an appropriate processing granularity for subsequent intent inference.
[0066] Next, intent inference is performed on each work event in the event list, generating a corresponding high-level semantic description for each work event. This determines the intent steps for the corresponding event list, and finally, the initial workflow is generated by combining them in sequence. Intent inference refers to transforming work events into readable high-level semantic descriptions that describe the purpose of the operation.
[0067] Furthermore, intent inference is performed on each work event in the event list to determine the intent steps corresponding to the event list, and an initial workflow is provided, specifically including the following steps: Based on the multimodal intent inference function, intent inference is performed for each working event to generate an intent description; By combining the event list and intent description, determine the intent steps for each work event in the corresponding event list and provide an initial workflow.
[0068] Understandably, although each work event in the work event list has certain semantic aggregation features (such as merging consecutive character keystrokes into a text input event), its expression form is still machine-recognizable operation parameters (such as coordinate values, key values, etc.), and has not yet been transformed into a readable, high-level semantic description of the operation purpose.
[0069] The multimodal intent inference function is a mathematical relation that maps multimodal signals associated with job events into a natural language-style intent description of the operational purpose of the job events.
[0070] The multimodal intent inference function is a semantic mapping function. It can be pre-constructed based on historical data, and its specific form is not further limited here. Multimodal intent inference functions can include rule-based template mapping sub-functions, control semantic parsing sub-functions based on optical character recognition (OCR), and speech-first intent extraction sub-functions. For example, in this example, the "_infer_intent_description()" function is used to perform differentiated intent inference for different types of work events.
[0071] Specifically, for mouse click events, the multimodal intent inference function first loads a screenshot of the screen at the time of the event and analyzes the spatial location and functional area affiliation of the click coordinates within the interface. The click coordinates are then compared with preset interface area division rules to determine whether the coordinates are located in typical functional areas such as the top menu bar, bottom status bar, sidebar, or main content area. Simultaneously, based on the horizontal and vertical ratios relative to the screen resolution, location descriptors (such as "left," "center," "right," "top," "middle," and "bottom") are generated.
[0072] Building upon this, the system further expands the interface area image outwards from the click coordinates by a preset pixel range. Visual features within this area are analyzed to identify the type of the target control (such as a button, input box, or link) and its associated text labels (such as "Save," "Login," or "Submit"). Combining the spatial location description and control semantic information, intent descriptions such as "Click on the top left of the screen," "Click the 'filename' input box," and "Click the 'Save' button" are generated.
[0073] For text input event types, since their operation purpose has been clearly defined through the aggregation process as entering text content into a specific interface element, the multimodal intent inference function directly extracts the data content of the event (i.e., the input text content) and generates an intent description in the form of "input text [specific content]". For example, for a text input event with data content "hello", the generated intent description is "input text [hello]".
[0074] For keyboard key press events (including single key presses and key combinations), the multimodal intent inference function calls a pre-defined key name mapping table to convert key value codes (such as "enter", "tab", "Ctrl+C", etc.) into readable key name descriptions (such as "Enter key", "Tab key", "copy", etc.), thereby generating intent descriptions such as "press Enter key" and "execute copy operation". The key name mapping table covers normalized mappings of commonly used function keys, modifier key combinations, and multilingual keyboard layouts, ensuring consistent semantic expression for key presses across different keyboard environments.
[0075] It should also be noted that if narration is captured during recording (demonstration), and the timestamp of the narration is within the same time window as the timestamp of the current task event, the multimodal intent inference function will prioritize using the transcribed content of the narration as the primary source or enhancement content for the intent description. For example, if a user clicks the "Save" button while verbally saying "Now I want to save the file," the transcribed text obtained through speech recognition can be directly used as the intent description for the corresponding click task event, or it can be fused with the description generated by visual analysis to form a more accurate and richer expression of intent. By utilizing the user's subjective intent information carried in the narration, the semantic ambiguity that may occur when inferring intent solely based on screen vision and keyboard / mouse parameters is effectively mitigated.
[0076] After the above processing, each work event in the event list is assigned a high-level, readable natural language intent description, achieving a precise mapping from machine operation parameters to intent semantics. Then, combining the original temporal order of the event list with the intent descriptions corresponding to each work event, the intent steps for each work event in the event list are determined, and then combined sequentially to provide the initial workflow.
[0077] An intent step refers to an independent operation semantic in a workflow, which can include the following information elements: step number (arranged in chronological order according to the event list), action type (such as click, input, key press, etc., inherited from the type identifier of the work event), original operation parameters (such as click coordinates, input text, key value code, etc., reserved for subsequent execution location), and intent description.
[0078] Arrange the intent steps according to their step numbers to form a complete initial workflow. The initial workflow, in a structured format of "[Step Number, Intent Description, Action Type, Original Parameters]", fully reproduces the semantic logic of the user's operation demonstration. For example, a typical "click the input box → enter username → press Enter" operation demonstration, after the above processing, can form an initial workflow containing the following intent steps: Step 1: The action type is click, the intent description is "click the 'username' input box", and the original operation parameter is the click coordinates; Step 2: The action type is input, the intent description is "input text [username]", and the original operation parameter is the input content; Step 3: The action type is key press, the intent description is "press Enter key", and the original operation parameter is key value code.
[0079] Thus, the initial workflow fully reproduces the user's operation demonstration process in a high-level semantic manner, allowing users to understand the execution logic without reading the code. This facilitates verification that the automatically constructed workflow conforms to the user's operational intent and provides a structured semantic foundation for subsequent screen scanning, exception handling, and other tasks. The action type and original operation parameters in each intent step provide the necessary target information for generating semantic fingerprints and execution localization; the intent description provides a readable semantic reference for execution verification and audit logs.
[0080] After semantic analysis, an initial workflow consisting of several intent steps arranged in sequence is provided. The initial workflow has basic execution logic, that is, it has a clear step sequence and semantic description, but it has not yet been bound to a stable interface element positioning mechanism. Therefore, it may still fail when the interface changes, and can be further processed by subsequent steps.
[0081] S3 identifies target interface controls based on the analysis of each intent step.
[0082] As can be understood, target UI controls refer to the UI element entities that users actually interact with during the workflow demonstration process. Target UI controls can include buttons, input boxes, links, checkboxes, etc., and are the specific objects of action of the intended steps at the UI level.
[0083] Specifically, based on the screen area corresponding to each intent step in the initial workflow, the original operation parameters are extracted, the analysis area is quickly located, and the geometric attributes (aspect ratio, height), visual features (color, texture) and text content of the analysis area image are analyzed to infer the control type and function. By utilizing the temporal relationship of the intent steps, the input operation is associated with the target control of the preceding click operation to ensure the continuity of the operation chain, and finally the recognition of the target interface control is completed.
[0084] For example, for an intent step involving a click, a preset pixel range (e.g., ±60 pixels horizontally, ±20 pixels vertically) is expanded outwards from the recorded coordinates to form a region to be identified. Image segmentation and feature extraction are then performed on the interface elements within this region to identify the target interface control actually operated by the user. The target interface control can be identified using the following information: the control's type (e.g., button, input box, link), the control's precise position on the screen (boundary box coordinates), and the text label associated with the control (e.g., surface text obtained through optical character recognition or system accessibility interfaces).
[0085] For intent steps involving input, the system locates the currently focused input control by combining the controls identified in the preceding click step (i.e., the input box or text field that received focus) or by using a screenshot of the screen at the time the input step occurs. The recognition results include the input box's type identifier, boundary position, and associated label or placeholder text.
[0086] For action-type intent steps (such as pressing the Enter key or a combination of keys), it is usually not necessary to independently identify new target interface controls. Instead, the controls identified in the previous steps are used as the operation targets, or the control where the focus is located is inferred based on the key function (such as the Tab key switching focus).
[0087] By mapping abstract semantic operations to concrete control objects, the workflow is upgraded from a purely semantic description to a semi-execution scheme with a clear addressing target, and provides a clear object for the subsequent generation of location identifiers.
[0088] S4: Based on the control description information of the target interface control, a location identifier is formed. By extracting the control description information of the target interface control and converting it into the corresponding location identifier, stable addressing capability is provided for the execution phase.
[0089] Depending on the application environment, two positioning modes are designed: visual fingerprint (semantic fingerprint) mode and DOM selector mode. Correspondingly, two different positioning identifiers are provided: multi-positional identifier and DOM path identifier. The positioning identifier refers to the addressing identifier generated by converting control description information. Control description information refers to information that provides multi-dimensional feature descriptions of the target interface controls, encompassing semantic fingerprints based on screen visual analysis and structural description information based on document objects.
[0090] When the application environment is a browser webpage and the operating system can access the Document Object Model (DOM), the DOM selector pattern is used. In this case, the control description information can be DOM structure information, which is a structured path description formed by parsing the logical path, tag attributes, and hierarchical relationship of the target UI control in the DOM tree. Based on the control description information of the target UI control, a positioning identifier is formed, specifically including: Obtain the DOM structure information of the target UI control; Extracting DOM structure features from DOM structure information; Encapsulate DOM structural features into DOM path identifiers.
[0091] Specifically, the system first obtains the structured information of the target control in the DOM tree through browser accessibility interfaces, injected scripts, or browser extension mechanisms. Then, it extracts the following DOM structure features: control tag name (e.g., button, input, div), CSS selector path (constructed based on attributes such as id, class, and name), XPath path (constructed based on the DOM tree hierarchy), control text content, attribute dictionary (id, class, name, etc.), position coordinates in the DOM tree, and parent element chain (the sequence of tags tracing upwards from the current node to the root node). Finally, the DOM structure features are encapsulated into a DOM path identifier, containing at least the CSS selector path and / or XPath path, serving as the basis for accurately hitting the target DOM node in the browser rendering engine during the execution phase.
[0092] The DOM path identifier describes the logical position of a control in the document tree. Even if the screen coordinates of an element change due to page scrolling or window scaling, the identifier can still be accurately located as long as the DOM structure remains stable.
[0093] When the application environment is a desktop application, a program without a DOM environment, or when DOM information is unavailable in the browser, the semantic fingerprint (visual fingerprint) mode is adopted. In this case, the control description information is the semantic fingerprint, and the positioning identifier is the multi-positioning identifier. The semantic fingerprint is a multi-dimensional feature extracted from multimodal signals, including visual hash, average color of the area, control type, structural position, descriptive text, etc. Each dimension describes the identifiable attributes of the target interface control from a different perspective.
[0094] Furthermore, based on the control description information of the target interface control, a positioning identifier is formed, specifically including the following steps: Obtain the target modal signals of the corresponding target interface controls; Each target modal signal undergoes feature transformation to form corresponding target semantic features; Based on the weight allocation sub-rule, determine the weight coefficient corresponding to each target semantic feature. Based on the matching threshold sub-rule, set the corresponding matching threshold for each target semantic feature. The semantic features of each target, along with their corresponding weight coefficients and matching thresholds, are structurally encapsulated to form a multi-location identifier.
[0095] For each identified target interface control, the multimodal signals in the corresponding sample data are acquired and identified as the respective target modal signals. Specifically, a screenshot of the screen at the moment the user interacts with the interface control is used as a visual sample. The screenshot fully records the appearance, color distribution, boundary shape, and surrounding context of the interface control. The operation parameters of the user's interaction with the interface control (including mouse click coordinates, click button type, etc.) are used as the central reference for determining the precise position of the interface control on the screen and the center of the captured analysis area. System-level attribute information of the interface control is obtained from the accessibility interface (API) provided by the operating system or application, including structured information such as control type declaration, associated text labels, placeholder hints, and parent container context. The acquisition of the above multimodal signals uses the moment the interface control is operated as the time anchor point to ensure that each target modal signal strictly corresponds to the same interface state in time sequence, avoiding signal misalignment due to dynamic changes in the interface.
[0096] Next, feature transformation is performed on each acquired target modal signal to extract high-level features with discriminative and stable characteristics, forming corresponding target semantic features. The feature transformation employs differentiated transformation strategies for different target modal signals. The feature transformation for each target modal signal to form corresponding target semantic features specifically includes: Based on the initial feature transformation strategy, the first target modal signal is transformed to form initial descriptive text features and initial target semantic features of multiple dimensions. The initial feature transformation strategy refers to the semantic structure, mapping reconstruction and feature purification of the first target modal signal through multiple transformation methods, giving a unified set of semantic features of multiple dimensions. The initial feature transformation strategy includes visual feature transformation, control type transformation, structural position transformation and descriptive text transformation. The first target modal signal includes a local region image, which is a region image formed by expanding a preset pixel range around the click coordinates of the target interface control. Semantic relevance determination is performed on the second target modal signal, the association level with the target interface control is given, and the enhancement signals of each target are determined. The second target modal signal includes the target speech signal, and the timestamp of the target speech signal is aligned with the operation time of the target interface control. Based on the association level, the enhanced signals of each target and the initial descriptive text features are fused to form the corresponding enhanced descriptive text features; By combining initial target semantic features from multiple dimensions with enhanced descriptive text features, the target semantic features are presented.
[0097] Furthermore, feature transformation is performed on the first target modal signal to form initial descriptive text features and multiple dimensions of initial target semantic features, specifically including: The local region image is scaled to a preset size, converted to grayscale, and its low-frequency coefficients are extracted using a two-dimensional discrete cosine transform and binarized to generate a visual hash value. The average RGB channel values of the local region image are then calculated to obtain the region's average color features. The visual hash value and the region's average color features are fused to provide the target's visual semantic features; and / or, Calculate the aspect ratio and height of the local region image, compare them with preset control type discrimination rules, and determine the semantic features of the target control type; and / or, Calculate the proportional coordinates of the center point of the target interface control relative to the screen resolution, and map them to relative position descriptors based on a preset partition threshold to form the semantic features of the target structural position; and / or, Extract the text content, associated label text, and / or placeholder tooltip text displayed on the surface of the target interface control, provide character recognition results, and form initial descriptive text features.
[0098] Specifically, image features are extracted from the local area image containing the target interface control in the screen visual signal to form target visual semantic features, including visual hash values and regional average color features. In practice, a preset pixel range is expanded outward from the click coordinates in the keyboard and mouse operation signals to capture the local area image containing the target interface control, which is then analyzed as the first target modal signal. Perceptual hashing (pHash) is performed on the local area image, which is then scaled to a preset size (e.g., 8×8 pixels), converted to grayscale, and subjected to a two-dimensional discrete cosine transform (DCT). The transform coefficients in the low-frequency region are extracted, the median of the transform coefficients is calculated, and binarization is performed, ultimately encoding a fixed-length hexadecimal hash string. The visual hash value is robust to slight scaling, brightness changes, and compression distortion, maintaining effective similarity matching even with minor visual changes to the interface. Simultaneously, the channel average values of the local area image in the RGB color space are calculated and analyzed to obtain regional average color features, which are used to help distinguish controls with significant color differences.
[0099] The geometric and visual attributes of a local area image are analyzed to infer the type identifier of the control, forming a semantic feature of the target control type. Specifically, the aspect ratio and height of the local area image are calculated and compared with preset control type discrimination rules. For example, when the aspect ratio is between 2.0 and 8.0 and the height is between 20 and 60 pixels, combined with the text content features detected within the local area image, the type of interface control can be inferred to be a button; when the aspect ratio is greater than a preset threshold and the shape is elongated, it is inferred to be an input box. In addition, if the system-level attribute information provides a clear control type declaration, this declaration is preferentially used as the semantic feature of the target control type.
[0100] Spatial relationship analysis is performed on the click coordinates in keyboard and mouse operation signals to form semantic features of the target structure's location. Specifically, the proportional coordinates of the target interface control's center point relative to the screen resolution are calculated (e.g., horizontal 450 / 1920≈0.234, vertical 300 / 1080≈0.278), and based on a preset interface partition threshold (e.g., one-third of the screen's width and height as the boundary), the proportional coordinates are mapped to relative position descriptors. Simultaneously, the hierarchical context information of the target control (such as its parent container, adjacent sibling elements, etc.) is recorded. If structural hierarchy data is provided in the system-level attribute information, it is also included in the semantic features of the target structure's location.
[0101] Optical Character Recognition (OCR) is performed on the local area image to extract at least one of the following: text content displayed on the surface of the target interface control, associated label text, and placeholder tooltip text, forming the initial descriptive text features. If the system-level attribute information already provides the control's text label or placeholder attributes, these are directly used and added to the initial descriptive text features. Descriptive text features are the strongest semantic identifiers for distinguishing functionally equivalent controls. For example, for controls of the same button type, the "Save" button and the "Cancel" button can be clearly distinguished through descriptive text features.
[0102] It should also be noted that, in some embodiments, when processing the initial description text features of a target interface control, the timestamp of the moment the target interface control is operated is used as the anchor point. Voice signals within a preset time tolerance window (e.g., 1 second before or after) are retrieved from the voice narration as target voice signals. Semantic relevance determination is then performed on the target voice signals to confirm whether there is a semantic relationship between the target voice signal and the current operation of the target interface control, rather than irrelevant voice narration or environmental noise during the operation interval. Semantic relevance determination refers to determining the degree of semantic binding between the target voice signal and the target interface control. Semantic relevance determination may include: keyword trigger determination based on a preset functional keyword library (if the target voice signal contains at least one functional keyword, a preliminary relevance determination is passed); matching determination based on control text (comparing the target voice signal with the extracted character recognition results or the control surface text for text similarity); and consistency determination based on operation type (based on the action type of the current work event, the semantic tendency of the target voice signal is checked for consistency; if the semantic tendency of the target voice signal is significantly contradictory to the action type, it is determined to be a negative association), etc.
[0103] The target speech signal determined through semantic relevance analysis is identified as the target enhancement signal. The relevance level is determined based on the strength of the relevance: if the target enhancement signal is semantically complete and the initial descriptive text features are empty or have extremely low confidence, the target enhancement signal is determined to be at the source level; if the target enhancement signal is semantically consistent with or complementary to the initial descriptive text features, the target enhancement signal is determined to be at the enhancement level; if the target enhancement signal conflicts with the initial descriptive text features but has a higher confidence level, the target enhancement signal is determined to be at the calibration level.
[0104] Next, based on the association level, the target enhancement signal is fused with the initial descriptive text features. If the association level is primary level, the target enhancement signal directly replaces the initial descriptive text features to form enhanced descriptive text features; if the association level is enhancement level, the initial descriptive text features and the target enhancement signal are concatenated and fused to strengthen the integrity of the initial descriptive text features, forming enhanced descriptive text semantic features; if the association level is calibration level, the target enhancement signal covers the initial descriptive text features as enhanced descriptive text features.
[0105] It should be noted that replacing the initial descriptive text features with the target enhancement signal is a hard feature replacement, directly discarding the initial descriptive text features; while covering the initial descriptive text features with the target enhancement signal is a form of inclusion, suppression, or masking. Replacement methods can be achieved through feature channel substitution, branch selection, etc., while coverage methods can be achieved through semantic weights, hierarchical feature suppression, etc. The specific methods of replacement and coverage will not be further limited here.
[0106] After fusion, the enhanced descriptive text features, together with the target visual semantic features, target control type semantic features, and target structural location semantic features, constitute a complete set of target semantic features, which serve as the basic data for forming multiple location identifiers.
[0107] Through a multi-dimensional feature transformation and fusion process that enhances the descriptive text features containing speech signals, multi-dimensional semantic features such as visual, type, location, and text are extracted from the original multimodal signals. By making full use of the direct expression of the user's subjective intention through voice narration, the limitations of single-source recognition in optical character recognition are effectively overcome, and the completeness and accuracy of the descriptive text dimension in semantic fingerprints are significantly improved. This provides a reliable semantic foundation for intelligent addressing based on multi-dimensional feature weighted matching in the execution stage.
[0108] Subsequently, based on the preset weight allocation sub-rules, differentiated weight coefficients are assigned to each target semantic feature to reflect its relative importance in the comprehensive matching decision during the execution phase.
[0109] The weighting sub-rule refers to the constraint standard for configuring differentiated weight coefficients for the target semantic features of each dimension. For example, the enhanced descriptive text feature carries the functional semantics of the target interface control and is assigned the first level weight; the target visual semantic features are relatively robust to changes in interface style and are assigned the second level weight; the target control type semantic features are stable but have limited distinguishability and are assigned the third level weight; the target structural position semantic features are most easily changed by interface redesign and are assigned the fourth level weight.
[0110] At the same time, based on the preset matching threshold sub-rules, differentiated matching thresholds are set for each target semantic feature to provide a quantitative boundary for determining whether the candidate control is hit during the execution phase.
[0111] Matching threshold sub-rules refer to the constraint standards for configuring differentiated threshold matching for target semantic features of various dimensions. For example, the matching threshold for enhancing descriptive text features includes an exact match threshold and an inclusion match threshold; the visual hash matching threshold in target visual semantic features is set based on Hamming distance; and the matching threshold in target structural positional semantic features is set based on relative positional distance.
[0112] Finally, the aforementioned target semantic features, corresponding weight coefficients, and matching thresholds are structurally encapsulated according to a preset encoding standard to form a multi-location identifier. This structural encapsulation includes at least: assigning identifier fields and data types to each target semantic feature, generating unique identifiers for controls, binding associated context information (such as source screenshot paths and generated timestamps), and serialization encoding.
[0113] By embedding weight coefficients and matching thresholds in the multi-location identifiers, the workflow has independent intelligent addressing capabilities during the execution phase, without relying on external rule bases or configuration files, thus improving the workflow's cross-environment mobility and execution autonomy.
[0114] In a specific example, firstly, standardized identifier field names and data types are assigned to each target semantic feature; then, a globally unique identifier (such as a UUID) is generated for the current target interface control as a permanent identity code, which is convenient for reference in subsequent workflows and for audit log traceability; then, the associated data in the semantic fingerprint is bound to the identifier object, including the source screenshot path, generation timestamp, and confidence scores of each feature extraction; finally, the above field data is encoded into a storable and transmittable byte stream or structured text according to a preset data exchange format (such as JSON, Protocol Buffers, or binary serialization) to form the final multi-location identifier.
[0115] The resulting multi-location identifier is not simply a set of feature values, but a comprehensive location description that integrates multi-dimensional features. It carries multi-dimensional semantic features and embeds weight coefficients and matching thresholds to guide the location execution, providing stable semantic anchors for re-addressing target elements in dynamically changing interfaces. During the execution phase, the multi-location identifier can be directly parsed, a weighted matching score can be calculated, and functionally equivalent target elements can be re-searched and identified in dynamically changing interfaces without needing to query an external rule base.
[0116] Furthermore, when the application environment is a browser webpage and DOM information can be obtained, the DOM selector pattern is used to extract DOM structure information such as control tag names, CSS selector paths, XPath paths, attribute dictionaries, and parent element chains to form a DOM path identifier. This identifier achieves precise targeting through the browser rendering engine's logical path, forming a positioning system covering both desktop and browser environments together with the multi-location identifier. This eliminates the need to build separate workflows for different environments, achieving unified automation capabilities across platforms and application types. Only when the multi-location identifier completely fails does it revert to the original operation coordinates as the final reference.
[0117] S5 integrates the intent steps and location identifiers to update the initial workflow and form the running workflow.
[0118] By fusing and binding intent steps with location identifiers, the initial workflow is structurally updated to form an operational workflow with autonomous addressing capabilities.
[0119] When the application environment is a browser webpage and the DOM selector pattern is used, the corresponding positioning identifier is the DOM path identifier. Binding the DOM path identifier to the corresponding intent step allows that intent step to directly hit the target DOM node through the browser's rendering engine's logical path during execution.
[0120] In semantic fingerprint mode, each intent step in the initial workflow is traversed first. Based on the mapping relationship established in the control recognition stage, each intent step is bound to the multi-location identifier of its corresponding target interface control. This allows the intent step to directly call the weighted matching mechanism of multi-dimensional features for target addressing during the execution stage.
[0121] After completing the above bindings, the initial workflow is updated to a running workflow. In the running workflow, each intent step can contain the following fields: step number, action type, intent description, original operation parameters, and location identifier (multiple location identifier or DOM path identifier).
[0122] The above has transformed user sample data into a semantic, executable workflow with preliminary resistance to UI drift. The positioning identifiers in the workflow use the semantic features of the target UI controls or the DOM structure path as the addressing basis, rather than volatile absolute pixel coordinates. Even if software upgrades cause button position shifts, style adjustments, or slight changes in the webpage DOM, the target UI controls can still be re-locked through fuzzy matching or path correction, significantly improving the stability of the workflow in real dynamic environments.
[0123] S6, in conjunction with the workflow, identifies interface anomalies through screen scanning. Specifically, it includes the following steps: Based on the running workflow, determine the baseline screenshot for the current time frame; Get a real-time screenshot of the current frame by scanning the screen; Perform a first detection on the real-time screenshot of the current frame to identify abnormal interface elements; By combining the baseline screenshot of the current frame, a second detection is performed on the real-time screenshot of the current frame to identify abnormal interface states; By integrating abnormal interface elements and abnormal interface states, the system provides interface anomaly identification results that include the interface anomaly type.
[0124] It's important to note that during the execution phase, the workflow steps are not mechanically replayed sequentially. Instead, the current screen state is proactively sensed before each step. This involves identifying interface anomalies that might interfere with the normal execution of the workflow through screen scanning. Anomaly handling strategies are then invoked and executed, and ultimately verified, forming a closed-loop mechanism of "perception-decision-execution." Figure 4 As shown.
[0125] First, a screenshot from the sample data corresponding to the current time frame is identified as the baseline screenshot for that frame. This baseline screenshot records the expected normal interface state before the current workflow step is executed, providing a reference point for subsequent pixel-level comparative analysis. Of course, as the current time frame changes, the baseline screenshot will be redefined. Therefore, the baseline screenshot for the current time frame always reflects the latest expected interface state when the workflow is on track. By setting the baseline screenshot for the current time frame, the degree of change of the current screen interface relative to the expected state can be quantitatively determined, thereby distinguishing between normal and abnormal screen interface states.
[0126] Then, a real-time screenshot of the current frame is captured through screen scanning. The real-time screenshot reflects the actual interface state of the current frame when the workflow is executed, and is the direct object of analysis for identifying interface anomalies.
[0127] Next, anomaly detection (i.e., the first detection) based on contour feature extraction is performed on the real-time screenshot of the current frame to identify newly appearing anomalous interface elements with clear geometric boundaries, such as obstructive pop-ups and CAPTCHA areas. In other words, anomalous interface elements refer to newly appearing objects with clear geometric boundaries on the screen.
[0128] The first step is to perform a detection on the real-time screenshot of the current frame to identify abnormal interface elements, specifically including: Perform edge detection on the real-time screenshot of the current frame to determine the edge contour; Based on the geometric features of the edge contour, feature filtering and analysis are performed to identify abnormal interface elements.
[0129] Specifically, edge detection is first performed on the real-time screenshot of the current frame (e.g., using the Canny operator) to extract the edge contour information of interface elements; then, a contour lookup algorithm is used to detect closed contour regions; based on the geometric features of the closed contour regions, feature filtering and analysis are performed. These geometric features can include preset parameters such as contour size range, aspect ratio, and compactness. Closed contour regions that conform to specific geometric feature rules are identified as abnormal interface elements. These specific geometric feature rules can be preset based on historical data. For example, contours with sizes within the typical range of pop-ups and reasonable aspect ratios are identified as obstructive pop-ups; small contour groups with a quantity meeting a threshold and compact size are identified as CAPTCHA character regions.
[0130] In a specific example, a pop-up detection algorithm is invoked to identify whether a modal dialog box exists on the screen that blocks process execution. Specifically, Canny edge detection is first performed on a real-time screenshot of the current frame to extract the edge contour information of interface elements; then, a contour lookup algorithm is used to detect closed contour regions, and these are filtered according to geometric feature rules (the contour width should be greater than 200 pixels and less than 80% of the screen width, the height should be greater than 100 pixels and less than 80% of the screen height, and the aspect ratio should be between 0.5 and 3.0). Closed contour regions that meet the above size and proportion characteristics are identified as pop-ups. Of course, if a template library of common pop-up types is preset, a template matching algorithm is further used to compare the similarity of pop-ups to confirm the pop-up type (such as protocol pop-ups, confirmation pop-ups, error pop-ups, etc.).
[0131] By combining a baseline screenshot of the current frame with a pixel-based statistical analysis of the real-time screenshot of the current frame (i.e., the second detection), abnormal interface state detection is performed to identify pixel-level state deviations in the overall or local areas of the interface, such as UI layout drift, error messages, loading status, etc. In other words, abnormal interface state refers to the form of pixel-level deviation that occurs in the overall or local areas of the interface.
[0132] Combining the baseline screenshot of the current frame, a second detection is performed on the real-time screenshot of the current frame to identify abnormal interface states, specifically including: Pixel analysis is performed on the baseline screenshot and the real-time screenshot of the current frame, respectively, and the baseline pixel data and real-time pixel data are given. To obtain the proportion of specific color pixels in real-time pixel data and identify abnormal interface states; and / or, By comparing and analyzing baseline pixel data and real-time pixel data, abnormal interface states can be identified based on the pixel change ratio or the dynamic change area between time frames.
[0133] Specifically, in a particular example, the baseline screenshot and the real-time screenshot of the current frame are decomposed into pixel-level color information. The pixel-level difference between the two is calculated, and after grayscale conversion and thresholding, the similarity of the changing colors is determined, and the proportion of these colors in the total pixels is counted. If the proportion exceeds a preset threshold, it is determined that a UI layout shift or an overall interface change has occurred. The specific pixels of a particular color may vary in different examples; no further limitations are made here.
[0134] In another specific example, a UI change detection algorithm is invoked to determine whether the overall layout or style of the interface has changed by comparing a real-time screenshot with a baseline screenshot. Specifically, a pixel-level difference algorithm is used to calculate the real-time difference map between the current frame's real-time screenshot and the previous frame's real-time screenshot, as well as the baseline difference map between the current frame's baseline screenshot and the previous frame's baseline screenshot. Then, the real-time difference map and the baseline difference map are converted to grayscale, and a threshold comparison is performed between the real-time difference value and the baseline difference value, marking the changed pixels. The proportion of changed pixels to the total number of pixels on the screen is counted to obtain the interface change ratio, indicating an abnormal interface state. For example, if the proportion of changed pixels meeting a preset threshold (e.g., the difference between the real-time difference value and the baseline difference value exceeds 5%) exceeds 10%, UI drift is determined to exist.
[0135] Therefore, the second detection uses pixel analysis to quantify the degree of change or color distribution abnormality of the interface, thereby identifying abnormal interface states.
[0136] Finally, the outputs of the first and second detections are comprehensively processed, and abnormal interface elements and abnormal interface states are merged to give a structured interface anomaly identification result.
[0137] The interface anomaly identification results can include: summarizing the detected interface anomaly markers from two categories, constructing an enumeration list of interface anomaly types (interface anomaly types refer to categories of unexpected performance issues that occur during dynamic interface display; each type of abnormal interface element or abnormal interface state is considered a different interface anomaly type, such as pop-ups and UI drift, which would include two types of interface anomaly in the list), and integrating the detailed features of various interface anomaly types (including contour geometric attributes, pixel change ratios, color area ratios, confidence scores, etc.) to form a unified data structure. The interface anomaly identification results provide standardized decision input for the subsequent retrieval and invocation of anomaly handling strategies, enabling the selection of the most appropriate anomaly handling strategy based on the interface anomaly type and confidence level.
[0138] By scanning the screen before each step, the system proactively senses the current screen environment and identifies potential interference factors before execution, thus gaining decision-making time for subsequent strategy calls and fundamentally avoiding the passive process of "blind execution → failure → crash." Through a categorized detection mechanism, it simultaneously covers the most common abnormal scenarios in office automation, such as blocking pop-ups, UI layout changes, CAPTCHA challenges, error messages, and loading statuses, ensuring that abnormalities of different natures can be effectively identified and significantly improving the workflow's adaptability in complex real-world environments.
[0139] Understandably, identifying interface anomalies is only the initial detection step; without a targeted response mechanism, the workflow will still stall or fail. Therefore, a hierarchical strategy library is provided, which matches and invokes different anomaly handling strategies based on the interface anomaly identification results, thereby optimizing the workflow.
[0140] The hierarchical strategy library is a collection of multi-level handling strategies for various interface exception types, built based on the source channel and execution priority of the handling strategy. The hierarchical strategy library can be pre-built based on historical data, and no further restrictions are placed on its form and type here.
[0141] In a specific example, the hierarchical policy library adopts a three-tiered architecture, managing policy objects according to policy source and priority. The first tier consists of cloud-shared policies, the second tier consists of local policies, and the third tier consists of system-built-in policies.
[0142] The cloud-shared strategy originates from a collective intelligence sharing mechanism, aggregating high-quality processing strategies uploaded by multiple user instances after anonymization and desensitization. This level enjoys a fixed bonus in priority calculation (e.g., a +50 priority bonus), giving it priority over basic strategies with the same score in strategy selection. Through the cloud-shared strategy, a "one-size-fits-all" approach is achieved: when a user instance successfully handles a new type of interface anomaly and it is recorded as a processing strategy, other user instances can simultaneously obtain this processing strategy.
[0143] Local policies are formed autonomously by current users during their usage, including processing strategies automatically extracted from successful user interventions and processing strategies manually configured by the user. The priority of this level is dynamically assessed based on historical execution success rates, and the value range can be adjusted within a preset range (e.g., 0 to 100). Local policies have strong user personalization characteristics and can adapt to the software environment, operating habits, and special interface specifications commonly used by specific users within an enterprise.
[0144] The system's built-in policies are a set of preset default policies that cover the most common interface anomaly types in office automation scenarios. This level has a basic priority (e.g., 60 to 90) as a safety net. When neither the cloud policy nor the local policy covers the current interface anomaly type, the system's built-in policy will be used as a fallback.
[0145] Each processing strategy in the hierarchical strategy library is stored as a structured strategy object, which may include: a unique strategy identifier, strategy name, applicable exception types, trigger condition dictionary, priority score, historical success rate, and action sequence. The trigger condition dictionary defines the contextual conditions that must be met for the processing strategy to be activated (such as pop-up type, keyword matching, CAPTCHA type, etc.); the action sequence details the specific operational steps when the processing strategy is executed (such as scrolling, clicking, searching, dragging, notification, etc.).
[0146] S7, based on the interface anomaly identification results, invokes the anomaly handling strategy and optimizes the runtime workflow. Specifically, it includes the following steps: Based on the interface anomaly identification results, matching processing strategies are retrieved from the hierarchical strategy library and identified as candidate processing strategies to form a set of candidate processing strategies. Based on preset triggering conditions and priority rules, the candidate processing strategy set is filtered and sorted to determine the target processing strategy and provide the target processing strategy set. Execute each target processing strategy in the target processing strategy set in sequence and monitor the strategy execution results; Based on the strategy execution results, optimize the running workflow and update the hierarchical strategy library.
[0147] Furthermore, matching processing strategies are retrieved from the hierarchical strategy library, specifically including: Using interface exception types as search elements, a full-level parallel search mechanism is adopted to traverse each level of the hierarchical strategy library and provide the processing strategy that matches the search element in each level.
[0148] Furthermore, based on preset triggering conditions and priority rules, the candidate processing strategy set is filtered and sorted to determine the target processing strategy, specifically including the following steps: Based on preset triggering conditions, a full trigger matching verification is performed on each candidate processing strategy in the candidate processing strategy set, filtering out candidate processing strategies that do not fully meet the triggering conditions, and providing an initial target processing strategy. By combining the adjustment factors corresponding to each level, the processing strategies for each initial target are weighted and calculated to give a comprehensive priority score; Based on the overall priority score, the target processing strategy is determined.
[0149] During the execution phase, after providing the interface anomaly identification results, instead of blindly retrying or simply reporting errors, the system uses a closed-loop mechanism based on the structured interface anomaly identification results—"retrieval—filtering, sorting—sequential execution—optimization and update"—to achieve precise response to interface anomaly types and dynamic optimization of the workflow.
[0150] Specifically, the system first takes the interface anomaly identification results as input and analyzes the interface anomaly type (such as obstructive pop-ups, UI drift, CAPTCHA challenges, error messages, loading status, etc.) and contextual features (such as keywords in the pop-up, CAPTCHA type, UI change ratio, error message color area, etc.). Using the interface anomaly type as the search element, a full-level parallel search mechanism is initiated, simultaneously sending search requests to each level of the hierarchical strategy library. Each level independently searches its own stored processing strategies, filters out processing strategies that match the search element (interface anomaly type), and returns them to the search results aggregation module. For example, when the anomaly type is "obstructive pop-up," the system's built-in strategy layer returns preset pop-up handling strategies (such as automatic agreement in agreement pop-ups, automatic confirmation in confirmation pop-ups), the local strategy layer returns effective pop-up response strategies accumulated by the user in the past, and the cloud-shared strategy layer returns pop-up handling strategies uploaded by other users and verified by the group. The strategies returned by each level together constitute a candidate processing strategy set, ensuring the breadth of coverage and diversity of sources in the candidate set.
[0151] The candidate processing strategy set is filtered based on preset triggering conditions. Triggering conditions refer to the set of pre-defined constraint rules used to filter and activate processing strategies. These rules can include: interface anomaly type (e.g., agreement pop-ups, confirmation pop-ups, error pop-ups), keywords (e.g., pop-up text containing "agree," "agreement," "terms," etc.), CAPTCHA type classification (e.g., slider, text, puzzle), confidence interval matching, etc. Each strategy in the candidate processing strategy set is individually verified to ensure its triggering condition dictionary completely matches the contextual features in the current interface anomaly identification result. A candidate processing strategy passes verification only if the interface anomaly type satisfies all the constraint rules of its triggering conditions; otherwise, it is filtered out. This retains the initial target processing strategy that is highly compatible with the interface anomaly type.
[0152] Furthermore, for each initial target processing strategy in the initial target processing strategy set, a weighted calculation is performed based on the adjustment factor corresponding to its source level to give a comprehensive priority score.
[0153] The adjustment factor is a preset weighted parameter reflecting the characteristics of the strategy source level, with each level having a different adjustment factor. Taking into account factors such as the strategy source level (cloud-shared strategies enjoy a preset priority bonus, local strategies are dynamically adjusted based on historical execution success rates, and system-built-in strategies have a basic priority value), the basic priority value, and historical execution success rates, a comprehensive priority score is calculated for each initial target processing strategy. Target processing strategies are then sorted in descending order based on their comprehensive priority scores, resulting in a set of target processing strategies ordered by execution priority.
[0154] Next, the various target processing strategies in the target processing strategy set are analyzed and executed sequentially according to their execution priority. For each target processing strategy, its action sequence is analyzed, and each action step is executed in sequence. The action steps vary depending on the type of exception: for a protocol pop-up, the action sequence of simulating reading and scrolling followed by clicking "agree" is executed; for example, in a specific example, for UI drift, an action sequence of expanding the scope of search based on semantic fingerprints can be executed; for a CAPTCHA, an action sequence of simulating dragging or requesting manual assistance can be executed; for an error pop-up, an action sequence of capturing text, closing the pop-up, and notifying the user can be executed.
[0155] During execution, the results of the above target processing strategies are monitored in real time to determine whether the current interface anomaly type has been successfully eliminated and whether the workflow has returned to the expected state. If a target processing strategy is executed successfully, subsequent strategy attempts are stopped, and the workflow optimization and hierarchical strategy library update phase begins. If the current target processing strategy fails, the next target processing strategy in the target processing strategy set is tried until the target processing strategy set is exhausted. If the target processing strategy set is exhausted and the interface anomaly type is still not eliminated, a degradation mechanism is triggered.
[0156] In a specific example, the interface anomaly type was UI drift, which resulted in a target UI control not being found at its original location. During the demonstration learning phase, each target UI control was associated with multiple location identifiers. The action sequence of the target processing strategy determined here employs a hierarchical expansion search mechanism: the first round uses the original location as the center and performs precise matching within a preset small range (e.g., ±50 pixels) based on semantic fingerprints; if no match is found, the second round expands to ±100 pixels; the third round expands to ±200 pixels; and the fourth round expands to ±300 pixels. Each round of search uses a sliding window traversal with a step size set to a preset value (e.g., 50 pixels), using perceptual hash similarity (Hamming distance threshold set to 20) as the primary matching criterion. If a functionally equivalent target UI control is found in a certain round, the location parameters for that step in the workflow are updated, and the newly discovered location becomes the benchmark for subsequent execution. This target processing strategy achieves intelligent addressing "from near to far, from precise to coarse" through hierarchical search, controlling computational overhead while ensuring positioning accuracy.
[0157] If a target processing strategy in the target processing strategy set is successfully executed, the workflow is optimized based on the strategy execution result: the mapping relationship between the "abnormal scenario and response strategy" is solidified into the abnormal response path of the corresponding step in the workflow, so that the workflow can directly reuse the verified response experience when it is executed repeatedly in the future without having to search and try again, which significantly improves the efficiency and stability of subsequent execution.
[0158] Simultaneously, the hierarchical strategy library is updated, specifically including: updating historical success rates based on an exponential moving average mechanism, and updating execution statistics for target processing strategies. Execution statistics include priority ranking. The execution statistics of new target processing strategies can promptly reflect their actual effectiveness in the current environment. For example, if a target processing strategy is a newly generated or optimized strategy in the local strategy layer, and its success rate reaches a preset threshold, it can be stored or uploaded to the cloud-based shared strategy layer after user authorization and anonymization / desensitization, enabling the group sharing of high-quality experience and the continuous evolution of the hierarchical strategy library.
[0159] Furthermore, the historical success rate is updated based on the exponential moving average mechanism, specifically including: Obtain the historical success rate of the target processing strategy and the results of this execution; The historical success rate is weighted and decayed based on a preset decay coefficient, and the current execution result is weighted and updated based on a preset update coefficient to give a new success rate.
[0160] The exponential moving average mechanism is specifically expressed as follows:
[0161] Where new_rate is the new success rate, old_rate is the historical success rate of the target processing strategy, result is the binary representation of the execution result, 0 indicates execution failure, 1 indicates execution success, k1 is the decay coefficient, such as 0.9, and k2 is the update coefficient, such as 0.1.
[0162] Through a closed-loop mechanism of "retrieval-filtering-sorting-execution-solidification", the system achieves accurate classification of interface anomaly types and matching of anomaly handling strategies. Priority sorting enables the reuse of high-quality experiences. Continuous updating of execution statistics enables dynamic evaluation of the effectiveness of handling strategies, giving the workflow the ability to self-optimize as it is used.
[0163] By constructing a complete interface exception handling system covering multiple levels and scenarios, and combining trigger condition filtering and priority sorting mechanisms, we can achieve accurate matching and intelligent sorting of handling strategies. We can also solidify exception handling experience into workflow capabilities, enabling adaptive evolution at the execution level and significantly improving the efficiency and stability of workflow execution.
[0164] S8 verifies the optimized workflow and constructs a workflow based on multimodal demonstration learning.
[0165] After executing actions and applying processing strategies, the operating system verifies the optimized workflow to confirm whether the expected results of the current step have occurred, thereby constructing a workflow based on multimodal demonstration learning.
[0166] In a specific example, a screenshot of the executed state is captured and compared with the expected interface state of the current step to verify whether the optimized workflow has brought the process back on track. During the training phase, each intent step is associated with both a screenshot before and after the operation. The execution phase uses the screenshot recorded during the training phase as the template for the expected state after the step is completed. For workflow steps optimized by the strategy, the operating system infers the expected result characteristics based on the type of strategy applied. For example: after executing a pop-up handling strategy, the pop-up is expected to disappear; after executing a UI drift handling strategy, the target element is expected to be successfully located and manipulated; after executing a CAPTCHA handling strategy, verification is expected to pass and the page will enter the next state.
[0167] Furthermore, based on the location identifier and action type of the current step, expected result features can be dynamically generated. For example, for click-type operations, the expected result is a change in the state of the clicked control (such as button highlighting or page navigation); for input-type operations, the expected result is the appearance of the target text in the input box; for button-type operations, the expected result is form submission or dialog box closing, etc.
[0168] The optimized workflow is validated, including the following steps: The optimized screenshot is compared with the expected state template using pixel difference analysis to calculate the position and proportion of the changed area, and to determine whether the expected change has occurred; and / or, In the screenshot after execution, the target UI control is re-searched based on the location identifier of the current intent step to verify whether it is in the expected state (such as pop-up disappearance, new page loading, successful text input, etc.); and / or, Extract visual features from key areas (such as the original pop-up area and input box area) in the optimized screenshot, match them with the expected result features, and confirm the operation result.
[0169] If the screen status after optimization meets expectations, the current step is deemed to have passed verification, the process is confirmed to be on track, and the execution result of the current step and the applied strategy (if any) are solidified into the running workflow.
[0170] If the screen state after optimization does not meet expectations (e.g., pop-up window still exists, target element not found, verification code not passed, etc.), the current step is considered a verification failure. The failure information is recorded, triggering downgrade processing or an exception flag.
[0171] Among them, degradation processing refers to: when all processing strategies have been tried and failed, the degradation mechanism is triggered, and a notification is pushed to the user requesting manual assistance; anomaly marking refers to: marking the current interface anomaly type as a sample to be learned, so that the learning and evolution module can extract experience, generate new processing strategies, and supplement the hierarchical strategy library.
[0172] It should also be noted that during the verification and full-process execution, all operations and decision-making basis can be recorded to a structured audit log to ensure the traceability and compliance of the operations.
[0173] In one example, the structured audit logs employ a dual-storage architecture, using an SQLite database as the primary storage for efficient querying and retrieval; and a JSONL text file as a daily backup for easy export and long-term archiving. The audit log data structure includes fields such as: unique log identifier, workflow identifier, step identifier, timestamp, operation type, detailed information (JSON format), execution result, error message, screen resolution, and active window title.
[0174] Specific recording scenarios include workflow step execution records, exception event records, and sensitive operation confirmation records. Workflow step execution records refer to recording the pre-execution state (intent description, active window, screen resolution, etc.) and post-execution state (success / failure, strategy used, screenshot paths before and after execution, etc.) for each step. Exception event records record the detected exception type, description, applied strategy identifier, and whether it has been resolved. Sensitive operation confirmation records, for sensitive operations such as file deletion and payment confirmation, record the confirmation request and confirmation result to ensure that sensitive behaviors are traceable.
[0175] Once all steps in the workflow have been executed sequentially and verified, the operating system completes the construction of the workflow based on multimodal demonstration learning. The resulting workflow uses intent steps as basic units, presenting the operation process in natural language descriptions, ensuring human readability. Each intent step is bound to a location identifier (semantic fingerprint or DOM path identifier), ensuring machine executableness. The workflow has embedded exception handling paths and strategy execution statistics for each step, providing adaptability to dynamic environments. The workflow's execution records and exception samples provide training material for the learning and evolution module, supporting continuous improvement of the operating system's capabilities. All operations and decision-making processes are recorded in structured audit logs, meeting enterprise compliance and security audit requirements. In short, it possesses semantic, executable, adaptive, evolvable, and auditable characteristics.
[0176] like Figure 5 As shown, the constructed workflow, taking "Auto-fill Enterprise Expense Report" as an example, includes 12 semantic intent steps arranged in sequence: Step 1 "Double-click icon", Step 2 "Wait for loading", Step 3 "Enter employee number", Step 4 "Enter password", Step 5 "Click login", Step 6 "Click create expense report", Step 7 "Select amount category", Step 8 "Enter amount", Step 9 "Enter description", Step 10 "Submit for approval", Step 11 "Confirmation pop-up", and Step 12 "Wait for completion". The bottom of the interface features an execution control area with manual intervention options such as pause, resume, stop, and refresh; the execution status display area shows the current execution progress in real time, such as "Executing [Auto-fill Enterprise Expense Report] Step 3 / 12".
[0177] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention. Clearly, those skilled in the art can make various alterations and modifications to the invention without departing from its spirit and scope. Thus, if these modifications and modifications of the invention fall within the scope of the claims and their equivalents, the invention is also intended to include these modifications and modifications.
Claims
1. A method for dynamically constructing intelligent workflows based on multimodal demonstration learning, characterized in that, Specifically, the steps include the following: Acquire the multimodal signals corresponding to each time frame in the demonstration data. The demonstration data is the data of the user's workflow demonstration operation, and the multimodal signals refer to the multi-source heterogeneous signals synchronously collected according to the operation time frame during the user's workflow demonstration operation. Semantic analysis is performed on multimodal signals to determine intent steps and provide an initial workflow, where an intent step refers to an independent operation semantic in the workflow; Based on the analysis of each intent step, the target interface control is identified; Based on the control description information of the target interface control, a location identifier is formed, where the location identifier is an addressing identifier generated by converting the control description information; By integrating intent steps and location identifiers, the initial workflow is updated to form the running workflow; Combined with the workflow, screen scanning is used to identify interface anomalies; Based on the interface anomaly identification results, the anomaly handling strategy is invoked, and the runtime workflow is optimized; The optimized workflow was validated, and a workflow based on multimodal demonstration learning was constructed.
2. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 1, characterized in that, Semantic analysis is performed on multimodal signals to determine the intended steps and provide an initial workflow, which includes the following steps: Timestamp synchronization of multimodal signals; Perform format conversion on each modal signal, determine each initial event, and provide a list of initial events; Call the filtering function to filter each initial event and form a filtered event list; Based on the correlation of the operation sequence, the aggregation function is called to merge the filtered event list, determine the working events, and form the event list; Infer the intent of each work event in the event list, determine the intent steps of the corresponding event list, and provide an initial workflow.
3. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 2, characterized in that, For each work event in the event list, perform intent inference to determine the corresponding intent steps in the event list and provide an initial workflow, which includes the following steps: Based on the multimodal intent inference function, intent inference is performed for each working event to generate an intent description; By combining the event list and intent description, determine the intent steps for each work event in the corresponding event list and provide an initial workflow.
4. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 1, characterized in that, The control description information includes semantic fingerprints, and the location identifiers include multiple location identifiers; Based on the control description information of the target interface control, a positioning identifier is generated, which specifically includes the following steps: Obtain the target modal signals of the corresponding target interface controls; Each target modal signal undergoes feature transformation to form corresponding target semantic features; Based on the weight allocation sub-rules, determine the weight coefficients corresponding to each target semantic feature; According to the matching threshold sub-rule, a corresponding matching threshold is set for each target semantic feature; The semantic features of each target, along with their corresponding weight coefficients and matching thresholds, are structurally encapsulated to form a multi-location identifier.
5. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 4, characterized in that, For each target modal signal, feature transformation is performed to form the corresponding target semantic features, specifically including: Based on the initial feature transformation strategy, the first target modal signal is transformed to form initial descriptive text features and initial target semantic features in multiple dimensions. The initial feature transformation strategy includes visual feature transformation, control type transformation, structural position transformation and descriptive text transformation. The first target modal signal includes a local region image, which is a region image formed by expanding a preset pixel range around the click coordinates of the target interface control. Semantic relevance determination is performed on the second target modal signal, the association level with the target interface control is given, and the enhancement signals of each target are determined. The second target modal signal includes the target speech signal, and the timestamp of the target speech signal is aligned with the operation time of the target interface control. Based on the association level, the enhanced signals of each target and the initial descriptive text features are fused to form the corresponding enhanced descriptive text features; By combining initial target semantic features from multiple dimensions with enhanced descriptive text features, the target semantic features are presented.
6. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 4, characterized in that, In conjunction with the workflow, screen scanning is used to identify interface anomalies, specifically including the following steps: Based on the running workflow, determine the baseline screenshot for the current time frame; Get a real-time screenshot of the current frame by scanning the screen; Perform a first detection on the real-time screenshot of the current frame to identify abnormal interface elements; By combining the baseline screenshot of the current frame, a second detection is performed on the real-time screenshot of the current frame to identify abnormal interface states; By integrating abnormal interface elements and abnormal interface states, the system provides interface anomaly identification results that include the interface anomaly type.
7. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 6, characterized in that, Combining the baseline screenshot of the current frame, a second detection is performed on the real-time screenshot of the current frame to identify abnormal interface states, specifically including: Pixel analysis is performed on the baseline screenshot and the real-time screenshot of the current frame, respectively, and the baseline pixel data and real-time pixel data are given. To obtain the proportion of specific color pixels in real-time pixel data and identify abnormal interface states; and / or, By comparing and analyzing baseline pixel data and real-time pixel data, abnormal interface states can be identified based on the pixel change ratio or the dynamic change area between time frames.
8. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 6, characterized in that, Based on the interface anomaly identification results, the anomaly handling strategy is invoked, and the execution workflow is optimized, specifically including the following steps: Based on the interface anomaly identification results, matching processing strategies are retrieved from the hierarchical strategy library and identified as candidate processing strategies to form a set of candidate processing strategies. Based on preset triggering conditions and priority rules, the candidate processing strategy set is filtered and sorted to determine the target processing strategy and provide the target processing strategy set. Execute each target processing strategy in the target processing strategy set in sequence and monitor the strategy execution results; Based on the strategy execution results, optimize the running workflow and update the hierarchical strategy library.
9. The intelligent workflow dynamic construction method based on multimodal demonstration learning as described in claim 8, characterized in that, Retrieve matching processing strategies from the hierarchical strategy library, specifically including: Using interface exception types as search elements, a full-level parallel search mechanism is adopted to traverse each level of the hierarchical strategy library and provide the processing strategy that matches the search element in each level.
10. A device for dynamically constructing intelligent workflows based on multimodal demonstration learning, employing the method for dynamically constructing intelligent workflows based on multimodal demonstration learning as described in any one of claims 1-9, specifically comprising: The demonstration learning module is used to acquire the multimodal signals corresponding to each time frame in the demonstration data; Semantic analysis is performed on multimodal signals to determine intent steps and provide an initial workflow; and, based on the analysis of each intent step, target interface controls are identified. The workflow building module is used to generate location identifiers based on the control description information of the target interface controls; and to update the initial workflow by integrating the intent steps and location identifiers to form a running workflow. The perception optimization module is used to identify interface anomalies by scanning the screen in conjunction with the running workflow. Based on the interface anomaly identification results, the anomaly handling strategy is invoked and the running workflow is optimized; and the optimized running workflow is verified to construct a workflow based on multimodal demonstration learning.