Multimodal Awareness-Based Web Application Job Processing Methods, Devices, and Storage Media
By collecting multimodal contextual data from web application interfaces using multimodal perception technology, and combining this data with vector databases and large language models to generate job decision instructions, the problems of single interaction dimensions and ambiguous semantic references are solved, enabling precise job execution and resource optimization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA ELECTRONICS CLOUD DIGITAL INTELLIGENCE TECH CO LTD
- Filing Date
- 2026-04-30
- Publication Date
- 2026-06-30
AI Technical Summary
Existing web interaction technologies suffer from problems such as limited interaction dimensions, ambiguous semantic references, illusions of large-scale tool calls, and redundant contexts, making them difficult to adapt to the efficient interaction needs of complex enterprise applications.
Multimodal context data, including passive response mode and active perception mode, is collected through a dual-mode trigger perception mechanism. Combined with a vector database and a large language model, job decision instructions are generated, and the host web application's native functions are called to execute the job.
It achieves high-dimensional and accurate semantic resolution, suppresses reasoning illusions, optimizes resource consumption, and provides non-intrusive and robust job execution results.
Smart Images

Figure CN122309086A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer interaction technology, specifically to a method, apparatus, device, and computer-readable storage medium for processing Web application tasks based on multimodal perception. Background Technology
[0002] With the continuous iteration of Web technologies and the deepening of enterprise digital transformation, the application scenarios of enterprise-level applications are constantly expanding, covering multiple core areas such as cloud-native governance, digital office, and professional design tools. Their interaction logic is also becoming increasingly complex, making them a key carrier supporting efficient enterprise operations. These enterprise-level applications generally have three typical characteristics: First, extremely high parameter dimensions, with a single page often containing dozens of controlled form items, requiring the handling of multi-dimensional data input and validation; second, deep coupling of state logic, with UI presentation highly bound to complex front-end state machines, requiring coordinated responses from multiple modules for state changes; and third, extremely rich functional entry points, with the backend providing services through hundreds or thousands of OpenAPI interfaces, forming a vast functional service system.
[0003] Currently, many complex web applications are incorporating AI-assisted technologies to optimize the interactive experience. However, existing web interaction and AI-assisted models still have significant shortcomings, primarily focusing on two major issues: ambiguous semantic reference and limited interaction dimensions. At the semantic interaction level, existing technologies largely rely on plain text commands. When users use vague pronouns such as "here" or "that button," the system cannot accurately locate the specific object of operation, easily leading to intent recognition failure. This, in turn, causes the illusion of large model tool calls and contextual redundancy, affecting interaction efficiency and accuracy.
[0004] At the tool invocation level, when faced with a large set of functions comprised of OpenAPIs, existing technologies often input all tool definitions into a large model at once. This not only creates context window redundancy and increases token consumption, but also allows irrelevant tools to interfere with the model's inference process, easily leading to a "tool overflow" illusion—that is, the model fabricates non-existent functions or parameters. This severely affects the reliability of AI assistance and makes it difficult to adapt to the efficient interaction requirements of complex enterprise-level applications. Therefore, it is urgent to optimize existing interaction and AI assistance models to address the above pain points and provide more accurate and efficient interaction support for complex web applications. Summary of the Invention
[0005] This application provides a web application job processing method, apparatus, device, and computer-readable storage medium based on multimodal perception, which can solve the technical problems of single interaction dimension, ambiguous semantic reference, illusion of large model tool calls, and context redundancy in the prior art.
[0006] In a first aspect, embodiments of this application provide a web application job processing method based on multimodal awareness, comprising: The system receives user commands and collects multimodal context data of the current web application interface through a dual-mode trigger perception mechanism. The dual-mode trigger perception mechanism includes a passive response mode and an active perception mode. The multimodal context data includes at least text information, image information, and page structure information. The task intent is determined based on the user instructions, and a general toolset is retrieved from the vector database based on the task intent. The native toolset actively injected by the host web application through a predefined protocol interface is also obtained. The general toolset is merged with the native toolset to construct a dynamic toolset, and the dynamic toolset and the multimodal context data are input into a large language model for reasoning to generate a task decision instruction, wherein the task decision instruction includes the target function to be called and parameters; The predefined protocol interface is used to call the native function corresponding to the target function in the host web application to execute the job and receive the execution result.
[0007] In conjunction with the first aspect, in one implementation, the step of collecting multimodal context data of the current Web application interface through a dual-mode triggering perception mechanism includes: Passive response mode: In response to user annotation operations, it collects image features and absolute screen coordinates of a specific area on the user interface. The annotation operations include drawing lines, selecting boxes, or text annotations. Active perception mode: In response to the perception command issued by the intent determination agent, a semantic HTML snapshot of the current viewport is collected. The semantic HTML snapshot contains predefined business identification attributes.
[0008] In conjunction with the first aspect, in one implementation, in the active sensing mode, the method further includes: The sensing command is sent to the front end via a two-way communication protocol; This allows the front-end to silently collect cropped page area data according to the perception instructions, and extract DOM node information containing the data-testid attribute from the page area data as the page structure information.
[0009] In conjunction with the first aspect, in one implementation, obtaining the set of native tools actively injected by the host web application through a predefined protocol interface includes: When the host web application is detected to be in the initialization phase or when the page state changes, the native tool schema is registered through the standard interface of the window object. The native tool schema includes API call specifications and UI state reporting function definitions. The server-side scheduling layer selects a subset of native tools related to the current intent from the registered native tool schema based on the current page state, and uses this subset as the native tool set.
[0010] In conjunction with the first aspect, in one implementation, the step of calling the native function corresponding to the target function in the host web application through the predefined protocol interface to execute the job includes: The AI assistant plugin acts as a trusted proxy, executing the native functions directly within the context of the host web application, rather than simulating user interface interaction events. If the job decision instruction involves sensitive operations, an execution preview card will pop up before execution, and the API request will be suspended until a user confirmation instruction is received.
[0011] In conjunction with the first aspect, in one implementation, after receiving the execution result, the method further includes: Parse the execution result to obtain the task ID or status information; Based on the task ID or status information, the page automatically redirects to the results display page, forming a closed loop for the operation.
[0012] In conjunction with the first aspect, in one implementation, before determining the job intent based on the user instruction and retrieving a general toolset from the vector database based on the job intent, the method further includes: The acquired text knowledge vector data and tool function vector data are stored in a vector database so that the vector database has the function of unified vectorized indexing of heterogeneous dual-source data. Obtain text knowledge vector data, including: Obtain the operation documents and FAQs of the Web platform, and perform text cleaning and semantic segmentation on the operation documents and FAQs to generate text knowledge vector data; Obtain tool function vector data, including: Obtain the OpenAPI interface specification document for the web application, and perform text cleaning on the OpenAPI interface specification document; The business function data of each API in the cleaned OpenAPI interface specification document is transformed into function data vectors and associated with and stored in FunctionSchema to generate tool function vector data.
[0013] Secondly, embodiments of this application provide a web application job processing apparatus based on multimodal awareness, the web application job processing apparatus based on multimodal awareness comprising: The receiving and acquisition module is used to receive user instructions and acquire multimodal context data of the current Web application interface through a dual-mode trigger perception mechanism. The dual-mode trigger perception mechanism includes a passive response mode and an active perception mode. The multimodal context data includes at least text information, image information and page structure information. The retrieval and acquisition module is used to determine the task intent according to the user instruction, retrieve a general tool set from the vector database based on the task intent, and obtain a native tool set actively injected by the host web application through a predefined protocol interface; The generation module is used to merge the general toolset with the native toolset to construct a dynamic toolset, and input the dynamic toolset and the multimodal context data into the large language model for inference to generate a task decision instruction, wherein the task decision instruction includes the target function to be called and parameters; The execution module is used to call the native function corresponding to the target function in the host web application to execute the job through the predefined protocol interface, and to receive the execution result.
[0014] Thirdly, embodiments of this application provide a Web application job processing device based on multimodal perception. The Web application job processing device based on multimodal perception includes a processor, a memory, and a Web application job processing program based on multimodal perception stored in the memory and executable by the processor. When the Web application job processing program based on multimodal perception is executed by the processor, it implements the steps of the Web application job processing method based on multimodal perception as described above.
[0015] Fourthly, embodiments of this application provide a computer-readable storage medium storing a multimodal-aware web application job processing program, wherein when the multimodal-aware web application job processing program is executed by a processor, it implements the steps of the multimodal-aware web application job processing method described above.
[0016] The beneficial effects of the technical solutions provided in this application include: By receiving user commands and collecting multimodal context data of the current web application interface through a dual-mode trigger perception mechanism (including a passive response mode and an active perception mode), and by using the multimodal context data to collect at least text information, image information, and page structure information, the system resolves the technical problems of single interaction dimension, ambiguous semantic reference, and illusion of tool calls and context redundancy in related technologies. This achieves high-dimensional accurate semantic resolution, suppresses inference illusions, optimizes resource consumption, and provides non-intrusive and robust job execution. The system also receives user commands to determine the job intent, retrieves a general toolset from a vector database based on the job intent, and obtains a native toolset actively injected by the host web application through a predefined protocol interface. The general toolset and the native toolset are then merged to construct a dynamic toolset, which, along with the multimodal context data, is input into a large language model for inference to generate a job decision command. The job decision command includes the target function to be called and its parameters. Finally, the system calls the native function corresponding to the target function in the host web application through the predefined protocol interface to execute the job and receives the execution result. Attached Figure Description
[0017] Figure 1 This is a flowchart illustrating the first embodiment of the Web application job processing method based on multimodal awareness according to this application; Figure 2 This is a schematic diagram of the functional modules of an embodiment of the Web application job processing device based on multimodal perception according to this application; Figure 3 This is a schematic diagram of the hardware structure of a Web application job processing device based on multimodal perception involved in the embodiments of this application. Detailed Implementation
[0018] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present application.
[0019] First, some of the technical terms used in this application will be explained to help those skilled in the art understand this application.
[0020] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.
[0021] In a first aspect, embodiments of this application provide a Web application job processing method based on multimodal awareness.
[0022] In one embodiment, reference is made to Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the Web application job processing method based on multimodal awareness according to this application. Figure 1 As shown, the web application job processing method based on multimodal awareness includes: Step S10: Receive user instructions and collect multimodal context data of the current Web application interface through a dual-mode trigger perception mechanism. The dual-mode trigger perception mechanism includes a passive response mode and an active perception mode. The multimodal context data includes at least text information, image information, and page structure information. As an example, upon receiving a user-sent command, a dual-mode triggering perception mechanism is used to collect multimodal context data of the current web application interface. This mechanism includes a passive response mode and an active perception mode. The multimodal context data includes at least text information, image information, and page structure information. For instance, the passive response mode collects the text information, image information, and page structure information of the current web application interface. The active perception mode collects the same information.
[0023] Specifically, the multimodal context data of the current web application interface is collected through the dual-mode trigger perception mechanism, including: passive response mode: in response to the user's annotation operation, image features and absolute screen coordinates of a specific area on the user interface are collected, and the annotation operation includes drawing lines, selecting boxes or text annotations; active perception mode: in response to the perception command issued by the intent determination agent, semantic HTML snapshots within the current viewport are collected, and the semantic HTML snapshots contain predefined business identification attributes.
[0024] As an example, the user manually triggers the "screenshot annotation" function in the front-end AI assistant interface. The front-end activates the Canvas overlay, allowing the user to draw lines, select boxes, or annotate text on specific areas within the viewport (such as error messages in red, specific components). The front-end collects the image features after annotation and records the absolute screen coordinates (x, y) of the annotated area, encapsulating it into a multimodal input package actively provided by the user.
[0025] Through AI-driven proactive perception, the intent determination agent receives user text, such as "How do I configure this?", identifies ambiguous referential or context-dependent intents, and collects multimodal context data of the current web application interface. Alternatively, the system sends perception commands to the front end via WebSocket, and the front end silently collects a cropped semantic HTML snapshot, extracts business tags such as data-testid, and collects multimodal context data of the current web application interface.
[0026] Step S20: Determine the task intent based on the user instruction, retrieve a general toolset from the vector database based on the task intent, and obtain the native toolset actively injected by the host web application through a predefined protocol interface; As an example, the user's command determines the task intent, allowing the retrieval of a general toolset from a vector database. Upon detecting a change in the host web application's state during initialization or a page state change, a native tool schema is registered via the standard interface of the window object. This native tool schema includes API call specifications and UI state reporting function definitions. The server-side scheduling layer then selects a subset of the registered native tool schemas relevant to the current intent, based on the current page state, to form the native toolset. For instance, the host web application proactively injects a Function Schema containing API call specifications and UI state reporting functions into the AI assistant plugin via a standard protocol interface (such as `Window.registerAIAgentTools`). Upon receiving this schema, the AI assistant stores it in a local temporary buffer.
[0027] Before determining the task intent based on user instructions and retrieving a general toolset from the vector database based on the task intent, the process further includes: acquiring text knowledge vector data and tool function vector data, and storing the acquired text knowledge vector data and tool function vector data in the vector database, so that the vector database has the function of unified vectorized indexing of heterogeneous dual-source data. Acquiring text knowledge vector data includes: Obtain the operation documentation and FAQs of the Web platform, and perform text cleaning and semantic segmentation on the operation documentation and the FAQs to generate text knowledge vector data. Obtain tool function vector data, including: obtaining the OpenAPI interface specification document of the Web application, and performing text cleaning on the OpenAPI interface specification document; converting the business function data of each API in the cleaned OpenAPI interface specification document into function data vectors, and storing them in association with FunctionSchema to generate tool function vector data.
[0028] Step S30: Merge the general toolset with the native toolset to construct a dynamic toolset, and input the dynamic toolset and the multimodal context data into the large language model for reasoning to generate a task decision instruction, wherein the task decision instruction includes the target function to be called and parameters; As an example, when a user asks a question, "native tools" are integrated with "general tools" retrieved from a vector database. The server-side scheduling layer dynamically selects the most relevant native tool schema based on the current page state and injects it into the large language model context. The server receives mixed multimodal input, user text + self-defined screenshots, annotations (x,y), and semantic snapshots, and uses the large language model for comprehensive reasoning. If the user's question involves a self-defined area, such as pointing to a selected "graphics card model" and asking "What is this?", the agent uses image features and coordinates (x,y) to quickly locate the corresponding DOM node ID on the front end and accurately capture its business meaning. It accurately fills the corresponding slots, such as automatically identifying the resource pool ID referred to by the user. Based on the injected local toolset, the large language model determines whether to call the API. For missing required slots, the server maintains a session state machine and issues follow-up questions until the API call threshold is met.
[0029] Step S40: Through the predefined protocol interface, call the native function corresponding to the target function in the host web application to execute the job, and receive the execution result.
[0030] As an example, after the large model makes a decision, it issues `tool_calls`, specifying the host function name and parameters to be invoked. The AI assistant plugin, acting as a trusted proxy, executes the corresponding native function within the context of the host application. Once the host function completes execution, such as when the form is filled out or the API call is successful, the execution result is fed back to the AI assistant in real time, which then synchronizes the task progress with the user.
[0031] Specifically, after receiving the execution result, the process further includes: parsing the execution result to obtain the task ID or status information; and automatically redirecting the page to the result display page based on the task ID or status information, thus forming a closed loop in the job.
[0032] As an example, for sensitive operations such as resource changes, a preview card pops up on the front end. The system suspends the API request, waiting for the user to perform final manual verification and click confirmation. After the API call is successful, the system parses the returned task ID or status, driving the page to jump to the results display page, such as the billing details or training task monitoring page, completing the complete job loop.
[0033] In this embodiment, by receiving user instructions and collecting multimodal context data of the current web application interface through a dual-mode trigger perception mechanism, wherein the dual-mode trigger perception mechanism includes a passive response mode and an active perception mode, and the multimodal context data includes at least text information, image information, and page structure information; determining the task intent based on the user instructions, retrieving a general toolset from a vector database based on the task intent, and obtaining a native toolset actively injected by the host web application through a predefined protocol interface; merging the general toolset and the native toolset to construct a dynamic toolset, and inputting the dynamic toolset and the multimodal context data into a large language model for reasoning to generate a task decision instruction, wherein the task decision instruction includes the target function to be called and parameters; and calling the native function corresponding to the target function in the host web application through the predefined protocol interface to execute the task and receiving the execution result, this method solves the technical problems of single interaction dimension, ambiguous semantic reference, and large model tool call illusion and context redundancy in related technologies, achieving high-dimensional accurate semantic resolution, suppressing reasoning illusions, optimizing resource consumption, and providing non-intrusive and robust task execution.
[0034] Secondly, embodiments of this application also provide a Web application job processing device based on multimodal perception.
[0035] In one embodiment, reference is made to Figure 2 , Figure 2 This is a functional module diagram of an embodiment of the Web application job processing device based on multimodal perception according to this application. Figure 2 As shown, the web application job processing device based on multimodal awareness includes: The receiving and acquisition module 10 is used to receive user instructions and acquire multimodal context data of the current Web application interface through a dual-mode trigger perception mechanism. The dual-mode trigger perception mechanism includes a passive response mode and an active perception mode. The multimodal context data includes at least text information, image information and page structure information. The retrieval and acquisition module 20 is used to determine the task intent according to the user instruction, retrieve a general tool set from the vector database based on the task intent, and obtain a native tool set actively injected by the host web application through a predefined protocol interface; The generation module 30 is used to merge the general toolset with the native toolset to construct a dynamic toolset, and input the dynamic toolset and the multimodal context data into the large language model for reasoning to generate a task decision instruction, wherein the task decision instruction includes the target function to be called and parameters; The execution module 40 is used to call the native function corresponding to the target function in the host web application to execute the job through the predefined protocol interface, and to receive the execution result.
[0036] Furthermore, in one embodiment, the receiving and acquisition module 10 is used for: Passive response mode: In response to user annotation operations, it collects image features and absolute screen coordinates of a specific area on the user interface. The annotation operations include drawing lines, selecting boxes, or text annotations. Active perception mode: In response to the perception command issued by the intent determination agent, a semantic HTML snapshot of the current viewport is collected. The semantic HTML snapshot contains predefined business identification attributes.
[0037] Furthermore, in one embodiment, the receiving and acquisition module 10 is used for: The sensing command is sent to the front end via a two-way communication protocol; This allows the front-end to silently collect cropped page area data according to the perception instructions, and extract DOM node information containing the data-testid attribute from the page area data as the page structure information.
[0038] Furthermore, in one embodiment, the retrieval and acquisition module 20 is used for: When the host web application is detected to be in the initialization phase or when the page state changes, the native tool schema is registered through the standard interface of the window object. The native tool schema includes API call specifications and UI state reporting function definitions. The server-side scheduling layer selects a subset of native tools related to the current intent from the registered native tool schema based on the current page state, and uses this subset as the native tool set.
[0039] Furthermore, in one embodiment, the execution module 40 is used to: The AI assistant plugin acts as a trusted proxy, executing the native functions directly within the context of the host web application, rather than simulating user interface interaction events. If the job decision instruction involves sensitive operations, an execution preview card will pop up before execution, and the API request will be suspended until a user confirmation instruction is received.
[0040] Furthermore, in one embodiment, the multimodal awareness-based web application job processing apparatus further includes a new module for: Parse the execution result to obtain the task ID or status information; Based on the task ID or status information, the page automatically redirects to the results display page, forming a closed loop for the operation.
[0041] Furthermore, in one embodiment, the multimodal awareness-based web application job processing apparatus further includes a new module for: The acquired text knowledge vector data and tool function vector data are stored in a vector database so that the vector database has the function of unified vectorized indexing of heterogeneous dual-source data. Obtain text knowledge vector data, including: Obtain the operation documents and FAQs of the Web platform, and perform text cleaning and semantic segmentation on the operation documents and FAQs to generate text knowledge vector data; Obtain tool function vector data, including: Obtain the OpenAPI interface specification document for the web application, and perform text cleaning on the OpenAPI interface specification document; The business function data of each API in the cleaned OpenAPI interface specification document is transformed into function data vectors and associated with and stored in FunctionSchema to generate tool function vector data.
[0042] The functions of each module in the above-mentioned multimodal perception-based Web application job processing device correspond to the steps in the above-mentioned multimodal perception-based Web application job processing method embodiment, and their functions and implementation processes will not be described in detail here.
[0043] Thirdly, embodiments of this application provide a Web application job processing device based on multimodal perception. The Web application job processing device based on multimodal perception can be a personal computer (PC), a laptop computer, a server, or other devices with data processing capabilities.
[0044] Reference Figure 3 , Figure 3 This is a schematic diagram of the hardware structure of a web application job processing device based on multimodal perception, as described in an embodiment of this application. In this embodiment, the web application job processing device based on multimodal perception may include a processor, a memory, a communication interface, and a communication bus.
[0045] The communication bus can be of any type and is used to interconnect the processor, memory, and communication interface.
[0046] The communication interface includes input / output (I / O) interfaces, physical interfaces, and logical interfaces used for interconnecting devices within the multimodal sensing-based web application job processing device, as well as interfaces used for interconnecting the multimodal sensing-based web application job processing device with other devices (such as other computing devices or user equipment). Physical interfaces can be Ethernet interfaces, fiber optic interfaces, ATM interfaces, etc.; user equipment can be displays, keyboards, etc.
[0047] Memory can be various types of storage media, such as random access memory (RAM), read-only memory (ROM), non-volatile RAM (NVRAM), flash memory, optical storage, hard disk, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), etc.
[0048] The processor can be a general-purpose processor, which can call a multimodal-aware web application job processing program stored in memory and execute the multimodal-aware web application job processing method provided in the embodiments of this application. For example, the general-purpose processor can be a central processing unit (CPU). The method executed when the multimodal-aware web application job processing program is called can be referred to in the various embodiments of the multimodal-aware web application job processing method of this application, and will not be repeated here.
[0049] Those skilled in the art will understand that Figure 3 The hardware structure shown does not constitute a limitation of this application and may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0050] Fourthly, embodiments of this application also provide a computer-readable storage medium.
[0051] The present application provides a computer-readable storage medium storing a multimodal awareness-based web application job processing program, wherein when the multimodal awareness-based web application job processing program is executed by a processor, it implements the steps of the multimodal awareness-based web application job processing method described above.
[0052] The method implemented when the multimodal awareness-based Web application job processing program is executed can be referred to in the various embodiments of the multimodal awareness-based Web application job processing method of this application, and will not be repeated here.
[0053] It should be noted that the sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0054] The terms "comprising" and "having," and any variations thereof, in the specification, claims, and accompanying drawings of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to such process, method, product, or apparatus. The terms "first," "second," and "third," etc., are used to distinguish different objects, etc., and do not indicate a sequence, nor do they limit "first," "second," and "third" to different types.
[0055] In the description of the embodiments of this application, terms such as "exemplary," "for example," or "for instance" are used to indicate examples, illustrations, or explanations. Any embodiment or design described as "exemplary," "for example," or "for instance" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of terms such as "exemplary," "for example," or "for instance" is intended to present the relevant concepts in a concrete manner.
[0056] In the description of the embodiments of this application, unless otherwise stated, " / " means "or". For example, A / B can mean A or B. The "and / or" in the text is merely a description of the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of this application, "multiple" means two or more.
[0057] In some processes described in the embodiments of this application, multiple operations or steps are included in a specific order. However, it should be understood that these operations or steps may not be executed in the order they appear in the embodiments of this application, or they may be executed in parallel. The sequence number of the operation is only used to distinguish different operations, and the sequence number itself does not represent any execution order. In addition, these processes may include more or fewer operations, and these operations or steps may be executed sequentially or in parallel, and these operations or steps may be combined.
[0058] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a terminal device to execute the methods described in the various embodiments of this application.
[0059] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A web application job processing method based on multimodal awareness, characterized in that, include: The system receives user commands and collects multimodal context data of the current web application interface through a dual-mode trigger perception mechanism. The dual-mode trigger perception mechanism includes a passive response mode and an active perception mode. The multimodal context data includes at least text information, image information, and page structure information. The task intent is determined based on the user instructions, and a general toolset is retrieved from the vector database based on the task intent. The native toolset actively injected by the host web application through a predefined protocol interface is also obtained. The general toolset is merged with the native toolset to construct a dynamic toolset, and the dynamic toolset and the multimodal context data are input into a large language model for reasoning to generate a task decision instruction, wherein the task decision instruction includes the target function to be called and parameters; The predefined protocol interface is used to call the native function corresponding to the target function in the host web application to execute the job and receive the execution result.
2. The Web application job processing method based on multimodal awareness as described in claim 1, characterized in that, The process of collecting multimodal context data of the current web application interface through a dual-mode triggering perception mechanism includes: Passive response mode: In response to user annotation operations, it collects image features and absolute screen coordinates of a specific area on the user interface. The annotation operations include drawing lines, selecting boxes, or text annotations. Active perception mode: In response to the perception command issued by the intent determination agent, a semantic HTML snapshot of the current viewport is collected. The semantic HTML snapshot contains predefined business identification attributes.
3. The Web application job processing method based on multimodal awareness as described in claim 2, characterized in that, In the active sensing mode, the method further includes: The sensing command is sent to the front end via a two-way communication protocol; This allows the front-end to silently collect cropped page area data according to the perception instructions, and extract DOM node information containing the data-testid attribute from the page area data as the page structure information.
4. The Web application job processing method based on multimodal awareness as described in claim 1, characterized in that, The acquisition of the native tool set actively injected by the host web application through a predefined protocol interface includes: When the host web application is detected to be in the initialization phase or when the page state changes, the native tool schema is registered through the standard interface of the window object. The native tool schema includes API call specifications and UI state reporting function definitions. The server-side scheduling layer selects a subset of native tools related to the current intent from the registered native tool schema based on the current page state, and uses this subset as the native tool set.
5. The Web application job processing method based on multimodal awareness as described in claim 1, characterized in that, The step of calling the native function corresponding to the target function in the host web application through the predefined protocol interface to execute the job includes: The AI assistant plugin acts as a trusted proxy, executing the native functions directly within the context of the host web application, rather than simulating user interface interaction events. If the job decision instruction involves sensitive operations, an execution preview card will pop up before execution, and the API request will be suspended until a user confirmation instruction is received.
6. The Web application job processing method based on multimodal awareness as described in claim 1, characterized in that, After receiving the execution result, the process also includes: Parse the execution result to obtain the task ID or status information; Based on the task ID or status information, the page automatically redirects to the results display page, forming a closed loop for the operation.
7. The Web application job processing method based on multimodal awareness as described in claim 1, characterized in that, Before determining the task intent based on the user instruction and retrieving a general toolset from the vector database based on the task intent, the method further includes: The acquired text knowledge vector data and tool function vector data are stored in a vector database so that the vector database has the function of unified vectorized indexing of heterogeneous dual-source data. Obtain text knowledge vector data, including: Obtain the operation documents and FAQs of the Web platform, and perform text cleaning and semantic segmentation on the operation documents and FAQs to generate text knowledge vector data; Obtain tool function vector data, including: Obtain the OpenAPI interface specification document for the web application, and perform text cleaning on the OpenAPI interface specification document; The business function data of each API in the cleaned OpenAPI interface specification document is transformed into function data vectors and associated with and stored as a Function Schema to generate tool function vector data.
8. A Web application job processing device based on multimodal perception, characterized in that, The multimodal awareness-based web application job processing device includes: The receiving and acquisition module is used to receive user instructions and acquire multimodal context data of the current Web application interface through a dual-mode trigger perception mechanism. The dual-mode trigger perception mechanism includes a passive response mode and an active perception mode. The multimodal context data includes at least text information, image information and page structure information. The retrieval and acquisition module is used to determine the task intent based on the user instruction, retrieve a general toolset from the vector database based on the task intent, and acquire a native toolset actively injected by the host web application through a predefined protocol interface. The generation module is used to merge the general toolset with the native toolset to construct a dynamic toolset, and input the dynamic toolset and the multimodal context data into the large language model for inference to generate a task decision instruction, wherein the task decision instruction includes the target function to be called and parameters; The execution module is used to call the native function corresponding to the target function in the host web application to execute the job through the predefined protocol interface, and to receive the execution result.
9. A Web application job processing device based on multimodal perception, characterized in that, The multimodal awareness-based web application job processing device includes a processor, a memory, and a multimodal awareness-based web application job processing program stored in the memory and executable by the processor, wherein when the multimodal awareness-based web application job processing program is executed by the processor, it implements the steps of the multimodal awareness-based web application job processing method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a multimodal-aware web application job processing program, wherein when the multimodal-aware web application job processing program is executed by a processor, it implements the steps of the multimodal-aware web application job processing method as described in any one of claims 1 to 7.