A method, system, storage medium, and electronic device for edge intent recognition and atomized task planning based on multimodal fusion.
By employing a multimodal fusion-based edge intent recognition method, the problems of low intent recognition accuracy and cumbersome operation of intelligent interactive devices in educational classroom scenarios have been solved. This method enables low-latency real-time interaction and efficient automated task execution, thereby improving teaching efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU LANGO ELECTRONICS TECH CO LTD
- Filing Date
- 2026-04-03
- Publication Date
- 2026-06-30
Smart Images

Figure CN122308609A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of human-computer interaction technology, and more particularly to edge intent recognition and atomized task planning methods, systems, storage media and electronic devices based on multimodal fusion. Background Technology
[0002] In the field of intelligent interactive devices, user intent recognition and task execution are the core links to achieve natural interaction. Currently, there are three main technical solutions in the industry, each adapted to different interaction scenario requirements.
[0003] The first type is the single-modal voice interaction solution, typically represented by smart speakers and voice assistants. This solution relies solely on voice input to recognize user intent. Users speak commands, and the device interprets these intents using voice recognition and natural language understanding, executing the corresponding actions. Typical examples include Amazon Alexa, Apple Siri, and Baidu Xiaodu, widely used in various lightweight smart interaction scenarios. The second type is the single-modal visual interaction solution, a purely visual interaction type, centered on gesture recognition and eye tracking. The device uses a camera to collect visual information such as user gestures, gaze, and facial expressions, then recognizes the user's intent and executes the corresponding actions. Typical applications include Leap Motion gesture control and Tobii eye tracking, suitable for scenarios where voice input is inconvenient. The third type is a rule-based task execution solution, most existing smart interaction devices use this approach. Its core is establishing a fixed one-to-one mapping between user commands and device operations through pre-programmed rules or preset command mapping tables. One command can only trigger one predefined operation, making it unable to automate the planning and execution of complex, multi-step tasks.
[0004] The aforementioned existing technical solutions have many shortcomings when applied in multimodal interaction scenarios such as educational classrooms, making it difficult to meet the actual interaction needs of specific scenarios, as follows: First, single-modal interaction information is incomplete, resulting in low accuracy in intent recognition. In actual interaction, users often use multiple modalities simultaneously to express their intentions. For example, they might say a voice command like "open the courseware" while pointing to a specific area on the screen, simultaneously displaying a focused facial expression. Single-modal interaction solutions can only capture one modality of information, easily leading to misinterpretation of intent. For instance, pure voice interaction solutions cannot effectively distinguish whether a teacher is speaking to the whole class or issuing an operation command to the IFPD interactive flat panel display; pure visual interaction solutions cannot understand the semantic needs contained in the user's voice, making it difficult to accurately capture the user's core intent. The fundamental reason is that existing solutions lack the ability to fuse and analyze multimodal data, failing to integrate information from different modalities to achieve accurate intent recognition.
[0005] Secondly, the lack of automated task breakdown capabilities means complex operations require multiple manual input steps. Educational classroom scenarios present numerous complex operational needs, such as "projecting the third page of a math lesson onto students' screens and highlighting key points." This operation requires sequentially opening the lesson, navigating to the third page, calling the projection interface, and launching the annotation tool, among other steps. Current solutions require users to input the corresponding instructions for each step, resulting in a cumbersome process that not only increases the user's workload but also severely impacts teaching efficiency, failing to meet the fast-paced demands of classroom teaching. The root cause lies in the current solutions' lack of task planning capabilities to automatically break down complex user intentions into an ordered sequence of executable instructions.
[0006] Secondly, edge computing power is limited, and large model inference suffers from high latency. Existing large multimodal models, such as GPT-4V and Gemini, have a huge number of parameters, making real-time operation impossible on edge smart interactive devices. Relying on the cloud for inference presents problems such as high network latency, inaccessibility during network outages, and data privacy leaks. Simple rule engines and small models, due to their own performance limitations, cannot handle complex multimodal fusion inference tasks, making it difficult to balance inference accuracy and real-time performance. The root cause is the lack of lightweight multimodal inference models optimized for edge devices, which cannot adapt to the computing power conditions at the edge.
[0007] Finally, interference from invalid interaction signals is severe. In educational classroom environments, numerous invalid interaction signals exist, such as teachers' verbal interjections (e.g., "um," "that," "I mean," etc.), unintentional gestures (e.g., wiping sweat, adjusting clothing), and unintentional gaze deviations (e.g., looking at students instead of the screen). These invalid signals are easily misinterpreted as valid interaction commands by existing technologies, leading to erroneous device triggering and disrupting teaching order and the user experience. The root cause lies in the lack of invalid interaction filtering capabilities based on multimodal information in existing solutions, making it impossible to effectively distinguish between valid interaction signals and invalid interference signals.
[0008] In summary, existing user intent recognition and task execution technologies for intelligent interactive devices suffer from problems such as low intent recognition accuracy, cumbersome complex operations, high edge inference latency, and severe interference from invalid signals in multimodal interaction scenarios such as educational classrooms. There is an urgent need for a technology that can adapt to multimodal interaction scenarios and balance accuracy and real-time performance to address the shortcomings of the existing technologies. Summary of the Invention
[0009] The purpose of this invention is to overcome the aforementioned shortcomings of the prior art and provide an edge-end intent recognition and atomized task planning method, system, storage medium, and electronic device based on multimodal fusion, to solve the problems of incomplete single-modal interaction information, lack of automatic task decomposition capability, high edge-side inference latency, and severe interference from invalid interaction signals. The technical solution adopted by this invention is as follows: In a first aspect of the present invention, an edge intent recognition and atomized task planning method based on multimodal fusion is provided, comprising the following steps: S1. Simultaneously collect multimodal data from the user, including visual modal data, voice modal data, and perceptual modal data; S2. Perform time alignment and feature-level fusion on the collected multimodal data to generate a unified multimodal feature representation; S3. Filter based on the multimodal feature representation to remove invalid interaction signals; S4. Input the filtered multimodal feature representation into the multimodal large model deployed on the edge for fusion inference to identify the user's operation intention; S5. Match the identified user operation intent with a predefined atomic instruction library, and break down the complex intent into an ordered sequence of atomic instructions.
[0010] Preferably, the visual modal data includes eye-tracking data, gesture recognition data, and emotion recognition data.
[0011] Preferably, the sensing modal data includes spatial positioning data and millimeter-wave radar gesture data.
[0012] Preferably, the multimodal data fusion step includes: performing frame-level time alignment based on the timestamps of each modality; extracting modal feature vectors for each modality; calculating the correlation weights between each modality through a cross-modal attention mechanism; and performing weighted fusion based on the weights.
[0013] Preferably, the multimodal large model is obtained from a general large model through knowledge distillation technology, and the number of parameters after distillation does not exceed 0.5B. The model includes a visual encoder, a speech encoder, a perceptual encoder, a cross-modal fusion layer, and an intent decoder.
[0014] Preferably, the filtering steps for invalid interaction signals include rule engine filtering based on preset rules and model filtering based on trained classification models.
[0015] Preferably, each atomic instruction in the atomic instruction library includes an instruction identifier, an instruction name, an instruction parameter list, a precondition list, and a postcondition description.
[0016] Preferably, the specific steps of decomposing complex intents into ordered atomic instruction sequences include: semantic matching with an atomic instruction library to determine the required set of atomic instructions; constructing a directed acyclic graph based on instruction dependencies; and performing topological sorting on the directed acyclic graph to generate an ordered atomic instruction sequence.
[0017] Preferably, all inference computations of the method are performed on the edge device, and critical computations are offloaded to the NPU for execution, with an end-to-end latency of no more than 500ms.
[0018] In a second aspect of the present invention, an edge-end intent recognition and atomized task planning system based on multimodal fusion is provided for the aforementioned edge-end intent recognition and atomized task planning method based on multimodal fusion, comprising the following modules: The multi-modal data acquisition module is used to collect multi-modal data from users, including visual modal data, voice modal data, and perceptual modal data; The multimodal data fusion module is used to perform time alignment and feature-level fusion on the collected multimodal data to generate a unified multimodal feature representation; An invalid interaction filtering module is used to filter and remove invalid interaction signals based on the multimodal feature representation. The intent recognition module is used to input the filtered multimodal feature representations into the multimodal large model deployed on the edge for fusion reasoning to identify the user's operation intent; The atomic task planning and scheduling module is used to match the identified user operation intentions with a predefined atomic instruction library, decompose complex intentions into an ordered sequence of atomic instructions, and schedule the execution of each atomic instruction in the order of the atomic instruction sequence.
[0019] In a third aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, the computer program being executed by a processor to implement the steps of the edge intent recognition and atomized task planning method based on multimodal fusion described above.
[0020] In a fourth aspect of the invention, a computer-readable storage medium is provided, comprising: one or more processors; A memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the edge intent recognition and atomized task planning method based on multimodal fusion as described above.
[0021] Compared with the prior art, the present invention has the following significant advantages: 1. Multimodal fusion improves the accuracy of intent recognition: By using feature-level fusion of visual, speech and perception data and cross-modal attention mechanism, the intent recognition accuracy is improved by more than 30% compared with single-modal solution, effectively solving the misjudgment problem caused by incomplete single-modal information.
[0022] 2. Automatic atomized task decomposition, triggering multiple steps with a single interaction: Complex user intentions are automatically decomposed into an ordered sequence of atomic instructions. A single natural user interaction can trigger multiple automated steps, significantly reducing user operation steps and improving teaching efficiency.
[0023] 3. Lightweight multimodal model on the edge with low latency and availability even when offline: The general large model is compressed to 0.5B parameters through knowledge distillation, and all inference is completed at the edge. The end-to-end latency is no more than 500ms, with no network dependency, ensuring data privacy and availability even when offline.
[0024] 4. Multimodal invalid interaction filtering to reduce false trigger rate: Through the multi-layer filtering mechanism of rule engine and lightweight classification model, invalid interaction signals such as interjections and unintentional gestures are effectively filtered, reducing the false trigger rate by more than 60%. Attached Figure Description
[0025] Figure 1 This is a flowchart of the edge intent recognition and atomized task planning method based on multimodal fusion in a specific embodiment of the present invention; Figure 2 This is a diagram of the multimodal fusion architecture in a specific embodiment of the present invention; Figure 3 This is a structural diagram of the fusion model in a specific embodiment of the present invention; Figure 4 This is a structural diagram of the multimodal data fusion model in a specific embodiment of the present invention; Figure 5 This is a flowchart of atomic task planning in a specific embodiment of the present invention. Detailed Implementation
[0026] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0027] This embodiment uses multimodal fusion intent recognition and atomized task planning for educational IFPD as the implementation scenario, referencing... Figure 1As shown, in a first aspect of the present invention, an edge-end intent recognition and atomized task planning method based on multimodal fusion is provided, the specific implementation steps of which are as follows: S1. Simultaneously collect multimodal data from the user, including visual modal data, speech modal data, and perceptual modal data, specifically including: Real-time parallel acquisition of three modal data is performed. The visual modal data includes gaze tracking data, gesture recognition data, and emotion recognition data, while the perceptual modal data includes spatial positioning data and millimeter-wave radar gesture data. Visual modality: Visual data is acquired through an IFPD camera (30fps, 1080p) and processed by three parallel visual algorithms: gaze tracking algorithm (based on pupil detection and gaze point prediction), gesture recognition algorithm (based on MediaPipeHands), and emotion recognition algorithm (based on a lightweight facial expression classification model to identify 6 basic emotions).
[0028] Speech modality: Audio data is collected through a 6-channel microphone array, processed by an acoustic front end, and then sent to a streaming speech recognition engine to output a sequence of text segments.
[0029] Perception modality: Gesture spatial information (distance, speed, angle) is collected via millimeter-wave radar (60GHz FMCW), and the spatial positioning module (fusion of vision, radar and IMU) outputs the user's three-dimensional coordinates.
[0030] All modal data are accompanied by high-precision timestamps for subsequent time alignment.
[0031] S2. Perform time alignment and feature-level fusion on the acquired multimodal data to generate a unified multimodal feature representation, specifically including: Please see Figure 2 The process involves frame-level time alignment based on the timestamps of each modality's data; extraction of modal feature vectors for each modality's data; calculation of correlation weights between modalities using a cross-modal attention mechanism; weighted fusion based on these weights; and the multimodal large model obtained from a general large model through knowledge distillation, with the number of parameters after distillation not exceeding 0.5B. The model includes a visual encoder, a speech encoder, a perceptual encoder, a cross-modal fusion layer, and an intent decoder, as detailed below: S2.1 Time Alignment: Using the speech modality as the primary time axis, the visual and perceptual modality data are aligned to the speech time axis. The nearest visual frame and perceptual data point within a 100ms range before and after each frame of speech data are found to form a multimodal data triplet.
[0032] S2.2 Modal Feature Extraction: Gaze data is extracted as a 4-dimensional vector (horizontal angle, vertical angle, gaze duration, pupil diameter change rate); gesture data is extracted as a 63-dimensional vector (21 key points × 3-dimensional coordinates); emotion data is extracted as a 6-dimensional one-hot encoding. Speech features are extracted as a 256-dimensional semantic vector. Radar gestures are extracted as a 32-dimensional vector; spatial location is extracted as 3-dimensional coordinates and 3-dimensional velocity.
[0033] S2.3 Cross-modal attention fusion: A multi-head cross-attention mechanism (4 heads, 256 hidden dimensions) is adopted. Speech semantic features are used as the query, and visual and perceptual features are concatenated as the key and value to calculate cross-modal attention weights. The weighted output is residually connected with the speech features and layer normalized to generate a unified 256-dimensional multimodal feature representation.
[0034] S3: Filtering based on the multimodal feature representation to remove invalid interaction signals, specifically including: The filtering steps for invalid interaction signals include rule engine filtering based on preset rules and model filtering based on trained classification models.
[0035] Please see Figure 3 By setting up a two-layer filtering mechanism: The first-layer rule engine uses a preset list of 50 interjections. When the voice text contains only interjections and no command verbs, it is marked as invalid. It also presets invalid gesture patterns, such as continuous small shaking or quickly moving out of the screen. When these are matched, the visual modality is marked as invalid.
[0036] The second-layer classification model is based on a multimodal binary classification model (3-layer Transformer encoder, approximately 20M parameters). The input is multimodal feature representation, and the output is the probability of valid interactions. The training data contains 50,000 valid interaction samples and 30,000 invalid interaction samples. Interaction events with a valid probability below a threshold of 0.6 are filtered out.
[0037] S4: Input the filtered multimodal feature representations into the multimodal large model deployed on the edge for fusion inference to identify the user's operation intent. The specific operation is as follows: The filtered multimodal feature representations are input into the large multimodal model on the input side for fusion inference. Please refer to [link / reference]. Figure 4 The multimodal large model includes a visual encoder, a speech encoder, a perceptual encoder, a cross-modal fusion layer, and an intent decoder.
[0038] The visual encoder consists of 3 layers of ViT, with a 256-dimensional output; the speech encoder consists of 3 layers of Transformer, with a 256-dimensional output; the perceptual encoder consists of 2 layers, one fully connected and one attention-based, with a 128-dimensional output; the cross-modal fusion layer consists of 6 layers of cross-modal Transformer; and the intent decoder consists of a linear classification head, which outputs the probability distribution of 100 intent categories for educational scenarios.
[0039] The multimodal large model has a total parameter count of 0.5B, obtained from a general multimodal large model through knowledge distillation. The distillation dataset contains 100,000 multimodal interaction samples. The model is deployed on an MTK Genio 520 NPU after INT8 quantization, with a single inference latency of no more than 300ms.
[0040] S5: Match the identified user operation intent with a predefined atomic instruction library, break down the complex intent into an ordered sequence of atomic instructions, and schedule the execution of each atomic instruction according to the order of the atomic instruction sequence, specifically including: The atomic instruction library predefines 200 standardized atomic instructions. Each instruction includes: instruction ID, name, parameter list, preconditions, postconditions, and execution timeout. These are divided into system-level (SYS_HOME, SYS_SWITCH_APP, etc.) and application-level (APP_OPEN, APP_PAGE_JUMP, APP_SCREEN_CAST, APP_ANNOTATE, etc.).
[0041] Please see Figure 5 The task planning process is as follows: S5.1 Semantic Matching – Semantically match the intent with the atomic instruction library to determine the required subset of atomic instructions. The matching uses a 100×200 intent-instruction association matrix. S5.2 Dependency Analysis – Constructing a DAG based on instruction preconditions and poststates; S5.3 Topological sorting—generates an ordered sequence of atomic instructions, and marks instructions with no dependencies as parallel execution groups; S5.4 Parameter Population — Extract entity parameters from intent recognition results and populate them into the instruction parameter list.
[0042] In one feasible implementation, the teacher says, "Project page 3 of the math presentation onto the students' screens and highlight the key points," while pointing to the presentation area. The multimodal fusion recognition intent is "presentation projection + highlighting," which is automatically broken down into a sequence of atomic instructions: ①APP_OPEN(courseware="mathematics")→②APP_PAGE_JUMP(page number=3)→③APP_SCREEN_CAST(target="student screen")→④APP_ANNOTATE(mode="highlighting").
[0043] ② and ③ can be executed in parallel.
[0044] In one feasible implementation, the DAG construction and topology sorting in the atomized task planning step are optimized through reinforcement learning. A state space (the set of currently executed instructions and the system state), an action space (the next executable atomic instruction), and a reward function (task completion speed + execution success rate) are set, and a policy network is trained to select the optimal execution path. The model has approximately 5M parameters and is trained on 100,000 historical execution trajectories, resulting in a 10% improvement in task planning accuracy.
[0045] In a second aspect of the present invention, an edge-end intent recognition and atomized task planning system based on multimodal fusion is provided for the aforementioned edge-end intent recognition and atomized task planning method based on multimodal fusion, comprising the following modules: The multi-modal data acquisition module is used to collect multi-modal data from users, including visual modal data, voice modal data, and perceptual modal data; The multimodal data fusion module is used to perform time alignment and feature-level fusion on the collected multimodal data to generate a unified multimodal feature representation; An invalid interaction filtering module is used to filter and remove invalid interaction signals based on the multimodal feature representation. The intent recognition module is used to input the filtered multimodal feature representations into the multimodal large model deployed on the edge for fusion reasoning to identify the user's operation intent; The atomic task planning and scheduling module is used to match the identified user operation intentions with a predefined atomic instruction library, decompose complex intentions into an ordered sequence of atomic instructions, and schedule the execution of each atomic instruction in the order of the atomic instruction sequence.
[0046] In a third aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, the computer program being executed by a processor to implement the steps of the edge intent recognition and atomized task planning method based on multimodal fusion described above.
[0047] In a fourth aspect of the invention, a computer-readable storage medium is provided, comprising: one or more processors; A memory for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the edge intent recognition and atomized task planning method based on multimodal fusion as described above.
[0048] The above embodiments are merely descriptions of preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made by those skilled in the art to the technical solutions of the present invention without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.
Claims
1. An edge-end intent recognition and atomized task planning method based on multimodal fusion, characterized in that, Includes the following steps: S1. Simultaneously collect multimodal data from the user, including visual modal data, voice modal data, and perceptual modal data; S2. Perform time alignment and feature-level fusion on the collected multimodal data to generate a unified multimodal feature representation; S3. Filter based on the multimodal feature representation to remove invalid interaction signals; S4. Input the filtered multimodal feature representation into the multimodal large model deployed on the edge for fusion inference to identify the user's operation intention; S5. Match the identified user operation intent with a predefined atomic instruction library, decompose the complex intent into an ordered sequence of atomic instructions, and schedule the execution of each atomic instruction according to the order of the atomic instruction sequence.
2. The edge-end intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, The visual modal data includes eye-tracking data, gesture recognition data, and emotion recognition data.
3. The edge-end intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, The sensing modal data includes spatial positioning data and millimeter-wave radar gesture data.
4. The edge-end intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, The multimodal data fusion steps include: performing frame-level time alignment based on the timestamps of each modality; extracting modal feature vectors for each modality; calculating the correlation weights between each modality through a cross-modal attention mechanism; and performing weighted fusion based on the weights.
5. The edge-end intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, The multimodal large model is obtained from a general large model through knowledge distillation technology. The number of parameters after distillation does not exceed 0.5B. The model includes a visual encoder, a speech encoder, a perceptual encoder, a cross-modal fusion layer, and an intent decoder.
6. The edge intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, The filtering steps for invalid interaction signals include rule engine filtering based on preset rules and model filtering based on trained classification models.
7. The edge-end intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, Each atomic instruction in the atomic instruction library includes an instruction identifier, instruction name, instruction parameter list, precondition list, and postcondition description.
8. The edge-end intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, The specific steps for decomposing complex intentions into an ordered sequence of atomic instructions include: semantically matching with an atomic instruction library to determine the required set of atomic instructions; constructing a directed acyclic graph based on instruction dependencies; and performing topological sorting on the directed acyclic graph to generate an ordered sequence of atomic instructions.
9. The edge-end intent recognition and atomized task planning method based on multimodal fusion according to claim 1, characterized in that, All inference computations of the method are completed on the edge device, and critical computations are offloaded to the NPU for execution, with an end-to-end latency of no more than 500ms.
10. An edge-end intent recognition and atomized task planning system based on multimodal fusion, used to implement the edge-end intent recognition and atomized task planning method based on multimodal fusion as described in any one of claims 1-9, characterized in that, Includes the following modules: The multi-modal data acquisition module is used to collect multi-modal data from users, including visual modal data, voice modal data, and perceptual modal data; The multimodal data fusion module is used to perform time alignment and feature-level fusion on the collected multimodal data to generate a unified multimodal feature representation; An invalid interaction filtering module is used to filter and remove invalid interaction signals based on the multimodal feature representation. The intent recognition module is used to input the filtered multimodal feature representations into the multimodal large model deployed on the edge for fusion reasoning to identify the user's operation intent; The atomic task planning and scheduling module is used to match the identified user operation intentions with a predefined atomic instruction library, decompose complex intentions into an ordered sequence of atomic instructions, and schedule the execution of each atomic instruction in the order of the atomic instruction sequence.
11. A computer-readable storage medium storing a computer program, the computer program being executed by a processor to implement the steps of the edge intent recognition and atomized task planning method based on multimodal fusion as described in any one of claims 1 to 9.
12. An electronic device, characterized in that, include: One or more processors; Memory, used to store one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the edge intent recognition and atomized task planning method based on multimodal fusion as described in any one of claims 1-9.