Robot multi-modal perception data real-time fusion method based on NPU acceleration

By using a real-time fusion method of multimodal perception data based on NPU, the problems of high latency in robot multimodal perception data processing and loose system integration are solved. This enables the robot to provide immediate feedback to the dynamic environment and operate stably in complex environments, thereby improving system reliability and deployment flexibility.

CN122196859APending Publication Date: 2026-06-12ZHUANGSI FEI (SHANGHAI) MANAGEMENT CONSULTING CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHUANGSI FEI (SHANGHAI) MANAGEMENT CONSULTING CO LTD
Filing Date
2025-12-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, robot multimodal perception data processing relies on CPU/GPU, resulting in high latency, inability to achieve deep feature association, loose system integration, difficulty in deployment and upgrade, and insufficient communication reliability in complex environments.

Method used

A real-time fusion method for multimodal sensing data based on NPU is adopted. By constructing modules for data acquisition, preprocessing and feature extraction, spatiotemporal alignment and feature-level fusion, decision and execution, and utilizing the parallel computing capabilities of NPU, real-time synchronous processing and deep fusion of multimodal data are achieved.

🎯Benefits of technology

By reducing the end-to-end latency of perception, cognition, and decision-making, robots can provide real-time feedback to dynamic environments, improving system reliability and deployment flexibility, ensuring continuous and stable operation in complex environments, and supporting independent algorithm updates and modular hardware replacement.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196859A_ABST
    Figure CN122196859A_ABST
Patent Text Reader

Abstract

The application discloses a robot multimodal perception data real-time fusion method based on NPU acceleration, relates to the cross technical field of robot technology and artificial intelligence, and comprises the following steps: a data acquisition module is constructed; the data acquisition module synchronously collects heterogeneous perception data streams including visual data, audio data, spatial positioning data and wireless perception data in real time through a multimodal sensor array carried by a robot body; the robot multimodal perception data real-time fusion method based on NPU acceleration accelerates the hardware of multimodal feature extraction and fusion through a special NPU, and prepositions a deep fusion link to a feature layer, so that the full-link delay of perception, cognition and decision is reduced from the hundreds of milliseconds of a traditional scheme to the tens of milliseconds. This enables the robot to make nearly instantaneous natural feedback to a dynamic environment (such as a sudden question or a walking crowd), and completely solves the core pain points of interaction lag and stiffness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robotics and artificial intelligence, specifically to a real-time fusion method for robot multimodal perception data based on NPU acceleration. Background Technology

[0002] The integration of artificial intelligence and robotics has driven the application of intelligent service robots in exhibition halls, commercial spaces, and public areas. These applications require robots to possess environmental perception, autonomous interaction, and precise navigation capabilities, the core of which lies in the real-time fusion and understanding of information from multiple sensor sources.

[0003] Current mainstream solutions typically transmit raw data from sensors such as cameras and microphones to an onboard central processing unit (CPU) or general-purpose graphics processing unit (GPU) for processing. Perceptual fusion algorithms run on this general-purpose computing unit to achieve basic functions such as visual positioning and speech recognition.

[0004] Therefore, the core of intelligent service robots lies in the real-time fusion of multi-source perception information. Existing solutions mostly rely on CPUs / GPUs to process multimodal data, which suffers from high latency in general computing units processing complex neural network inference, resulting in slow robot interaction response and a clunky user experience. At the same time, existing fusion methods mostly remain at the data stitching level, failing to achieve deep feature association across modalities, making it difficult for robots to understand the context of complex scenarios. Furthermore, the systems are loosely integrated, making deployment and upgrades difficult; and relying on a single network makes communication prone to interruption in complex environments, resulting in insufficient reliability. Summary of the Invention

[0005] The purpose of this invention is to provide a real-time fusion method for robot multimodal perception data based on NPU acceleration, so as to overcome the shortcomings of the prior art.

[0006] To achieve the above objectives, the present invention provides the following technical solution: a real-time fusion method for multimodal perception data of a robot based on NPU acceleration, comprising the following steps: constructing a data acquisition module, wherein the data acquisition module acquires heterogeneous perception data streams, including visual data, audio data, spatial positioning data, and wireless perception data, in real time and synchronously through a multimodal sensor array mounted on the robot body; constructing a data preprocessing and NPU-accelerated feature extraction module, which inputs the heterogeneous perception data streams to the NPU, and utilizes the parallel computing capability of the NPU to simultaneously run multiple dedicated neural network models to perform target detection on the visual data respectively. The system performs measurement and feature extraction, voice endpoint detection and sound source localization on audio data, and fusion filtering on localization data. It constructs a spatiotemporal alignment and feature-level fusion mechanism. Within the NPU, based on a unified timestamp and spatial coordinate system, it aligns the extracted multimodal feature vectors and inputs them into a lightweight multimodal fusion network to generate a unified scene-aware feature map rich in contextual information. A decision-making and execution module is also constructed, which inputs the scene-aware feature map into the task decision model to generate control commands, driving the robot to complete tasks including autonomous navigation, active interaction, content playback, or behavior imitation.

[0007] Preferably, the visual data in the data acquisition module is acquired by at least the following cameras: a miniature front camera for speaker detection and orientation recognition, a top camera for panoramic environment perception, and a high frame rate global exposure camera for visual positioning under high-speed motion; the audio data is acquired by a linear microphone array with customized increased spacing, and echo cancellation and far-field enhancement processing are performed based on integrated speech recognition services; the spatial positioning data is jointly provided by a high-precision BeiDou positioning module, a UWB sensing system, and a BLE beacon network to achieve seamless indoor and outdoor positioning; and the wireless sensing data is used to assist in environmental perception.

[0008] Preferably, it also includes an edge computing module, which is the robot's synesthetic computing smart backpack or a built-in PC3 module, integrating a high-performance NPU, ARM64 CPU, 5G communication module, multi-channel digital power amplifier and power management unit, and communicating with the robot's main control system through built-in NAT service.

[0009] Preferably, the lightweight multimodal fusion network is a neural network based on an attention mechanism or a Transformer architecture, which is optimized and deployed on the NPU to calculate the correlation weights between visual features, audio features, and spatial location features in real time, thereby achieving dynamic feature weighted fusion. Preferably, the task decision model is a model built based on reinforcement learning or a vertical domain large language model / VLM, which generates human-like interaction strategies and action sequences in real time based on the scene perception feature map and a pre-set scene knowledge base.

[0010] Preferably, real-time synchronous acquisition is achieved through a hardware synchronization signal or a high-precision software clock service, ensuring that the timestamp deviation between the visual data, audio data and spatial positioning data is within a preset millisecond threshold, so as to meet the requirements of dynamic interaction for timing consistency.

[0011] Preferably, the installation position of the miniature front camera and the acoustic center of the microphone array are jointly calibrated so that the sound source location result can be directly mapped into the visual image coordinate system, which is used to assist in the detection of the interlocutor and the focusing of visual attention.

[0012] Preferably, the NPU built into the intelligent edge module communicates with the ARM64 CPU via a high-speed on-chip bus or shared memory. The NPU is responsible for intensive parallel computing, and the ARM64 CPU is responsible for task scheduling, network communication, and protocol conversion with the robot body controller. Preferably, it further includes: selectively uploading the multimodal joint embedding vector or compressed raw sensing data to a cloud-based intelligent service platform via the 5G module; and receiving enhanced analysis results, model parameter updates, or cross-robot collaborative instructions from the platform to optimize local decision-making and execution.

[0013] Preferably, the method includes a robot system comprising: a robot body having a humanoid or mobile chassis structure and multiple built-in actuators; a multimodal perception kit integrated on the robot body, including the aforementioned camera array, microphone array, BeiDou / UWB / BLE positioning module, and environmental sensors; a syn-sensing intelligent edge module detachably mounted on the back or inside of the robot body, including the aforementioned high-performance NPU, CPU, and communication unit, for executing the real-time data fusion method; and a cloud-based intelligent service platform connected to the syn-sensing intelligent edge module via a 5G network, for model updates, big data analysis, cross-robot collaborative scheduling, and complex task offloading calculations.

[0014] In the above technical solution, the real-time fusion method for robot multimodal perception data based on NPU acceleration provided by the present invention has the following beneficial effects: 1. By using a dedicated NPU to accelerate multimodal feature extraction and fusion in hardware, and by moving the deep fusion process forward to the feature layer, the end-to-end latency of perception, cognition, and decision-making is reduced from hundreds of milliseconds in traditional solutions to tens of milliseconds. This enables the robot to make near-instantaneous and natural responses to dynamic environments (such as sudden questions or moving crowds), completely solving the core pain points of sluggish and stiff interaction.

[0015] 2. Through feature-level spatiotemporal alignment and attention-based fusion, robots can establish intrinsic connections between cross-modal information, forming a unified and context-sensitive understanding of the scene. This enables robots to perform complex tasks requiring refined spatiotemporal correlation and intent understanding, such as recognizing and walking towards specific audience members who have raised their hands and answering their questions, thus upgrading them from functional executors to scene understanders. 3. Modular design (especially the Tonggan Computing Intelligent Edge Module) highly integrates high-performance computing, multi-mode communication, and precise positioning, which not only improves system reliability and deployment flexibility (plug and play), but also ensures continuous and stable operation and precise positioning in complex environments such as poor WiFi signal or indoor-outdoor transition through the integration of 5G and Beidou / UWB technologies, thus breaking through the bottleneck of environmental adaptability.

[0016] 4. The collaborative architecture of real-time edge NPU fusion combined with cloud-based model training / complex inference ensures both the real-time performance and privacy of core interactions, while leveraging the unlimited computing power of the cloud for model iteration and handling of ultra-complex tasks. Clearly defined functional modules support independent algorithm updates and modular hardware replacement, significantly reducing the cost and complexity of long-term maintenance and feature upgrades. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this invention. For those skilled in the art, other drawings can be obtained based on these drawings.

[0018] Figure 1 The flowchart provides an embodiment of the present invention for a real-time fusion method of robot multimodal perception data based on NPU acceleration. Detailed Implementation

[0019] To enable those skilled in the art to better understand the technical solution of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings.

[0020] Please see Figure 1The present invention provides a real-time fusion method for robot multimodal perception data based on NPU acceleration, comprising the following steps: First, a data acquisition module is constructed, which synchronously acquires heterogeneous perception data streams, including visual data, audio data, spatial positioning data, and wireless perception data, in real time through a multimodal sensor array mounted on the robot body. Second, a data preprocessing and NPU-accelerated feature extraction module is constructed, which inputs the heterogeneous perception data streams to the NPU and utilizes the parallel computing capabilities of the NPU to simultaneously run multiple dedicated neural network models to perform target detection and feature extraction on visual data, voice endpoint detection and sound source localization on audio data, and fusion filtering on positioning data. Third, a spatiotemporal alignment and feature-level fusion module is constructed, which aligns the multimodal feature vectors after feature extraction within the NPU based on a unified timestamp and spatial coordinate system and inputs them into a lightweight multimodal fusion network to generate a unified scene perception feature map rich in contextual information. Fourth, a decision-making and execution module is constructed, which inputs the scene perception feature map into a task decision model to generate control commands to drive the robot to complete tasks including autonomous navigation, active interaction, content broadcasting, or behavior imitation.

[0021] The data acquisition module, serving as the system's sensory layer, has the core task of achieving high-precision synchronization and data aggregation. It doesn't simply connect various sensors (such as front, top, and high-speed cameras, a 4-microphone linear array, and BeiDou / UWB / BLE positioning units) to the system. Instead, it forces all sensors to start sampling under a unified time reference through hardware trigger signals or software scheduling based on a precision clock. This ensures that visual frames, audio frames, and positioning data packets from different physical locations have a strictly aligned unified timestamp at the time of generation. All timestamped raw data streams are aggregated in real time, forming a spatiotemporally correlated raw sensory data packet, laying the foundation for subsequent deep fusion and avoiding information mismatch problems caused by asynchronous acquisition.

[0022] Furthermore, the data preprocessing and NPU-accelerated feature extraction module serves as the system's feature factory, offloading heterogeneous computing tasks to a dedicated NPU for parallel pipelined processing. Traditional CPU serial processing cannot meet real-time requirements. This module directly inputs the aggregated raw data stream into the NPU, leveraging its powerful parallel computing core and instruction set optimized for neural networks to synchronously and concurrently execute multiple lightweight, specialized neural network models: In the visual processing pipeline, object detection networks (such as YOLO variants) and feature extraction networks run in parallel, outlining people and objects from images in real time and generating visual feature vectors. In the audio processing pipeline, a speech activity detection (VAD) model and beamforming algorithm run in parallel, separating valid human voices and calculating their spatial azimuth. In the localization processing pipeline, sensor fusion filtering algorithms (such as Kalman filters) are run to denoise and optimize multi-source localization data. This combination of pipeline parallelism and data parallelism allows multiple heavy computational tasks that would otherwise have to be executed sequentially to be completed simultaneously, significantly reducing latency in the feature extraction stage.

[0023] Furthermore, the spatiotemporal alignment and feature-level fusion module is the cognitive hub of the system. Its key lies in deep fusion at the feature layer rather than splicing at the decision layer. Within the NPU, the module performs the following operations: using pre-calibrated sensor extrinsic matrix and unified timestamps, it transforms the visual target pixel coordinates, sound source azimuth, and robot pose extracted in the previous step into the same world coordinate system; based on geometric relationships, it performs logical judgments (such as determining whether a visually detected human body exists within the cone-shaped area in the direction of the sound source); it binds multimodal features belonging to the same entity (such as a person's appearance, voice, and location); and inputs the bound feature vectors into a lightweight cross-modal attention network (such as a simplified Transformer encoding layer) also deployed on the NPU. This network dynamically evaluates the importance and relevance of each entity and modal information in the current scene through a self-attention mechanism, ultimately outputting a fixed-dimensional, globally semantically rich multimodal joint embedding vector (i.e., a scene-aware feature map). This achieves a cognitive leap from seeing a person, hearing a sound, and knowing their location to understanding that the audience asking questions three meters to the left is the current service focus.

[0024] Furthermore, the decision-making and execution module serves as the system's decision-maker and executor, characterized by decision-making based on a unified high-dimensional scene representation. The task decision model (which can be a rule-based engine, a lightweight reinforcement learning policy network, or a large cloud model accessed via 5G) receives the aforementioned scene-aware feature map as input. Since this feature map integrates refined information from all modalities, the decision model no longer needs to process and understand the original visual and audio signals separately, greatly simplifying the decision-making logic and improving decision-making speed and accuracy. Based on the scene semantics parsed from the feature map, the model generates structured control commands (such as moving to coordinates (X,Y), performing a handshake action, or broadcasting the content of exhibit A), and sends them to the robot's underlying motion, voice, and behavior controllers to complete the closed-loop execution of the intelligent task.

[0025] Based on the above, a specific set of action instructions with a priority sequence is ultimately generated. These instructions are sent to the robot's underlying motion controller, speech synthesis module, and behavior engine, driving it to execute corresponding movements, broadcasts, gestures, or light feedback, thus forming a real-time closed loop of perception, fusion, decision-making, and execution. Furthermore, hardware acceleration of multimodal feature extraction and fusion is achieved through a dedicated NPU, and the deep fusion process is moved forward to the feature layer, reducing the end-to-end latency of perception, cognition, and decision-making from hundreds of milliseconds in traditional solutions to tens of milliseconds. This enables the robot to provide near-instantaneous and natural responses to dynamic environments (such as sudden questions or moving crowds), completely solving the core pain points of sluggish and stiff interaction. Through feature-level spatiotemporal alignment and attention-based fusion, the robot can establish intrinsic connections between cross-modal information, forming a unified and context-sensitive understanding of the scene. This allows the robot to perform complex tasks requiring refined spatiotemporal correlation and intent understanding, such as recognizing and walking towards specific audience members who have raised their hands and answering their questions, upgrading it from a functional executor to a scene understander. Modular design (especially the Tonggan Computing intelligent edge module) highly integrates high-performance computing, multi-mode communication, and precise positioning. This not only improves system reliability and deployment flexibility (plug-and-play), but also ensures continuous and stable operation and precise positioning in complex environments such as poor WiFi signal or indoor-outdoor transitions through the integration of 5G and technologies like BeiDou / UWB, overcoming environmental adaptability bottlenecks. Simultaneously, the edge NPU's real-time fusion combined with a collaborative architecture for cloud-based model training / complex inference ensures both the real-time nature and privacy of core interactions, while also leveraging the unlimited computing power of the cloud for model iteration and handling ultra-complex tasks. Clearly defined functional modules support independent algorithm updates and modular hardware replacement, significantly reducing the cost and complexity of long-term operation and maintenance and functional upgrades.

[0026] Among them, sensor set These represent visual, audio, positioning, and wireless sensing sensors, respectively. Raw data The original multimodal data packets at time t. The feature extraction function (parallelized by NPU) is also included. Among them, there are dedicated neural network models for vision, audio, and localization, respectively.

[0027] Furthermore, synchronous acquisition and time alignment. in, Indicates a hardware-synchronized or high-precision software clock. Timestamp alignment operation. NPU parallel feature extraction:

[0028] The above calculations are performed in parallel on multiple computing cores of the NPU, satisfying:

[0029] In spatiotemporal alignment and feature fusion, in, Let C be the coordinate transformation function, and C be the pre-calibrated sensor extrinsic parameter matrix. Next: The output is a unified multimodal joint embedding vector.

[0030] Furthermore, decision-making and execution ,in, This refers to the robot's internal state (such as battery level and current task). Action commands. Issued to the actuator: The visual data in the data acquisition module is collected by at least the following cameras: a miniature front camera for speaker detection and orientation recognition, a top camera for panoramic environment perception, and a high frame rate global exposure camera for visual positioning under high-speed motion; audio data is collected by a linear microphone array with customized increased spacing, and echo cancellation and far-field enhancement processing are performed based on integrated speech recognition services; spatial positioning data is jointly provided by a high-precision BeiDou positioning module, a UWB sensing system, and a BLE beacon network to achieve seamless indoor and outdoor positioning; wireless sensing data is used to assist in environmental perception.

[0031] It also includes an edge computing module, which is the robot's synesthetic computing smart backpack or the built-in PC3 module. It integrates a high-performance NPU, ARM64 CPU, 5G communication module, multi-channel digital power amplifier and power management unit, and communicates with the robot's main control system (PC1, PC2) through built-in NAT service.

[0032] The lightweight multimodal fusion network is a neural network based on attention mechanisms or the Transformer architecture. After optimization, it is deployed on the NPU to calculate the correlation weights between visual features, audio features, and spatial location features in real time, thereby achieving dynamic feature weighted fusion. Among them, the task decision model is a model built based on reinforcement learning or vertical domain large language model / VLM. It generates human-like interaction strategies and action sequences in real time based on scene perception feature maps and a pre-built scene knowledge base.

[0033] Real-time synchronous acquisition is achieved through a hardware synchronization signal or a high-precision software clock service, ensuring that the timestamp deviation between visual data, audio data and spatial positioning data is within a preset millisecond threshold to meet the timing consistency requirements of dynamic interaction.

[0034] The installation position of the miniature front camera and the acoustic center of the microphone array are jointly calibrated so that the sound source location results can be directly mapped to the visual image coordinate system, which is used to assist in the detection of the person in the conversation and the focusing of visual attention.

[0035] Among them, the NPU built into the Tonggan Computing Intelligent Edge Module communicates with the ARM64 CPU through a high-speed on-chip bus or shared memory. The NPU is responsible for intensive parallel computing, while the ARM64 CPU is responsible for task scheduling, network communication, and protocol conversion with the robot body controller.

[0036] This also includes: selectively uploading multimodal joint embedding vectors or compressed raw perception data to a cloud-based intelligent service platform via a 5G module; and receiving enhanced analysis results, model parameter updates, or cross-robot collaborative instructions from the platform to optimize local decision-making and execution.

[0037] Based on the above, it is specifically applied in the following scenarios: At the entrance of the exhibition hall, it actively identifies and greets visitors through face detection and sound source localization, simultaneously broadcasting a welcome message and making guiding gestures; During the guided tour, it integrates real-time visual SLAM information, infrared monitoring data of crowd flow, and the location of predetermined explanation points to dynamically plan the optimal movement path and avoid congestion; At the explanation points, it calls on the knowledge base to provide targeted explanations based on the audience's orientation (visual) and questions (audio), and can provide feedback through gestures or nods; In outdoor or large venues, it uses the 5G network to upload high-load sensing data or complex queries to the intelligent service cloud platform in real time for auxiliary calculations and receives returned instructions.

[0038] This includes a robot system for implementing the method. The robot system comprises: a robot body with a humanoid or mobile chassis structure and multiple built-in actuators; a multimodal perception kit integrated on the robot body, including a camera array, a microphone array, a BeiDou / UWB / BLE positioning module, and environmental sensors; a detachable intelligent edge module for sensing and computing, mounted on the back or inside of the robot body, containing a high-performance NPU, CPU, and communication unit, used to execute the real-time data fusion method; and a cloud-based intelligent service platform connected to the intelligent edge module via a 5G network for model updates, big data analysis, cross-robot collaborative scheduling, and complex task offloading computation.

[0039] The foregoing has only described certain exemplary embodiments of the present invention by way of illustration. Undoubtedly, those skilled in the art can modify the described embodiments in various ways without departing from the spirit and scope of the present invention. Therefore, the foregoing drawings and descriptions are illustrative in nature and should not be construed as limiting the scope of protection of the claims of the present invention.

Claims

1. A real-time fusion method for robot multimodal perception data based on NPU acceleration, characterized in that, Includes the following steps: A data acquisition module is constructed, which uses a multimodal sensor array mounted on the robot body to collect heterogeneous sensing data streams, including visual data, audio data, spatial positioning data, and wireless sensing data, in real time and synchronously. A data preprocessing and NPU-accelerated feature extraction module is constructed, which inputs the heterogeneous perception data stream to the NPU, utilizes the parallel computing capability of the NPU to run multiple dedicated neural network models simultaneously, and performs target detection and feature extraction on visual data, speech endpoint detection and sound source location on audio data, and fusion filtering on location data. Spatiotemporal alignment and feature-level fusion are constructed within the NPU. Based on a unified timestamp and spatial coordinate system, the multimodal feature vectors after feature extraction are aligned and input into a lightweight multimodal fusion network to generate a unified scene-aware feature map rich in contextual information. A decision-making and execution module is constructed, which inputs the scene perception feature map into the task decision model, generates control commands, and drives the robot to complete tasks including autonomous navigation, active interaction, content broadcasting, or behavior imitation.

2. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 1, characterized in that, The visual data in the data acquisition module is acquired by at least the following cameras: a miniature front camera for speaker detection and orientation recognition, a top camera for panoramic environment perception, and a high frame rate global exposure camera for visual positioning under high-speed motion; the audio data is acquired by a linear microphone array with customized increased spacing, and echo cancellation and far-field enhancement processing are performed based on an integrated speech recognition service. The spatial positioning data is jointly provided by a high-precision BeiDou positioning module, a UWB sensing system, and a BLE beacon network to achieve seamless indoor and outdoor positioning; the wireless sensing data is used to assist in environmental perception.

3. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 1, characterized in that, It also includes an edge computing module, which is either the robot's smart backpack or a built-in PC3 module. It integrates a high-performance NPU, ARM64 CPU, 5G communication module, multi-channel digital power amplifier and power management unit, and communicates with the robot's control system through built-in NAT service.

4. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 1, characterized in that, The lightweight multimodal fusion network is a neural network based on an attention mechanism or a Transformer architecture. After optimization, it is deployed on the NPU to calculate the correlation weights between visual features, audio features, and spatial location features in real time, thereby achieving dynamic feature weighted fusion.

5. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 1, characterized in that, The task decision model is a model built based on reinforcement learning or a vertical domain large language model / VLM. It generates human-like interaction strategies and action sequences in real time based on the scene perception feature map and a pre-set scene knowledge base.

6. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 1, characterized in that, Real-time synchronous acquisition is achieved through a hardware synchronization signal or a high-precision software clock service, ensuring that the timestamp deviation between the visual data, audio data and spatial positioning data is within a preset millisecond threshold, so as to meet the requirements of dynamic interaction for timing consistency.

7. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 2, characterized in that, The installation position of the miniature front camera and the acoustic center of the microphone array are jointly calibrated so that the sound source location result can be directly mapped into the visual image coordinate system, which is used to assist in the detection of the interlocutor and the focusing of visual attention.

8. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 3, characterized in that, The NPU built into the intelligent edge module communicates with the ARM64 CPU via a high-speed on-chip bus or shared memory. The NPU is responsible for intensive parallel computing, while the ARM64 CPU is responsible for task scheduling, network communication, and protocol conversion with the robot body controller.

9. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 4, characterized in that, Also includes: The 5G module selectively uploads the multimodal joint embedding vector or compressed raw sensing data to the cloud-based intelligent service platform; and receives enhanced analysis results, model parameter updates, or cross-robot collaborative instructions from the platform to optimize local decision-making and execution.

10. The real-time fusion method for robot multimodal perception data based on NPU acceleration according to claim 1, characterized in that, Includes a robotic system for implementing the method of any one of claims 1 to 9, the robotic system comprising: The robot body has a humanoid or mobile chassis structure and a variety of built-in actuators; A multimodal perception kit, integrated on the robot body, includes the aforementioned camera array, microphone array, BeiDou / UWB / BLE positioning module, and environmental sensors; The intelligent edge module of the sensory computing system is detachably mounted on the back or inside of the robot body, and includes the high computing power NPU, CPU and communication unit, for executing the real-time data fusion method; The cloud-based intelligent service platform connects to the Tonggan Computing Intelligent Edge Module via a 5G network and is used for model updates, big data analysis, cross-robot collaborative scheduling, and complex task offloading calculations.