A method for processing multi-modal data and related apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By converting image and point cloud data into feature sequences in the same format and processing them using the same feature extraction network, the problem of wasted storage resources and latency caused by multiple encoders is solved, and efficient modal data fusion is achieved.

CN117077073BActive Publication Date: 2026-06-19HUAWEI TECH CO LTD +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2023-07-11
Publication Date: 2026-06-19

Application Information

Patent Timeline

11 Jul 2023

Application

19 Jun 2026

Publication

CN117077073B

IPC: G06F18/25; G06N3/04

AI Tagging

Application Domain

Neural architectures

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, the fusion of image and point cloud data requires the separate use of image encoders and point cloud encoders, resulting in the need to deploy multiple encoders on the device, which leads to high storage resource overhead and extended inference time.

Method used

Image and point cloud data are converted into feature sequences in the same format and processed through the same feature extraction network to achieve feature extraction and fusion of data from different modalities.

Benefits of technology

It saves equipment storage resources, improves data processing efficiency, reduces modal data fusion time, and ensures the accuracy of fused features.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117077073B_ABST

Patent Text Reader

Abstract

A method for processing multimodal data is proposed and applied in the field of artificial intelligence. In this method, data from different modalities (i.e., images and point cloud data) are converted into feature sequences in the same format, and the same feature extraction network processes these feature sequences. This ensures that feature extraction from different modalities can be achieved using a single feature extraction network, eliminating the need to deploy multiple encoders for different modalities and saving device storage resources. Furthermore, processing the feature sequences from different modalities using the same feature extraction network enables AI hardware to perform parallel processing of different modalities, improving data processing efficiency and effectively reducing the fusion time of different modalities.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence (AI) technology, and in particular to a method and apparatus for processing multimodal data. Background Technology

[0002] In reliable autonomous driving systems, perceiving the physical world in three-dimensional (3D) space is crucial. As sensors in the field of autonomous driving become more advanced, it is necessary to integrate complementary signals captured from different sensors, such as cameras and LiDAR, in a unified manner.

[0003] Data acquired from multi-sensor systems is fundamentally represented in different modalities: for example, cameras capture semantically rich images, while lidar acquires point clouds with precise geometric information in 3D space. Integrating these complementary sensor signals is an ideal solution for achieving robust 3D perception. However, developing effective fusion methods is not easy due to the significant differences in the original data representations.

[0004] Currently, the specific method for fusing image and point cloud data in related technologies is as follows: first, image features and point cloud features are extracted separately through image encoder and point cloud encoder, and then the features of these two modalities are transformed into a unified space (such as bird's-eye-view (BEV) space) for fusion to obtain the fused features.

[0005] However, the relevant technologies require the use of a modality-specific encoder to extract features from data of different modalities separately. That is, image encoders and point cloud encoders are used to perform feature extraction separately. This requires multiple encoders to be deployed on the device at the same time, and the AI hardware on the device cannot process data of different modalities in parallel, resulting in large storage resource overhead and long inference latency. Summary of the Invention

[0006] This application provides a method for processing multimodal data, which can save device storage resources and improve data processing efficiency, effectively reducing the fusion time of different modal data.

[0007] The first aspect of this application provides a method for processing multimodal data, applied to the fusion of image and point cloud data of different modalities. The method includes: firstly, acquiring a first image and point cloud data. The first image and point cloud data may be acquired in the same scene. For example, both the first image and point cloud data are acquired on the same autonomous vehicle in an autonomous driving scenario.

[0008] Then, the first image is converted into an image feature sequence, and the point cloud data is converted into a point cloud feature sequence. Both the image feature sequence and the point cloud feature sequence include multiple vectors, and the vectors in the image feature sequence and the point cloud feature sequence have the same dimension, so as to facilitate the subsequent processing of feature sequences corresponding to different modalities of data based on the same network model. Specifically, converting the first image into an image feature sequence actually represents the content of the image in the first image in the form of a feature sequence, while converting the point cloud data into a point cloud feature sequence represents the sparsely distributed points in the point cloud data in the form of a feature sequence.

[0009] Secondly, the image feature sequence and the point cloud feature sequence are processed separately by a feature extraction network to obtain a first feature sequence corresponding to the image feature sequence and a second feature sequence corresponding to the point cloud feature sequence. In other words, when both the image feature sequence and the point cloud feature sequence are composed of multiple vectors of the same dimension, the feature extraction network can process the feature sequences corresponding to different modalities of data, thereby obtaining feature sequences corresponding to different modalities of data.

[0010] Finally, the first feature sequence and the second feature sequence are fused to obtain fused features, which are used to perform environmental perception tasks, such as object detection tasks and semantic segmentation tasks.

[0011] In this solution, by converting data of different modalities (i.e., image and point cloud data) into feature sequences of the same format, and using the same feature extraction network to process the feature sequences corresponding to different modalities, it is possible to ensure that feature extraction of different modalities can be achieved based on a single feature extraction network. This eliminates the need to deploy multiple encoders corresponding to different modalities, saving device storage resources. Furthermore, processing the feature sequences corresponding to different modalities based on the same feature extraction network enables AI hardware to perform parallel processing of different modalities, improving data processing efficiency and effectively reducing the fusion time of different modalities.

[0012] In one possible implementation, the first feature sequence and the second feature sequence are fused to obtain fused features. This includes: fusing vectors from the second feature sequence to the first feature sequence based on the projection positions of points in the point cloud data in the image space, resulting in a first fused sequence, where the vectors in the second feature sequence correspond to points in the point cloud data; and fusing vectors from the first feature sequence to the second feature sequence based on the mapping positions of image patches in the first image in the point cloud space, resulting in a second fused sequence, where the vectors in the first feature sequence correspond to image patches in the first image. That is, the first fused sequence is a feature sequence obtained by fusing in the image space, while the second fused sequence is a feature sequence obtained by fusing in the point cloud space. Thus, fusing the first and second fused sequences yields the fused features.

[0013] In this scheme, by fusing the first feature sequence and the second feature sequence in the image space and the point cloud space respectively, and then fusing the fused sequence obtained in the image space and the point cloud space in a unified space, the fusion of image features and point cloud features can be well realized, ensuring the accuracy of the final fused features.

[0014] In one possible implementation, based on the projection positions of points in the point cloud data into the image space, the vectors in the second feature sequence are fused into the first feature sequence. This includes: projecting points in the point cloud data into the image space to obtain the projection positions of the points in the point cloud data in the first image; adjusting the first vector in the first feature sequence to be the fusion result of the first vector and the second vector in the second feature sequence; wherein the projection position of the point corresponding to the second vector is located in the image patch corresponding to the first vector. In other words, the first vector in the first feature sequence and the second vector in the second feature sequence are vectors with a corresponding relationship determined by projecting the point cloud data into the image space.

[0015] In this scheme, by projecting points in point cloud data onto image space, the position of the points in image space can be determined, and then the vectors in the second feature sequence that correspond to the first feature sequence can be determined. By fusing the corresponding vectors based on the first feature sequence, the first feature sequence and the second feature sequence are fused in image space, so that the point cloud features can interact with their neighboring image features, ensuring the smooth fusion of image and point cloud data.

[0016] In one possible implementation, based on the mapping position of image patches in the first image in the point cloud space, the vectors in the first feature sequence are fused into the second feature sequence, including: mapping the image patches in the first image to the point cloud space to obtain the mapping position of the image patches in the first image in the point cloud space; adjusting the third vector in the second feature sequence to be the fusion result of the third vector and the fourth vector in the first feature sequence; wherein the mapping position of the image patch corresponding to the third vector is located in the cube space where the point corresponding to the fourth vector is located.

[0017] In this scheme, by projecting the image patch in the first image onto the point cloud space, the position of the image patch in the point cloud space can be determined, and then the vector in the first feature sequence that has a corresponding relationship with the second feature sequence can be determined. By fusing the corresponding vectors based on the second feature sequence, the first feature sequence and the second feature sequence are fused in the point cloud space, so that the image features can interact with its similar point cloud features, ensuring the smooth fusion of image and point cloud data.

[0018] In one possible implementation, mapping an image patch in the first image to a point cloud space includes: determining at least one projection position closest to the first image patch in the first image based on the projection positions of points in the point cloud data in the first image. Here, the first image patch can be any image patch in the first image. Then, based on the depth of the point corresponding to the at least one projection position, the mapped position of the first image patch in the point cloud space is determined. That is, based on the depth of the point corresponding to the at least one projection position, the depth of the first image patch in the point cloud space can be determined; thus, based on the two-dimensional coordinates of the first image patch in the image space and the transformation relationship from image space to point cloud space, the mapped position of the first image patch in the point cloud space can be determined.

[0019] In summary, after projecting the points in the point cloud data onto the first image, for each image block in the first image, one or more projection positions closest to that image block can be found. Thus, the depth of the image block in the point cloud space can be determined based on the depth of the points at these projection positions, thereby determining the mapping position of the image block in the point cloud space.

[0020] In this scheme, the points in the point cloud data are first projected into the image space, and then the depth of the image patch in the point cloud space is determined by finding the nearest projection position on the image. This ensures that the image patch in the image can be accurately mapped to the point cloud space, thus improving the feasibility of the scheme.

[0021] In one possible implementation, the first fused sequence and the second fused sequence are fused to obtain fused features, including: converting the first fused sequence and the second fused sequence to the BEV space respectively to obtain first BEV features and second BEV features; and fusing the first BEV features and the second BEV features to obtain fused features. Specifically, in the process of fusing the first BEV features and the second BEV features, vectors located at the same position in the first BEV features and the second BEV features may be fused (e.g., the vectors are summed) to obtain fused features.

[0022] In this scheme, by fusing the first feature sequence and the second feature sequence in the image space and the point cloud space respectively, and then fusing the fused sequence obtained in the image space and the point cloud space in a unified space, the fusion of image features and point cloud features can be well realized, ensuring the accuracy of the final fused features.

[0023] In one possible implementation, converting the first image into an image feature sequence includes: dividing the first image into multiple image blocks; and converting each image block into a vector using a first feature converter to obtain an image feature sequence composed of multiple vectors corresponding to the multiple image blocks. The first feature converter can be, for example, a linear neural network.

[0024] In this scheme, the first image is divided into multiple image blocks, and each image block is converted into a vector of a specific dimension based on the first feature converter, thereby obtaining an image feature sequence composed of multiple vectors. This ensures that the same feature sequence can be obtained for images of different formats, thus ensuring the feasibility and compatibility of the scheme.

[0025] In one possible implementation, converting point cloud data into a point cloud feature sequence includes: dividing the point cloud data into multiple cube spaces, each cube space including one or more points from the point cloud data; and converting each cube space into a vector using a second feature converter based on the points included in each cube space, resulting in a point cloud feature sequence composed of multiple vectors corresponding to the multiple cube spaces. The second feature converter can be, for example, a linear neural network.

[0026] In this scheme, point cloud data is divided into multiple cubic spaces, and points in each cubic space are converted into vectors of a specific dimension using a second feature converter, thereby obtaining a point cloud feature sequence composed of multiple vectors. This ensures that the same format feature sequence can be obtained for point cloud data with different distributions, thus ensuring the feasibility and compatibility of the scheme.

[0027] In one possible implementation, the first image and point cloud data are acquired in any of the following scenarios: autonomous driving scenario, robot driving scenario, and intelligent inspection scenario.

[0028] A second aspect of this application provides a multimodal data processing apparatus, comprising: an acquisition module for acquiring a first image and point cloud data; a processing module for converting the first image into an image feature sequence and converting the point cloud data into a point cloud feature sequence, wherein both the image feature sequence and the point cloud feature sequence include multiple vectors, and the vectors included in the image feature sequence and the point cloud feature sequence have the same dimension; the processing module is further configured to process the image feature sequence and the point cloud feature sequence respectively through a feature extraction network to obtain a first feature sequence corresponding to the image feature sequence and a second feature sequence corresponding to the point cloud feature sequence; and the processing module is further configured to perform a fusion processing on the first feature sequence and the second feature sequence to obtain fused features, wherein the fused features are used to perform an environmental perception task.

[0029] In one possible implementation, the processing module is further configured to: fuse vectors in the second feature sequence into the first feature sequence based on the projection positions of points in the point cloud data in the image space to obtain a first fused sequence, wherein the vectors in the second feature sequence have corresponding points in the point cloud data; fuse vectors in the first feature sequence into the second feature sequence based on the mapping positions of image patches in the first image in the point cloud space to obtain a second fused sequence, wherein the vectors in the first feature sequence have corresponding image patches in the first image; and perform fusion processing on the first fused sequence and the second fused sequence to obtain fused features.

[0030] In one possible implementation, the processing module is further configured to: project points in the point cloud data onto an image space to obtain the projection positions of the points in the point cloud data in the first image; adjust the first vector in the first feature sequence to a fusion result of the first vector and the second vector in the second feature sequence; wherein the projection position of the point corresponding to the second vector is located in the image patch corresponding to the first vector.

[0031] In one possible implementation, the processing module is further configured to: map image patches in the first image to point cloud space to obtain the mapping position of the image patches in the first image in point cloud space; adjust the third vector in the second feature sequence to be the fusion result of the third vector and the fourth vector in the first feature sequence; wherein the mapping position of the image patch corresponding to the third vector is located in the cube space where the point corresponding to the fourth vector is located.

[0032] In one possible implementation, the processing module is further configured to: determine at least one projection position closest to the first image block in the first image based on the projection position of points in the point cloud data in the first image; and determine the mapping position of the first image block in the point cloud space based on the depth of the point corresponding to the at least one projection position.

[0033] In one possible implementation, the processing module is further configured to: convert the first fusion sequence and the second fusion sequence to the BEV space respectively to obtain the first BEV feature and the second BEV feature; and fuse the first BEV feature and the second BEV feature to obtain the fusion feature.

[0034] In one possible implementation, the processing module is further configured to: divide the first image into multiple image blocks; and convert each image block in the multiple image blocks into a vector using a first feature converter to obtain an image feature sequence composed of multiple vectors corresponding to the multiple image blocks.

[0035] In one possible implementation, the processing module is further configured to: divide the point cloud data into multiple cube spaces, each cube space including one or more points in the point cloud data; and convert each cube space into a vector using a second feature converter based on the points included in each cube space, thereby obtaining a point cloud feature sequence composed of multiple vectors arranged according to the multiple cube spaces.

[0036] In one possible implementation, the first image and point cloud data are acquired in the same scene.

[0037] In one possible implementation, the first image and point cloud data are acquired in any of the following scenarios: autonomous driving scenario, robot driving scenario, and intelligent inspection scenario.

[0038] A third aspect of this application provides a multimodal data processing apparatus, which may include a processor and a memory coupled together. The memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method described in the first aspect or any implementation thereof is implemented. For details regarding the steps of the various possible implementations of the first aspect executed by the processor, please refer to the first aspect; further details will not be repeated here.

[0039] The fourth aspect of this application provides a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the method of any implementation of the first aspect described above.

[0040] The fifth aspect of this application provides a circuit system including a processing circuit configured to perform the method of any implementation of the first aspect described above.

[0041] The sixth aspect of this application provides a computer program product that, when run on a computer, causes the computer to perform any implementation of the first aspect described above.

[0042] A seventh aspect of this application provides a chip system including a processor for supporting a server or threshold value acquisition device in implementing the functions involved in any implementation of the first aspect described above, such as transmitting or processing data and / or information involved in the methods described above. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the server or communication device. This chip system may be composed of chips or may include chips and other discrete devices.

[0043] The beneficial effects of the second to seventh aspects mentioned above can be referred to the introduction of the first aspect above, and will not be repeated here. Attached Figure Description

[0044] Figure 1A A schematic diagram of an autonomous vehicle driving on a road, provided as an embodiment of this application;

[0045] Figure 1B A schematic diagram of an inspection robot for the power transmission industry provided in an embodiment of this application;

[0046] Figure 2 This is a schematic diagram of the structure of a vehicle 100 provided in an embodiment of this application;

[0047] Figure 3 This is a schematic diagram of the structure of a computer system 101 in a vehicle provided in an embodiment of this application;

[0048] Figure 4 This is a schematic diagram of the structure of a robot 400 provided in an embodiment of this application;

[0049] Figure 5 A flowchart illustrating a method for processing multimodal data provided in an embodiment of this application;

[0050] Figure 6 This application provides a schematic diagram of a process for fusing a first image and point cloud data in an autonomous driving scenario.

[0051] Figure 7 This is a schematic diagram illustrating feature transformation performed on a first image and point cloud data, provided as an embodiment of this application.

[0052] Figure 8 This application provides a schematic diagram of a fusion process for a first feature sequence and a second feature sequence.

[0053] Figure 9 A schematic diagram illustrating the fusion of a first feature sequence and a second feature sequence in image space, provided as an embodiment of this application;

[0054] Figure 10 A schematic diagram illustrating the fusion of a first feature sequence and a second feature sequence in a point cloud space, provided as an embodiment of this application;

[0055] Figure 11A This application provides a schematic diagram of a process for fusing data from different modalities, as illustrated in an embodiment of the present application.

[0056] Figure 11B A comparative schematic diagram of different multimodal data processing methods provided for embodiments of this application;

[0057] Figure 12 A schematic diagram of the structure of a multimodal data processing device provided in an embodiment of this application;

[0058] Figure 13 A schematic diagram of the structure of the execution device provided in the embodiments of this application;

[0059] Figure 14 A schematic diagram of the structure of a chip provided in an embodiment of this application;

[0060] Figure 15 This is a schematic diagram of the structure of a computer-readable storage medium provided in an embodiment of this application. Detailed Implementation

[0061] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application are described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some, and not all, of the embodiments of this application. Those skilled in the art will understand that, with the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0062] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such descriptions can be used interchangeably where appropriate to allow embodiments to be implemented in a sequence other than that illustrated or described in this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those explicitly listed, but may include other steps or modules not explicitly listed or inherent to such processes, methods, products, or devices. The naming or numbering of steps appearing in this application does not imply that the steps in the method flow must be performed in the chronological / logical order indicated by the naming or numbering. The execution order of named or numbered process steps can be changed according to the desired technical purpose, as long as the same or similar technical effect is achieved. The division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the shown or discussed mutual coupling, direct coupling, or communication connection may be through some interface, and the indirect coupling or communication connection between units may be electrical or other similar forms, none of which are limited in this application. Furthermore, the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed among multiple circuit units. Some or all of the units can be selected to achieve the purpose of the solution in this application according to actual needs.

[0063] For ease of understanding, some technical terms involved in the embodiments of this application will be introduced below.

[0064] (1) Neural Network

[0065] A neural network can be composed of neural units, specifically understood as a neural network with input layers, hidden layers, and output layers. Generally, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. Neural networks with many hidden layers are called deep neural networks (DNNs). The function of each layer in a neural network can be expressed mathematically. To describe it physically, each layer in a neural network can be understood as transforming the input space (the set of input vectors) to the output space (i.e., from the row space to the column space of a matrix) through five operations on the input space. These five operations are: 1. Dimensionality increase / decrease; 2. Magnification / scaling; 3. Rotation; 4. Translation; 5. "Bending". Operations 1, 2, and 3 are... Operation 4 is completed using "+b", and operation 5 is implemented using "a()". The term "space" is used here because the objects being classified are not individual things, but a class of things; space refers to the set of all individuals within this class of things. Here, W is the weight matrix of each layer of the neural network, where each value represents the weight of a neuron in that layer. This matrix W determines the spatial transformation from the input space to the output space, as described above; that is, the W of each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to ultimately obtain the weight matrices of all layers of the trained neural network. Therefore, the training process of a neural network is essentially learning how to control spatial transformation, more specifically, learning the weight matrix.

[0066] (2) Attention Network

[0067] Attention networks are network models that utilize attention mechanisms to improve model training speed. Currently, typical attention networks include the Transformer network. Models applying attention mechanisms can assign different weights to each part of the input sequence, thereby extracting more important feature information from the input sequence and resulting in a more accurate output.

[0068] In deep learning, attention mechanisms can be implemented using weight vectors that describe importance: when predicting or inferring an element, the weight vectors determine the correlation between that element and other elements. For example, for a pixel in an image or a word in a sentence, attention vectors can be used to quantitatively estimate the correlation between the target element and other elements, and the weighted sum of the attention vectors serves as an approximation of the target value.

[0069] The attention mechanism in deep learning simulates the human brain's attention mechanism. For example, when a human views a painting, although the eyes can see the entire picture, upon closer inspection, the eyes actually focus on only a portion of the image. At this point, the brain primarily focuses on this smaller area. In other words, when humans carefully observe an image, the brain's attention to the entire image is not uniform but rather differentiated by weight; this is the core idea of the attention mechanism.

[0070] In simple terms, the human visual processing system often selectively focuses on certain parts of an image while ignoring other irrelevant information, thus aiding the brain's perception. Similarly, in the attention mechanism of deep learning, in problems involving language, speech, or vision, certain parts of the input may be more relevant than others. Therefore, through the attention mechanism in attention models, different processing can be applied to different parts of the input data, allowing the attention model to dynamically focus only on task-relevant data.

[0071] (3) Loss Function

[0072] During neural network training, to ensure the network's output closely approximates the desired predicted value, we compare the network's current prediction with the target value. Based on the difference, we update the weight matrix of each layer (usually pre-configuring parameters before the initial update). For example, if the network's prediction is too high, we adjust the weight matrix to lower it, continuing this adjustment process until the network accurately predicts the target value. Therefore, we need to predefine "how to compare the difference between the predicted and target values," which is the basis of the loss function or objective function. These are crucial equations used to measure the difference between the predicted and target values. Taking the loss function as an example, a higher output value (loss) indicates a greater difference, making neural network training a process of minimizing this loss as much as possible.

[0073] (4) Backpropagation algorithm

[0074] During the training of a neural network, the back propagation (BP) algorithm can be used to correct the parameters in the initial neural network model, thereby reducing the reconstruction error loss. Specifically, forward propagation of the input signal to the output generates an error loss. By backpropagating this error loss information, the parameters in the initial neural network model are updated, thus bringing the error loss to a convergence. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the neural network model, such as the weight matrix.

[0075] (5) Gradient descent method

[0076] Gradient descent is a first-order optimization algorithm commonly used in machine learning to recursively approximate a minimum-bias prediction model. To find a local minimum of a function using gradient descent, iterative searches must be performed at points a predetermined step distance away in the opposite direction of the gradient (or approximate gradient) at the current point on the function. Gradient descent is one of the most frequently used methods for solving the prediction model parameters of machine learning algorithms, i.e., unconstrained optimization problems.

[0077] Specifically, when finding the minimum value of the loss function, gradient descent can be used to iteratively solve the problem step by step, obtaining the minimized loss function value and the predicted model parameter values. Conversely, if we need to find the maximum value of the loss function, we need to use gradient ascent iteratively.

[0078] (6) Point cloud data

[0079] Point cloud data refers to a collection of vectors in a three-dimensional coordinate system. Typically, point cloud data is obtained by scanning the environment using a laser scanner (such as LiDAR). Furthermore, point cloud data is recorded in the form of points, each containing three-dimensional coordinates. In some cases, the points in the point cloud data may contain color information or reflection intensity information. The reflection intensity information refers to the echo intensity collected by the laser scanner's receiving device; this reflection intensity information is related to the target's surface material, roughness, incident angle, and the instrument's emission energy and laser wavelength.

[0080] (7) Laser Radar

[0081] LiDAR (Light Detection and Ranging) is a radar system that uses emitted laser beams to detect the position, velocity, and other characteristics of targets. The working principle of LiDAR is to emit a detection signal (laser beam) towards the target, then compare and process the received signal reflected back from the target (target echo) with the emitted signal to obtain relevant information about the target, such as its distance, azimuth, altitude, velocity, attitude, and even shape. This allows for the detection of targets on the ground, such as roads or obstacles. Generally, LiDAR consists of a laser transmitter, an optical receiver, and an information processing system. The laser transmitter converts electrical pulses into light pulses and emits them. The optical receiver then converts the light pulses reflected from the target back into electrical pulses, which are then sent to the information processing system for processing.

[0082] (8) Encoder

[0083] In the embodiments of this application, the encoder is essentially a neural network model that can convert data such as text or images into vectors in an encoding space (that is, convert text or images into text features or image features).

[0084] The multimodal data processing method provided in this embodiment can be applied to electronic devices that need to fuse image and point cloud data, such as autonomous vehicles in the field of autonomous driving, robots with situation and environment detection capabilities, or servers that can acquire image and point cloud data to be fused.

[0085] Specifically, autonomous vehicles are equipped with image acquisition devices and LiDAR (Light Detection and Ranging) systems. During operation, the image acquisition devices capture environmental images of the road, while the LiDAR collects point cloud data of the surrounding environment. This allows the autonomous vehicle to determine the driving environment and, consequently, its driving strategy. For an example, please refer to [link to example]. Figure 1A , Figure 1A This is a schematic diagram of an autonomous vehicle driving on a road, provided as an embodiment of this application.

[0086] Examples of robots include inspection robots in the power transmission industry (such as inspection robots inside substations), handling or inspection robots in the coal mining industry, and handling or underground exploration robots in the logistics industry. Similarly, these robots are equipped with image acquisition devices and LiDAR, enabling them to capture environmental images during movement and collect point cloud data of the surrounding environment using LiDAR. This allows the robot to perform obstacle avoidance and other driving maneuvers based on the environmental images and point cloud data. For an example, please refer to [link to example]. Figure 1B , Figure 1B This is a schematic diagram of an inspection robot for the power transmission industry provided in an embodiment of this application.

[0087] For example, a server can serve as a centralized processing server in inspection scenarios, acquiring and processing image and point cloud data from different devices. In the power transmission industry, for instance, cameras fixed to transmission towers or ground-based inspection robots can acquire images of the towers and transmit them to the server; similarly, drones equipped with LiDAR can collect point cloud data of the surrounding environment and transmit it to the server. In this case, the server can determine whether the transmission tower is operating normally and unaffected by foreign objects by fusing and processing the acquired images and point cloud data.

[0088] To facilitate understanding of this solution, the embodiments of this application incorporate... Figure 2 The structure of the vehicle provided in this application is described. Please refer to [link / reference]. Figure 2 , Figure 2 This is a schematic diagram of a vehicle 100 provided in an embodiment of this application.

[0089] like Figure 2 As shown, in one embodiment, vehicle 100 can be configured in a fully or partially autonomous driving mode. For example, vehicle 100 can control itself while in autonomous driving mode, and can determine the current state of the vehicle and its surrounding environment through human intervention, determine the possible behaviors of at least one other vehicle in the surrounding environment, and determine the confidence level corresponding to the probability of that other vehicle performing the possible behavior, and control vehicle 100 based on the determined information. When vehicle 100 is in autonomous driving mode, vehicle 100 can be set to operate without human interaction.

[0090] Vehicle 100 may include various subsystems, such as a mobility system 102, a sensor system 104, a control system 106, one or more peripheral devices 108, a power supply 110, a computer system 101, and a user interface 116. Optionally, vehicle 100 may include more or fewer subsystems, and each subsystem may include multiple components, such as multiple electronic control units (ECUs) per subsystem. Furthermore, each subsystem and component of vehicle 100 may be interconnected via wired or wireless means.

[0091] The propulsion system 102 may include components that provide powered motion to the vehicle 100. In one embodiment, the propulsion system 102 may include an engine 118, an energy source 119, a transmission 120, and wheels / tires 121. The engine 118 may be an internal combustion engine, an electric motor, an air-compressed engine, or other types of engine combinations, such as a hybrid engine consisting of a gasoline engine and an electric motor, or a hybrid engine consisting of an internal combustion engine and an air-compressed engine. The engine 118 converts the energy source 119 into mechanical energy.

[0092] Examples of energy sources 119 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electricity. Energy source 119 may also provide energy to other systems of vehicle 100.

[0093] The transmission 120 can transmit mechanical power from the engine 118 to the wheels 121. The transmission 120 may include a gearbox, a differential, and a drive shaft. In one embodiment, the transmission 120 may also include other components, such as a clutch. The drive shaft may include one or more axles that can be coupled to one or more wheels 121.

[0094] Sensor system 104 may include several sensors for sensing information about the environment surrounding vehicle 100. For example, sensor system 104 may include a positioning system 122 (which may be a GPS system, a BeiDou system, or another positioning system), an inertial measurement unit (IMU) 124, a radar 126, a laser rangefinder 128, and a camera 130. Sensor system 104 may also include sensors for the internal systems of the monitored vehicle 100 (e.g., an in-vehicle air quality monitor, fuel gauge, oil temperature gauge, etc.). Sensor data from one or more of these sensors can be used to detect objects and their corresponding characteristics (position, shape, orientation, speed, etc.). This detection and identification is a critical function for the safe operation of vehicle 100.

[0095] The positioning system 122 can be used to estimate the geographic location of the vehicle 100. The IMU 124 is used to sense changes in the position and orientation of the vehicle 100 based on inertial acceleration. In one embodiment, the IMU 124 can be a combination of an accelerometer and a gyroscope.

[0096] Radar 126 can use radio signals to sense objects in the surrounding environment of vehicle 100. In some embodiments, in addition to sensing objects, radar 126 can also be used to sense the speed and / or direction of travel of objects.

[0097] The laser rangefinder 128 can use lasers to sense objects in the environment in which the vehicle 100 is located. In some embodiments, the laser rangefinder 128 may include one or more laser sources, a laser scanner, and one or more processing modules, as well as other system components.

[0098] Camera 130 can be used to capture multiple images of the surrounding environment of vehicle 100. Camera 130 can be a still camera or a video camera.

[0099] The control system 106 controls the operation of the vehicle 100 and its components. The control system 106 may include various elements, including a steering system 132, a throttle 134, a braking unit 136, a computer vision system 140, a route control system 142, and an obstacle avoidance system 144.

[0100] The steering system 132 is operable to adjust the forward direction of the vehicle 100. For example, in one embodiment, it may be a steering wheel system.

[0101] Throttle 134 is used to control the operating speed of engine 118 and thus the speed of vehicle 100.

[0102] Braking unit 136 is used to control the deceleration of vehicle 100. Braking unit 136 can use friction to slow down wheel 121. In other embodiments, braking unit 136 can convert the kinetic energy of wheel 121 into electric current. Braking unit 136 may also take other forms to slow down the rotational speed of wheel 121 to control the speed of vehicle 100.

[0103] The computer vision system 140 is operable to process and analyze images captured by the camera 130 to identify objects and / or features in the environment surrounding the vehicle 100. The objects and / or features may include traffic signals, road boundaries, and obstacles. The computer vision system 140 may use object recognition algorithms, structure from motion (SFM) algorithms, video tracking, and other computer vision techniques. In some embodiments, the computer vision system 140 may be used to map the environment, track objects, estimate object velocities, and so on.

[0104] The route control system 142 is used to determine the driving route of the vehicle 100. In some embodiments, the route control system 142 may combine data from GPS 122 and one or more predetermined maps to determine the driving route of the vehicle 100.

[0105] The obstacle avoidance system 144 is used to identify, assess and avoid or otherwise traverse potential obstacles in the environment of the vehicle 100.

[0106] Of course, in one instance, the control system 106 may include additional or alternative components besides those shown and described. Alternatively, some of the components shown above may be reduced.

[0107] Vehicle 100 interacts with external sensors, other vehicles, other computer systems, or users via peripheral devices 108. Peripheral devices 108 may include a wireless communication system 146, an on-board computer 148, a microphone 150, and / or a speaker 152.

[0108] In some embodiments, peripheral device 108 provides a means for a user of vehicle 100 to interact with user interface 116. For example, on-board computer 148 may provide information to a user of vehicle 100. User interface 116 may also operate on-board computer 148 to receive user input. On-board computer 148 may be operated via a touchscreen. In other cases, peripheral device 108 may provide a means for vehicle 100 to communicate with other devices located within the vehicle. For example, microphone 150 may receive audio (e.g., voice commands or other audio input) from a user of vehicle 100. Similarly, speaker 152 may output audio to a user of vehicle 100.

[0109] The wireless communication system 146 can communicate wirelessly with one or more devices directly or via a communication network. For example, the wireless communication system 146 can use 3G cellular communication, such as CDMA, EVDO, GSM / GPRS, or 4G cellular communication, such as LTE, or 5G cellular communication. The wireless communication system 146 can communicate using WiFi and a wireless local area network (WLAN). In some embodiments, the wireless communication system 146 can communicate directly with devices using an infrared link, Bluetooth, or ZigBee. Other wireless protocols, such as various vehicle communication systems, are also possible. For example, the wireless communication system 146 may include one or more dedicated short-range communications (DSRC) devices that can enable public and / or private data communication between vehicles and / or roadside stations.

[0110] Power source 110 can provide power to various components of vehicle 100. In one embodiment, power source 110 can be a rechargeable lithium-ion or lead-acid battery. One or more such battery packs can be configured to provide power to various components of vehicle 100. In some embodiments, power source 110 and energy source 119 can be implemented together, as is the case in some fully electric vehicles.

[0111] Some or all of the functions of vehicle 100 are controlled by computer system 101. Computer system 101 may include at least one processor 113, which executes instructions 115 stored in a non-transitory computer-readable medium such as data storage device 114. Computer system 101 may also be multiple computing devices that control individual components or subsystems of vehicle 100 in a distributed manner.

[0112] Processor 113 can be any conventional processor, such as a commercially available CPU. Alternatively, the processor can be a special-purpose device such as an ASIC or other hardware-based processor. Although Figure 2 The processor, memory, and other components of computer 110 in the same block are illustrated functionally, but those skilled in the art will understand that the processor, computer, or memory may actually include multiple processors, computers, or memories that may or may not be stored in the same physical housing.

[0113] For example, memory can be a hard disk drive or other storage media located in a casing different from that of computer 110. Therefore, references to processors or computers will be understood to include references to a collection of processors or computers or memories that may or may not operate in parallel. Unlike using a single processor to perform the steps described herein, some components, such as the steering assembly and deceleration assembly, may each have their own processor that performs calculations only related to the component's specific function.

[0114] In the various aspects described herein, the processor may be located remotely from the vehicle and communicate wirelessly with the vehicle. In other aspects, some of the processes described herein are executed on a processor located within the vehicle, while others are executed by a remote processor, including taking the necessary steps to perform a single operation.

[0115] In some embodiments, memory 114 may contain instructions 115 (e.g., program logic) that can be executed by a processor to perform various functions of vehicle 100, including those described above. Data storage device 114 may also contain additional instructions, including instructions to send data to, receive data from, interact with, and / or control one or more of the propulsion system 102, sensor system 104, control system 106, and peripheral devices 108.

[0116] In addition to instruction 115, memory 114 may also store data such as road maps, route information, vehicle position, direction, speed, and other such vehicle data, as well as other information. This information can be used by vehicle 100 and computer system 101 during operation of vehicle 100 in autonomous, semi-autonomous, and / or manual modes.

[0117] User interface 116 is used to provide information to or receive information from users of vehicle 100. Optionally, user interface 116 may include one or more input / output devices within a set of peripheral devices 108, such as wireless communication system 146, vehicle-to-everything (V2X) computer 148, microphone 150, and speaker 152.

[0118] Computer system 101 can control the functions of vehicle 100 based on input received from various subsystems (e.g., driving system 102, sensor system 104, and control system 106) and from user interface 116. For example, computer system 101 can utilize input from control system 106 to control steering unit 132 to avoid obstacles detected by sensor system 104 and obstacle avoidance system 144. In some embodiments, computer system 101 is operable to provide control over many aspects of vehicle 100 and its subsystems.

[0119] Alternatively, one or more of these components may be installed separately from or associated with vehicle 100. For example, memory 114 may exist partially or completely separately from vehicle 100. The components may be communicatively coupled together in a wired and / or wireless manner.

[0120] Optionally, the components described above are merely examples. In actual applications, components in each of the above modules may be added or removed as needed. Figure 2 This should not be construed as a limitation on the embodiments of this application.

[0121] Autonomous vehicles traveling on roads, such as vehicle 100 above, can identify objects in their surrounding environment to determine adjustments to their current speed. These objects can be other vehicles, traffic control equipment, or other types of objects. In some examples, each identified object can be considered independently, and based on the object's individual characteristics, such as its current speed, acceleration, and distance from the vehicle, the speed adjustment to be made by the autonomous vehicle can be determined.

[0122] Optionally, vehicle 100 or computing devices associated with vehicle 100 (computer system 101, computer vision system 140, memory 114 as shown in Figure 1) can predict the behavior of the identified objects based on the characteristics of the identified objects and the state of the surrounding environment (e.g., traffic, rain, ice on the road, etc.). Optionally, each identified object depends on the behavior of each other, so all identified objects can also be considered together to predict the behavior of a single identified object. Vehicle 100 can adjust its speed based on the predicted behavior of the identified objects.

[0123] In other words, autonomous vehicles can determine what steady state the vehicle will need to adjust to (e.g., accelerate, decelerate, or stop) based on the predicted behavior of objects. In this process, other factors can also be considered in determining the vehicle's speed, such as its lateral position on the road, the road's curvature, and the proximity of static and dynamic objects.

[0124] In addition to providing instructions to adjust the speed of the autonomous vehicle, the computing device can also provide instructions to modify the steering angle of the vehicle 100 so that the autonomous vehicle follows a given trajectory and / or maintains a safe lateral and longitudinal distance from objects near the autonomous vehicle (e.g., cars in adjacent lanes on the road).

[0125] The aforementioned vehicle 100 can be a car, truck, motorcycle, bus, recreational vehicle, amusement park vehicle, construction equipment, tram, and golf cart, etc., and this application embodiment does not impose any special limitations.

[0126] Figure 2 The vehicle 100 shown may be equipped with an advanced driver assistance system (ADAS) for achieving autonomous driving functions. This ADAS contains numerous parameters that require calibration. Specifically, the calibration process for the vehicle's ADAS mainly includes the calibration of parameters for each subsystem in the execution layer, perception layer, and functional layer. The execution layer involves calibration of the powertrain, braking system, steering system, four-wheel alignment parameters, and suspension system. The perception layer involves GNSS and INS (Initial Navigation System) calibration, camera calibration, lidar calibration, millimeter-wave radar calibration, and ultrasonic radar calibration. GNSS includes GPS (Global Positioning System), GLONASS (Global Navigation Satellite System), Galileo (Galileo navigation satellite system), and BDS (BeiDou Navigation Satellite System). The functional layer involves calibration of the vehicle's longitudinal control module, lateral control module, basic ADAS function calibration, and ADAS driving style calibration. Longitudinal control primarily controls speed, achieved by controlling the brakes, accelerator, and gear shift. Lateral control primarily controls heading, by adjusting steering wheel torque or angle to guide the vehicle in the desired direction. Basic ADAS functions include ACC (Adaptive Cruise Control), LCC (Lane Center Control), and ALC (Auto Lane Change). Driving style refers to the way or habitual driving method, including selection of driving speed and following distance. Driving styles include aggressive, moderate, and cautious driving styles.

[0127] Please refer to Figure 3 , Figure 3 This is a schematic diagram of the structure of a computer system 101 in a vehicle provided in an embodiment of this application. The computer system 101 includes a processor 103, which is coupled to a system bus 105. The processor 103 can be used to implement... Figure 2The processor 103 can be one or more processors, each of which can include one or more processor cores. A video adapter 107 drives a display 109, which is coupled to a system bus 105. The system bus 105 is coupled to an input / output (I / O) bus via a bus bridge 111. An I / O interface 115 is coupled to the I / O bus. The I / O interface 115 communicates with various I / O devices, such as input devices 117 (e.g., keyboard, mouse, touchscreen), a media tray 121 (e.g., CD-ROM, multimedia interface), a transceiver 123 (capable of sending and / or receiving radio communication signals), a camera 155 (capable of capturing dynamic digital video images), and an external USB port 125. Optionally, the interface connected to the I / O interface 115 can be a USB port.

[0128] The processor 103 can be any conventional processor, including a Reduced Instruction Set Computing (“RISC”) processor, a Complex Instruction Set Computing (“CISC”) processor, or a combination thereof. Optionally, the processor can be a special-purpose device such as an Application-Specific Integrated Circuit (“ASIC”). Optionally, the processor 103 can be a neural network processor or a combination of a neural network processor and the conventional processors described above.

[0129] Optionally, in the various embodiments described herein, the computer system 101 may be located remotely from the autonomous vehicle and may communicate wirelessly with the autonomous vehicle. In other aspects, some of the processes described herein are executed on a processor located within the autonomous vehicle, while others are executed by a remote processor, including taking actions necessary to perform a single manipulation.

[0130] Computer system 101 can communicate with software deployment server 149 via network interface 129. Network interface 129 is a hardware network interface, such as a network interface card (NIC). Network 127 can be an external network, such as the Internet, or an internal network, such as Ethernet or a Virtual Private Network (VPN). Optionally, network 127 can also be a wireless network, such as a WiFi network or a cellular network.

[0131] The hard disk drive interface is coupled to the system bus 105. The hardware drive interface is connected to the hard disk drive. The system memory 135 is coupled to the system bus 105. The data running in the system memory 135 may include the operating system 137 and applications 143 of the computer 101.

[0132] An operating system consists of a shell (139) and a kernel (141). The shell (139) is an interface between the user and the operating system kernel. The shell is the outermost layer of the operating system. It manages the interaction between the user and the operating system: waiting for user input, interpreting user input for the operating system, and handling various types of operating system output.

[0133] The kernel 141 consists of the parts of the operating system used to manage memory, files, peripherals, and system resources. Interacting directly with the hardware, the operating system kernel typically runs processes and provides inter-process communication, CPU time-slice management, interrupts, memory management, I / O management, and so on.

[0134] Application 143 includes a road surface detection program 147 and a program related to controlling the vehicle's autonomous driving. The road surface detection program 147 processes the capacitance signals transmitted by the processing module in the system. By executing the road surface detection program 147, the computer system 101 can implement the road surface detection function.

[0135] The programs controlling autonomous driving may include, for example, programs that manage the interaction between the autonomous vehicle and obstacles on the road, programs that control the autonomous vehicle's route or speed, and programs that control the interaction between the autonomous vehicle and other autonomous vehicles on the road. Application 143 also exists on the system of software deployment server 149.

[0136] Sensor 153 is associated with computer system 101. Sensor 153 is used to detect the environment around computer system 101. For example, sensor 153 can detect animals, cars, obstacles, and pedestrian crossings, etc. Furthermore, the sensor can also detect the environment around the aforementioned animals, cars, obstacles, and pedestrian crossings, such as the environment around the animals, for example, other animals around the animals, weather conditions, ambient light levels, etc.

[0137] Please see Figure 4 , Figure 4 This is a structural schematic diagram of a robot 400 provided in an embodiment of this application. Figure 4 As shown, the robot 400 may include an image acquisition module 401, a sensor 402, a processor 403, a memory 404, and a communication module 405.

[0138] The processor 403 may include one or more processing units, such as an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, memory, a video codec, a digital signal processor (DSP), a baseband processor, and / or a neural network processing unit (NPU). These different processing units may be independent devices or integrated into one or more processors.

[0139] The controller can serve as the neural center and command center of the robot 400. It can generate operational control signals based on instruction opcodes and timing signals to control the fetching and execution of instructions.

[0140] The processor 403 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor is a cache memory. This memory can store instructions or data that the processor has just used or that are used repeatedly. If the processor needs to use the instruction or data again, it can retrieve it directly from the memory. This avoids repeated accesses, reduces the waiting time of the processor 403, and thus improves the efficiency of the system.

[0141] In some embodiments, the processor 403 may include one or more interfaces. Interfaces may include an inter-integrated circuit (I2C) interface, a universal asynchronous receiver / transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input / output (GPIO) interface, a subscriber identity module (SIM) interface, and / or a universal serial bus (USB) interface, etc.

[0142] The interface is a bidirectional synchronous serial bus, including a serial data line (SDA) and a serial clock line (SCL). In some embodiments, the processor may include multiple I2C buses. The processor can couple sensors, cameras, etc., through different I2C bus interfaces.

[0143] The UART interface is a universal serial data bus used for asynchronous communication. This bus can be a bidirectional communication bus. It converts the data to be transmitted between serial and parallel communication. In some embodiments, the UART interface is typically used to connect the processor 403 and the communication module 405. For example, the processor 403 communicates with the Bluetooth module in the wireless communication module via the UART interface to implement Bluetooth functionality.

[0144] The IPI interface can be used to connect the processor to peripheral devices such as cameras. The MIPI interface includes the camera serial interface (CSI). In some embodiments, the processor and camera communicate via the CSI interface to enable the robot 400 to capture images.

[0145] The GPIO interface is configurable via software. It can be configured as a control signal or a data signal. In some embodiments, the GPIO interface can be used to connect a processor to a camera, wireless communication module, sensor module, etc. The GPIO interface can also be configured as an I2C interface, UART interface, MIPI interface, etc.

[0146] The USB interface conforms to the USB standard specification and can be a Mini USB interface, Micro USB interface, USB Type-C interface, etc. The USB interface can be used to connect a charger to charge the robot 400, and it can also be used for data transfer between the robot 400 and peripheral devices. This interface can also be used to connect other robots 400.

[0147] The image acquisition module 401 can acquire image information around the robot 400, such as taking photos or videos. The robot 400 can achieve image acquisition through an ISP, camera, video codec, GPU, and application processor.

[0148] An ISP (Image Signal Processor) processes data fed back from the camera. For example, when taking a picture, the shutter is opened, and light is transmitted through the lens to the camera's image sensor. The light signal is converted into an electrical signal, which is then transmitted to the ISP for processing, transforming it into a visible image. The ISP can also perform algorithmic optimizations on image noise, brightness, and other parameters. It can also optimize parameters such as exposure and color temperature for the shooting scene. In some embodiments, the ISP can be integrated into the camera itself.

[0149] A camera is used to capture still images or videos. An object is projected onto a photosensitive element by an optical image generated through a lens. The photosensitive element can be a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the light signal into an electrical signal, which is then passed to an ISP (Internet Service Provider) for conversion into a digital image signal. The ISP outputs the digital image signal to a DSP (Digital Signal Processor) for processing. The DSP converts the digital image signal into image signals in standard formats such as RGB and YUV. In some embodiments, the robot 400 may include one or N cameras, where N is a positive integer greater than 1.

[0150] Sensor 402 can acquire information such as the robot 400's moving speed, moving direction, and distance to surrounding objects. For example, sensor 402 may include a gyroscope sensor, a speed sensor, an accelerometer sensor, a distance sensor, etc.

[0151] The gyroscope sensor can be used to determine the motion posture of the robot 400. In some embodiments, the gyroscope sensor can determine the angular velocity of the robot 400 around three axes (i.e., the x, y, and z axes). The gyroscope sensor can be used for image stabilization. For example, when the robot 400 is acquiring images, the gyroscope sensor detects the angle of the robot 400's shaking, calculates the distance that the lens module needs to compensate based on the angle, and allows the lens to counteract the shaking of the robot 400 through reverse movement, thus achieving image stabilization. The gyroscope sensor can also be used for navigation or calculating the unevenness of the road surface, determining whether the robot 400 is stuck, etc.

[0152] A speed sensor is used to measure movement speed. In some embodiments, the robot 400 measures its current movement speed using the speed sensor, and can combine this with a distance sensor to predict the environment in which the robot 400 will be in the next moment based on the current environment.

[0153] The accelerometer can detect the magnitude of the robot 400's acceleration in various directions (typically three axes). When the robot 400 is stationary, the magnitude and direction of gravity can be detected.

[0154] A distance sensor is used to measure distance. The robot 400 can measure distance using infrared or laser. In some embodiments, during scene capture, the robot 400 can utilize the distance sensor to measure distance for rapid focusing.

[0155] The memory 404 may include external memory and internal memory. The external memory interface can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the robot 400. The external memory card communicates with the processor through the external memory interface to perform data storage functions. For example, sample information files can be stored on the external memory card.

[0156] Internal memory can be used to store computer executable program code, which includes instructions. The processor executes various functional applications and data processing of the robot 400 by running the instructions stored in the internal memory. The internal memory may include a program storage area and a data storage area. The program storage area may store the operating system and at least one application program required for a given function. The data storage area may store data created during the use of the robot 400. Furthermore, the internal memory may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory, universal flash storage (UFS), etc.

[0157] The wireless communication function of robot 400 can be implemented through communication module 405. For example, through communication module 405, robot 400 can communicate with other devices, such as with a server. As an example, communication module 405 may include antenna 1, antenna 2, mobile communication module, wireless communication module, modem processor, and baseband processor, etc.

[0158] Antennas 1 and 2 are used to transmit and receive electromagnetic wave signals. Each antenna in robot 400 can be used to cover one or more communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example, antenna 1 can be reused as a diversity antenna for a wireless local area network. In some other embodiments, the antennas can be used in conjunction with tuning switches.

[0159] The mobile communication module can provide wireless communication solutions, including 2G / 3G / 4G / 5G, for applications on the robot 400. The mobile communication module may include at least one filter, switch, power amplifier, low-noise amplifier (LNA), etc. The mobile communication module can receive electromagnetic waves via antenna 1, and perform filtering, amplification, and other processing on the received electromagnetic waves before transmitting them to a modem processor for demodulation. The mobile communication module can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves for radiation via antenna 1. In some embodiments, at least some functional modules of the mobile communication module may be housed within the processor. In some embodiments, at least some functional modules of the mobile communication module and at least some modules of the processor may be housed in the same device.

[0160] The wireless communication module can provide solutions for applications on the Robot 400, including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), Bluetooth (BT), global navigation satellite system (GNSS), and infrared (IR) technologies. The wireless communication module can be one or more devices integrating at least one communication processing module. The wireless communication module receives electromagnetic waves via antenna 2, frequency-modulates and filters the electromagnetic wave signals, and sends the processed signal to the processor. The wireless communication module can also receive signals to be transmitted from the processor, frequency-modulate and amplify them, and then convert them into electromagnetic waves for radiation via antenna 2.

[0161] In some embodiments, antenna 1 of robot 400 is coupled to a mobile communication module, and antenna 2 is coupled to a wireless communication module, enabling robot 400 to communicate with servers and other devices via wireless communication technology. The wireless communication technology may include Global System for Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time-Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC, FM, and / or IR technologies. The GNSS may include Global Positioning System (GPS), Global Navigation Satellite System (GLONASS), BeiDou Navigation Satellite System (BDS), Quasi-Zenith Satellite System (QZSS), and / or Satellite Based Augmentation Systems (SBAS).

[0162] It is understood that the structure illustrated in this embodiment does not constitute a specific limitation on robot 400. In other embodiments, robot 400 may include more or fewer components than illustrated, or combine some components, or separate some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

[0163] The above describes the scenarios and devices in which the multimodal data processing method provided in this application is applied. The following will describe in detail the multimodal data processing method provided in this application.

[0164] Please see Figure 5 , Figure 5 This is a flowchart illustrating a method for processing multimodal data provided in an embodiment of this application. Figure 5 As shown, the method for processing multimodal data includes the following steps 501-504.

[0165] Step 501: Obtain the first image and point cloud data.

[0166] In this embodiment, the electronic device can acquire a first image and point cloud data that need to be fused. The first image and point cloud data can be obtained through different modules on the electronic device. For example, if the electronic device is an autonomous vehicle, the first image can be captured by a camera on the autonomous vehicle, and the point cloud data can be obtained by a LiDAR scanner on the autonomous vehicle. The electronic device can also receive the first image and point cloud data from other devices. For example, if the electronic device is a server, the server can receive the first image transmitted by a smart camera and the point cloud data obtained by a drone through LiDAR scanning.

[0167] To ensure the fusion of the first image and point cloud data, both are acquired in the same scenario. For example, both the first image and point cloud data may be acquired in an autonomous driving scenario; or both may be acquired in an indoor robot driving environment. This embodiment does not specifically limit the acquisition scenario for the first image and point cloud data. Furthermore, in both autonomous driving and robot scenarios, the acquisition time points for the first image and point cloud data may be the same.

[0168] In this way, by acquiring the first image and point cloud data in the same scene, it can be ensured that the environment and objects corresponding to the first image and point cloud data are the same, that is, to ensure that the first image and point cloud data are different modal representations of the same object, thus ensuring the fusion of the first image and point cloud data.

[0169] Step 502: Convert the first image into an image feature sequence and convert the point cloud data into a point cloud feature sequence. Both the image feature sequence and the point cloud feature sequence include multiple vectors, and the vectors included in the image feature sequence and the point cloud feature sequence have the same dimension.

[0170] In this embodiment, converting the first image into an image feature sequence essentially represents the content of the first image as a feature sequence, while converting the point cloud data into a point cloud feature sequence represents the sparsely distributed points in the point cloud data as a feature sequence. Furthermore, both the resulting image feature sequence and point cloud feature sequence are composed of multiple vectors, and each vector in both sequences has the same dimension. In other words, this step converts data from different modalities (i.e., the first image and point cloud data) into the same sequence format, facilitating subsequent processing of the feature sequences corresponding to different modalities using the same network model.

[0171] The number of vectors included in the image feature sequence can be the same as or different from the number of vectors included in the point cloud feature sequence, depending on the amount of data of the first image and point cloud data actually acquired. This embodiment does not make a specific limitation on this.

[0172] Specifically, in this embodiment, image feature mapping relationships and point cloud feature mapping relationships may be pre-established. Based on the image feature mapping relationship, an image can be converted into a unique corresponding image feature sequence; based on the point cloud feature mapping relationship, point cloud data can be converted into a unique corresponding image feature sequence.

[0173] Optionally, in the process of converting the first image into an image feature sequence, the first image can be first divided into multiple image blocks, and each image block in the multiple image blocks can be of the same size; then, each image block in the multiple image blocks is converted into a vector by a first feature converter, resulting in an image feature sequence composed of multiple vectors corresponding to the multiple image blocks. That is, the multiple vectors in the image feature sequence are obtained by converting each image block in the first image by the first feature converter, meaning that the number of multiple vectors in the image feature sequence is the same as the number of multiple image blocks obtained based on the division of the first image.

[0174] For example, when the size of the first image is 256*256, and based on the size of an image patch being 16*16, the first image can be divided into 8*8 image patches, with each patch being 16*16. For these 8*8 image patches obtained from the first image, based on a left-to-right and top-to-bottom order, a first feature converter can sequentially convert each patch into a corresponding vector, resulting in 64 vectors. Thus, based on the conversion order of these 64 vectors, they can be arranged sequentially to obtain an image feature sequence. The first feature converter can be, for example, a linear neural network capable of converting each image patch into a vector of a specific dimension.

[0175] By dividing the first image into multiple image blocks and converting each image block into a vector of a specific dimension based on the first feature converter, an image feature sequence composed of multiple vectors is obtained. This ensures that the same feature sequence can be obtained for images of different formats, thus ensuring the feasibility and compatibility of the solution.

[0176] Optionally, in the process of converting point cloud data into point cloud feature sequences, multiple cubic spaces can be first obtained based on the point cloud data, and each cubic space includes one or more points from the point cloud data. Since point cloud data actually includes a large number of points in three-dimensional space, and each point has unique three-dimensional coordinates, the three-dimensional space in which the point cloud data resides can be divided into multiple smaller cubic spaces, and these cubic spaces are closely arranged. Furthermore, each cubic space divided from the point cloud data can include one or more points. For example, the size of each cubic space obtained based on the point cloud data can be 0.1*0.1*0.1m.

[0177] Then, based on the points included in each cube space, a second feature converter transforms each cube space into a vector, resulting in a point cloud feature sequence composed of multiple vectors corresponding to the cube spaces. The order of the vectors in the point cloud feature sequence is also related to the arrangement order of the multiple cube spaces. In other words, the multiple vectors in the point cloud feature sequence are obtained by transforming each cube space corresponding to the point cloud data through the second feature converter, and the values of the points distributed in the cube spaces are related to the values of the vectors ultimately obtained by the second feature converter.

[0178] The second feature converter can be, for example, a linear neural network, capable of converting the points in each cubic space into a vector of a specific dimension. By dividing the point cloud data into multiple cubic spaces and converting the points in each cubic space into vectors of a specific dimension using the second feature converter, a point cloud feature sequence composed of multiple vectors is obtained. This ensures that the same format of feature sequence can be obtained for point cloud data with different distributions, thus ensuring the feasibility and compatibility of the solution.

[0179] It should be noted that the structure of the linear neural network used as the first feature converter can be different from the structure of the linear neural network used as the second feature converter. Furthermore, the first and second feature converters can be viewed as neural networks storing pre-established mapping relationships, capable of converting images to sequences and point cloud data to sequences. Since the first and second feature converters merely convert images or point cloud data into vectors of a specific dimension, the capacity of the linear neural networks used as the first or second feature converters is not large, and the computational overhead is also relatively small. That is, the computational overhead of converting the first image into an image feature sequence and the point cloud data into a point cloud feature sequence is small, and the computational latency is low.

[0180] Step 503: The image feature sequence and the point cloud feature sequence are processed by the feature extraction network to obtain the first feature sequence corresponding to the image feature sequence and the second feature sequence corresponding to the point cloud feature sequence.

[0181] In this embodiment, an image feature sequence is processed by a feature extraction network to obtain a first feature sequence; and a point cloud feature sequence is processed by the same feature extraction network to obtain a second feature sequence. That is, when both the image feature sequence and the point cloud feature sequence are composed of multiple vectors of the same dimension, the feature extraction network can process feature sequences corresponding to different modalities of data, thereby obtaining feature sequences corresponding to different modalities of data. Specifically, the feature extraction network can extract features from the input feature sequence to obtain the output feature sequence. In other words, when converting a first image into an image feature sequence, the feature extraction network can be used to extract features from the image feature sequence to obtain the first feature sequence corresponding to the first image; when converting point cloud data into a point cloud feature sequence, the feature extraction network can be used to extract features from the point cloud feature sequence to obtain the second feature sequence corresponding to the point cloud data.

[0182] The feature extraction network can be an attention network, such as the Transformer model. Furthermore, the feature sequence output by the feature extraction network has the same length as the feature sequence input to the feature extraction network; that is, the image feature sequence has the same length as the first feature sequence, and the point cloud feature sequence has the same length as the second feature sequence.

[0183] Step 504: The first feature sequence and the second feature sequence are fused to obtain fused features, which are used to perform environmental perception tasks.

[0184] After obtaining the first feature sequence corresponding to the first image and the second feature sequence corresponding to the point cloud data, the first and second feature sequences can be fused to obtain fused features. These fused features can be used by electronic devices to perform subsequent environmental perception tasks, such as object detection and semantic segmentation, ensuring that the electronic device can perceive its surrounding environment based on the first image and point cloud data. For example, in autonomous driving scenarios, autonomous vehicles can perform environmental perception tasks based on the fused features, thereby detecting various moving and stationary obstacles (such as vehicles, pedestrians, and buildings) and collecting various information on the road (such as drivable areas, lane lines, traffic signs, and traffic lights), ensuring that the autonomous vehicle can determine the appropriate driving strategy based on the perceived environmental information.

[0185] For example, please refer to Figure 6 , Figure 6 This is a schematic diagram illustrating a process for fusing first image and point cloud data in an autonomous driving scenario, as provided in an embodiment of this application. Figure 6 As shown, in an autonomous driving scenario, the autonomous vehicle acquires a first image via a camera and point cloud data via LiDAR scanning during its journey. Then, the autonomous vehicle performs feature transformation on the first image to obtain an image feature sequence; it also performs feature transformation on the point cloud data to obtain a point cloud feature sequence. Next, the autonomous vehicle inputs the image feature sequence and the point cloud feature sequence into a feature extraction network, which performs feature extraction processing on both sequences in parallel, resulting in a first feature sequence for the first image and a second feature sequence for the point cloud data. Finally, the autonomous vehicle fuses the first and second feature sequences to obtain a fused feature. Based on this fused feature, the autonomous vehicle can continue to perform subsequent target detection tasks, thereby detecting target objects in the surrounding environment during its journey, enabling it to determine its driving strategy based on the detected target objects.

[0186] Please see Figure 7 , Figure 7 This is a schematic diagram illustrating feature transformation performed on a first image and point cloud data, as provided in an embodiment of this application. Figure 7 As shown, during the feature transformation process of the autonomous vehicle on the first image, the first image is first divided into blocks to obtain a block result, which includes multiple image blocks. Then, the first feature converter sequentially converts the image blocks in the first image into vectors, thereby obtaining an image feature sequence composed of multiple vectors.

[0187] In the process of feature transformation of point cloud data by autonomous vehicles, the 3D space in which the point cloud data is located is first divided into spatial partitions to obtain spatial partitioning results, which include multiple cubic spaces. Then, the first feature converter sequentially converts the cubic spaces in the spatial partitioning results into vectors, thereby obtaining a point cloud feature sequence composed of multiple vectors.

[0188] It should be noted that the above describes the process of converting the first image into an image feature sequence, further extracting the first feature sequence through a feature extraction network, and then fusing the first feature sequence with the second feature sequence corresponding to the point cloud data. In practical applications, autonomous vehicles and other devices may acquire multiple images simultaneously. Therefore, for each of the multiple images, the above steps can be used to perform feature conversion and feature extraction, and finally, the obtained feature sequence is fused with the feature sequence corresponding to the point cloud data.

[0189] In this solution, by converting data of different modalities (i.e., image and point cloud data) into feature sequences of the same format, and using the same feature extraction network to process the feature sequences corresponding to different modalities, it is possible to ensure that feature extraction of different modalities can be achieved based on a single feature extraction network. This eliminates the need to deploy multiple encoders corresponding to different modalities, saving device storage resources. Furthermore, processing the feature sequences corresponding to different modalities based on the same feature extraction network enables AI hardware to perform parallel processing of different modalities, improving data processing efficiency and effectively reducing the fusion time of different modalities.

[0190] Specifically, based on the characteristics of current AI hardware (such as graphics processing units (GPUs) or neural processing units (NPUs)), AI hardware can process multiple sets of input data in parallel when running the same neural network. This means it can simultaneously extract image features and point cloud data features, thereby improving data processing efficiency. However, in related technologies that use different encoders to process data of different modalities, parallel processing of different modalities is not possible, resulting in higher latency when processing different modalities.

[0191] Optionally, step 504 above may specifically include the following steps 5041-5043.

[0192] Step 5041: Based on the projection position of points in the point cloud data in the image space, the vectors in the second feature sequence are fused into the first feature sequence to obtain the first fused sequence, wherein the vectors in the second feature sequence have corresponding points in the point cloud data.

[0193] For example, see Figure 8 , Figure 8 This is a schematic diagram illustrating the fusion processing of a first feature sequence and a second feature sequence, provided as an embodiment of this application. Figure 8 As shown, step 5041 actually involves fusing the first feature sequence and the second feature sequence in the image space. Since each point in the point cloud data has a corresponding three-dimensional coordinate, based on the transformation relationship between the point cloud space and the image space, the points in the point cloud data can be projected into the image space, thereby obtaining the projection position of each point in the point cloud data in the image space, that is, the projection position of each point on the first image.

[0194] Furthermore, since each vector in the first feature sequence is obtained by performing feature transformation and feature extraction on image patches in the first image, each vector in the first feature sequence can find a corresponding image patch in the first image. Similarly, since each vector in the second feature sequence is also obtained by performing feature transformation and feature extraction on cubic space, each vector in the second feature sequence can find a corresponding cubic space in the stereo space corresponding to the point cloud data, thereby determining the point corresponding to each vector.

[0195] In this way, for each vector in the second feature sequence, a corresponding point can be found in the point cloud data; and for each vector in the first feature sequence, a corresponding image patch can be found in the first image. Therefore, after determining the projection position of a point in the point cloud data in the first image, it is possible to determine whether the vector corresponding to the point corresponds to the vector corresponding to the image patch where the point's projection position is located. Thus, based on the first feature sequence, the corresponding vectors in the second feature sequence are fused with those in the first feature sequence, achieving the fusion of the first and second feature sequences in the image space.

[0196] For example, please refer to Figure 9 , Figure 9 This is a schematic diagram illustrating the fusion of a first feature sequence and a second feature sequence in image space, as provided in an embodiment of this application. Figure 9 As shown, by projecting points from the point cloud data onto the image space, the projected positions of these points in the first image can be obtained. With the pre-obtained block division results of the first image and the spatial partitioning results of the point cloud data, the image block in which the projected position of a point in the point cloud data lies within the first image can be determined. For example... Figure 9 The projection result of the point cloud data onto the first image shown can be used to find the corresponding projection position of the points in the point cloud data on the first image, thereby determining the image block corresponding to the points in the point cloud data in the image space.

[0197] Then, the first vector in the first feature sequence is adjusted to be the fusion result of the first vector and the second vector in the second feature sequence, where the projection position of the point corresponding to the second vector is located in the image patch corresponding to the first vector. In other words, the first vector in the first feature sequence and the second vector in the second feature sequence are vectors with a corresponding relationship, determined by projecting point cloud data onto the image space. Furthermore, the fusion result of the first vector and the second vector can be the sum of the first and second vectors. Alternatively, the fusion result of the first vector and the second vector can be the second vector itself; that is, adjusting the first vector is actually replacing the first vector with the second vector.

[0198] The above describes the process of adjusting vectors in the first feature sequence using the first and second vectors that have a corresponding relationship. In practical applications, all vectors in the first feature sequence that have a corresponding relationship with vectors in the second feature sequence can be adjusted in the same way, thereby achieving the adjustment of vectors in the first feature sequence. Thus, based on the first feature sequence, by adjusting the vectors in the first feature sequence that have a corresponding relationship with vectors in the second feature sequence, it is possible to adjust some vectors in the first feature sequence while keeping the length of the first feature sequence unchanged, thereby obtaining the first mixed sequence.

[0199] Furthermore, after obtaining the first mixed sequence, it can be input into a first attention network, such as a Transformer network. The first attention network further fuses the vectors in the first mixed sequence, enabling the point cloud features to interact with their neighboring image features. Finally, after inputting the first mixed sequence into the first attention network, the first fused sequence output by the first attention network can be obtained.

[0200] In summary, by projecting points from point cloud data onto image space, the position of the points in image space can be determined, and then the vectors in the second feature sequence that correspond to the first feature sequence can be determined. By fusing the corresponding vectors based on the first feature sequence, the first and second feature sequences can be fused in image space, enabling point cloud features to interact with their neighboring image features and ensuring the smooth fusion of image and point cloud data.

[0201] Step 5042: Based on the mapping position of the image patch in the first image in the point cloud space, the vectors in the first feature sequence are fused into the second feature sequence to obtain the second fused sequence. The vectors in the first feature sequence have corresponding image patches in the first image.

[0202] For example, such as Figure 8 As shown, step 5042 actually involves fusing the first feature sequence and the second feature sequence in the point cloud space. Since each pixel in the image has a unique two-dimensional coordinate in the image space, the pixels in the image can be transformed into the point cloud space to obtain the three-dimensional coordinates of the pixels in the image in the point cloud space, thereby determining the cube space corresponding to the pixels in the image in the point cloud space. Since each image patch in the first image includes multiple pixels, the mapping position of the image patch in the point cloud space can be determined based on the three-dimensional coordinates of the multiple pixels in the image patch in the point cloud space.

[0203] For each vector in the second feature sequence, a corresponding point can be found in the point cloud data; for each vector in the first feature sequence, a corresponding image patch can also be found in the first image. Therefore, after determining the mapping position of the image patch in the first image within the point cloud space, we can determine the correspondence between the vector corresponding to the image patch and the vector corresponding to the mapping position in the cube space, based on the cube space where the image patch's mapping position is located. Thus, based on the second feature sequence, we can fuse the corresponding vectors in the second feature sequence and the first feature sequence, achieving the fusion of the first and second feature sequences in the point cloud space.

[0204] For example, please refer to Figure 10 , Figure 10 This is a schematic diagram illustrating the fusion of a first feature sequence and a second feature sequence in a point cloud space, as provided in an embodiment of this application. Figure 10 As shown, after mapping image patches from the first image to the point cloud space, the mapped positions of the image patches in the first image within the point cloud space can be obtained. Given the pre-obtained block division results of the first image and the spatial partitioning results of the point cloud data, the mapped positions of the image patches in the first image within the cube space of the point cloud space can be determined. For example... Figure 10 The mapping result of the first image on the point cloud data shown shows that the image blocks in the first image can find the corresponding mapping positions in the point cloud space, thereby determining the cube space corresponding to the image blocks of the first image in the point cloud space.

[0205] Then, the third vector in the second feature sequence is adjusted to be the fusion result of the third vector and the fourth vector in the first feature sequence. The mapping position of the image patch corresponding to the third vector is located in the cube space where the point corresponding to the fourth vector is located. That is, the third vector in the second feature sequence and the fourth vector in the first feature sequence are vectors with a corresponding relationship, determined by projecting the first image onto the point cloud space. Furthermore, the fusion result of the third and fourth vectors can be the sum of the third and fourth vectors. Alternatively, the fusion result of the third and fourth vectors can be the fourth vector itself; that is, adjusting the third vector is actually replacing the third vector with the fourth vector.

[0206] The above describes the process of adjusting vectors in the second feature sequence using the corresponding third and fourth vectors. In practical applications, all vectors in the second feature sequence that correspond to vectors in the first feature sequence can be adjusted in the same way, thus achieving the adjustment of vectors in the second feature sequence. In this way, by adjusting the vectors in the second feature sequence that correspond to vectors in the first feature sequence, it is possible to adjust some vectors in the second feature sequence while keeping its length unchanged, thereby obtaining the second mixed sequence.

[0207] Furthermore, after obtaining the second mixed sequence, it can be input into a second attention network, such as a Transformer network. The second attention network further fuses the vectors in the second mixed sequence, enabling image features to interact with similar point cloud features. Finally, by inputting the second mixed sequence into the second attention network, the second fused sequence output by the second attention network can be obtained.

[0208] In summary, by projecting image patches from the first image onto the point cloud space, the position of the image patches in the point cloud space can be determined, and then the vectors in the first feature sequence that correspond to the second feature sequence can be determined. By fusing the corresponding vectors based on the second feature sequence, the first and second feature sequences can be fused in the point cloud space, enabling image features to interact with their similar point cloud features and ensuring the smooth fusion of image and point cloud data.

[0209] Since pixels in an image only have two-dimensional coordinates, while points in a point cloud have three-dimensional coordinates (i.e., points in a point cloud have additional depth information than pixels in an image), mapping image patches from the first image to the point cloud often requires determining the depth of the image patch in the point cloud before the mapping can be achieved. This embodiment provides a corresponding implementation method to achieve this mapping.

[0210] For example, in the process of mapping an image patch in the first image to the point cloud space, the points in the point cloud data can first be projected into the first image, and based on the projection positions of the points in the point cloud data in the first image, at least one projection position closest to the first image patch in the first image can be determined. Here, the first image patch can be any image patch in the first image.

[0211] Then, based on the depth of the point corresponding to at least one projection position, the mapping position of the first image patch in the point cloud space is determined. That is, based on the depth of the point corresponding to at least one projection position, the depth of the first image patch in the point cloud space can be determined; thus, based on the two-dimensional coordinates of the first image patch in the image space and the transformation relationship from the image space to the point cloud space, the mapping position of the first image patch in the point cloud space can be determined.

[0212] For example, the nearest projection position to the first image patch can be determined, and the depth of the point corresponding to that projection position can be used as the depth of the first image patch in the point cloud space, thus determining the mapping position of the first image patch in the point cloud space. Alternatively, multiple nearest projection positions can be determined with the first image patch as the center and a radius of r, and the average depth of the points corresponding to these multiple projection positions can be used as the depth of the first image patch, thus determining the mapping position of the first image patch in the point cloud space. In practical applications, the choice between selecting the nearest projection position or multiple projection positions to determine the mapping position of the image patch in the point cloud space can be determined based on specific requirements. Generally, selecting the nearest projection position to determine the mapping position of the image patch in the point cloud space requires less computation and is more efficient; selecting multiple nearest projection positions to determine the mapping position of the image patch in the point cloud space offers higher accuracy.

[0213] In summary, after projecting the points in the point cloud data onto the first image, for each image block in the first image, one or more projection positions closest to that image block can be found. Thus, the depth of the image block in the point cloud space can be determined based on the depth of the points at these projection positions, thereby determining the mapping position of the image block in the point cloud space.

[0214] In this scheme, the points in the point cloud data are first projected into the image space, and then the depth of the image patch in the point cloud space is determined by finding the nearest projection position on the image. This ensures that the image patch in the image can be accurately mapped to the point cloud space, thus improving the feasibility of the scheme.

[0215] Step 5043: Perform fusion processing on the first fusion sequence and the second fusion sequence to obtain fusion features.

[0216] For example, the first fusion sequence and the second fusion sequence can be converted to BEV space respectively to obtain the first BEV feature and the second BEV feature. Specifically, BEV space is actually a two-dimensional space seen from a bird's-eye view, so it can be divided into multiple grids. After determining the position of a feature in three-dimensional space (i.e., point cloud space), the grid in which the feature is located in BEV can be determined. Since each vector in the first fusion sequence corresponds to an image patch, the grid in which each vector in the first fusion sequence is located in BEV space (i.e., the position of the vector in BEV space) can be determined based on the mapping position of the image patch in the point cloud space, thereby labeling the position information of each vector in the first fusion sequence in BEV space, realizing the conversion of the first fusion sequence to BEV space, and obtaining the first BEV feature. Similarly, since each vector in the first fusion sequence corresponds to a cube space in the point cloud space, the grid in which each vector in the second fusion sequence is located in BEV space can be determined based on the BEV grid corresponding to each cube space, thereby labeling the position information of each vector in BEV space in the second fusion sequence, realizing the conversion of the second fusion sequence to BEV space, and obtaining the second BEV feature.

[0217] Then, the first BEV feature and the second BEV feature are fused to obtain the fused feature. Specifically, in the process of fusing the first BEV feature and the second BEV feature, vectors located at the same position in the first BEV feature and the second BEV feature can be fused (for example, the vectors are summed) to obtain the fused feature.

[0218] In this scheme, by fusing the first feature sequence and the second feature sequence in the image space and the point cloud space respectively, and then fusing the fused sequence obtained in the image space and the point cloud space in a unified space, the fusion of image features and point cloud features can be well realized, ensuring the accuracy of the final fused features.

[0219] Please see Figure 11A , Figure 11A This is a schematic diagram illustrating the fusion processing of different modal data provided in an embodiment of this application. As shown in Figure 11, in an autonomous driving scenario, the autonomous vehicle acquires a first image and point cloud data through a camera and a LiDAR, respectively. For the first image, a first feature converter can be used to perform feature transformation on the first image to obtain an image feature sequence; for the point cloud data, a second feature converter can be used to perform feature transformation on the point cloud data to obtain a point cloud feature sequence. The first feature converter and the second feature converter can be different Transformer networks.

[0220] Then, the image feature sequence and the point cloud feature sequence are processed in parallel using the same feature extraction network to obtain the first feature sequence and the second feature sequence. The feature extraction network can be, for example, a Transformer network.

[0221] Next, the first feature sequence and the second feature sequence are fused in the image space using an image space fusion network to obtain a first fused sequence. The image space fusion network includes the first attention network described above. Then, the first feature sequence and the second feature sequence are fused in the point cloud space using a point cloud fusion network to obtain a second fused sequence. The point cloud space fusion network includes the second attention network described above.

[0222] Finally, the first fusion sequence and the second fusion sequence are fused in the BEV space to obtain fusion features, which are used to perform subsequent environmental perception tasks.

[0223] Please see Figure 11B , Figure 11B This diagram illustrates a comparison of different multimodal data processing methods provided in embodiments of this application. Figure 11B As shown, this embodiment compares the performance of 17 existing solutions with that of the proposed solution for 3D detection tasks in autonomous driving scenarios. Figure 11B As shown, our proposed solution outperforms all previous LiDAR-camera fusion methods while achieving a lower inference latency (89.7ms). Our solution achieves state-of-the-art performance in the combined detection accuracy (NDS) on the validation and test sets at 73.5 and 74.5, respectively, which is 2.1 and 1.6 points higher than the previous best methods. Furthermore, the proposed network model can be easily accelerated using an optimized deployment tool (NVIDIA TensorRT) to reduce inference latency (50.2ms).

[0224] The above describes a method for processing multimodal data provided in this embodiment. In practical applications, the training of the network involved in the above-described multimodal data processing method can be combined with a specific environmental perception task. Specifically, during training, the image and point cloud data used as training data are processed according to the above-described multimodal data processing method to obtain fused features. These fused features are then input into the network corresponding to the environmental perception task, such as an object detection network or a semantic segmentation network, to obtain the final output result. In this way, a loss function can be constructed based on the final output result, and then the network involved in the above-described multimodal data processing method (e.g., a feature extraction network) can be trained based on the loss function.

[0225] Since this scheme uses the same feature extraction network to process feature sequences corresponding to different modal data, unified training of the network can be achieved during the training process. Unlike related technologies, it is not necessary to train the encoders corresponding to each modal data separately, which can effectively improve the training efficiency of the network.

[0226] The methods provided in the embodiments of this application have been described in detail above. Next, the device for performing the above methods provided in the embodiments of this application will be described.

[0227] Please see Figure 12 , Figure 12 This is a schematic diagram of the structure of a multimodal data processing device provided in an embodiment of this application. Figure 12 As shown, the multimodal data processing device includes: an acquisition module 1201 for acquiring a first image and point cloud data; a processing module 1202 for converting the first image into an image feature sequence and the point cloud data into a point cloud feature sequence, wherein both the image feature sequence and the point cloud feature sequence include multiple vectors, and the vectors included in the image feature sequence and the point cloud feature sequence have the same dimension; the processing module 1202 is further used to process the image feature sequence and the point cloud feature sequence respectively through a feature extraction network to obtain a first feature sequence corresponding to the image feature sequence and a second feature sequence corresponding to the point cloud feature sequence; the processing module 1202 is further used to perform fusion processing on the first feature sequence and the second feature sequence to obtain fused features, which are used to perform environmental perception tasks.

[0228] In one possible implementation, the processing module 1202 is further configured to: fuse vectors in the second feature sequence into the first feature sequence based on the projection positions of points in the point cloud data in the image space to obtain a first fused sequence, wherein the vectors in the second feature sequence have corresponding points in the point cloud data; fuse vectors in the first feature sequence into the second feature sequence based on the mapping positions of image patches in the first image in the point cloud space to obtain a second fused sequence, wherein the vectors in the first feature sequence have corresponding image patches in the first image; and perform fusion processing on the first fused sequence and the second fused sequence to obtain fused features.

[0229] In one possible implementation, the processing module 1202 is further configured to: project points in the point cloud data onto the image space to obtain the projection positions of the points in the point cloud data in the first image; adjust the first vector in the first feature sequence to a fusion result of the first vector and the second vector in the second feature sequence; wherein the projection position of the point corresponding to the second vector is located in the image block corresponding to the first vector.

[0230] In one possible implementation, the processing module 1202 is further configured to: map the image patch in the first image to the point cloud space to obtain the mapping position of the image patch in the first image in the point cloud space; adjust the third vector in the second feature sequence to the fusion result of the third vector and the fourth vector in the first feature sequence; wherein the mapping position of the image patch corresponding to the third vector is located in the cube space where the point corresponding to the fourth vector is located.

[0231] In one possible implementation, the processing module 1202 is further configured to: determine at least one projection position closest to the first image block in the first image based on the projection position of points in the point cloud data in the first image; and determine the mapping position of the first image block in the point cloud space based on the depth of the point corresponding to the at least one projection position.

[0232] In one possible implementation, the processing module 1202 is further configured to: convert the first fusion sequence and the second fusion sequence to the BEV space respectively to obtain the first BEV feature and the second BEV feature; and fuse the first BEV feature and the second BEV feature to obtain the fusion feature.

[0233] In one possible implementation, the processing module 1202 is further configured to: divide the first image into multiple image blocks; and convert each image block in the multiple image blocks into a vector using a first feature converter to obtain an image feature sequence composed of multiple vectors corresponding to the multiple image blocks.

[0234] In one possible implementation, the processing module 1202 is further configured to: divide the point cloud data into multiple cube spaces, each cube space including one or more points in the point cloud data; based on the points included in each cube space, convert each cube space into a vector through a second feature converter to obtain a point cloud feature sequence composed of multiple vectors corresponding to the multiple cube spaces.

[0235] In one possible implementation, the first image and point cloud data are acquired in the same scene.

[0236] In one possible implementation, the first image and point cloud data are acquired in any of the following scenarios: autonomous driving scenario, robot driving scenario, and intelligent inspection scenario.

[0237] Please see Figure 13 , Figure 13This is a schematic diagram of an execution device provided in an embodiment of this application. The execution device 1300 can specifically be an autonomous vehicle, a robot, a server, etc., and is not limited thereto. Specifically, the execution device 1300 includes: a receiver 1301, a transmitter 1302, a processor 1303, and a memory 1304 (wherein the execution device 1300 may have one or more processors 1303). Figure 13 (Taking a processor as an example), processor 1303 may include application processor 13031 and communication processor 13032. In some embodiments of this application, receiver 1301, transmitter 1302, processor 1303 and memory 1304 may be connected via bus or other means.

[0238] Memory 1304 may include read-only memory and random access memory, and provides instructions and data to processor 1303. A portion of memory 1304 may also include non-volatile random access memory (NVRAM). Memory 1304 stores processor and operation instructions, executable modules, or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

[0239] Processor 1303 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together through a bus system, which may include not only the data bus, but also power buses, control buses, and status signal buses. However, for clarity, all buses are referred to as the bus system in the diagram.

[0240] The methods disclosed in the embodiments of this application described above can be applied to processor 1303, or implemented by processor 1303. Processor 1303 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit in the hardware of processor 1303 or by instructions in the form of software. The processor 1303 described above can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0241] The processor 1303 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory 1304. The processor 1303 reads information from memory 1304 and, in conjunction with its hardware, completes the steps of the above methods.

[0242] Receiver 1301 can be used to receive input digital or character information, and to generate signal inputs related to the settings and function control of the execution device. Transmitter 1302 can be used to output digital or character information through the first interface; transmitter 1302 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; transmitter 1302 may also include a display device such as a display screen.

[0243] The electronic device provided in this application embodiment can specifically be a chip, which includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip in the execution device to execute the model structure determination method described in the above embodiments, or to cause the chip in the training device to execute the model structure determination method described in the above embodiments. Optionally, the storage unit can be an internal storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be an external storage unit located within the wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0244] For details, please refer to Figure 14 , Figure 14 This is a schematic diagram of a chip provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 1400. The NPU 1400 is mounted as a coprocessor on the host CPU, and tasks are assigned by the host CPU. The core part of the NPU is the arithmetic circuit 1403, which is controlled by the controller 1404 to extract matrix data from the memory and perform multiplication operations.

[0245] In some implementations, the arithmetic circuit 1403 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 1403 is a two-dimensional pulsating array. The arithmetic circuit 1403 can also be a one-dimensional pulsating array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1403 is a general-purpose matrix processor.

[0246] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 1402 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 1401 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is stored in the accumulator 1408.

[0247] Unified memory 1406 is used to store input and output data. Weight data is directly transferred to weight memory 1402 via Direct Memory Access Controller (DMAC) 1405. Input data is also transferred to unified memory 1406 via DMAC.

[0248] BIU stands for Bus Interface Unit, which is used for interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1409.

[0249] The Bus Interface Unit (BIU) 1410 is used by the instruction fetch memory 1409 to fetch instructions from external memory, and also by the memory access controller 1405 to fetch the original data of the input matrix A or the weight matrix B from external memory.

[0250] The DMAC is mainly used to move input data from external memory DDR to unified memory 1406, or to weight data to weight memory 1402, or to input data to input memory 1401.

[0251] The vector computation unit 1407 includes multiple arithmetic processing units that, when needed, further process the output of the computation circuit 1403, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.

[0252] In some implementations, the vector computation unit 1407 can store the processed output vector in the unified memory 1406. For example, the vector computation unit 1407 can apply a linear function, or a nonlinear function, to the output of the computation circuit 1403, such as performing linear interpolation on feature planes extracted from a convolutional layer, or, for example, accumulating a vector of values to generate activation values. In some implementations, the vector computation unit 1407 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as activation input to the computation circuit 1403, for example, for use in subsequent layers of the neural network.

[0253] The instruction fetch buffer 1409 connected to the controller 1404 is used to store the instructions used by the controller 1404;

[0254] Unified memory 1406, input memory 1401, weighted memory 1402, and instruction fetch memory 1409 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.

[0255] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above program.

[0256] Please refer to Figure 15 , Figure 15 This is a schematic diagram of a computer-readable storage medium provided in an embodiment of this application. This application also provides a computer-readable storage medium in some embodiments, wherein the above-described... Figure 5 The disclosed method can be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of art.

[0257] Figure 15 A conceptual partial view of an example computer-readable storage medium arranged according to at least some of the embodiments shown herein is illustrated schematically. The example computer-readable storage medium includes a computer program for executing computer processes on a computing device.

[0258] In one embodiment, the computer-readable storage medium 1500 is provided using a signal bearer medium 1501. The signal bearer medium 1501 may include one or more program instructions 1502, which, when executed by one or more processors, can provide the above-mentioned... Figure 5 The described function or part of the function.

[0259] In some examples, the signal carrying medium 1501 may include a computer-readable medium 1503, such as, but not limited to, a hard disk drive, a compact disc (CD), a digital video disc (DVD), a digital magnetic tape, a memory, ROM, or RAM, etc.

[0260] In some embodiments, the signal-bearing medium 1501 may include a computer-recordable medium 1504, such as, but not limited to, a memory, a read / write (R / W) CD, a R / W DVD, etc. In some embodiments, the signal-bearing medium 1501 may include a communication medium 1505, such as, but not limited to, digital and / or analog communication media (e.g., fiber optic cables, waveguides, wired communication links, wireless communication links, etc.). Therefore, for example, the signal-bearing medium 1501 may be transmitted by a wireless communication medium 1505 (e.g., a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocols).

[0261] One or more program instructions 1502 may be, for example, computer-executable instructions or logical implementation instructions. In some examples, the computing device may be configured to provide various operations, functions, or actions in response to one or more program instructions 1502 conveyed to the computing device via a computer-readable medium 1503, a computer-recordable medium 1504, and / or a communication medium 1505.

[0262] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the accompanying drawings of the device embodiments provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0263] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods of the various embodiments of this application.

[0264] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0265] A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

1. A method of processing multi-modal data, the method comprising: include: Acquire the first image and point cloud data; The first image is converted into an image feature sequence, and the point cloud data is converted into a point cloud feature sequence. Both the image feature sequence and the point cloud feature sequence include multiple vectors, and the vectors included in the image feature sequence and the point cloud feature sequence have the same dimension. The image feature sequence and the point cloud feature sequence are processed by the same feature extraction network to obtain a first feature sequence corresponding to the image feature sequence and a second feature sequence corresponding to the point cloud feature sequence. The first feature sequence and the second feature sequence are fused to obtain fused features, which are used to perform environmental perception tasks.

2. The method of claim 1, wherein, The process of fusing the first feature sequence and the second feature sequence to obtain fused features includes: Based on the projection position of points in the point cloud data in the image space, the vectors in the second feature sequence are fused into the first feature sequence to obtain a first fused sequence, wherein the vectors in the second feature sequence have corresponding points in the point cloud data; Based on the mapping position of the image patch in the first image in the point cloud space, the vectors in the first feature sequence are fused into the second feature sequence to obtain the second fused sequence, wherein the vectors in the first feature sequence have corresponding image patches in the first image; The first fusion sequence and the second fusion sequence are fused to obtain the fusion feature.

3. The method of claim 2, wherein, The step of fusing the vectors in the second feature sequence into the first feature sequence based on the projection positions of points in the point cloud data in the image space includes: The points in the point cloud data are projected onto the image space to obtain the projection position of the points in the point cloud data in the first image. The first vector in the first feature sequence is adjusted to be the result of the fusion of the first vector and the second vector in the second feature sequence; The projection position of the point corresponding to the second vector is located in the image patch corresponding to the first vector.

4. The method according to claim 2 or 3, characterized in that, The step of fusing vectors from the first feature sequence into the second feature sequence based on the mapping positions of image patches in the first image in the point cloud space includes: Map the image blocks in the first image to the point cloud space to obtain the mapped positions of the image blocks in the first image in the point cloud space; The third vector in the second feature sequence is adjusted to be the result of the fusion of the third vector and the fourth vector in the first feature sequence; The mapping position of the image block corresponding to the third vector is located in the cube space where the point corresponding to the fourth vector is located.

5. The method of claim 4, wherein, The step of mapping image patches in the first image to the point cloud space includes: Based on the projection positions of points in the point cloud data in the first image, determine at least one projection position that is closest to the first image block in the first image; Based on the depth of the point corresponding to the at least one projection position, the mapping position of the first image block in the point cloud space is determined.

6. The method according to claim 2 or 3, characterized in that, The process of fusing the first fusion sequence and the second fusion sequence to obtain the fusion feature includes: The first fusion sequence and the second fusion sequence are respectively converted to the bird's-eye view BEV space to obtain the first BEV feature and the second BEV feature; The first BEV feature and the second BEV feature are fused to obtain the fused feature.

7. The method according to any one of claims 1 to 3, characterized in that, The step of converting the first image into an image feature sequence includes: The first image is divided into multiple image blocks; The first feature converter converts each of the plurality of image blocks into a vector, thereby obtaining the image feature sequence composed of the arrangement of multiple vectors corresponding to the plurality of image blocks.

8. The method according to any one of claims 1-3, characterized in that, The step of converting the point cloud data into a point cloud feature sequence includes: Based on the point cloud data, multiple cubic spaces are obtained, and each cubic space includes one or more points from the point cloud data. Based on the points included in each cube space, each cube space is converted into a vector by a second feature converter, resulting in the point cloud feature sequence composed of multiple vectors corresponding to the multiple cube spaces.

9. The method according to any one of claims 1 to 3, characterized in that, The first image and the point cloud data were acquired in the same scene.

10. The method of any one of claims 1-3, wherein, The first image and the point cloud data are acquired in any of the following scenarios: autonomous driving scenario, robot driving scenario, and intelligent inspection scenario.

11. An apparatus for processing multi-modal data, the apparatus comprising: include: The acquisition module is used to acquire the first image and point cloud data; The processing module is used to convert the first image into an image feature sequence and the point cloud data into a point cloud feature sequence. Both the image feature sequence and the point cloud feature sequence include multiple vectors, and the vectors included in the image feature sequence and the point cloud feature sequence have the same dimension. The processing module is further configured to process the image feature sequence and the point cloud feature sequence respectively through the same feature extraction network to obtain a first feature sequence corresponding to the image feature sequence and a second feature sequence corresponding to the point cloud feature sequence; The processing module is further configured to perform fusion processing on the first feature sequence and the second feature sequence to obtain fused features, which are used to perform environmental perception tasks.

12. The apparatus of claim 11, wherein, The processing module is further configured to: Based on the projection position of points in the point cloud data in the image space, the vectors in the second feature sequence are fused into the first feature sequence to obtain a first fused sequence, wherein the vectors in the second feature sequence have corresponding points in the point cloud data; Based on the mapping position of the image patch in the first image in the point cloud space, the vectors in the first feature sequence are fused into the second feature sequence to obtain the second fused sequence, wherein the vectors in the first feature sequence have corresponding image patches in the first image; The first fusion sequence and the second fusion sequence are fused to obtain the fusion feature.

13. The apparatus of claim 12, wherein, The processing module is further configured to: The points in the point cloud data are projected onto the image space to obtain the projection position of the points in the point cloud data in the first image. The first vector in the first feature sequence is adjusted to be the result of the fusion of the first vector and the second vector in the second feature sequence; The projection position of the point corresponding to the second vector is located in the image patch corresponding to the first vector.

14. The apparatus according to claim 12 or 13, characterized in that, The processing module is further configured to: Map the image blocks in the first image to the point cloud space to obtain the mapped positions of the image blocks in the first image in the point cloud space; The third vector in the second feature sequence is adjusted to be the result of the fusion of the third vector and the fourth vector in the first feature sequence; The mapping position of the image block corresponding to the third vector is located in the cube space where the point corresponding to the fourth vector is located.

15. The apparatus of claim 14, wherein, The processing module is further configured to: Based on the projection positions of points in the point cloud data in the first image, determine at least one projection position that is closest to the first image block in the first image; Based on the depth of the point corresponding to the at least one projection position, the mapping position of the first image block in the point cloud space is determined.

16. The apparatus of claim 12 or 13, wherein, The processing module is further configured to: The first fusion sequence and the second fusion sequence are respectively converted to the bird's-eye view BEV space to obtain the first BEV feature and the second BEV feature; The first BEV feature and the second BEV feature are fused to obtain the fused feature.

17. The apparatus of any one of claims 11-13, wherein, The processing module is further configured to: The first image is divided into multiple image blocks; The first feature converter converts each of the plurality of image blocks into a vector, thereby obtaining the image feature sequence composed of the arrangement of multiple vectors corresponding to the plurality of image blocks.

18. The apparatus of any one of claims 11-13, wherein, The processing module is further configured to: Based on the point cloud data, multiple cubic spaces are obtained, and each cubic space includes one or more points from the point cloud data. Based on the points included in each cube space, each cube space is converted into a vector by a second feature converter, resulting in the point cloud feature sequence composed of multiple vectors corresponding to the multiple cube spaces.

19. An apparatus for processing multi-modal data, the apparatus comprising: The device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the device performs the method as described in any one of claims 1 to 10.

20. A computer storage medium, comprising, The computer storage medium stores instructions that, when executed by the computer, cause the computer to perform the method according to any one of claims 1 to 10.

21. A computer program product, characterised in that, The computer program product stores instructions that, when executed by a computer, cause the computer to perform the method described in any one of claims 1 to 10.