Information processing method and apparatus, and autonomous vehicle
By fusing sensor data through a four-dimensional spatiotemporal representation method, the problems of high storage and computing resource consumption and difficulty in capturing dynamic interactions in existing technologies are solved, generating a general representation applicable to multiple tasks, and achieving resource saving and accurate capture of dynamic correlations.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING JINGDONG QIANSHITECHNOLOGY CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
In existing autonomous driving technologies, the data processing method of 3D data + time series results in excessive consumption of storage and computing resources, difficulty in capturing dynamic interactions of time series information, poor flexibility, and the need to redesign feature extraction models to adapt to different tasks.
It adopts a spatiotemporal information integrated modeling logic, integrates continuous frame sensing data through a four-dimensional spatiotemporal representation method, and uses cross attention and self/graph attention networks for feature extraction and fusion to generate a general representation without the need for an additional temporal alignment module.
It achieves efficient storage and computing resource saving, accurately captures instantaneous correlations in dynamic scenes, and generates representations applicable to various downstream tasks without the need to redesign the feature extraction model.
Smart Images

Figure CN122241592A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computers, particularly to the field of autonomous driving, and especially to an information processing method and apparatus and an autonomous vehicle. Background Technology
[0002] Autonomous vehicles collect environmental information through various sensors installed on the vehicle, such as cameras and radar, and perform intelligent calculations and reasoning through various intelligent task models, such as target detection models, to achieve automatic driving control.
[0003] The performance improvement of environmental perception and decision-making tasks in autonomous driving technology relies on efficient data representation methods. Currently, environmental perception and scene task modeling in the field of autonomous driving mainly rely on 3D data or "3D data + time series".
[0004] The "3D data + time series" data processing method treats the time dimension as a parameter independent of the 3D spatial dimension. This involves first collecting multiple frames of independent 3D data, and then using time series processing algorithms such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) to post-process and align the multiple frames of 3D data. This enables the extraction of temporal features from dynamic scenes, thereby supporting the autonomous driving system in judging the motion state and predicting the trajectory of dynamic targets.
[0005] This "3D data + time series" data processing method has revealed many insurmountable technical defects in practical applications. For example, it requires storing multiple frames of independent 3D data, resulting in information redundancy and consuming a great deal of storage and computing resources; the time series information is aligned through post-processing, making it difficult to capture dynamic interactions; and for different autonomous driving tasks, corresponding feature extraction models need to be redesigned, resulting in poor flexibility. Summary of the Invention
[0006] The information processing scheme of this disclosure adopts a spatiotemporal information integrated modeling logic to achieve native association and deep fusion of spatiotemporal information. It does not require an additional time alignment module. Through the spatiotemporal fusion representation method, it can accurately capture instantaneous and dynamic associations in dynamic scenes, fundamentally solving the problems of loss of instantaneous association features and inaccurate capture of dynamic associations caused by spatiotemporal separation modeling. Furthermore, it compresses high-dimensional redundant four-dimensional spatiotemporal data (such as point clouds and images of continuous frames) into a compact four-dimensional spatiotemporal representation of a preset encoding size, saving storage and computing resources. Moreover, the generated four-dimensional spatiotemporal representation is a general representation that is independent of downstream tasks, and can support various downstream tasks without redesigning the feature extraction model.
[0007] This disclosure proposes an information processing method in some embodiments, including: acquiring a first data sequence, including multiple consecutive frames of first sensing data collected by a first sensor of an autonomous vehicle; extracting a first spatial feature of each frame; fusing the first spatial features of the multiple consecutive frames according to the correlation weight between frames to obtain a first feature representation after spatiotemporal fusion; using a preset four-dimensional spatiotemporal encoding as a query in cross-attention, performing cross-attention processing on the first feature representation to obtain a four-dimensional spatiotemporal representation of the first data sequence, so that the downstream task of the autonomous vehicle performs task processing based on the four-dimensional spatiotemporal representation, wherein each element in the four-dimensional spatiotemporal encoding corresponds to a four-dimensional spatiotemporal unit.
[0008] In some embodiments, the first sensor is a camera, the first sensing data is image data, and extracting the first spatial feature of each frame includes: performing three-dimensional convolution processing on each frame of image data to obtain the first spatial feature of each frame of image data.
[0009] In some embodiments, fusing the first spatial features of the consecutive multiple frames according to the inter-frame association weights to obtain the spatiotemporally fused first feature representation includes: determining the inter-frame association weights through self-attention; performing weighted fusion of the first spatial features of the consecutive multiple frames according to the association weights between each frame and other frames to obtain the first spatiotemporal features of each frame; and combining the first spatiotemporal features of the consecutive multiple frames to obtain the first feature representation.
[0010] In some embodiments, the first sensor is a radar, the first sensing data is point cloud data, and extracting the first spatial features of each frame includes: performing sparse voxelization processing on each frame of point cloud data, performing three-dimensional sparse convolution processing on each obtained sparse voxel, obtaining the first spatial features of each sparse voxel, and forming the first spatial features of each frame of point cloud data.
[0011] In some embodiments, fusing the first spatial features of the consecutive multiple frames according to the inter-frame association weights to obtain the spatiotemporally fused first feature representation includes: determining the association weights of sparse voxels between frames through graph attention, wherein the nodes of the graph are the first spatial features of the sparse voxels, and the edges of the graph are the associations of the sparse voxels between frames; performing weighted fusion of the first spatial features of the sparse voxels of the consecutive multiple frames according to the association weights between the sparse voxels of each frame and the sparse voxels of other frames to obtain the first spatiotemporal features of the sparse voxels of each frame; and combining the first spatiotemporal features of the sparse voxels of the consecutive multiple frames to obtain the first feature representation.
[0012] In some embodiments, using a preset four-dimensional spatiotemporal code as a query in cross-attention, and performing cross-attention processing on the first feature representation to obtain a four-dimensional spatiotemporal representation of the first data sequence includes: determining the association weight between the first element feature of each element in the four-dimensional spatiotemporal code and each first feature in the first feature representation through cross-attention; performing weighted fusion on each first feature in the first feature representation according to the corresponding association weight of each first feature in the first feature representation, as the second element feature of each element; and combining the second element features of all elements in the four-dimensional spatiotemporal code to obtain a four-dimensional spatiotemporal representation of the first data sequence.
[0013] In some embodiments, the method further includes: acquiring a second data sequence, including multiple consecutive frames of second sensing data collected by a second sensor of an autonomous vehicle; extracting a second spatial feature for each frame; fusing the second spatial features of the multiple consecutive frames according to the correlation weight between frames to obtain a spatiotemporally fused second feature representation; aligning the first feature representation and the second feature representation; using a preset four-dimensional spatiotemporal code as a query in cross-attention, performing cross-attention processing on the first feature representation and the second feature representation to obtain a four-dimensional spatiotemporal representation of the first data sequence and the second data sequence.
[0014] In some embodiments, using a preset four-dimensional spatiotemporal code as a query in cross-attention, and performing cross-attention processing on the first feature representation and the second feature representation to obtain the four-dimensional spatiotemporal representation of the first data sequence and the second data sequence includes: determining the association weight between the first element feature of each element in the four-dimensional spatiotemporal code and each first feature in the first feature representation through cross-attention; performing weighted fusion on each first feature in the first feature representation according to the corresponding association weight, as the second element feature of each element; determining the association weight between the first element feature of each element in the four-dimensional spatiotemporal code and each second feature in the second feature representation through cross-attention; performing weighted fusion on each second feature representation according to the corresponding association weight, as the third element feature of each element; fusing the second element feature and the third element feature of each element to obtain the fourth element feature of each element; and combining the fourth element features of all elements in the four-dimensional spatiotemporal code to obtain the four-dimensional spatiotemporal representation of the first data sequence and the second data sequence.
[0015] In some embodiments, aligning the first feature representation and the second feature representation includes: aligning positional information in the first feature representation and the second feature representation through coordinate transformation; and aligning semantic information in the first feature representation and the second feature representation through upsampling or downsampling.
[0016] In some embodiments, the first data sequence and the second data sequence are respectively continuous multi-frame image data collected by the camera of the autonomous vehicle and continuous multi-frame point cloud data collected by the radar.
[0017] In some embodiments, the number of elements in the four-dimensional spatiotemporal encoding is set or adjusted according to the level of detail of the four-dimensional spatiotemporal representation.
[0018] In some embodiments, the resolution of the four-dimensional spatiotemporal representation is set or adjusted based on at least one of the graphics processor's memory performance, scene requirements, and computing load of the autonomous vehicle.
[0019] In some embodiments, the method further includes: determining the attention weights of all elements in the four-dimensional spatiotemporal encoding based on the second element features of all elements in the four-dimensional spatiotemporal encoding; retaining the second element features of the first element if the ranking of the attention weights of any first element in the four-dimensional spatiotemporal encoding meets a preset condition; and setting the second element features of the second element to zero if the ranking of the attention weights of any second element in the four-dimensional spatiotemporal encoding does not meet the preset condition.
[0020] In some embodiments, the method further includes: determining the attention weights of all elements in the four-dimensional spatiotemporal encoding based on the fourth element features of all elements in the four-dimensional spatiotemporal encoding; retaining the fourth element features of the first element if the ranking of the attention weights of any first element in the four-dimensional spatiotemporal encoding meets a preset condition; and setting the fourth element features of the second element to zero if the ranking of the attention weights of any second element in the four-dimensional spatiotemporal encoding does not meet the preset condition.
[0021] In some embodiments, the downstream task includes an object detection task, and the task processing based on the four-dimensional spatiotemporal representation includes: compressing the four-dimensional spatiotemporal representation in the temporal dimension to obtain a three-dimensional spatial representation; performing three-dimensional convolutional decoding on the three-dimensional spatial representation to obtain a predicted feature map; determining candidate object boxes based on the confidence of the predicted feature map; and determining object boxes from the candidate object boxes based on nonmaximum suppression.
[0022] In some embodiments, the downstream task includes a trajectory prediction task, and the task processing based on the four-dimensional spatiotemporal representation includes: decoding the velocity and acceleration of the target in the four-dimensional spatiotemporal representation using a long short-term memory neural network; and predicting the trajectory of the target based on the velocity and acceleration of the target.
[0023] In some embodiments, the downstream task includes a drivable region segmentation task. The task processing based on the four-dimensional spatiotemporal representation includes: compressing the four-dimensional spatiotemporal representation in the temporal and height dimensions to obtain a two-dimensional spatial representation; performing two-dimensional deconvolution processing on the two-dimensional spatial representation to obtain a predicted feature map; performing activation processing on the feature values of the predicted feature map to obtain a probability map; and segmenting the drivable region according to the probability of each pixel in the probability map and the determination threshold.
[0024] In some embodiments, the method further includes: performing end-to-end training on a spatial feature extraction network for extracting spatial features, an inter-frame attention network for determining the correlation weights between frames and fusing spatial features, and a cross-attention network for performing cross-attention processing, based on the training data sequence and a preset loss function, wherein the loss function is constructed based on reconstruction loss, contrast loss, and prediction loss.
[0025] This disclosure provides an information processing apparatus comprising one or more modules that perform an information processing method.
[0026] Some embodiments of this disclosure provide an information processing apparatus, including: a memory; and a processor coupled to the memory, the processor being configured to execute an information processing method based on instructions stored in the memory.
[0027] Some embodiments of this disclosure propose an autonomous vehicle, including an information processing device configured to perform an information processing method.
[0028] Some embodiments of this disclosure provide a computer-readable storage medium having computer instructions stored thereon, which, when executed by a processor, implement an information processing method.
[0029] Some embodiments of this disclosure provide a computer program product including computer instructions that, when executed by a processor, implement an information processing method. Attached Figure Description
[0030] The accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. This disclosure can be more clearly understood from the following detailed description with reference to the accompanying drawings.
[0031] Obviously, the accompanying drawings described below are merely some embodiments of this disclosure. Those skilled in the art can obtain other drawings based on these drawings without any creative effort.
[0032] Figure 1 A schematic diagram of the electrical architecture of an autonomous vehicle according to some embodiments of the present disclosure is shown.
[0033] Figure 2 The diagram shows the external structure of an autonomous vehicle according to some embodiments of the present disclosure.
[0034] Figure 3 A schematic diagram of a four-dimensional spatiotemporal representation network according to some embodiments of the present disclosure is shown.
[0035] Figure 4 A schematic diagram illustrating the training process of a four-dimensional spatiotemporal representation network according to some embodiments of this disclosure is shown.
[0036] Figure 5 A schematic diagram illustrating a single-modal information processing method according to some embodiments of the present disclosure is shown.
[0037] Figure 6 A schematic diagram illustrating a multimodal information processing method according to some embodiments of the present disclosure is shown.
[0038] Figure 7 A schematic diagram of an information processing apparatus according to some embodiments of the present disclosure is shown.
[0039] Figure 8 A schematic diagram of an information processing apparatus according to other embodiments of the present disclosure is shown. Detailed Implementation
[0040] It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of this disclosure.
[0041] Those skilled in the art will understand that the terms "first," "second," etc., in the embodiments of this disclosure are only used to distinguish different steps, devices, or modules, and do not represent any specific technical meaning, nor do they indicate a necessary logical order between them.
[0042] It should also be understood that in the embodiments disclosed herein, "a plurality of" may refer to two or more, and "at least one" may refer to one, two or more.
[0043] It should also be understood that any component, data or structure mentioned in the embodiments of this disclosure can generally be understood as one or more unless expressly defined or given to the contrary in the context.
[0044] Furthermore, the term "and / or" in this disclosure is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this disclosure generally indicates that the preceding and following related objects have an "or" relationship.
[0045] It should also be understood that the description of the various embodiments in this disclosure emphasizes the differences between the various embodiments, and the similarities or similarities can be referred to each other. For the sake of brevity, they will not be described in detail.
[0046] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.
[0047] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.
[0048] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.
[0049] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.
[0050] Furthermore, in order to avoid obscuring this disclosure with unnecessary details, only processing steps and / or device structures closely related to the scheme at least according to this disclosure are shown in the accompanying drawings, while other details that are not closely related to this disclosure are omitted.
[0051] Figure 1 The diagram illustrates the electrical architecture of an autonomous vehicle according to some embodiments of this disclosure. Autonomous vehicles may be, for example, driverless cars, unmanned delivery vehicles, unmanned vending vehicles, etc.
[0052] like Figure 1 As shown, the autonomous driving vehicle 100 in this embodiment includes, for example, an autonomous driving module 110 and a chassis module 120. Depending on the needs, it may also include a remote monitoring and streaming module 130 and a cargo box module 140. For example, vehicles requiring remote monitoring are equipped with the remote monitoring and streaming module 130; vehicles without such requirements may omit it. Similarly, vehicles requiring cargo (such as trucks) are equipped with the cargo box module 140; vehicles without such requirements (such as passenger cars) may omit it.
[0053] The autonomous driving module 110 may include, as needed, one or more of the following: a central processing unit (Orin or Xavier module) 111, a traffic light recognition camera 112, a front camera 1131, a rear camera 1132, a left camera 1133, a right camera 1134, a LiDAR 114, a front blind spot radar 1151, a rear blind spot radar 1152, a left blind spot radar 1153, and a right blind spot radar 1154, a positioning module (such as BeiDou, GPS, etc.) 116, an inertial navigation unit 117, and a switch 118. Each camera can communicate with the autonomous driving module. To improve transmission speed and reduce wiring, GMSL (Gigabit Multimedia Serial Links) links can be used for communication. The central processing unit 111 can be implemented using a general-purpose central processing unit, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gates, or transistors, etc. The central processing unit 111 can be configured to perform autonomous driving control.
[0054] The chassis module 120 may include, as needed, one or more of the following: battery 121, power management device 122, chassis controller 123, motor driver 124, drive motor 125, and communication module 126. Battery 121 provides power to the entire autonomous vehicle system. Battery 121 includes a main battery 1211 and a standby battery 1212. When the autonomous vehicle is running, the main battery 1211 powers each module of the autonomous vehicle. When the autonomous vehicle is in standby mode, the standby battery 1212 powers the central processing unit 111 and the communication module 126. The power management device 122 converts the output of battery 121 into different voltage levels usable by each module and controls power-on and power-off. The chassis controller 123 receives motion commands from the autonomous driving module 110 and controls the autonomous vehicle's steering, forward, reverse, and braking. The communication module 126 communicates with a backend server, enabling remote control of the autonomous vehicle by backend operators. The communication module 126 includes a cellular wireless communication device 1261 and a radio frequency communication device 1262. Cellular wireless communication device 1261 communicates using cellular wireless communication technology, such as 2G (second generation), 3G (third generation), 4G (fourth generation), or 5G (fifth generation) cellular wireless communication technology. Radio frequency communication device 1262 communicates using radio frequency communication technology.
[0055] The remote monitoring streaming module 130 may include, as needed, one or more of the following: a front monitoring camera 1311, a rear monitoring camera 1312, a left monitoring camera 1313, a right monitoring camera 1314, and a streaming module 132. The streaming module 132 transmits the video data captured by the monitoring cameras 1311-1314 to the backend server for viewing by backend operators.
[0056] The cargo box module 140 may include a cargo box 141 as needed, which is a cargo-carrying device for the autonomous vehicle. The cargo box module 140 also includes a display and interaction module 142 for interaction between the autonomous vehicle and the user. Users can perform operations such as picking up items, storing goods, and purchasing goods through the display and interaction module 142. The type of cargo box 141 can be changed according to actual needs. For example, in a logistics scenario, the cargo box may include multiple sub-boxes of different sizes, which can be used to load goods for delivery. In a retail scenario, the cargo box can be set as a transparent box so that users can see the products for sale.
[0057] Figure 2 The diagram shows the external structure of an autonomous vehicle according to some embodiments of the present disclosure. Figure 2 The diagram shows the external structure of an autonomous vehicle capable of carrying cargo. Autonomous vehicles with different functions can have different external structures; for example, the external structure of a passenger-carrying autonomous vehicle can be referenced from that of a car. Figure 2 As shown, from the current perspective, the chassis 210, cargo box 141, display and interaction module 142, right-side camera 1134, lidar 114, rear blind spot radar 1152, left-side blind spot radar 1153, right-side blind spot radar 1154 and other equipment of the autonomous vehicle 200 can be seen.
[0058] Figure 3 The diagram illustrates a four-dimensional spatiotemporal representation network according to some embodiments of this disclosure. The four-dimensional spatiotemporal representation network can be applied to autonomous driving scenarios.
[0059] like Figure 3As shown, the four-dimensional spatiotemporal representation network includes a spatial feature extraction network for extracting spatial features, an inter-frame attention network for determining the correlation weights between frames and fusing spatial features, and a cross-attention network for performing cross-attention processing. Depending on the different modalities of the sensing data to be processed, such as image sensing data or point cloud transmission data, the spatial feature extraction network may include, for example, an image feature extraction network and / or a point cloud feature extraction network. The inter-frame attention network may include, for example, a self-attention network for processing image features and / or a graph attention network for processing point cloud features. The image feature extraction network and the self-attention network together form an image spatiotemporal fusion module, used to convert the image data sequence into a spatiotemporally fused image feature representation. The point cloud feature extraction network and the graph attention network together form a point cloud spatiotemporal fusion module, used to convert the point cloud data sequence into a spatiotemporally fused point cloud feature representation.
[0060] The various networks included in the four-dimensional spatiotemporal representation network are built upon neural networks, such as those based on Transformers, but not limited to the examples given. For instance, an inter-frame attention network can be built based on a Transformer encoder (4 layers, 8 heads). Neural networks are typically trained before inference; therefore, the four-dimensional spatiotemporal representation network can also be trained before inference. The spatial feature extraction network, inter-frame attention network, and cross-attention network are trained end-to-end based on the training data sequence and a predefined loss function. The loss function can be constructed, for example, based on reconstruction loss, contrastive loss, and prediction loss. The training data sequence may include, for example, image training data sequences and / or point cloud training data sequences.
[0061] Before training begins, the resolution of the four-dimensional spatiotemporal representation and the number of elements in the four-dimensional spatiotemporal encoding are set. Each element is a learnable vector. Each element corresponds to a four-dimensional spatiotemporal unit, and a four-dimensional spatiotemporal unit corresponds to a time and a three-dimensional spatial location.
[0062] The number of elements in four-dimensional spatiotemporal coding is set or adjusted according to the level of detail required for the four-dimensional spatiotemporal representation. The more elements there are, the more detailed the four-dimensional spatiotemporal representation becomes. For example, in simple scenarios with low detail requirements, such as highways, the number of elements in the four-dimensional spatiotemporal coding is set to 512, covering a range of approximately 200 meters; in complex scenarios with high detail requirements, such as urban intersections, the number of elements in the four-dimensional spatiotemporal coding is set to 2048, covering a range of approximately 50 meters.
[0063] The resolution of a four-dimensional spatiotemporal representation is (T×X×Y×Z), where T represents the time dimension and X×Y×Z represents the three-dimensional spatial feature dimensions. For example, 10×32×32×32 represents a 10-second time window, with each time step being 1 second. If the original area of a city road is 1km×1km×50m, it is compressed into a 32×32×32 feature grid, with each grid corresponding to approximately 30m×30m×1.5m of physical space.
[0064] The resolution of the four-dimensional spatiotemporal representation is set or adjusted, for example, based on at least one of the following: the memory performance of the graphics processing unit (GPU) of the autonomous vehicle, scene requirements, and computing load.
[0065] For example, the larger the scene (such as a highway), the larger the spatial dimension (such as 64×64×32) and the longer the time window (such as 15 seconds); the smaller the scene (such as a parking lot), the smaller the spatial dimension (such as 16×16×16) and the shorter the time window (such as 5 seconds).
[0066] For example, the smaller the video memory (e.g., 8GB), the smaller the space dimension (e.g., T×16×16×16); the larger the video memory (e.g., 24GB), the larger the space dimension (e.g., T×32×32×32).
[0067] For example, if the computing load of an autonomous vehicle is too high, such as >80%, and the computing power is insufficient, the resolution of the four-dimensional spatiotemporal representation will be automatically reduced, such as from T×32×32×32 to T×16×16×16.
[0068] Preprocess the training data. If the training data includes both image training data and point cloud training data, time-align the image training data and point cloud training data using hardware synchronization. The time deviation of the hardware synchronization should be in the millisecond range, such as less than 10ms. Perform data augmentation on the training data. For example, randomly rotate (e.g., ±5°), translate (e.g., ±0.5m), or occlude (e.g., randomly discard 20% of the point cloud) the training data.
[0069] Figure 4 A schematic diagram illustrating the training process of a four-dimensional spatiotemporal representation network according to some embodiments of this disclosure is shown. Figure 4 As shown, the training process of the four-dimensional spatiotemporal representation network includes: step 410, inputting training data; step 420, forward propagation; step 430, loss calculation; step 440, backpropagation; and step 450, iterative optimization until the four-dimensional spatiotemporal representation network converges.
[0070] Step 410, Training Data Input: Input unimodal or multimodal training data into the four-dimensional spatiotemporal representation network. Training data may include, for example, image training data and / or point cloud training data. Point clouds may be acquired, for example, by LiDAR or millimeter-wave radar. Images may be acquired, for example, by a camera.
[0071] Step 420, Forward Propagation: The training data is processed sequentially through a spatial feature extraction network, an inter-frame attention network, and a cross-attention network to output a four-dimensional spatiotemporal representation. Specifically, the training data for the image modality is processed sequentially through an image feature extraction network, a self-attention network, and a cross-attention network; the training data for the point cloud modality is processed sequentially through a point cloud feature extraction network, a graph attention network, and a cross-attention network. If the multimodal training data for both the image and point cloud modalities are input together, a cross-attention network is also needed for multimodal fusion to output a multimodal four-dimensional spatiotemporal representation.
[0072] Step 430, Loss Calculation: Construct a loss function based on reconstruction loss, contrastive loss, and prediction loss. Reconstruction Loss requires that the original data can be reconstructed after decoding the four-dimensional spatiotemporal representation (e.g., L2 loss). Contrastive Loss requires that positive samples (different viewpoints of the same scene) have close representation distances, while negative samples (different scenes) have large representation distances. Future Prediction Loss: Randomly mask several future frames (e.g., 3 frames), and predict the point cloud or image of the masked frames based on the four-dimensional spatiotemporal representation (cross-entropy loss).
[0073] Assuming a batch size of B, each sample contains a time series length T (including past observation frames and future frames). Scene view index: different views of the same scene are denoted as v, v′; different scenes are denoted as s, s′. The four-dimensional spatiotemporal representation network is denoted as f(·); its decoding / prediction heads are denoted as g(·) and h(·), respectively. Point cloud data. Image data . ||·||2 is the L2 norm, ·,· τ is the vector inner product, τ>0 is the contrast temperature, M is the mask indicator (1 indicates that it is masked), and CE is the cross-entropy.
[0074] Reconstruction loss (L2 loss): For each frame t, the four-dimensional spatiotemporal representation is... Decode back to the original mode. The reconstruction loss function is, for example:
[0075] in , Decoders for point clouds and images, respectively. , As a weighting factor for balancing point cloud and image, For the four-dimensional spatiotemporal representation of batch b of t frames, For point cloud data of batch b, frame t, This refers to image data for batch b, frame t.
[0076] Contrast Loss (InfoNCE Loss): A four-dimensional spatiotemporal representation of frame t To present the same scene from another perspective As positive samples, all frames in different scenes, Neg = { } is used as a negative sample. The contrastive loss function is, for example:
[0077] in For the four-dimensional spatiotemporal representation of batch b of t frames, This represents the four-dimensional spatiotemporal representation of the positive samples in batch b, frame t. This represents the k-th negative sample.
[0078] Future prediction loss (cross-entropy loss): Random mask, index of the next 3 consecutive frames masked = { , , }, using the four-dimensional spatiotemporal representation of frame t Predict the masked point cloud and image. The prediction loss function is, for example:
[0079] in 、 These are prediction heads for point clouds and images, respectively; For the four-dimensional spatiotemporal representation of batch b of t frames, For point cloud data of batch b, frame t, Let CE represent the image data of batch b, frames t, where CE is the cross-entropy.
[0080] Total loss function: ,in, , , These represent the reconstruction loss, contrast loss, and future prediction loss, respectively. , This represents the weighting coefficient, which is a hyperparameter.
[0081] Step 440, Backpropagation: Calculate the gradient based on the loss, and update the parameters of the spatial feature extraction network, inter-frame attention network, and cross-attention network based on the gradient. Network parameter updates can, for example, use the Adam optimizer, with a learning rate of, for example, 0.0001 and a batch size of, for example, 8.
[0082] Step 450, Iterative optimization: Repeat the process of forward propagation in step 420, loss calculation in step 430, and backpropagation in step 440 until the four-dimensional spatiotemporal representation network converges.
[0083] Based on a pre-trained four-dimensional spatiotemporal representation network, reasoning on four-dimensional spatiotemporal representations of data sequences can be performed. The data sequences can be unimodal or multimodal.
[0084] Figure 5 Schematic diagrams illustrating single-modal information processing methods according to some embodiments of this disclosure are shown. For example... Figure 5 As shown, the information processing method in single-modality mode of this embodiment includes the following steps 510-540, and step 550 may also be added according to task requirements.
[0085] Step 510: Obtain the first data sequence, including multiple consecutive frames of first sensing data collected by the first sensor of the autonomous vehicle.
[0086] If the first sensor is a camera, the first sensing data is image data, and the first data sequence is a series of multiple frames of image data captured by the camera of the autonomous vehicle, assuming there are N frames. If the image is an RGB image, R / G / B represent the red, green, and blue color channels respectively, and each frame of the RGB image is represented as H×W×3, where H / W / 3 represent the height, width, and three color channels of the image respectively.
[0087] If the first sensor is radar, the first sensing data is point cloud data, and the second data sequence is a series of multiple frames of point cloud data collected by the radar of the autonomous vehicle, assuming there are N frames. Each frame of point cloud is represented as (x,y,z,r), that is, the three-dimensional spatial coordinates (x,y,z) and the reflection intensity (r).
[0088] Assuming the length of the first data sequence is N frames, each time a new data frame is received, the new frame is combined with the previously received N-1 frames to form the first data sequence of N frames, and the information processing method of this embodiment is executed.
[0089] Step 520: Extract the first spatial features of each frame.
[0090] If the first data sequence is a series of consecutive frames of image data collected by the camera of an autonomous vehicle, then step 520 specifically includes: performing three-dimensional convolution processing on each frame of image data to obtain the first spatial feature of each frame of image data.
[0091] For example, an image feature extraction network can be used to extract the first spatial features of a single frame of image data using a three-dimensional convolutional kernel (e.g., 3×3×3), outputting a feature map (e.g., H / 16×W / 16×256). H / 16×W / 16 represents the spatial resolution of the feature map (16x downsampling), and 256 represents the number of feature channels (fixed dimension). The original image is H×W×3, with high spatial dimension and few channels; the feature map H / 16×W / 16×256 compresses the spatial dimension and increases the number of channels, extracting high-dimensional semantic features.
[0092] If the second data sequence is a series of point cloud data collected by the radar of an autonomous vehicle, then step 520 specifically includes: performing sparse voxelization processing on each frame of point cloud data, with voxel size for example being 0.1m×0.1m×0.1m; using a point cloud feature extraction network, performing three-dimensional sparse convolution processing on each obtained sparse voxel to obtain the first spatial features of each sparse voxel, forming the first spatial features of each frame of point cloud data (such as X×Y×Z×128), where X×Y×Z represents the spatial dimension of the sparse voxel grid, corresponding to the x / y / z axes of the physical space respectively, and 128 represents the number of feature channels (high-dimensional semantic feature dimension) of each sparse voxel.
[0093] Sparse voxelization of 0.1m×0.1m×0.1m enables centimeter-level physical spatial alignment, accurately preserving the fine-grained geometric features of the point cloud and meeting the high-precision modeling requirements of four-dimensional representation; sparse convolution only performs convolution operations on non-empty voxels, significantly reducing computational load and memory usage; the structured feature output X×Y×Z×128-dimensional first spatial feature is an ordered grid structure, which not only models the three-dimensional spatial relationship of the point cloud, but also unifies with the grid structure of image features, facilitating subsequent cross-modal fusion.
[0094] Step 530: Based on the correlation weights between frames, fuse the first spatial features of multiple consecutive frames to obtain the first feature representation after spatiotemporal fusion.
[0095] If the first data sequence is a series of consecutive frames of image data collected by the camera of an autonomous vehicle, then step 530 can be implemented, for example, based on a self-attention network. Specifically, it includes determining the correlation weights (such as attention coefficients) between frames through self-attention; performing weighted fusion (such as weighted summation) on the first spatial features of the consecutive frames according to the correlation weights between each frame and other frames to obtain the first spatiotemporal features of each frame; combining (such as splicing) the first spatiotemporal features of the consecutive frames to obtain the first feature representation (such as T×H / 16×W / 16×256), where T represents the number of time steps of the consecutive frames, for example, T=N, for example, when T=10, it represents 10 frames of a 10-second time window, and the meanings of other symbols are as described above.
[0096] Thus, the dynamic spatiotemporal dependencies between frames are adaptively captured. For example, the "vehicle" feature in frame t is associated with the "brake light" feature in frame t-1 with high weight. This not only preserves the details of the spatial features of a single frame, but also strengthens the temporal correlation of multiple frames. The resulting spatiotemporal feature representation has both spatial accuracy and temporal consistency.
[0097] In some embodiments, determining the inter-frame association weights through self-attention includes: flattening the first spatial features of multiple consecutive frames into corresponding frame feature sequences; inputting the frame feature sequences into three independent linear transformation layers to generate a query matrix Q, a key matrix K, and a value matrix V; and determining the inter-frame association weights based on the dot product of the transposes of Q and K.
[0098] If the second data sequence is continuous multi-frame point cloud data collected by the radar of an autonomous vehicle, then step 530 can be implemented, for example, based on a graph attention network, specifically including: determining the association weights (such as attention coefficients, representing the similarity between voxels) of sparse voxels between frames through graph attention, wherein the nodes of the graph are the first spatial features of sparse voxels, the edges of the graph are the associations of sparse voxels between frames, and the association weights of inter-frame (sparse) voxels are generated by a multilayer perceptron (MLP); according to the association weights between the sparse voxels of each frame and the sparse voxels of other frames, the first spatial features of sparse voxels of continuous multi-frames are weighted and fused (such as weighted summation) to obtain the first spatiotemporal features of sparse voxels of each frame; the first spatiotemporal features of sparse voxels of continuous multi-frames are combined (such as splicing) to obtain the first feature representation (such as T×X×Y×Z×128), where T represents the number of time steps of continuous multi-frames, for example T=N, for example T=10, which represents 10 frames of a 10-second time window, and the meanings of other symbols are as described above.
[0099] Thus, by using graph attention mechanism to simultaneously model and capture the spatial and temporal dependencies in temporal point cloud data, dynamic spatiotemporal correlation modeling between frames is achieved at the sparse voxel granularity. This preserves the three-dimensional geometric sparsity of the point cloud while accurately capturing the temporal dependencies at the voxel level (such as the motion trajectory of vehicles / pedestrians). The resulting feature representation combines sparse efficiency with fine-grained spatiotemporal correlation.
[0100] Step 540: Use the preset four-dimensional spatiotemporal code as the query in the cross-attention process, and perform cross-attention processing on the first feature representation to obtain the four-dimensional spatiotemporal representation of the first data sequence, wherein each element in the four-dimensional spatiotemporal code corresponds to a four-dimensional spatiotemporal unit.
[0101] In some embodiments, step 540 is implemented, for example, by a cross-attention network, guided by four-dimensional spatiotemporal coding, to perform spatiotemporal information weighted aggregation on the first feature representation, specifically including: (1) determining the association weight (such as attention coefficient) between the first element feature of each element in the four-dimensional spatiotemporal coding and each first feature in the first feature representation through cross-attention; (2) performing weighted fusion (such as weighted summation) on each first feature in the first feature representation according to the corresponding association weight of each first feature in the first feature representation, as the second element feature of each element; (3) combining (such as splicing) the second element features of all elements in the four-dimensional spatiotemporal coding to obtain the four-dimensional spatiotemporal representation of the first data sequence.
[0102] Steps (1) and (2) can be executed iteratively once or multiple times. In the first iteration, the first element feature of an element is the initial element feature of that element. In subsequent iterations, the first element feature of an element is the second element feature of that element determined in the previous iteration.
[0103] For example, the number of elements in the four-dimensional spatiotemporal coding is assumed to be 1024, and the semantic dimension of each element is assumed to be 128. Assume that the first feature representation is the first feature representation of the point cloud (T×X×Y×Z×128), including T×X×Y×Z first features, and the semantic dimension of each first feature is 128. For each element, let it be element a, calculate the association weight between the first element feature of element a and each first feature. Based on the association weight of each first feature, perform a weighted sum of T×X×Y×Z first features, which is taken as the second element feature of element a. In this way, the second element feature of each element can be obtained. By combining the second element features of all elements, the four-dimensional spatiotemporal representation of the first data sequence can be obtained.
[0104] Furthermore, based on the second-element features of all elements in the four-dimensional spatiotemporal encoding, the attention weights of all elements in the four-dimensional spatiotemporal encoding are determined. For example, the attention weights of elements can be obtained by processing the second-element features of elements using a neural network. If the ranking of the attention weights of any first element in the four-dimensional spatiotemporal encoding meets a preset condition, the second-element features of the first element are retained. If the ranking of the attention weights of any second element in the four-dimensional spatiotemporal encoding does not meet the preset condition, the second-element features of the second element are set to zero. For example, the second-element features of the top 10% of elements in terms of attention weight are retained, and the second-element features of the remaining elements are set to zero.
[0105] Thus, by focusing on core spatiotemporal information and filtering out redundant / noise features, we can improve computational efficiency and enhance the expression of key features, making subsequent four-dimensional spatiotemporal representations more accurate and efficient.
[0106] Step 550: Various downstream tasks of autonomous vehicles perform task processing based on four-dimensional spatiotemporal representation.
[0107] Four-dimensional spatiotemporal representation serves as a universal input for various downstream tasks, which can be completed with only a lightweight decoder. These downstream tasks for autonomous vehicles include, but are not limited to, detection and prediction tasks such as object detection, object tracking, trajectory prediction, and scene segmentation.
[0108] Traditional methods require designing anchor boxes for detection tasks and LSTM (Long Short-Term Memory) heads for prediction tasks, making the models unusable.
[0109] For example, downstream tasks include object detection tasks. The task processing based on four-dimensional spatiotemporal representation includes: (1) compressing the four-dimensional spatiotemporal representation in the temporal dimension (e.g., pooling) to obtain a three-dimensional spatial representation; (2) performing three-dimensional convolution decoding on the three-dimensional spatial representation to obtain a predicted feature map; (3) determining candidate object boxes based on the confidence of the predicted feature map, wherein the predicted feature map contains two types of information, the confidence score represents the probability of an object at each position (e.g., 0~1), and the box regression parameters represent the coordinates / size / angle of the object box, etc. Based on the set confidence threshold (e.g., 0.5), the boxes corresponding to the positions with confidence ≥ the threshold are retained, and these boxes are used as candidate object boxes; (4) determining object boxes from the candidate object boxes based on non-maximum suppression, for example: sorting the candidate boxes from high to low confidence; first selecting the box with the highest confidence as the "base box", calculating its overlap with other boxes, such as IoU (Intersection over Union (Intersection over Union); remove overlapping boxes with IoU ≥ a threshold (e.g., 0.7); repeat the above steps for the remaining boxes to obtain non-overlapping, high-confidence object boxes.
[0110] For example, downstream tasks include trajectory prediction tasks. Task processing based on four-dimensional spatiotemporal representation includes: (1) using a long short-term memory neural network to decode the velocity and acceleration of the target in the four-dimensional spatiotemporal representation; (2) predicting the future trajectory of the target based on the velocity and acceleration of the target.
[0111] For example, downstream tasks include drivable area segmentation tasks. The task processing based on four-dimensional spatiotemporal representation includes: (1) compressing the four-dimensional spatiotemporal representation in the temporal and height dimensions (e.g., pooling) to obtain a two-dimensional spatial representation; (2) performing two-dimensional deconvolution on the two-dimensional spatial representation to obtain a predicted feature map; (3) performing activation processing on the feature values of the predicted feature map to obtain a probability map. The probability of each pixel in the probability map represents the confidence level (e.g., 0~1) that the physical location corresponding to the pixel belongs to the drivable area; (4) segmenting the drivable area and the non-drivable area according to the probability of each pixel in the probability map and the judgment threshold (e.g., 0.5).
[0112] This embodiment employs an integrated spatiotemporal information modeling logic to achieve native correlation and deep fusion of spatiotemporal information without the need for an additional time-series alignment module. Through spatiotemporal fusion representation, it can accurately capture instantaneous and dynamic correlations in dynamic scenes, fundamentally solving the problems of lost instantaneous correlation features and inaccurate capture of dynamic correlations caused by spatiotemporal separation modeling. Furthermore, it compresses high-dimensional redundant single-modal four-dimensional spatiotemporal data (such as point clouds or images of continuous frames) into a compact four-dimensional spatiotemporal representation of a preset encoding size, saving storage and computing resources. Moreover, the generated four-dimensional spatiotemporal representation is a universal representation independent of downstream tasks, and can support various downstream tasks without redesigning the feature extraction model.
[0113] Figure 6 Schematic diagrams illustrating multimodal information processing methods according to some embodiments of this disclosure are shown. For example... Figure 6 As shown, the information processing method under multimodal conditions in this embodiment includes the following steps 610-680, and step 690 may also be added according to task requirements.
[0114] Step 610: Obtain the first data sequence, including multiple consecutive frames of first sensing data collected by the first sensor of the autonomous vehicle.
[0115] Step 620: Extract the first spatial features of each frame.
[0116] Step 630: Based on the correlation weights between frames, fuse the first spatial features of multiple consecutive frames to obtain the first feature representation after spatiotemporal fusion.
[0117] Step 640: Obtain the second data sequence, including multiple consecutive frames of second sensing data collected by the second sensor of the autonomous vehicle.
[0118] The first data sequence and the second data sequence are data sequences of different modalities, such as continuous multi-frame image data collected by the camera of an autonomous vehicle and continuous multi-frame point cloud data collected by radar, respectively.
[0119] Step 650: Extract the second spatial features of each frame.
[0120] Step 660: Based on the correlation weights between frames, fuse the second spatial features of multiple consecutive frames to obtain the second feature representation after spatiotemporal fusion.
[0121] It should be noted that steps 610-630 and steps 640-660 are the determination process of spatiotemporal fusion feature representation under single mode, and the specific implementation can be referred to steps 510-530 of the aforementioned embodiment, which will not be repeated here.
[0122] Step 670: Align the first feature representation with the second feature representation.
[0123] In some embodiments, positional information in the first feature representation and the second feature representation is aligned through coordinate transformation.
[0124] For example, the first feature representation of an image includes positional information in the image coordinate system, and the second feature representation of a point cloud includes positional information in the point cloud coordinate system. A transformation matrix from image coordinate system to point cloud coordinate system is used to transform the positional information in the first feature representation of the image from the image coordinate system to the point cloud coordinate system. Alternatively, a transformation matrix from image coordinate system to vehicle coordinate system is used to transform the positional information in the first feature representation of the image from the image coordinate system to the vehicle coordinate system, and a transformation matrix from point cloud coordinate system to vehicle coordinate system is used to transform the positional information in the second feature representation of the point cloud from the point cloud coordinate system to the vehicle coordinate system.
[0125] In some embodiments, semantic information in the first feature representation and the second feature representation is aligned by upsampling or downsampling.
[0126] For example, the image feature channels in the first feature representation are 256-dimensional (high-dimensional semantics, such as texture and color); the point cloud feature channels in the second feature representation are 128-dimensional (high-dimensional semantics, such as reflection intensity). Upsampling (dimensionality increase) of the point cloud features increases the 128-dimensional channels of the point cloud features to 256-dimensional; or downsampling (dimensionality reduction) of the image features reduces the 256-dimensional image features to 128-dimensional.
[0127] Thus, the positional and semantic information of the first and second feature representations are aligned to enable multimodal spatiotemporal feature fusion.
[0128] Step 680: Use the preset four-dimensional spatiotemporal code as the query in the cross-attention process, and perform cross-attention processing on the first feature representation and the second feature representation to obtain the four-dimensional spatiotemporal representation of the first data sequence and the second data sequence. Each element in the four-dimensional spatiotemporal code corresponds to a four-dimensional spatiotemporal unit.
[0129] In some embodiments, step 680 specifically includes the following steps (1) to (6): (1) By cross attention, determine the association weight (such as attention coefficient) between the first element feature of each element in the four-dimensional spatiotemporal coding and each first feature in the first feature representation. (2) Based on the corresponding association weights of each first feature in the first feature representation, perform weighted fusion on each first feature in the first feature representation and use it as the second element feature of each element (such as weighted summation). (3) By cross attention, determine the association weight (such as attention coefficient) between the first element feature of each element in the four-dimensional spatiotemporal coding and each second feature in the second feature representation. (4) Based on the corresponding association weights of each second feature in the second feature representation, perform weighted fusion (such as weighted summation) on each second feature in the second feature representation, and use it as the third element feature of each element; (5) Combine (e.g., weighted summation) the second and third element features of each element to obtain the fourth element feature of each element; (6) Combine (e.g., splice) the fourth element features of all elements in the four-dimensional spatiotemporal coding to obtain the four-dimensional spatiotemporal representation of the first data sequence and the second data sequence.
[0130] Among them, steps (1) to (5) can be executed once or multiple times. In the first iteration, the first element feature of the element is the initial element feature of the element. In subsequent iterations, the first element feature of the element is the fourth element feature of the element determined in the previous iteration.
[0131] Furthermore, based on the fourth-element features of all elements in the four-dimensional spatiotemporal encoding, the attention weights of all elements in the four-dimensional spatiotemporal encoding are determined. For example, the attention weights of elements can be obtained by processing the fourth-element features of elements using a neural network. If the ranking of the attention weights of any first element in the four-dimensional spatiotemporal encoding meets a preset condition, the fourth-element features of the first element are retained. If the ranking of the attention weights of any second element in the four-dimensional spatiotemporal encoding does not meet the preset condition, the fourth-element features of the second element are set to zero. For example, the fourth-element features of the top 10% of elements in terms of attention weight are retained, and the fourth-element features of the remaining elements are set to zero.
[0132] Thus, by focusing on core spatiotemporal information and filtering out redundant / noise features, we can improve computational efficiency and enhance the expression of key features, making subsequent four-dimensional spatiotemporal representations more accurate and efficient.
[0133] Step 690: Various downstream tasks of autonomous vehicles perform task processing based on four-dimensional spatiotemporal representation.
[0134] The four-dimensional spatiotemporal representation serves as a universal input for various downstream tasks, which can be processed using only a lightweight decoder. Details regarding downstream tasks and their processing can be found in step 550, and will not be repeated here.
[0135] This embodiment employs an integrated spatiotemporal information modeling logic to achieve native correlation and deep fusion of spatiotemporal information without the need for an additional time-series alignment module. Through spatiotemporal fusion representation, it can accurately capture instantaneous and dynamic correlations in dynamic scenes, fundamentally solving the problems of lost instantaneous correlation features and inaccurate capture of dynamic correlations caused by spatiotemporal separation modeling. Furthermore, it compresses high-dimensional redundant multimodal four-dimensional spatiotemporal data (such as point clouds and images of continuous frames) into a compact four-dimensional spatiotemporal representation of a preset encoding size, saving storage and computing resources. Moreover, the generated four-dimensional spatiotemporal representation is a universal representation independent of downstream tasks, and can support various downstream tasks without redesigning the feature extraction model.
[0136] Figure 7 Schematic diagrams of information processing apparatuses according to some embodiments of the present disclosure are shown. Figure 7 As shown, the information processing apparatus 700 of this embodiment includes modules 710 to 740, and may also include at least one of modules 750 to 770.
[0137] The data acquisition module 710 is configured to acquire a first data sequence, including multiple consecutive frames of first sensing data collected by the first sensor of the autonomous vehicle. The spatial feature extraction module 720 is configured to extract the first spatial feature of each frame; The spatiotemporal fusion module 730 is configured to fuse the first spatial features of multiple consecutive frames according to the inter-frame correlation weights to obtain the first feature representation after spatiotemporal fusion. The cross-attention module 740 is configured to use a preset four-dimensional spatiotemporal code as a query in the cross-attention process, perform cross-attention processing on the first feature representation, and obtain a four-dimensional spatiotemporal representation of the first data sequence, so that the downstream tasks of the autonomous vehicle can perform task processing based on the four-dimensional spatiotemporal representation, wherein each element in the four-dimensional spatiotemporal code corresponds to a four-dimensional spatiotemporal unit.
[0138] In some embodiments, the first sensor is a camera, the first sensing data is image data, and the spatial feature extraction module 720 includes an image feature extraction unit 721, which is configured to perform three-dimensional convolution processing on each frame of image data to obtain the first spatial features of each frame of image data.
[0139] In some embodiments, the spatiotemporal fusion module 730 includes: a self-attention unit 731 configured to determine the correlation weights between frames through self-attention; to perform weighted fusion of the first spatial features of multiple consecutive frames according to the correlation weights between each frame and other frames to obtain the first spatiotemporal features of each frame; and to combine the first spatiotemporal features of multiple consecutive frames to obtain a first feature representation.
[0140] In some embodiments, the first sensor is a radar, the first sensing data is point cloud data, and the spatial feature extraction module 720 includes a point cloud feature extraction unit 722, which is configured to perform sparse voxelization processing on each frame of point cloud data, perform three-dimensional sparse convolution processing on each obtained sparse voxel, and obtain the first spatial features of each sparse voxel to form the first spatial features of each frame of point cloud data.
[0141] In some embodiments, the spatiotemporal fusion module 730 includes: a graph attention unit 732, configured to determine the association weights of sparse voxels between frames through graph attention, wherein the nodes of the graph are the first spatial features of the sparse voxels, and the edges of the graph are the associations of the sparse voxels between frames; weighted fusion of the first spatial features of the sparse voxels of multiple consecutive frames according to the association weights between the sparse voxels of each frame and the sparse voxels of other frames to obtain the first spatiotemporal features of the sparse voxels of each frame; and combining the first spatiotemporal features of the sparse voxels of multiple consecutive frames to obtain a first feature representation.
[0142] In some embodiments, the cross-attention module 740 is configured to
[0143] By using cross attention, the association weights between the first element feature of each element in the four-dimensional spatiotemporal coding and the various first features in the first feature representation are determined; Based on the corresponding association weights of each first feature in the first feature representation, the first features in the first feature representation are weighted and fused to serve as the second element feature of each element; By combining the second element features of all elements in the four-dimensional spatiotemporal encoding, a four-dimensional spatiotemporal representation of the first data sequence is obtained.
[0144] In a multimodal scenario, the data acquisition module 710 is configured to acquire a second data sequence, including multiple consecutive frames of second sensing data collected by the second sensor of the autonomous vehicle; the spatial feature extraction module 720 is configured to extract the second spatial features of each frame; the spatiotemporal fusion module 730 is configured to fuse the second spatial features of multiple consecutive frames according to the correlation weight between frames to obtain a spatiotemporally fused second feature representation; the alignment module 750 is configured to align the first feature representation and the second feature representation; and the cross-attention module 740 is configured to use a preset four-dimensional spatiotemporal code as a query in the cross-attention process to perform cross-attention processing on the first feature representation and the second feature representation to obtain a four-dimensional spatiotemporal representation of the first data sequence and the second data sequence.
[0145] In a multimodal scenario, the cross-attention module 740 is configured to: determine the association weight between the first element feature of each element in the four-dimensional spatiotemporal encoding and each first feature in the first feature representation through cross-attention; perform weighted fusion of each first feature in the first feature representation according to the corresponding association weight of each first feature in the first feature representation, as the second element feature of each element; determine the association weight between the first element feature of each element in the four-dimensional spatiotemporal encoding and each second feature in the second feature representation through cross-attention; perform weighted fusion of each second feature in the second feature representation according to the corresponding association weight of each second feature in the second feature representation, as the third element feature of each element; fuse the second element feature and the third element feature of each element to obtain the fourth element feature of each element; combine the fourth element features of all elements in the four-dimensional spatiotemporal encoding to obtain the four-dimensional spatiotemporal representation of the first data sequence and the second data sequence.
[0146] Alignment module 750 is configured to align positional information in the first feature representation and the second feature representation through coordinate transformation; and to align semantic information in the first feature representation and the second feature representation through upsampling or downsampling.
[0147] The first data sequence and the second data sequence are respectively multi-frame image data collected by the camera of the autonomous vehicle and multi-frame point cloud data collected by the radar.
[0148] The number of elements in the four-dimensional spatiotemporal encoding is set or adjusted according to the level of detail of the four-dimensional spatiotemporal representation.
[0149] The resolution of the four-dimensional spatiotemporal representation is set or adjusted based on at least one of the following: the graphics processor's memory performance, scene requirements, and computing load of the autonomous vehicle.
[0150] In some embodiments, the element optimization module 760 is configured to: determine the attention weights of all elements in the four-dimensional spatiotemporal encoding based on the second element features of all elements in the four-dimensional spatiotemporal encoding; retain the second element features of the first element if the ranking of the attention weights of any first element in the four-dimensional spatiotemporal encoding meets the preset conditions; and set the second element features of the second element to zero if the ranking of the attention weights of any second element in the four-dimensional spatiotemporal encoding does not meet the preset conditions.
[0151] In some embodiments, the element optimization module 760 is configured to: determine the attention weights of all elements in the four-dimensional spatiotemporal encoding based on the fourth element features of all elements in the four-dimensional spatiotemporal encoding; retain the fourth element features of the first element if the ranking of the attention weights of any first element in the four-dimensional spatiotemporal encoding meets the preset conditions; and set the fourth element features of the second element to zero if the ranking of the attention weights of any second element in the four-dimensional spatiotemporal encoding does not meet the preset conditions.
[0152] Task module 770 is configured to perform task processing based on four-dimensional spatiotemporal representation.
[0153] For example, downstream tasks include object detection tasks. Task module 770 is configured to: compress the four-dimensional spatiotemporal representation in the temporal dimension to obtain a three-dimensional spatial representation; perform three-dimensional convolutional decoding on the three-dimensional spatial representation to obtain a predicted feature map; determine candidate object boxes based on the confidence of the predicted feature map; and determine object boxes from the candidate object boxes based on nonmaximum suppression.
[0154] For example, downstream tasks include trajectory prediction tasks. Task module 770 is configured to: decode the velocity and acceleration of a target in a four-dimensional spatiotemporal representation using a long short-term memory neural network; and predict the trajectory of the target based on the target's velocity and acceleration.
[0155] For example, downstream tasks include drivable region segmentation tasks. Task module 770 is configured to: compress the four-dimensional spatiotemporal representation in the temporal and height dimensions to obtain a two-dimensional spatial representation; perform two-dimensional deconvolution processing on the two-dimensional spatial representation to obtain a predicted feature map; perform activation processing on the feature values of the predicted feature map to obtain a probability map; and segment the drivable region according to the probability of each pixel in the probability map and the decision threshold.
[0156] Training module 770 is configured to perform end-to-end training on a spatial feature extraction network for extracting spatial features, an inter-frame attention network for determining the correlation weights between frames and fusing spatial features, and a cross-attention network for performing cross-attention processing, based on training data sequences and a preset loss function. The loss function is constructed based on reconstruction loss, contrast loss, and prediction loss.
[0157] Figure 8Schematic diagrams of information processing apparatuses according to other embodiments of this disclosure are shown. For example... Figure 8 As shown, the information processing apparatus 800 of this embodiment includes a memory 810 and a processor 820 coupled to the memory 810. The processor 820 is configured to execute the information processing methods of any of the foregoing embodiments based on instructions stored in the memory 810. The memory 810 and the processor 820 can be connected, for example, via a bus 830.
[0158] The memory 810 may include, for example, system memory, fixed non-volatile storage media, etc. The system memory may store, for example, the operating system, application programs, boot loader, and other programs.
[0159] The processor 820 can be implemented using a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gates, or transistors, or other discrete hardware components.
[0160] The bus 830 can use any of the various bus architectures. For example, bus architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
[0161] This disclosure provides an autonomous driving vehicle, including information processing devices 700 and 800, configured to execute the information processing methods described in any of the foregoing embodiments. The information processing devices 700 and 800 may, for example, be located in the autonomous driving module of the autonomous driving vehicle and are electrically connected to various sensors and a central processing unit.
[0162] This disclosure provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the information processing described in any of the foregoing embodiments. The storage medium may be, for example, a non-transitory computer-readable storage medium.
[0163] This disclosure provides a computer program product including computer instructions that, when executed by a processor, perform information processing as described in any of the foregoing embodiments.
[0164] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, systems, or computer program products. Therefore, this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this disclosure can take the form of a computer program product embodied on one or more (non-transitory) computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, cloud storage, etc.) containing computer program code. A computer program product should be understood as a software product that primarily implements its solution through a computer program.
[0165] This disclosure is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0166] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0167] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
Claims
1. An information processing method, comprising: Acquire a first data sequence, including multiple consecutive frames of first sensing data collected by the first sensor of the autonomous vehicle; Extract the first spatial features of each frame; Based on the correlation weights between frames, the first spatial features of the consecutive multiple frames are fused to obtain the first feature representation after spatiotemporal fusion. The preset four-dimensional spatiotemporal code is used as the query in the cross-attention process. The first feature representation is subjected to cross-attention processing to obtain the four-dimensional spatiotemporal representation of the first data sequence. This allows the downstream tasks of the autonomous vehicle to perform task processing based on the four-dimensional spatiotemporal representation. Each element in the four-dimensional spatiotemporal code corresponds to a four-dimensional spatiotemporal unit.
2. The method according to claim 1, wherein, The first sensor is a camera, and the first sensing data is image data. Extracting the first spatial features of each frame includes: Each frame of image data is subjected to three-dimensional convolution processing to obtain the first spatial feature of each frame of image data.
3. The method according to claim 2, wherein, Based on the inter-frame correlation weights, the first spatial features of the consecutive multiple frames are fused to obtain the first feature representation after spatiotemporal fusion, including: The association weights between frames are determined through self-attention; Based on the correlation weights between each frame and other frames, the first spatial features of the consecutive frames are weighted and fused to obtain the first spatiotemporal features of each frame. The first spatiotemporal features of the consecutive multiple frames are combined to obtain the first feature representation.
4. The method according to claim 1, wherein, The first sensor is radar, and the first sensing data is point cloud data. Extracting the first spatial features of each frame includes: Each frame of point cloud data is sparsed voxelized, and each sparse voxel is subjected to three-dimensional sparse convolution to obtain the first spatial features of each sparse voxel, which together form the first spatial features of each frame of point cloud data.
5. The method according to claim 4, wherein, Based on the inter-frame correlation weights, the first spatial features of the consecutive multiple frames are fused to obtain the first feature representation after spatiotemporal fusion, including: The association weights of inter-frame sparse voxels are determined by graph attention, where the nodes of the graph are the first spatial features of the sparse voxels, and the edges of the graph are the associations of inter-frame sparse voxels. Based on the correlation weight between the sparse voxels of each frame and the sparse voxels of other frames, the first spatial features of the sparse voxels of the consecutive frames are weighted and fused to obtain the first spatiotemporal features of the sparse voxels of each frame. The first feature representation is obtained by combining the first spatiotemporal features of the sparse voxels of the consecutive multiple frames.
6. The method according to claim 1, wherein, Using a preset four-dimensional spatiotemporal code as the query in the cross-attention process, the first feature representation is subjected to cross-attention processing to obtain the four-dimensional spatiotemporal representation of the first data sequence, including: By using cross attention, the association weight between the first element feature of each element in the four-dimensional spatiotemporal coding and each first feature in the first feature representation is determined; Based on the corresponding association weights of each first feature in the first feature representation, the first features in the first feature representation are weighted and fused to serve as the second element feature of each element. By combining the second element features of all elements in the four-dimensional spatiotemporal encoding, a four-dimensional spatiotemporal representation of the first data sequence is obtained.
7. The method according to claim 1, further comprising: Acquire a second data sequence, including multiple consecutive frames of second sensing data collected by the second sensor of the autonomous vehicle; Extract the second spatial features of each frame; Based on the correlation weights between frames, the second spatial features of the consecutive multiple frames are fused to obtain the spatiotemporally fused second feature representation. Align the first feature representation and the second feature representation; Using a preset four-dimensional spatiotemporal code as a query in cross-attention, cross-attention processing is performed on the first feature representation and the second feature representation to obtain the four-dimensional spatiotemporal representations of the first data sequence and the second data sequence.
8. The method according to claim 7, wherein, Using a preset four-dimensional spatiotemporal code as the query in the cross-attention process, cross-attention processing is performed on the first feature representation and the second feature representation to obtain the four-dimensional spatiotemporal representations of the first data sequence and the second data sequence, including: By using cross attention, the association weight between the first element feature of each element in the four-dimensional spatiotemporal coding and each first feature in the first feature representation is determined; Based on the corresponding association weights of each first feature in the first feature representation, the first features in the first feature representation are weighted and fused to serve as the second element feature of each element. By using cross attention, the association weights between the first element feature of each element in the four-dimensional spatiotemporal coding and the respective second features in the second feature representation are determined; Based on the corresponding association weights of each second feature in the second feature representation, the second features in the second feature representation are weighted and fused to form the third element feature of each element; By fusing the second and third element features of each element, a fourth element feature of each element is obtained; By combining the fourth element features of all elements in the four-dimensional spatiotemporal encoding, four-dimensional spatiotemporal representations of the first data sequence and the second data sequence are obtained.
9. The method according to claim 7, wherein, Aligning the first feature representation and the second feature representation includes: By transforming coordinates, the positional information in the first feature representation and the second feature representation is aligned; Semantic information in the first feature representation and the second feature representation is aligned by upsampling or downsampling.
10. The method according to claim 7, wherein, The first data sequence and the second data sequence are continuous multi-frame image data collected by the camera of the autonomous vehicle and continuous multi-frame point cloud data collected by the radar, respectively.
11. The method according to claim 1, wherein, The number of elements in the four-dimensional spatiotemporal encoding is set or adjusted according to the level of detail of the four-dimensional spatiotemporal representation; or The resolution of the four-dimensional spatiotemporal representation is set or adjusted based on at least one of the following: the graphics processor's memory performance, scene requirements, and computing load of the autonomous vehicle.
12. The method of claim 6, further comprising: Based on the second element features of all elements in the four-dimensional spatiotemporal coding, determine the attention weights of all elements in the four-dimensional spatiotemporal coding; If the ranking of the attention weights of any first element of the four-dimensional spatiotemporal encoding meets the preset conditions, the second element feature of the first element is retained; if the ranking of the attention weights of any second element of the four-dimensional spatiotemporal encoding does not meet the preset conditions, the second element feature of the second element is set to zero.
13. The method of claim 8, further comprising: Based on the fourth element features of all elements in the four-dimensional spatiotemporal coding, determine the attention weights of all elements in the four-dimensional spatiotemporal coding. If the ranking of the attention weights of any first element of the four-dimensional spatiotemporal encoding meets the preset conditions, the fourth element feature of the first element is retained; if the ranking of the attention weights of any second element of the four-dimensional spatiotemporal encoding does not meet the preset conditions, the fourth element feature of the second element is set to zero.
14. The method according to claim 1, wherein, The downstream tasks include object detection tasks, and the task processing based on the four-dimensional spatiotemporal representation includes: The four-dimensional spatiotemporal representation is compressed in the temporal dimension to obtain a three-dimensional spatial representation. The three-dimensional spatial representation is subjected to three-dimensional convolutional decoding to obtain a predicted feature map; Candidate object boxes are determined based on the confidence of the predicted feature maps; Based on nonmaximum suppression, the object bounding box is determined from the candidate object bounding box.
15. The method according to claim 1, wherein, The downstream tasks include trajectory prediction tasks, and the task processing based on the four-dimensional spatiotemporal representation includes: The velocity and acceleration of the target in the four-dimensional spatiotemporal representation are decoded using a long short-term memory neural network. Predict the trajectory of the target based on its velocity and acceleration.
16. The method according to claim 1, wherein, The downstream tasks include drivable region segmentation tasks, and the task processing based on the four-dimensional spatiotemporal representation includes: The four-dimensional spatiotemporal representation is compressed in the temporal and height dimensions to obtain a two-dimensional spatial representation. The two-dimensional spatial representation is subjected to two-dimensional deconvolution to obtain a predicted feature map; Activation processing is performed on the feature values of the predicted feature map to obtain the probability map; The drivable area is segmented based on the probability of each pixel in the probability map and the judgment threshold.
17. The method according to claim 1, further comprising: Based on the training data sequence and the preset loss function, the spatial feature extraction network for extracting spatial features, the inter-frame attention network for determining the correlation weights between frames and fusing spatial features, and the cross-attention network for performing cross-attention processing are trained end-to-end. The loss function is constructed based on reconstruction loss, contrast loss and prediction loss.
18. An information processing apparatus, comprising: One or more modules that perform the information processing method according to any one of claims 1-17.
19. An information processing apparatus, comprising: Memory; And a processor coupled to the memory, the processor being configured to execute the information processing method of any one of claims 1-17 based on instructions stored in the memory.
20. An autonomous vehicle, including an information processing unit configured to perform the information processing method according to any one of claims 1-17.
21. A computer-readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the information processing method according to any one of claims 1-17.
22. A computer program product comprising computer instructions that, when executed by a processor, implement the information processing method according to any one of claims 1-17.