3D occupancy grid detection method based on three-factor attention weight
By employing a three-factor attention weighted 3D occupancy grid detection method, combined with a multi-view camera array, LiDAR, and 4D millimeter-wave radar, multimodal scene features are generated and feature fusion is performed. This solves the accuracy and precision issues of 3D occupancy grid detection under adverse weather conditions, thereby improving the safety and reliability of autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- YANTAI PORT GRP CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
In adverse weather conditions such as rain, fog, and snow, the accuracy and precision of 3D occupancy grid detection based on lidar are significantly reduced, failing to meet the actual application requirements of autonomous driving. In particular, the accuracy of dynamic object detection and trajectory jitter issues have not been effectively resolved.
A 3D occupancy grid detection method based on three-factor attention weights is adopted. Environmental data is acquired through multi-view camera array, LiDAR and 4D millimeter-wave radar to generate multimodal scene features. Uncertainty heatmap and voxel entropy map are generated by using memory entries and historical memory confidence. The three-factor attention weights are calculated to fuse features and generate a 3D occupancy probability grid.
It effectively reduced the false negative rate of dynamic objects, improved the accuracy and precision of 3D occupancy grid detection in dynamic scenes, enhanced the safety and reliability of autonomous driving, improved the confidence recovery speed of object occlusion reconstruction by about 0.2 seconds, and reduced the false negative rate of dynamic objects by about 25%.
Smart Images

Figure CN122244832A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of autonomous driving technology, and in particular to a 3D occupancy grid detection method based on three-factor attention weights. Background Technology
[0002] In autonomous driving technology, accurately perceiving the vehicle's surroundings and detecting drivable areas is crucial. 3D occupancy grid detection can effectively represent the distribution of obstacles in the environment, providing a basis for determining the drivable area for the vehicle.
[0003] Currently, 3D occupancy grid detection is achieved using point cloud data collected by LiDAR. However, LiDAR-based 3D occupancy grid detection has poor environmental adaptability. In rainy or foggy weather, the reflection of raindrops / fog droplets generates false points, leading to increased point cloud noise. At the same time, the effective detection distance is shortened, and the point cloud density also decreases. In snowy weather, snow cover leads to uneven ground due to obscuring road edges and potholes. Snow also alters the shape of objects, causing blurred target outlines, and snow absorbs laser light, resulting in abnormal reflections. Therefore, the detection accuracy and precision of 3D occupancy grids are significantly reduced under rainy, foggy, and snowy weather conditions, failing to meet the practical application requirements of autonomous driving.
[0004] 3D occupancy grid detection based on multimodal data is mainly applied to weather conditions such as rain, fog, and snow. The detection of dynamic objects under these weather conditions is crucial for driving safety. Therefore, how to improve the accuracy of dynamic object detection and avoid trajectory jitter while implementing 3D occupancy grid detection based on multimodal data is an urgent problem to be solved. Summary of the Invention
[0005] In view of this, this disclosure provides a 3D occupancy grid detection method based on three-factor attention weights.
[0006] According to a first aspect of this disclosure, a 3D occupancy grid detection method based on three-factor attention weights is provided, the 3D occupancy grid detection method comprising:
[0007] The system acquires current frame point cloud data, current frame multi-view images, and current frame 4D radar data of the environment surrounding the vehicle from a multi-view camera array, lidar, and 4D millimeter-wave radar mounted on the vehicle. The current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data are voxelized in a multimodal manner to obtain multimodal scene features; The current frame memory entry is generated based on the multimodal scene features. The memory entry includes a memory key and a memory value, and the memory value contains memory features. Use the memory key in the current frame memory entry to retrieve the relevant memory segment from the memory bank; Obtain the confidence level of historical memory, and use the multimodal scene features and the confidence level of historical memory to generate an uncertainty heatmap and a voxel entropy map of the current frame; The three-factor attention weights of each relevant memory segment are determined by using the current frame uncertainty heatmap and the current frame voxel entropy map. The memory features in all relevant memory segments and the current frame memory entries are fused by the three-factor attention weights of each relevant memory segment to obtain fused features. The fusion features are used to obtain a 3D occupancy probability grid.
[0008] In some embodiments of the first aspect of this disclosure, each of the relevant memory segments includes a timestamp; the step of determining the three-factor attention weight of each relevant memory segment using the current frame uncertainty heatmap and the current frame voxel entropy map includes: A three-factor adjustment coefficient is determined, which is the product of a time decay factor, a noise suppression factor, and a motion compensation factor. The time decay factor is calculated based on the timestamps in the relevant memory segments. The noise suppression factor is obtained based on the current uncertainty value, which is obtained based on the current frame uncertainty heatmap and the current frame voxel entropy map. The motion compensation factor is obtained based on the dynamic features within the memory features of the current frame memory entries and the dynamic segments in the relevant memory segments. Calculate the spatial similarity between the current frame memory entry and the related memory segment; The product of the three-factor adjustment coefficient and the spatial similarity is calculated to obtain the three-factor attention weights of the relevant memory segments.
[0009] In some embodiments of the first aspect of this disclosure, the relevant memory segments include static segments, dynamic segments, and uncertain segments; the step of fusing memory features from all relevant memory segments and the current frame memory entry through the three-factor attention weights of each relevant memory segment to obtain fused features includes: The static aggregate feature is obtained by weighting the static segments of each relevant memory segment using the three-factor attention weights of each relevant memory segment; The dynamic aggregate feature is obtained by weighting the dynamic segments of each relevant memory segment using the three-factor attention weights of each relevant memory segment; The historical uncertainty features are obtained by weighting and summing the uncertain segments of each relevant memory segment using the three-factor attention weights of each relevant memory segment. The current uncertainty heatmap and the current frame voxel entropy map are fused to obtain the current uncertainty comprehensive map. The current uncertainty comprehensive map and the historical uncertainty features are concatenated in the channel dimension to obtain the gating feature map. The gating feature map is processed by the gating generation network to obtain the first gating signal, the second gating signal and the third gating signal. The memory features, static aggregation features, and dynamic aggregation features in the current frame memory entry are fused using the first gating signal, the second gating signal, and the third gating signal to obtain the fused features.
[0010] In some embodiments of the first aspect of this disclosure, generating a current frame uncertainty heatmap and a current frame voxel entropy map using the multimodal scene features and the historical memory confidence level includes: processing the multimodal scene features and the historical memory confidence level using a pre-trained dynamic uncertainty estimation model to obtain the current frame uncertainty heatmap.
[0011] In some embodiments of the first aspect of this disclosure, the current frame voxel entropy map includes the entropy value of each voxel; the step of generating the current frame uncertainty heatmap and the current frame voxel entropy map using the multimodal scene features and the historical memory confidence includes: The entropy value of each voxel is calculated as follows: the feature value of each voxel in the multimodal scene features is normalized to map to the range [0, 1], the normalized feature value is quantized into N intervals, the interval index of each feature value is calculated, the number of feature values in each interval is counted and the ratio of the number of feature values in each interval to the total number of feature values is calculated to obtain the probability of each interval, and the entropy value of each voxel is calculated based on the probability of each interval, where N is a preset fixed value.
[0012] In some embodiments of the first aspect of this disclosure, the 3D occupancy grid detection method is implemented by a pre-trained 3D occupancy grid detection model, wherein the loss function used during the training of the 3D occupancy grid detection model is determined based on static consistency loss and motion smoothing loss.
[0013] In some embodiments of the first aspect of this disclosure, the motion smoothing loss is obtained by: performing dynamic voxel detection using a 3D occupancy probability grid of consecutive frames to determine the current frame velocity, the previous frame velocity, and the current frame acceleration; calculating a velocity continuity loss based on the current frame velocity and the previous frame velocity; calculating an acceleration constraint loss based on the current frame acceleration; and determining the motion balance loss based on the velocity continuity loss and the acceleration constraint loss.
[0014] According to a second aspect of this disclosure, a 3D occupancy grid detection device based on three-factor attention weights is provided, the 3D occupancy grid detection device based on three-factor attention weights comprising: The data acquisition unit is used to acquire current frame point cloud data, current frame multi-view images and current frame 4D radar data of the environment around the vehicle from the multi-view camera array, lidar and 4D millimeter wave radar mounted on the vehicle. A multimodal voxelization unit is used to multimodally voxelize the current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data to obtain multimodal scene features; A memory encoding unit is used to generate a memory entry for the current frame based on the multimodal scene features. The memory entry includes a memory key and a memory value, and the memory value contains memory features. The memory retrieval unit is used to retrieve the relevant memory segment by using the memory key in the memory entry of the current frame; An uncertainty quantization unit is used to obtain the confidence level of historical memory and generate an uncertainty heatmap and a voxel entropy map of the current frame using the multimodal scene features and the confidence level of historical memory. The feature fusion unit is used to determine the three-factor attention weight of each relevant memory segment by using the current frame uncertainty heatmap and the current frame voxel entropy map, and to fuse the memory features in all relevant memory segments and the current frame memory entries by using the three-factor attention weight of each relevant memory segment to obtain fused features; A multi-task unit is used to obtain a 3D occupancy probability grid using the fusion features.
[0015] According to a third aspect of this disclosure, an electronic device is provided, comprising: a processor and a memory storing a program, the program including instructions that, when executed by the processor, cause the processor to perform the methods described above.
[0016] According to a fourth aspect of this disclosure, a computer-readable storage medium storing a program, the program including instructions that, when executed by a processor, cause the processor to perform the methods described above.
[0017] As can be seen from the above technical solutions, the embodiments of this disclosure, under the premise of realizing 3D occupancy grid detection based on multimodal data, dynamically calculate the three-factor attention weights and use three-factor attention perception to achieve feature fusion, which solves the perception jitter caused by time inconsistency in related technologies, effectively reduces the missed detection rate of dynamic objects, and significantly improves the confidence recovery speed of occluded reconstructed objects. It can effectively improve the accuracy and detection precision of 3D occupancy grid detection in dynamic scenes, thereby meeting the safety and reliability requirements of autonomous driving. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a schematic diagram of the system architecture to which the embodiments of this disclosure apply; Figure 2 A flowchart illustrating the 3D occupancy grid detection method based on three-factor attention weights provided in this embodiment of the disclosure; Figure 3 A schematic diagram of the structure of a 3D occupancy grid detection device based on three-factor attention weights provided in an embodiment of this disclosure; Figure 4 A schematic block diagram of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0020] The technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0021] The terminology used in the embodiments of this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of this disclosure. The singular forms “a,” “the,” and “the” as used in the embodiments of this disclosure and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.
[0022] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0023] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."
[0024] Figure 1 A schematic diagram of the architecture of a system to which embodiments of this disclosure apply is shown. See also Figure 1 The system to which this disclosure applies may include: an electronic device and peripheral sensor components connected to the electronic device, the peripheral sensor components including lidar, 4D millimeter-wave radar and multi-view camera array, the electronic device being used to perform the 3D occupancy grid detection method described in this disclosure.
[0025] LiDAR can be used to acquire point cloud data of the environment surrounding a vehicle. A multi-view camera array can be used to acquire multi-view images of the vehicle's environment, which can cover the entire surrounding area.
[0026] 4D millimeter-wave radar can be used to acquire 4D radar data of the environment surrounding a vehicle. 4D radar data can include, but is not limited to, the target's spatial position, velocity, reflection intensity, and signal-to-noise ratio (SNR). Here, "target" refers to a physical entity detected by the 4D millimeter-wave radar; these physical entities can be, but are not limited to, dynamic or static objects.
[0027] The spatial position of a target can include its range, azimuth, and elevation in three-dimensional space. Range refers to the straight-line distance between the 4D millimeter-wave radar and the target, determined by calculating the time difference between signal transmission and reception. Azimuth represents the horizontal angle of the target relative to the 4D millimeter-wave radar, and elevation represents the vertical angle of the target relative to the 4D millimeter-wave radar.
[0028] The target's velocity refers to its radial velocity relative to the 4D millimeter-wave radar, which can be measured using the Doppler effect. Velocity can be used to distinguish between stationary and moving objects.
[0029] The reflection intensity of a target refers to the intensity of electromagnetic wave energy reflected back to a 4D millimeter-wave radar. It is related to the target's material, surface roughness, geometry, and the 4D millimeter-wave radar cross-section. Metallic objects, such as vehicles, have high reflection intensity, while non-metallic objects, such as pedestrians, have low reflection intensity. Reflection intensity can assist in target classification and enhance the accuracy of environmental perception.
[0030] Signal-to-noise ratio (SNR) refers to the ratio of the useful signal to the background noise in a 4D millimeter-wave radar signal, reflecting signal quality. A high SNR indicates reliable target detection and high data confidence; a low SNR indicates potential false detections or missed detections due to noise interference. SNR can be used for data filtering to improve system robustness.
[0031] In this embodiment of the disclosure, moving physical entities such as cars, trucks, motorcycles, bicycles, and pedestrians are dynamic objects, while stationary physical entities such as roadblocks, curbs, stationary vehicles, traffic signs, streetlights, bridges, tunnels, ground fixed facilities, manhole covers, potholes, road gaps, cables, and tree branches are static objects.
[0032] LiDAR can be deployed as a LiDAR group or a single LiDAR unit, and 4D millimeter-wave radar can be deployed as a 4D millimeter-wave radar group or a single 4D millimeter-wave radar unit. For example, a 4D millimeter-wave radar group can include four 4D millimeter-wave radar units: a forward-facing 4D millimeter-wave radar group, left and right side-facing 4D millimeter-wave radar groups, and a rearward-facing 4D millimeter-wave radar group. Another example is that a 4D millimeter-wave radar group can include six 4D imaging radar units: a dual forward-facing 4D millimeter-wave radar group, a left and right side-facing 4D millimeter-wave radar group, and a dual rearward-facing 4D millimeter-wave radar group.
[0033] The multi-view camera array can be implemented as, but is not limited to, a six-view camera array, which includes a front-view camera, a rear-view camera, a left-view camera, a right-view camera, an upper-view camera, and a lower-view camera.
[0034] The specific deployment methods of lidar, 4D millimeter-wave radar and multi-view camera arrays are not limited in the embodiments disclosed herein.
[0035] Electronic devices can be implemented as, but are not limited to, domain controllers or other similar devices. LiDAR, 4D millimeter-wave radar, and multi-view camera arrays can be connected to electronic devices via communication or wired connections, respectively.
[0036] This disclosure can be applied to, but is not limited to, intelligent control of various devices such as multiple wheeled mobile robots, wheeled mobile robots, mobile robots, vehicles, aircraft, ships, Autonomous Rail Rapid Transit (ART) systems, and industrial automation equipment. Vehicles can be, but are not limited to, passenger cars, commercial vehicles (e.g., trucks, buses, vans), special-purpose vehicles (e.g., ambulances, fire trucks, engineering vehicles, rescue vehicles), agricultural and industrial vehicles (e.g., harvesters, forklifts), transportation and logistics vehicles (e.g., container trucks, refrigerated trucks), new energy vehicles (e.g., electric vehicles, hybrid vehicles), and special-purpose vehicles (e.g., garbage trucks, water trucks). In other words, the term "vehicle" in this disclosure is equivalent to the aforementioned various devices.
[0037] The embodiments disclosed herein can be applied to various scenarios such as urban transportation, highways, ports, mines, farms, closed parks, and industrial production. They are applicable to many aspects such as passenger travel, public transportation, logistics distribution, unmanned transportation, last-mile delivery, automated agricultural operations, and automated sanitation. This disclosure does not limit the application scenarios and applicable fields of the embodiments disclosed herein.
[0038] The systems applicable to the embodiments of this disclosure may include, but are not limited to, autonomous driving systems, intelligent driver assistance systems, etc. See also Figure 1The system provided in this disclosure can be installed in a vehicle and used as, but is not limited to, an intelligent driver assistance system or an autonomous driving system.
[0039] Furthermore, those skilled in the art should understand that the systems to which the embodiments of this disclosure apply are not limited to... Figure 1 The architecture shown is not limited to the above-mentioned application scenarios.
[0040] The embodiments disclosed herein are applicable to various environments, including but not limited to sunny days, rainy and foggy weather, snowy days, nighttime environments without light, strong light environments, and backlight environments, and can exhibit high robustness and reliability in various environments.
[0041] Figure 2 A flowchart illustrating the 3D occupancy grid detection method based on three-factor attention weights provided in this disclosure is shown. See also... Figure 2 The 3D occupancy grid detection method based on three-factor attention weights in this disclosure embodiment may include: Step 201: Obtain current frame point cloud data, current frame multi-view image and current frame 4D radar data of the environment around the vehicle from the multi-view camera array, lidar and 4D millimeter wave radar installed on the vehicle. Step 202: Multimodal voxelize the current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data to obtain multimodal scene features; Step 203: Generate the current frame memory entry based on the multimodal scene features. The memory entry includes a memory key and a memory value, and the memory value contains memory features. Step 204: Use the memory key in the current frame memory entry to retrieve the memory bank to obtain the relevant memory segment; Step 205: Obtain the confidence level of historical memory, and use the multimodal scene features and the confidence level of historical memory to generate the uncertainty heatmap and the voxel entropy map of the current frame; Step 206: Determine the three-factor attention weight of each relevant memory segment using the current frame uncertainty heatmap and the current frame voxel entropy map, and fuse the memory features of all relevant memory segments and the memory entries of the current frame through the three-factor attention weight of each relevant memory segment to obtain fused features; Step 207: Obtain the 3D occupancy probability grid using the fusion features.
[0042] This embodiment utilizes multimodal scene features and historical memory confidence to generate an uncertainty heatmap and a voxel entropy map of the current frame. The three-factor attention weights for each relevant memory segment are determined using these heatmaps and voxel entropy maps. Then, the memory features of the current frame memory entries and relevant memory segments are fused using these three-factor attention weights to obtain fused features. Finally, a 3D occupancy probability grid is obtained using these fused features. Therefore, this embodiment, while achieving 3D occupancy grid detection based on multimodal data, dynamically calculates the three-factor attention weights and uses three-factor attention perception to achieve feature fusion. This solves the perception jitter caused by temporal inconsistencies in related technologies, reducing missed detections of dynamic objects by approximately 25% and improving the confidence recovery speed of occluded re-enhancing objects by approximately 0.2 seconds. This effectively improves the accuracy and precision of 3D occupancy grid detection in dynamic scenes, thereby enhancing the safety and reliability of autonomous driving.
[0043] In step 202, feature extraction and voxelization are performed on the current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data to obtain image feature voxels, point cloud feature voxels, and radar feature voxels. The image feature voxels, point cloud feature voxels, and radar feature voxels are then fused to obtain multimodal scene features.
[0044] Specifically, feature maps of multi-view images can be extracted using neural networks such as 2D CNNs or other similar methods. These feature maps are then projected onto a predefined 3D voxel network to obtain image feature voxels. The current frame's point cloud data is voxelized using the 3D voxel network to obtain point cloud feature voxels. The current frame's 4D radar data is voxelized using the 3D voxel network to obtain radar feature voxels. The image feature voxels, point cloud feature voxels, and radar feature voxels are then stitched together to obtain stitched feature voxels. Finally, a 3D convolutional network is used to process the stitched feature voxels to obtain multimodal scene features.
[0045] Image feature voxels can include semantic features such as texture and color, point cloud feature voxels can include geometric features such as contours, and radar feature voxels can include motion features such as position and velocity.
[0046] Before multimodal voxelization, the current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data can be aligned in both temporal and spatial dimensions to further improve accuracy. In specific applications, the resolution of the predefined 3D voxel grid can be flexibly configured as needed. This disclosure does not limit the specific details of the 3D voxel grid.
[0047] Memory keys can be used for fast scene matching. They are low-dimensional features, easy to compute quickly, and have high discriminative power, distinguishing different scene types. Memory values can include memory features and metadata. Metadata can include, but is not limited to, timestamps, locations, and scene labels. Memory features are low-dimensional, dense vector representations of multimodal scene features obtained after passing them through a feature extraction network. Memory features can capture key information from the original data while removing redundancy and noise.
[0048] In some implementations, spatial pyramid pooling, dynamic scene encoding, and upscaling feature projection can be performed sequentially on multimodal scene features to generate memory keys.
[0049] Specifically, an exemplary process for generating a memory key may include the following steps a1 to a4: Step a1 involves extracting multi-scale spatial information from multimodal scene features using Spatial Pyramid Pooling (SPP) and transforming the multi-scale spatial information into spatial pyramid features, which are fixed-length vectors.
[0050] Specifically, global scale pooling is performed on multimodal scene features to capture global statistical features. Specifically, average pooling is performed on the multimodal scene features along the three dimensions of D, H, and W. This involves traversing each channel of the feature map, summing the values of that channel across all spatial locations, and dividing by the total number of spatial locations (D × H × W) to obtain a scalar value. A scalar value is obtained for each channel, resulting in a total of C scalar values. This allows the capture of the global statistical features of the entire scene.
[0051] Medium-scale pooling is performed on multimodal scene features to capture coarse-grained local features of the scene. Specifically, the D dimension is evenly divided into 2 parts, the H dimension is evenly divided into 2 parts, and the W dimension is evenly divided into 2 parts, resulting in a total of 2×2×2=8 sub-regions. Pooling for each sub-region: average pooling is performed on each sub-region within its own spatial range, and an average value is calculated for each channel in each sub-region. The scalar values of each channel in the 8 sub-regions are output, resulting in a total of 8×C scalars. Thus, coarse-grained local features of the scene are captured.
[0052] Fine-scale pooling (4×4×4 grid) is applied to multimodal scene features to capture fine-grained local features. Specifically, the D dimension, H dimension, and W dimension are uniformly divided into 4 parts, resulting in a total of 4×4×4=64 sub-regions. Average pooling is performed on each sub-region within its own spatial range, and an average value is calculated for each channel within each sub-region. The scalar values for each channel across the 64 sub-regions are output, resulting in a total of 64×C scalars. This captures fine-grained local features of the scene, allowing for the understanding of details in small regions.
[0053] The aforementioned global statistical features, coarse-grained local features of the scene, and fine-grained local features of the scene are concatenated to obtain a vector of fixed length.
[0054] Step a2: Perform dynamic scene encoding on the multimodal scene features to obtain dynamic encoded features.
[0055] Specifically, motion features are extracted from multimodal scene features and then concatenated to form a dynamic encoding vector.
[0056] Step a3: The spatial pyramid features and dynamic coding features are fused and then projected into higher dimensions to generate memory keys for multimodal scene features.
[0057] Specifically, the spatial pyramid features and dynamic coding features can be concatenated to obtain the total features. The total features are then projected onto a high-dimensional space through a fully connected layer to obtain high-dimensional features. The high-dimensional features are then normalized (e.g., L2 normalization) to obtain the memory keys of the multimodal scene features.
[0058] Based on the above, a memory key with good discriminative power and containing rich spatial and dynamic information can be extracted from the complex multimodal scene features. This memory key can be used for retrieval and storage in the memory bank.
[0059] In some implementations, the memory features include static features, dynamic features, and an uncertain graph. The memory features are obtained as follows: static features are obtained by compressing the multimodal scene features through CNN distillation; dynamic features are obtained by extracting the target state of the current frame from the multimodal scene features; and the uncertain graph is obtained by calculating the current entropy and confidence using the multimodal scene features. The static features may include the geometric structure features and semantic attribute features of obstacles in the current scene, while the dynamic features may include motion parameters such as the velocity, acceleration, velocity distribution pattern, and trajectory curvature of obstacles in the current scene.
[0060] Multimodal scene features can be compressed into static features through CNN distillation (compression ratio of 4:1). Specifically, the multimodal scene features are ranked by channel importance, and the M most important channels are retained to obtain simplified scene features. Lightweight convolutions (e.g., 1x1x1 convolutions) are used to re-encode the simplified scene features to obtain compressed features. The compressed features are then flattened and further reduced to a preset fixed dimension to obtain static features.
[0061] The channel importance ranking can be achieved by calculating the importance score of each channel. For example, the absolute values of the activation values of each channel in the multimodal scene features can be averaged, which is equivalent to averaging after global average pooling. The channels in the multimodal scene features are sorted from high to low importance scores, and the top M channels are selected and retained, while other channels are deleted to obtain simplified scene features. Here, M can be pre-configured. For example, M can be set to half the original number of channels or determined based on the compression ratio.
[0062] Multimodal scene features can include motion state information such as the velocity, acceleration, and position of obstacles in the current frame. By extracting the motion state information such as the velocity, acceleration, and position of each obstacle in the multimodal scene features, dynamic features can be formed.
[0063] Uncertainty plots are used to quantify the uncertainty of memory values, including confidence levels. Confidence levels represent the reliability of the corresponding voxel memory features.
[0064] In this embodiment of the disclosure, the memory bank stores memory segments, which are obtained by compressing historical frame memory entries. These historical frame memory entries are derived from historical frame point cloud data, historical frame multi-view images, and historical frame 4D radar data. Each historical frame memory entry includes multiple memory entries, each obtained from synchronously acquired past frame point cloud data, past frame multi-view images, and past frame 4D radar data. The acquisition method for each memory entry in the historical frame memory entries is the same as the acquisition method for the aforementioned current frame memory entries, and will not be repeated here.
[0065] The memory segment includes a segment retrieval key, static segments, dynamic segments, uncertain segments, and a timestamp. The segment retrieval key is obtained from the memory features in historical frame memory entries. The static segment is obtained by fusing the static features of historical frame memory entries. The dynamic segment is obtained by performing multinomial fitting of the motion trajectory from the dynamic features of historical frame memory entries. The uncertain segment is obtained by statistical analysis of the uncertainty graph of historical frame memory entries. The timestamp indicates the most recent update time of the memory segment. Thus, scene memory storage can be achieved through memory segments.
[0066] Static segments represent the static structure of a scene. Static segments can include features of static objects such as backgrounds, buildings, and road structures. For example, a static segment can include a segment representing a "crossroads" or a segment representing a "gas station".
[0067] A dynamic segment refers to a continuous time interval in which the trajectory of a target is represented parameterized by a mathematical function (polynomial). Each segment in a dynamic segment can correspond to a target, and each segment may include, but is not limited to, start and end times, polynomial coefficients, motion pattern (e.g., uniform linear motion, uniformly accelerated turning), initial state (e.g., position, velocity, acceleration), ending state (e.g., position, velocity, acceleration), and target category (e.g., vehicle, pedestrian). Dynamic segments have the advantages of temporal continuity, spatial continuity, motion consistency, and compact parameters, allowing the description of complex motions of objects with a small number of parameters.
[0068] It is possible to calculate and quantify the confidence statistics (such as mean and variance) of the uncertain graph of historical frame memory entries over a continuous period of time to obtain the uncertain segment.
[0069] The segment retrieval key can be obtained as follows: arrange the memory features in the historical frame memory entries in chronological order to form a feature sequence, encode the feature sequence through a time series model to obtain fused time series information, and perform spatial pyramid pooling, dynamic scene encoding and up-dimensional feature projection on the fused time series information in sequence to generate the segment retrieval key.
[0070] The method in this embodiment may further include: updating memory segments in the memory bank. Specifically, if a currently existing memory entry matches a static segment of an existing memory segment, it is merged into the static segment, the feature representation of the static segment is updated (e.g., by updating the feature vector through a weighted average), and the confidence level is updated. If there is no match, a new static segment can be created. Typically, static segments have a long lifespan and a low update frequency. If a currently existing memory entry matches a dynamic segment of an obstacle in an existing dynamic segment, the dynamic features in the currently existing memory entry are added to the dynamic segment, and the polynomial is refitted to update the parameters of the dynamic segment. If there is no dynamic segment that matches the dynamic features in the currently existing memory entry, a new dynamic segment is created using the dynamic features in the current frame memory entry. Further, a statistical measure (such as mean or variance) of the confidence level in the uncertainty graph of the currently existing memory entry can be calculated and quantized, and the quantized statistical value can be averaged with the existing uncertainty segment to update the uncertainty segment.
[0071] In step 204, the memory retrieval may include: calculating the similarity between the memory key in the current frame memory entry and the segment retrieval key of each memory segment in the memory bank; among the memory segments corresponding to the segment retrieval keys whose similarity exceeds a predetermined similarity threshold, selecting N memory segments with higher similarity as related memory segments, where N is a preset value and N is an integer greater than 1.
[0072] Furthermore, in step 204, the similarity between the static features in the current frame memory entries and the static segments in these memory segments can be calculated within the memory segments corresponding to segment retrieval keys whose similarity exceeds a predetermined similarity threshold. Based on this similarity, the memory segments corresponding to segment retrieval keys whose similarity exceeds the predetermined similarity threshold are then filtered. This reduces the number of relevant memory segments while improving fusion accuracy.
[0073] Similarity calculation can employ Euclidean distance, cosine similarity, etc. This disclosure does not impose any limitations on this method.
[0074] In step 205, the historical memory confidence score can be read from the processing result of the previous frame. The historical memory confidence score includes the historical memory confidence score for each voxel, and the historical memory confidence score for each voxel represents the reliability of the historical features (i.e., the relevant memory segments) of that voxel. In this embodiment of the present disclosure, the historical memory confidence score can also be updated after step 207 for use in the 3D occupancy raster detection of the next frame.
[0075] In step 205, a current frame uncertainty heatmap can be obtained by processing multimodal scene features and historical memory confidence using a pre-trained dynamic uncertainty estimation model. The current frame uncertainty heatmap includes the uncertainty of the current features (i.e., the memory features of the current frame memory entries) of each voxel. Here, the dynamic uncertainty estimation model can be a neural network or a probabilistic model, obtained through pre-training.
[0076] In step 205, the current frame voxel entropy map includes the entropy value of each voxel. The entropy value represents a quantitative measure of uncertainty. A low entropy value of a voxel indicates that the voxel's state is certain, with little information and low uncertainty; a high entropy value of a voxel indicates that the voxel's state is uncertain, with a lot of information and high uncertainty.
[0077] The entropy value of each voxel is calculated as follows: The feature values of each voxel in the multimodal scene features are normalized to map to the range [0, 1]. The normalized feature values are quantized into N intervals. For each feature value, the interval index is calculated. The number of feature values in each interval is counted, and the ratio of the number of feature values in each interval to the total number of feature values is calculated to obtain the probability of each interval. The entropy value of each voxel is calculated based on the probability of each interval. N is a preset fixed value and is an integer greater than 1.
[0078] Among them, the entropy value of a voxel It can be calculated using the following formula.
[0079] ,
[0080] in, This represents the probability that the current voxel feature value appears in the i-th quantization interval in the multimodal scene features. K is the preset total number of quantization intervals, for example, K=256. This represents the entropy value of a voxel. It is the probability mass function of the feature value distribution in multimodal scene features, describing the frequency of different values occurring in the multimodal scene features. When When =0, define ×log2( = 0.
[0081] In step 206, the three-factor attention weights for each relevant memory segment can be determined through the following steps b1 to b3: Step b1: Determine the three-factor adjustment coefficients. The three-factor adjustment coefficients are the product of the time decay factor, the noise suppression factor, and the motion compensation factor. The time decay factor is calculated based on the timestamps in the relevant memory segments. The noise suppression factor is obtained based on the current uncertainty value. The current uncertainty value is obtained based on the current frame uncertainty heatmap and the current frame voxel entropy value map. The motion compensation factor is obtained based on the dynamic features within the memory features of the current frame memory entries and the dynamic segments in the relevant memory segments. Step b2: Calculate the spatial similarity between the current frame memory entry and related memory segments; Step b3: Calculate the product of the three-factor adjustment coefficient and spatial similarity to obtain the three-factor attention weights of the relevant memory segments.
[0082] In some examples, the three-factor adjustment coefficients are calculated using the following formula:
[0083] in, Let represent the time decay coefficient of the i-th relevant memory segment, where i = 1, 2, ..., N, and N represents the total number of relevant memory segments. Indicates the predetermined time decay factor. The difference between the timestamp of the current frame's memory entry and the timestamp of the i-th related memory segment. This represents the noise suppression coefficient of the i-th relevant memory segment. Indicates the value of uncertainty. Determined based on the current frame uncertainty heatmap and the current frame voxel entropy value map. This represents the motion compensation coefficient of the i-th relevant memory segment. This represents the pre-calibrated motion compensation denominator coefficient. This represents the vehicle speed corresponding to the i-th relevant memory segment. denoted by , where represents the three-factor adjustment coefficient of the i-th related memory segment, and j represents the timestamp number of the i-th related memory segment.
[0084] In some examples, the uncertainty value can be obtained by the following formula:
[0085] Where U represents the uncertainty heatmap of the current frame, and E represents the voxel entropy map of the current frame. This represents the weighting coefficients of the current frame's uncertainty heatmap. This indicates an uncertain value.
[0086] In some examples, The value can be, but is not limited to, 0.2. Time decay factor. It can be set to a constant value, such as the time decay factor. It can be set to 0.1 or other values.
[0087] In some examples, the spatial similarity between the i-th related memory segment and the current frame memory entry can be calculated as follows: using the static features within the memory features of the current frame memory entry as the query vector and the static segment in the i-th related memory segment as the query key, the similarity is calculated. Here, the similarity can be calculated using, but is not limited to, cosine similarity, dot product similarity, etc.
[0088] Step 205 may include the following steps c1 to c4: Step c1: Use the three-factor attention weights of each relevant memory segment to perform a weighted summation of the static segments of each relevant memory segment to obtain the static aggregated features; Step c2: Use the three-factor attention weights of each relevant memory segment to perform a weighted summation of the dynamic segments of each relevant memory segment to obtain the dynamic aggregated features; Step c3: Use the three-factor attention weights of each relevant memory segment to weight and sum the uncertain segments of each relevant memory segment to obtain the historical uncertainty features. The current uncertainty heatmap and the current frame voxel entropy map are fused to obtain the current uncertainty comprehensive map. The current uncertainty comprehensive map and the historical uncertainty features are concatenated in the channel dimension to obtain the gating feature map. The gating feature map is processed by the gating generation network to obtain the first gating signal, the second gating signal and the third gating signal. Step c4: Use the first gating signal, the second gating signal, and the third gating signal to fuse the memory features, static aggregated features, and dynamic aggregated features in the current frame memory entry to obtain the fused features.
[0089] Therefore, the current frame memory entries and related memory segments can be fused through a three-factor perception attention mechanism. This three-factor perception attention mechanism can better track dynamic objects and improve the accuracy of dynamic object-related features in the fused features.
[0090] In step 207, the fused features can be processed using a 3D deconvolutional network and a 3D occupancy and semantic prediction head to obtain a 3D occupancy probability grid. This embodiment of the disclosure does not limit the specific method for obtaining the 3D occupancy probability grid.
[0091] The 3D probabilistic mesh includes multiple channels: one channel for occupancy probability, representing the probability that the voxel is occupied by any object; another channel for motion information, representing the motion state of the object occupying the voxel; and the remaining channels for semantic category probabilities, representing the probability that the voxel belongs to each of several predefined categories when it is occupied. Motion information may include, but is not limited to, velocity, acceleration, and motion patterns.
[0092] The 3D occupancy grid detection method of this disclosure can be implemented using a pre-trained 3D occupancy grid detection model. The loss function used during the training of the 3D occupancy grid detection model is determined based on static consistency loss and motion smoothing loss. Here, the 3D occupancy grid detection model can be used to implement the aforementioned steps 202 to 207.
[0093] In some implementations, the static consistency loss can be obtained as follows: estimate the 3D flow field using the 3D occupancy probability grid of the current frame and the 3D occupancy probability grid of the historical frame; use the 3D flow field to distort the 3D occupancy probability grid of the historical frame to the coordinate system where the 3D occupancy probability grid of the current frame is located, to obtain the distorted 3D occupancy probability grid of the historical frame; calculate the static consistency loss using the distorted 3D occupancy probability grid of the historical frame and the 3D occupancy probability grid of the current frame.
[0094] The static consistency loss can be calculated using the following formula:
[0095] in, This represents the static consistency loss. This represents the 3D occupancy probability grid for the current frame t. This represents the 3D occupancy probability grid of the previous frame t-1. This represents the 3D flow field from the previous frame t-1 to the current frame t. This represents a differentiable space warping operation. This represents the L2 norm.
[0096] In some implementations, the motion smoothing loss is obtained by: performing dynamic voxel detection using a 3D occupancy probability grid across multiple consecutive frames to determine the current frame velocity, the previous frame velocity, and the current frame acceleration; calculating the velocity continuity loss based on the current frame velocity and the previous frame velocity; calculating the acceleration constraint loss based on the current frame acceleration; and determining the motion balance loss based on the velocity continuity loss and the acceleration constraint loss.
[0097] The velocity continuity loss can be calculated using the following formula:
[0098] in, This indicates a loss of velocity continuity. This indicates that the velocity continuity loss is calculated only for dynamic objects. This represents the velocity vector of object i in the current frame t. This represents the velocity vector of object i in the previous frame t-1. express and The L2 norm of the difference.
[0099] Typically, the 3D occupancy probability grid contains object velocities, which can be directly extracted from the 3D occupancy probability grid of the current frame t. , It can then be directly extracted from the 3D occupancy probability grid of the previous frame t-1.
[0100] The acceleration constraint loss can be calculated using the following formula:
[0101] in, Indicates acceleration constraint loss. This indicates the preset maximum allowable acceleration. This represents the acceleration vector of object i in the current frame t. Here, the acceleration vector can also be extracted from the 3D occupancy probability grid.
[0102] The motion smoothing loss can be calculated using the following formula:
[0103] in, The weighting coefficients representing the loss of velocity continuity. The weighting coefficients represent the acceleration constraint loss. and The values can be pre-configured or taken from empirical values. This indicates motion smoothing loss.
[0104] The loss function used during the training of the 3D occupancy grid detection model is expressed as follows:
[0105] in, Indicates the total loss. The weighting coefficients representing the static consistency loss. The weighting coefficients representing the motion smoothing loss. and The values can be pre-configured or taken from empirical values.
[0106] By using the aforementioned model training method that combines static consistency loss and motion smoothness loss, the temporal inconsistency of 3D grid occupancy detection can be reduced by about 30%, and the mIoU fluctuation range of static objects can be narrowed to about ±3%.
[0107] Furthermore, the method in this embodiment may further include: updating a dynamic object trajectory table and / or a drivable area mask based on a 3D occupancy probability grid of consecutive frames. The dynamic object trajectory table can indicate the motion trajectory of dynamic objects, and the drivable area mask can indicate a 2D or 3D area that the vehicle can safely pass through. This better meets the needs of subsequent path planning in autonomous driving.
[0108] The method of this disclosure increases the weight of high-speed objects, i.e., objects with a speed greater than 5 m / s, by 3 times. This can effectively solve the perception jitter caused by time inconsistency and inaccurate detection of dynamic objects in related technologies, and effectively improve the accuracy and precision of 3D grid occupancy detection in dynamic scenes, thereby improving the safety and reliability of autonomous driving.
[0109] Figure 3 A schematic diagram of the structure of a 3D occupancy grid detection device based on three-factor attention weights provided in an embodiment of this disclosure is shown. See also Figure 3 The 3D occupancy grid detection device 300 based on three-factor attention weights in this disclosure embodiment may include: The data acquisition unit 301 is used to acquire current frame point cloud data, current frame multi-view image and current frame 4D radar data of the environment around the vehicle from the multi-view camera array, lidar and 4D millimeter wave radar mounted on the vehicle. The multimodal voxelization unit 302 is used to multimodally voxelize the current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data to obtain multimodal scene features; The memory encoding unit 303 is used to generate memory entries for the current frame based on multimodal scene features. The memory entries include memory keys and memory values, and the memory values contain memory features. The memory retrieval unit 304 is used to retrieve the relevant memory segment by using the memory key in the memory entry of the current frame; Uncertainty quantization unit 305 is used to obtain historical memory confidence and generate current frame uncertainty heatmap and current frame voxel entropy map using multimodal scene features and historical memory confidence. The feature fusion unit 306 is used to determine the three-factor attention weight of each relevant memory segment by using the current frame uncertainty heatmap and the current frame voxel entropy map, and to fuse the memory features in all relevant memory segments and the current frame memory entries through the three-factor attention weight of each relevant memory segment to obtain fused features; Multitasking unit 307 is used to obtain a 3D occupancy probability grid by utilizing fusion features.
[0110] Furthermore, the feature fusion unit 306 can be specifically used to: determine the three-factor adjustment coefficient, which is the product of the time decay factor, the noise suppression factor, and the motion compensation factor. The time decay factor is calculated based on the timestamps in the relevant memory segments, the noise suppression factor is obtained based on the current uncertainty value, the current uncertainty value is obtained based on the current frame uncertainty heatmap and the current frame voxel entropy value map, and the motion compensation factor is obtained based on the dynamic features within the memory features of the current frame memory entries and the dynamic segments in the relevant memory segments; calculate the spatial similarity between the current frame memory entries and the relevant memory segments; and calculate the product of the three-factor adjustment coefficient and the spatial similarity to obtain the three-factor attention weights of the relevant memory segments.
[0111] Furthermore, the feature fusion unit 306 can be specifically used to: use the three-factor attention weights of each relevant memory segment to weighted sum the static segments of each relevant memory segment to obtain static aggregated features; use the three-factor attention weights of each relevant memory segment to weighted sum the dynamic segments of each relevant memory segment to obtain dynamic aggregated features; use the three-factor attention weights of each relevant memory segment to weighted sum the uncertain segments of each relevant memory segment to obtain historical uncertain features; fuse the current frame uncertainty heatmap and the current frame voxel entropy map to obtain a current uncertainty comprehensive map; concatenate the current uncertainty comprehensive map and historical uncertain features in the channel dimension to obtain a gated feature map; process the gated feature map through a gating generation network to obtain a first gating signal, a second gating signal, and a third gating signal; and fuse the memory features, static aggregated features, and dynamic aggregated features in the current frame memory entries using the first gating signal, the second gating signal, and the third gating signal to obtain fused features.
[0112] Furthermore, the uncertainty quantization unit 305 can be specifically used to: process multimodal scene features and historical memory confidence through a pre-trained dynamic uncertainty estimation model to obtain the current frame uncertainty heatmap.
[0113] Furthermore, the uncertainty quantization unit 305 can be specifically used to: calculate the entropy value of each voxel in the following way: normalize the feature value of each voxel in the multimodal scene features to map to the range of [0, 1], quantize the normalized feature value into N intervals, calculate the interval index to which each feature value belongs, count the number of feature values in each interval and calculate the ratio of the number of feature values in each interval to the total number of feature values to obtain the probability of each interval, and calculate the entropy value of each voxel based on the probability of each interval, where N is a preset fixed value.
[0114] In specific applications, the 3D occupancy grid detection device 300 can be implemented through software, hardware, or a combination of both. For example, the 3D occupancy grid detection device 300 can be implemented as software running in the electronic device 400 described below.
[0115] Figure 4 A schematic structural diagram of an electronic device provided according to an embodiment of this disclosure is shown. See also... Figure 4 The electronic device 400 provided in this embodiment may include: one or more processors 401 and a memory 402. The memory 402 stores a computer program, which, when run by the one or more processors 401, causes the processors 401 to perform the aforementioned 3D occupancy grid detection method.
[0116] Processor 401 may be, but is not limited to, a central processing unit (CPU), a graphics processing unit (GPU), or other processing units with data processing capabilities and / or instruction execution capabilities.
[0117] Memory 402 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and / or cache memory. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor may execute the program instructions to implement the methods described above and / or other desired functions.
[0118] Depending on the specific application, the electronic device 400 may also include any other suitable components.
[0119] In addition to the methods and devices described above, embodiments of this disclosure may also be computer program products, including computer program instructions that, when executed by a processor, cause the processor to perform the steps in the above-described 3D occupancy grid detection method based on three-factor attention weights.
[0120] The computer program product can be written in any combination of one or more programming languages to perform the operations of the embodiments of this disclosure. The programming languages include object-oriented programming languages such as Java and C++, as well as conventional procedural programming languages such as C or similar languages. The program code can be executed entirely on a user's computing device, partially on a user's computing device, as a standalone software package, partially on a user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.
[0121] Furthermore, embodiments of this disclosure also provide a computer-readable storage medium having a computer program stored thereon, which, when run by a processor, causes the processor to perform the steps in the 3D occupancy grid detection method described above.
[0122] The computer-readable storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.
[0123] The technical solutions provided in this disclosure have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this disclosure. The descriptions of the embodiments above are only for the purpose of helping to understand the methods and core ideas of this disclosure. Furthermore, those skilled in the art will recognize that, based on the ideas of this disclosure, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this disclosure.
[0124] The above description is merely a preferred embodiment of this disclosure and is not intended to limit this disclosure. Any modifications or equivalent substitutions made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A 3D occupancy grid detection method based on three-factor attention weights, characterized in that, The 3D occupancy grid detection method includes: The system acquires current frame point cloud data, current frame multi-view images, and current frame 4D radar data of the environment surrounding the vehicle from a multi-view camera array, lidar, and 4D millimeter-wave radar mounted on the vehicle. The current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data are voxelized in a multimodal manner to obtain multimodal scene features; The current frame memory entry is generated based on the multimodal scene features. The memory entry includes a memory key and a memory value, and the memory value contains memory features. Use the memory key in the current frame memory entry to retrieve the relevant memory segment from the memory bank; Obtain the confidence level of historical memory, and use the multimodal scene features and the confidence level of historical memory to generate an uncertainty heatmap and a voxel entropy map of the current frame; The three-factor attention weights of each relevant memory segment are determined by using the current frame uncertainty heatmap and the current frame voxel entropy map. The memory features in all relevant memory segments and the current frame memory entries are fused by the three-factor attention weights of each relevant memory segment to obtain fused features. The fusion features are used to obtain a 3D occupancy probability grid.
2. The method of claim 1, wherein, Each of the aforementioned relevant memory segments includes a timestamp; the determination of the three-factor attention weights for each relevant memory segment using the current frame uncertainty heatmap and the current frame voxel entropy map includes: A three-factor adjustment coefficient is determined, which is the product of a time decay factor, a noise suppression factor, and a motion compensation factor. The time decay factor is calculated based on the timestamps in the relevant memory segments. The noise suppression factor is obtained based on the current uncertainty value, which is obtained based on the current frame uncertainty heatmap and the current frame voxel entropy map. The motion compensation factor is obtained based on the dynamic features within the memory features of the current frame memory entries and the dynamic segments in the relevant memory segments. Calculate the spatial similarity between the current frame memory entry and the related memory segment; The product of the three-factor adjustment coefficient and the spatial similarity is calculated to obtain the three-factor attention weights of the relevant memory segments.
3. The method of claim 1, wherein, The relevant memory segments include static segments, dynamic segments, and uncertain segments; the process of fusing memory features from all relevant memory segments and the current frame memory entry through the three-factor attention weights of each relevant memory segment to obtain fused features includes: The static aggregate feature is obtained by weighting the static segments of each relevant memory segment using the three-factor attention weights of each relevant memory segment; The dynamic aggregate feature is obtained by weighting the dynamic segments of each relevant memory segment using the three-factor attention weights of each relevant memory segment; The historical uncertainty features are obtained by weighting and summing the uncertain segments of each relevant memory segment using the three-factor attention weights of each relevant memory segment. The current uncertainty heatmap and the current frame voxel entropy map are fused to obtain the current uncertainty comprehensive map. The current uncertainty comprehensive map and the historical uncertainty features are concatenated in the channel dimension to obtain the gating feature map. The gating feature map is processed by the gating generation network to obtain the first gating signal, the second gating signal and the third gating signal. The memory features, static aggregation features, and dynamic aggregation features in the current frame memory entry are fused using the first gating signal, the second gating signal, and the third gating signal to obtain the fused features.
4. The method of claim 1, wherein, The step of generating the current frame uncertainty heatmap and the current frame voxel entropy map using the multimodal scene features and the historical memory confidence level includes: processing the multimodal scene features and the historical memory confidence level using a pre-trained dynamic uncertainty estimation model to obtain the current frame uncertainty heatmap.
5. The method of claim 1, wherein, The current frame voxel entropy map includes the entropy value of each voxel; the generation of the current frame uncertainty heatmap and the current frame voxel entropy map using the multimodal scene features and the historical memory confidence includes: The entropy value of each voxel is calculated as follows: the feature value of each voxel in the multimodal scene features is normalized to map to the range [0, 1], the normalized feature value is quantized into N intervals, the interval index of each feature value is calculated, the number of feature values in each interval is counted and the ratio of the number of feature values in each interval to the total number of feature values is calculated to obtain the probability of each interval, and the entropy value of each voxel is calculated based on the probability of each interval, where N is a preset fixed value.
6. The method of claim 1, wherein, The 3D occupancy grid detection method is implemented through a pre-trained 3D occupancy grid detection model. The loss function used during the training of the 3D occupancy grid detection model is determined based on static consistency loss and motion smoothness loss.
7. The method of claim 6, wherein, The motion smoothing loss is obtained in the following way: Dynamic voxel detection is performed using a 3D occupancy probability grid across multiple consecutive frames to determine the current frame velocity, the previous frame velocity, and the current frame acceleration. The velocity continuity loss is calculated based on the current frame velocity and the previous frame velocity, the acceleration constraint loss is calculated based on the current frame acceleration, and the motion balance loss is determined based on the velocity continuity loss and the acceleration constraint loss.
8. A 3D occupancy grid detection device based on three-factor attention weights, characterized in that, The 3D occupancy grid detection device based on three-factor attention weights includes: The data acquisition unit is used to acquire current frame point cloud data, current frame multi-view images and current frame 4D radar data of the environment around the vehicle from the multi-view camera array, lidar and 4D millimeter wave radar mounted on the vehicle. A multimodal voxelization unit is used to multimodally voxelize the current frame point cloud data, the current frame multi-view image, and the current frame 4D radar data to obtain multimodal scene features; A memory encoding unit is used to generate a memory entry for the current frame based on the multimodal scene features. The memory entry includes a memory key and a memory value, and the memory value contains memory features. The memory retrieval unit is used to retrieve the relevant memory segment by using the memory key in the memory entry of the current frame; An uncertainty quantization unit is used to obtain the confidence level of historical memory and generate an uncertainty heatmap and a voxel entropy map of the current frame using the multimodal scene features and the confidence level of historical memory. The feature fusion unit is used to determine the three-factor attention weight of each relevant memory segment by using the current frame uncertainty heatmap and the current frame voxel entropy map, and to fuse the memory features in all relevant memory segments and the current frame memory entries by using the three-factor attention weight of each relevant memory segment to obtain fused features; A multi-task unit is used to obtain a 3D occupancy probability grid using the fusion features.
9. An electronic device, comprising: include: A processor and a memory storing a program, the program comprising instructions that, when executed by the processor, cause the processor to perform the method according to any one of claims 1 to 7.
10. A computer-readable storage medium storing a program, the program comprising instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 7.