Four-dimensional occupancy grid panoramic perception method and device based on visual information

By using a visual information-based four-dimensional occupancy grid panoramic perception method, end-to-end panoramic segmentation and target tracking are performed using multi-view images and a four-dimensional panoramic perception model. This solves the problems of limited semantic richness and high cost in existing technologies, and achieves low-cost, semantically rich four-dimensional panoramic perception and accurate target location perception.

CN119810367BActive Publication Date: 2026-06-19SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date
2024-12-10
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies suffer from limited semantic richness and high cost in four-dimensional panoramic perception, especially in time-consistent panoramic segmentation methods based on LiDAR.

Method used

A four-dimensional occupancy grid panoramic perception method based on visual information is adopted. A pre-constructed four-dimensional panoramic perception model is input through multi-view images. The encoder module extracts three-dimensional volume features, the query vector propagation module updates the four-dimensional query vector, and the decoder realizes the interaction between the three-dimensional volume features and the four-dimensional query vector. The end-to-end panoramic segmentation and target tracking are performed by combining the volume cross-attention mechanism and the localization perception loss function.

Benefits of technology

It achieves low-cost, semantically rich four-dimensional panoramic perception, eliminates the need for post-processing, improves the performance of temporally consistent panoramic segmentation, and enhances the accuracy of target location perception.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119810367B_ABST
    Figure CN119810367B_ABST
Patent Text Reader

Abstract

This invention relates to a method and apparatus for four-dimensional occupancy grid panoramic perception based on visual information. The method acquires multi-view images, inputs these images into a pre-constructed four-dimensional panoramic perception model based on visual information, and outputs a four-dimensional panoramic occupancy grid prediction result. The four-dimensional panoramic perception model based on visual information includes an encoder module, a decoder module, and a query vector propagation module. The encoder module extracts three-dimensional volume features, the query vector propagation module updates the four-dimensional query vector, and the decoder enables the interaction between the three-dimensional volume features and the four-dimensional query vector. Compared with existing technologies, this invention has advantages such as generating the final four-dimensional panoramic occupancy grid prediction result in an end-to-end, streaming manner, eliminating the need for extensive post-processing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of machine learning technology, and in particular to a method and apparatus for four-dimensional occupancy grid panoramic perception based on visual information. Background Technology

[0002] In dynamic environments, the perception systems of robots and autonomous vehicles need to estimate the geometry, semantic information, and identity of the current scene in a spatially continuous and temporally consistent manner in order to interact with complex and changing three-dimensional (i.e., 3D) environments.

[0003] In existing technologies, the mainstream approach for perception systems in robots and autonomous vehicles to address the problem of four-dimensional (4D) panoramic perception is a time-consistent panoramic segmentation method based on LiDAR. This method divides the entire LiDAR sequence into small data frame windows, overlays the point clouds within each window, performs 3D panoramic segmentation, and then associates objects between each window based on their degree of overlap. Finally, the point clouds with the resulting labels are split back into the original data frames to achieve panoramic perception. However, this method for achieving 4D panoramic perception inevitably introduces post-processing problems due to panoramic segmentation in a 3D manner, has limited semantic richness, and is costly to implement. Summary of the Invention

[0004] The purpose of this invention is to overcome the shortcomings of the existing technology by providing a four-dimensional occupancy grid panoramic perception method and device based on visual information. It can extract three-dimensional volume features from multi-view images and use a query vector propagation module to provide four-dimensional query vectors for panoramic segmentation and target tracking. By using a decoder to interact with the module that realizes the three-dimensional volume features and the four-dimensional query vector, the final four-dimensional panoramic occupancy grid prediction result is generated in an end-to-end, streaming manner, which can eliminate the need for a lot of post-processing.

[0005] The objective of this invention can be achieved through the following technical solutions:

[0006] According to a first aspect of the present invention, a four-dimensional occupancy grid panoramic perception method based on visual information is provided, comprising the following steps: acquiring multi-view images; inputting the multi-view images into a pre-constructed four-dimensional panoramic perception model based on visual information, and outputting a four-dimensional panoramic occupancy grid prediction result; wherein, the four-dimensional panoramic perception model based on visual information includes an encoder module, a decoder module, and a query vector propagation module, the encoder module is used to extract three-dimensional volume features, the query vector propagation module is used to update a four-dimensional query vector, and the decoder is used to realize the interaction between the three-dimensional volume features and the four-dimensional query vector.

[0007] As a preferred technical solution, the encoder module includes an image feature extractor and a two-dimensional to three-dimensional conversion submodule. The image feature extractor is used to capture multi-scale features from the multi-view image, and the two-dimensional to three-dimensional conversion submodule is used to convert the multi-scale features into three-dimensional volume features.

[0008] As a preferred technical solution, the four-dimensional query vector is a concatenation of a new query vector and a tracking query vector. The new query vector is used to determine newly detected objects, and the tracking query vector is used to determine previously detected tracked objects.

[0009] As a preferred technical solution, the encoder outputs an updated four-dimensional query vector, which is then returned to the decoder through the query vector propagation module for processing the next frame sample. Specifically, this includes: during the training process of the four-dimensional panoramic perception model, converting the query vectors that match countable objects in the updated new query vectors into tracking query vectors, and removing query vectors that have not been matched for multiple consecutive frames from the updated tracking query vectors.

[0010] As a preferred technical solution, during the testing process of the four-dimensional panoramic perception model, newly detected objects and disappeared tracked objects are determined by the predicted classification scores. In the updated new query vectors, query vectors with classification scores higher than a first preset score threshold and matching countable objects are retained. In the updated tracking query vectors, query vectors with scores lower than a second preset score threshold for multiple consecutive frames are removed.

[0011] As a preferred technical solution, the interaction between the three-dimensional volume features and the four-dimensional query vector is achieved through a volume cross-attention mechanism, specifically including: each new query uses a multilayer perceptron layer to predict the corresponding 3D reference point; each tracking query maintains the corresponding reference point in time; the three-dimensional volume features are sampled around each reference point; the sampled features are weighted and summed to output the interaction result.

[0012] As a preferred technical solution, the volume cross-attention mechanism includes a volume cross-attention mechanism based on deformable attention.

[0013] As a preferred technical solution, during the training process, the four-dimensional panoramic perception model based on visual information is trained under supervision using a total loss function. The total loss function is the sum of the mask classification loss function and the localization perception loss function. The mask classification loss is used to evaluate the quality of the predicted mask, and the localization perception loss is used to evaluate the quality of the predicted query location.

[0014] As a preferred technical solution, the expression for the localization-aware loss function is:

[0015]

[0016] In the formula, P represents the localization perception loss. ′ L1 represents the center point of each object in the prediction result, L2 represents the center point of the corresponding object in the true value, and L1 represents the Manhattan distance.

[0017] According to a second aspect of the present invention, a four-dimensional occupancy grid panoramic perception device based on visual information is provided, comprising a memory, a processor, and a program stored in the memory, wherein the processor executes the program to implement the method described therein.

[0018] Compared with the prior art, the present invention has the following beneficial effects:

[0019] 1. This invention introduces a query vector propagation module to realize a query mechanism for four-dimensional panoramic occupancy grid perception. This mechanism can realize panoramic perception of three-dimensional grids with consistent time sequence. Compared with the time sequence consistent panoramic perception based on LiDAR, it can use cheaper and richer image information, and realize a perception method that is more in line with actual use scenarios in a streaming and end-to-end manner. It can eliminate the need for a lot of post-processing and reduce the cost of use while having rich semantic information.

[0020] 2. This invention proposes a localization perception loss function, which can guide a four-dimensional panoramic perception model based on visual information to perceive the location information of the tracked target more accurately, thereby improving the performance of the method for temporally consistent panoramic segmentation. Attached Figure Description

[0021] Figure 1 A schematic diagram of the framework flow of the method provided by the present invention;

[0022] in: This represents element-wise product. Detailed Implementation

[0023] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.

[0024] Example 1:

[0025] Occupancy is a 3D visual representation in the field of autonomous driving perception. This representation divides the entire environment into a cubic grid, with each grid assigned a semantic label. The method proposed in this invention incorporates a temporal dimension into the occupancy grid to achieve end-to-end 4D occupancy grid panoramic perception.

[0026] In the 4D panoramic perception problem, mainstream LiDAR-based temporally consistent panoramic segmentation methods achieve 4D panoramic perception through 3D panoramic perception models plus post-processing, which is costly and has limited semantic richness. Currently, there is no corresponding method to achieve end-to-end dynamic environment temporally consistent panoramic perception based on pure vision.

[0027] Specifically, in practical applications, machine perception systems need to rely on images captured by inexpensive multi-view cameras as input, then assign temporally consistent semantic labels to the surrounding dynamic scene, and assign temporally consistent identity labels to each countable object. Furthermore, the entire reasoning process needs to be end-to-end, without requiring post-processing; otherwise, deployment in real-world scenarios is impossible. If these requirements are met, machine perception systems will be able to achieve a higher level of understanding of surrounding dynamic objects and scenes, thereby enabling more sophisticated downstream tasks such as relative position determination, obstacle avoidance path planning, and reliable interaction with the dynamic environment.

[0028] Based on this, this embodiment provides a visual information-based four-dimensional occupancy grid panoramic perception method (TrackOcc method), which is a tracker method based on end-to-end learning, aiming to achieve 4D panoramic occupancy tracking based on visual information and solve the problems of occupancy panoramic segmentation and target tracking. This method uses a mask classification method to achieve panoramic occupancy prediction and introduces a 4D query vector into the prediction framework. The main implementation is as follows: after acquiring multi-view images, the multi-view images are input into a pre-constructed visual information-based four-dimensional panoramic perception model, and the four-dimensional panoramic occupancy grid prediction result is output. The visual information-based four-dimensional panoramic perception model includes an encoder module, a decoder module, and a query vector propagation module. The encoder module is used to extract three-dimensional volume features, the query vector propagation module is used to provide four-dimensional query vectors, and the decoder is used to realize the interaction between three-dimensional volume features and four-dimensional query vectors. Its framework flow is as follows: Figure 1 As shown, the specific process of the method is as follows:

[0029] Step S1, using a multi-view camera to acquire video streams, i.e., multi-view images, can be represented as:

[0030]

[0031] In the formula, M represents the number of surrounding cameras, K represents the sequence length, and t represents time.

[0032] Given a predetermined set of C semantic categories, encoded as C:={0,…,C-1}, the 4D panoramic occupancy tracking task requires the neural network to map each voxel i in the grid to a pair (c i ,z i )∈C×N, where c iz represents the semantic category of voxel i. i This represents the object ID of voxel i. The semantic label set includes countable object categories: thing (objects with clearly defined boundaries) and stuff (regions in the environment without clearly defined boundaries, uncountable). When a voxel is labeled as stuff, its corresponding object ID z i It's irrelevant. Voxels in free space are assigned a special `free` label. Object IDz i Voxels of the same category are grouped into distinct parts, which should remain unchanged throughout the sequence. The grid dimensions are X×Y×Z, representing the height, width, and depth of the grid, respectively. Here, a voxel represents a small cube or unit volume data element within a regular grid in three-dimensional space.

[0033] Step S2 involves inputting multi-view images into a pre-constructed four-dimensional panoramic perception model based on visual information, and outputting a four-dimensional panoramic occupancy grid prediction result. The prediction process specifically includes three parts:

[0034] In the first part, the multi-view images are processed by the encoder module to obtain three-dimensional volume features.

[0035] The encoder module includes an image feature extractor and a 2D-to-3D (i.e., 2D-to-3D) conversion submodule. The image feature extractor extracts multi-scale features from multi-view images, and the 2D-to-3D conversion submodule converts these multi-scale features into downsampled 3D volumetric features. The 3D volumetric feature V is represented as:

[0036]

[0037] In the formula, X, Y, and Z represent the height, width, and depth of the grid, respectively. R represents the number of feature channels, and R represents the set of real numbers commonly used in mathematics.

[0038] In this embodiment, the COTR encoder is used as the 2D-3D conversion submodule. In some other embodiments, the 2D-3D conversion submodule may employ other methods such as LSS, BevDepth, BEVFormer, and FB-Occ.

[0039] In the second part, the four-dimensional query vector is updated through the query vector propagation module.

[0040] The TrackOcc method proposed in this embodiment introduces a four-dimensional query vector, which interacts with query points in the three-dimensional volume features through a volume cross-attention mechanism. Specifically, newly detected objects are bound to a newly generated query vector Q. em Above, the objects detected in the previous frame use the same query vector, named the tracking query vector Q. trThe new query vector and the tracking query vector are concatenated to form a four-dimensional query vector.

[0041] The newborn query vectors are used for stuffing and detecting newborn things, while the tracking query vectors are responsible for predicting all existing tracked objects. For simplicity, stuffing is assigned to the newborn query vectors because it doesn't require tracking. The set of tracking query vectors is dynamically updated, and its size changes over time. In the first frame, there are no tracking query vectors; only fixed-length, learnable newborn query vectors are input into the decoder. In subsequent frames, Q... t The queries are input into the decoder. These queries interact with the 3D volumetric features in the decoder to generate updated 4D query vectors. Updated four-dimensional query vector It is used both to generate the final 4D panoramic occupancy raster prediction result and to pass it to the query vector propagation module to generate the tracking query for the next frame.

[0042] A query vector propagation module is introduced to handle objects that may appear or disappear in intermediate frames. The updated four-dimensional query vector... The sample processing is then returned to the next frame via the query vector propagation module. Specifically:

[0043] During training, the updated new query vector In this process, only query vectors corresponding to countable objects are converted into tracking query vectors. Updated tracking query vector In the middle, continuous T f Unmatched query vectors in a frame will be removed;

[0044] During test inference, the predicted classification scores are used to determine newly detected objects and disappeared tracked objects, and the updated new query vector is used. In the process, only query vectors that are higher than the first preset score threshold τ1 and correspond to countable objects will be retained, and the updated tracking query vectors will be updated accordingly. In the middle, continuous T f Query vectors whose frames are below the second score threshold τ2 will be removed.

[0045] The third part describes the interaction between the three-dimensional volume features and the four-dimensional query vector in the decoder module.

[0046] In the decoder, the 3D volumetric features and the 4D query vector interact through a volumetric cross-attention (VCA) mechanism. The detailed process is as follows:

[0047] First, the query vectors interact with each other, enhancing features through a self-attention layer. Then, they search for and aggregate volumetric features through a Volume Cross-Attention (VCA) layer. Since the input size of volumetric features is large, ordinary attention is computationally expensive. Therefore, a deformable attention-based VCA layer can be used, a resource-efficient attention mechanism where each query interacts only with its region of interest within the volume.

[0048] Specifically, each new student queries q∈Q em Predict a 3D reference point using multilayer perceptron (MLP) layers. And each tracking query q∈Q tr The corresponding reference points are maintained over time, and the volumetric features V are sampled around these reference points. The sampled features are then weighted and summed to obtain the output.

[0049] The VCA process can be represented as:

[0050]

[0051] In the formula, j is the sampling point index. As weight, These are learnable weights and biases.

[0052] The second and third parts are combined to obtain the four-dimensional panoramic occupancy grid prediction results by using the updated four-dimensional query vector and the corresponding three-dimensional volume features to perform element-wise multiplication.

[0053] Furthermore, for countable objects in the prediction results, supervised training is performed using a loss function. In this embodiment, the total loss function is the sum of the commonly used mask classification loss function and the localization-aware loss function proposed in this method.

[0054] The optimization objectives of the TrackOcc method are mainly focused on two aspects: mask prediction and query position prediction.

[0055] Masking Classification Loss

[0056] Different types of queries are matched with ground truth labels using different strategies. Tracking query vectors persist throughout the stream, ensuring that a ground truth label is assigned only once per query vector across all time steps. Once created by inheriting from a new query vector, a tracking query vector is bound to its corresponding ground truth label and remains unchanged.

[0057] The newly generated query vector is transformed between different time frames without explicit allocation at each time step. Based on Mask2Former, a correspondence between the ground truth labels (4D panorama labels) and the predicted 4D panorama results is established through binary matching; this problem is solved using the Hungarian algorithm. Once the matching relationship is established, the mask classification loss related to the query is calculated, denoted as... This includes binary mask loss. and multi-class cross-entropy loss

[0058] Location perception loss

[0059] Accurately locating and tracking each query position is both important and challenging. Previous work relying on mask classification for 3D panorama prediction, such as SparseOcc, has limited localization capabilities. Therefore, this method proposes a localization-aware loss function. The localization-aware loss function utilizes the center point P of each object in the prediction result. ’ Supervised training is performed using the Manhattan distance (i.e., L1 distance) between the center point P of the corresponding object in the ground truth and the ground truth.

[0060] The expression for the localization-aware loss function is:

[0061]

[0062] The localization-aware loss function is used to improve the accuracy of querying specific 3D locations, as these locations are used as reference points for attention in the VCA module. It's important to note that the center point P of each object in the prediction results... ′ This only comes from queries bound to the thing class because the stuff region lacks suitable and consistent true label points; P can be obtained by simply calculating the centroid of each object.

[0063] Therefore, the final total loss function is:

[0064]

[0065] The effectiveness of the TrackOcc method proposed in this embodiment will be verified through experiments:

[0066] According to the aforementioned methodology, the model sets 200 new query vectors in each frame, with each new query vector having a dimension of 256, τ1, τ2, and T. f The values ​​were 0.3, 0.25, and 3, respectively. The training was performed 24 times on the training set with a learning rate of 2 × 10⁻⁶. -4The optimizer used is AdamW. Then, the method is tested on the test set as described above. The comparison results of the TrackOcc method with other methods are shown in the table below. It can be seen that, compared to similar methods, this method can achieve an end-to-end model and significantly outperforms other solutions in terms of evaluation metrics.

[0067]

[0068]

[0069] Among them, MinVIS and CTVIS are inter-frame correlation methods based on query vectors; AB3DMOT is a method for tracking through target detection; 4D-LCA is similar to the method for handling correlation problems in lidar data modes, and achieves four-dimensional segmentation by performing three-dimensional segmentation on time-stacked data.

[0070] The sources for each comparison scheme are as follows:

[0071] [1] D.-A.Huang, Z.Yu, and A.Anandkumar, "Minvis: A minimal video instancesegmentation framework without video-based training," Advances in NeuralInformation Processing Systems, vol.35, pp.31 265–31 277, 2022.

[0072] [2] K.Ying, Q.Zhong, W.Mao, Z.Wang, H.Chen, LYWu, Y.Liu, C.Fan, Y.Zhuge, andC.Shen, "Ctvis: Consistent training for online video instance segmentation," in Proceedings of the IEEE / CVF International Conference on Computer Vision, 2023, pp.899–908.

[0073] [3] X. Weng, J. Wang, D. Held, and K. Kitani, "3DMulti-ObjectTracking: ABaseline and New Evaluation Metrics," IROS, 2020.

[0074] [4] M. Aygun, A. Osep, M. Weber, M. Maximov, C. Stachniss, J. Behley, and L. Leal-Taixe′, “4d panoptic lidar segmentation,” in Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5527–5537.

[0075] Example 2:

[0076] This embodiment provides a four-dimensional occupancy grid panoramic perception device based on visual information, including a memory, a processor, and a program stored in the memory. When the processor executes the program, it implements one or more steps of the method in Embodiment 1. The device processor includes a central processing unit (CPU), which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) or loaded from the memory unit into random access memory (RAM). Various programs and data required for device operation can also be stored in the RAM. The CPU, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus. Multiple components in the device are connected to the I / O interfaces, including: input units, such as a keyboard, mouse, etc.; output units, such as various types of displays, speakers, etc.; storage units, such as disks, optical disks, etc.; and communication units, such as network interface cards, modems, wireless transceivers, etc. The communication units allow the device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks. The processing unit performs the various methods and processes described above, such as one or more steps of the method in Embodiment 1. For example, in some embodiments, one or more steps of the method in Embodiment 1 may be implemented as a computer software program tangibly contained in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and / or installed on the device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of the method in Embodiment 1 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to perform one or more steps of the method in Embodiment 1 by any other suitable means (e.g., by means of firmware). The functionality described above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

[0077] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.

Claims

1. A four-dimensional occupancy grid panoramic perception method based on visual information, characterized in that, Includes the following steps: Acquire multi-view images; The multi-view images are input into a pre-constructed four-dimensional panoramic perception model based on visual information, and the four-dimensional panoramic occupancy grid prediction results are output. The four-dimensional panoramic perception model based on visual information includes an encoder module, a decoder module, and a query vector propagation module. The encoder module is used to extract three-dimensional volume features, the query vector propagation module is used to update the four-dimensional query vector, and the decoder module is used to realize the interaction between the three-dimensional volume features and the four-dimensional query vector. The four-dimensional query vector is a concatenation of the new query vector and the tracking query vector. The encoder outputs an updated four-dimensional query vector, which is then returned to the decoder via the query vector propagation module for processing the next frame of samples. Specifically, this includes: During the training process of the four-dimensional panoramic perception model, the query vectors that match countable objects in the updated new query vectors are converted into tracking query vectors, and the query vectors that are not matched for multiple consecutive frames in the updated tracking query vectors are removed. The interaction between the three-dimensional volume features and the four-dimensional query vector is achieved through a volume cross-attention mechanism, specifically including: Each new query vector uses a multilayer perceptron layer to predict the corresponding 3D reference point; Each tracking query vector maintains a corresponding reference point over time; The three-dimensional volume features are sampled around each reference point; The sampled features are weighted and summed to output the interactive results.

2. The four-dimensional occupancy grid panoramic perception method based on visual information according to claim 1, characterized in that, The encoder module includes an image feature extractor and a 2D-3D conversion submodule. The image feature extractor is used to capture multi-scale features from the multi-view image, and the 2D-3D conversion submodule is used to convert the multi-scale features into 3D volumetric features.

3. The four-dimensional occupancy grid panoramic perception method based on visual information according to claim 1, characterized in that, The new query vector is used to identify newly detected objects, and the tracking query vector is used to identify previously detected tracked objects.

4. The four-dimensional occupancy grid panoramic perception method based on visual information according to claim 1, characterized in that, During the testing of the four-dimensional panoramic perception model, newly detected objects and missing tracked objects are determined using the predicted classification scores. In the updated new query vectors, query vectors with classification scores higher than a first preset score threshold and matching countable objects are retained. In the updated tracking query vectors, query vectors with scores lower than a second preset score threshold for multiple consecutive frames are removed.

5. The four-dimensional occupancy grid panoramic perception method based on visual information according to claim 1, characterized in that, The volume cross-attention mechanism includes a volume cross-attention mechanism based on deformable attention.

6. The four-dimensional occupancy grid panoramic perception method based on visual information according to claim 1, characterized in that, During the training process of the four-dimensional panoramic perception model, the visual information-based four-dimensional panoramic perception model is trained under supervision using a total loss function. The total loss function is the sum of the mask classification loss function and the localization perception loss function. The mask classification loss is used to evaluate the quality of the predicted mask, and the localization perception loss is used to evaluate the quality of the predicted query location.

7. The four-dimensional occupancy grid panoramic perception method based on visual information according to claim 6, characterized in that, The expression for the localization-aware loss function is: In the formula, This indicates the loss of localization perception. P represents the center point of each object in the prediction result, P represents the center point of the corresponding object in the true value, and L1 represents the Manhattan distance.

8. A four-dimensional occupancy grid panoramic perception device based on visual information, comprising a memory, a processor, and a program stored in the memory, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-7.