Object detection with multiple distances and resolutions
By using a multi-branch structure of a deep neural network to process Cartesian grid data with different spatial dimensions and resolutions, the problems of memory requirements and computational costs in grid-based perception systems are solved, and more efficient object detection is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- APTIV TECHNOLOGIES AG
- Filing Date
- 2022-01-30
- Publication Date
- 2026-06-23
Smart Images

Figure CN114839628B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to apparatus, methods, and computer programs for object detection in a vehicle's surrounding environment using deep neural networks. The apparatus can be installed in a vehicle to detect objects for the vehicle. Background Technology
[0002] Automotive perception systems typically employ multiple sensors, such as camera-based sensors, radar-based sensors, and lidar-based sensors, or a combination thereof. The data generated by these sensors often exhibits significant differences. For example, data from camera-based sensors is typically formatted as a sequence of images in a perspective view, while data from radar-based and lidar-based sensors is typically formatted as point clouds and grids in a bird's-eye view (BEV), i.e., an overhead view of an object. In recent years, radar-based and lidar-based sensors have become increasingly essential components in autonomous driving systems due to their superior ability to perceive target shape and / or distance, and their less susceptibility to weather or other environmental conditions.
[0003] Thanks to recent advances in deep neural networks, perception systems with radar-based and / or lidar-based sensors have achieved great success. In these systems, data is typically represented as point clouds or meshes. Although point cloud processing is receiving increasing attention, mesh-based systems still hold an advantage in available products due to their simplicity of design and similarity to image processing, which has been extensively studied. Summary of the Invention
[0004] Technical issues
[0005] In grid-based perception systems, given a spatial distance and a resolution within that distance, data is typically represented as a 2D or 3D grid in a world Cartesian coordinate system centered on the ego vehicle (e.g., an autonomous vehicle or robot) (i.e., in a BEV). A drawback of this data structure is that its size (e.g., the number of nodes in the grid) increases quadratically (2D) or cubically (3D) with increasing spatial distance or resolution. For example, for a 2D grid, if the resolution is fixed, doubling the distance (e.g., from 40m to 80m) makes the 2D grid four times larger. Similarly, if the spatial distance is fixed, the resolution increases. This quadratically increases memory requirements, memory consumption, and computational costs, which is often prohibitive, for example, because of the need to make real-time driving decisions for the ego vehicle.
[0006] Data from radar-based and / or lidar-based sensors is typically stored in a polar grid with fixed spatial distance and angular resolution. However, polar grids are often converted to Cartesian grids for better compatibility with deep neural networks. However, the spatial resolution of a Cartesian grid differs from that of a polar grid. For example, at close range, one cell in a Cartesian grid corresponds to multiple cells in a polar grid, while at long range, multiple cells in a Cartesian grid correspond to one cell in a polar grid.
[0007] While there are studies on various spatial distances in the literature, they haven't addressed the technical challenges of effective data representation. For example, Wang et al.'s "Range adaptation for 3D object detection in lidar" (2019 IEEE / CVF International Conference on Computer Vision Workshop (ICCVW)) adapts features extracted from distant locations to those from nearby locations to obtain more uniform features, thus improving performance for detecting more distant targets. Here, feature alignment is performed in the feature space without considering the spatial correspondence of overlapping regions, thus simply aligning feature maps from different distances into a unified feature space. This is done in a network with a special loss function to measure the similarity of feature maps from different distances.
[0008] In addition, Engels et al.'s "3D Object Detection from LiDAR data using distance-dependent feature extraction" (Computer Vision and Pattern Recognition, 2020) trained neural networks for short-range and long-range networks separately with the same grid resolution, and then performed a later fusion to combine their outputs.
[0009] Solution
[0010] According to a first aspect, a computer implementation method for object detection in a vehicle surrounding environment using a deep neural network is provided, the method comprising: inputting a first set of sensor-based data for a first Cartesian grid having a first spatial dimension and a first spatial resolution into a first branch of the deep neural network; inputting a second set of sensor-based data for a second Cartesian grid having a second spatial dimension and a second spatial resolution into a second branch of the deep neural network; providing an interaction between the first branch of the deep neural network and the second branch of the deep neural network at an intermediate level of the deep neural network to consider features identified for overlapping spatial regions of the first and second spatial dimensions, respectively, in processing of subsequent layers of the deep neural network; and fusing a first output of the first branch of the deep neural network and a second output of the second branch of the deep neural network to detect objects in the vehicle surrounding environment.
[0011] According to the second aspect, the first spatial dimension is different from the second spatial dimension, and the first spatial resolution is different from the second spatial resolution.
[0012] According to the third aspect, interaction is also provided by resampling the first intermediate output of the first branch at the intermediate level and by resampling the second intermediate output of the second branch at the intermediate level.
[0013] According to the fourth aspect, further interaction is provided by merging the first intermediate output with the resampled second intermediate output and by merging the second intermediate output with the resampled first intermediate output.
[0014] According to the fifth aspect, the merging includes generating a first link between a first intermediate output and a resampled second intermediate output, and generating a second link between the second intermediate output and the resampled first intermediate output.
[0015] According to the sixth aspect, the merging further includes reducing the first connection to generate a first reduced output, and reducing the second connection to generate a second reduced output, wherein the first reduced output and the second reduced output are used for processing in subsequent layers of the deep neural network.
[0016] According to the seventh aspect, the first reduced output or the second reduced output is used to replace the corresponding portion of the first intermediate output or the second intermediate output.
[0017] According to the eighth aspect, the fusion includes filtering out one or more overlapping bounding boxes.
[0018] According to the ninth aspect, the fusion includes prioritizing information from the first output or the second output by using distance information.
[0019] According to the tenth aspect, the first output and the second output include information relating to one or more of the following: target category, bounding box, object position, object size, object orientation, and object velocity.
[0020] According to the eleventh aspect, a computer program includes instructions that, when executed by a computer, cause the computer to perform the method of any one of the first to tenth aspects.
[0021] According to a twelfth aspect, an apparatus is provided for object detection in a vehicle's surrounding environment using a deep neural network, wherein the apparatus includes: an acquisition unit configured to acquire sensor-based data about each of one or more radar antennas or lasers; and a determination unit configured to: input a first set of sensor-based data for a first Cartesian grid having a first spatial dimension and a first spatial resolution into a first branch of the deep neural network; input a second set of sensor-based data for a second Cartesian grid having a second spatial dimension and a second spatial resolution into a second branch of the deep neural network; provide an interaction between the first branch and the second branch of the deep neural network at an intermediate level of the deep neural network to consider features identified for overlapping spatial regions of the first and second spatial dimensions, respectively, in processing of subsequent layers of the deep neural network; and fuse a first output of the first branch of the deep neural network and a second output of the second branch of the deep neural network to detect objects in the vehicle's surrounding environment.
[0022] According to aspect thirteen, the device further includes one or more radar antennas and / or lasers.
[0023] According to the fourteenth aspect, the one or more radar antennas and / or lasers are configured to transmit signals and detect return signals; and the acquisition unit is configured to acquire acquired sensor data based on the return signals.
[0024] According to the fifteenth aspect, a vehicle has one or more devices according to any one of the twelfth to fourteenth aspects. Attached Figure Description
[0025] Figure 1 An apparatus according to an embodiment of the present disclosure is shown.
[0026] Figure 2 An apparatus for detecting objects in the environment surrounding a vehicle, according to a preferred embodiment of the present disclosure, is shown.
[0027] Figure 3A flowchart of a method according to an embodiment of the present disclosure is shown.
[0028] Figure 4 A flowchart of a method according to another embodiment of this disclosure is shown.
[0029] Figure 5 A flowchart of a method according to another embodiment of this disclosure (long-distance and short-distance fusion) is shown.
[0030] Figure 6 A flowchart of a method according to another embodiment of this disclosure (long-distance and short-distance fusion) is shown.
[0031] Figure 7 A computer according to a preferred embodiment is shown. Detailed Implementation
[0032] Embodiments of the present disclosure will now be described with reference to the accompanying drawings. Numerous specific details are set forth in the following detailed description. These specific details are intended only to provide a thorough understanding of the various described embodiments. Furthermore, although the terms first, second, etc., may be used to describe various elements, these elements are not limited by these terms. These terms are used only to distinguish one element from another.
[0033] Based on the concepts of this disclosure, a unified neural network architecture is proposed to process sensor-based data in BEV (Browser-Electronic Vehicle). It represents sensor-based data with different spatial resolutions, processed by different branches (or heads in the terminology of deep neural networks). These branches interact with each other at intermediate levels of the deep neural network. The final detection output is a fusion of the outputs of all branches.
[0034] Figure 1 An apparatus 100 for object detection in a vehicle's surrounding environment using a deep neural network, according to an embodiment of the present disclosure, is shown. The apparatus 100 can be configured as follows: Figure 2 The device 100 is shown on vehicle 200, and preferably can be mounted on vehicle 200 facing the direction of travel. Those skilled in the art will understand that device 100 does not need to face the direction of travel; device 100 can also face laterally or rearward. Device 100 can be a radar sensor, radar module, part of a radar system, etc. Device 100 can also be a light detection and ranging (LiDAR) type sensor, LiDAR type module, or part of a LiDAR type system that uses laser pulses (especially infrared laser pulses) instead of radio waves.
[0035] Vehicle 200 can be any land vehicle that moves by mechanical power. Such vehicle 200 can also be associated with railway tracks, levitation, underwater, or airborne systems. The accompanying drawings illustrate vehicle 200 as a car equipped with device 100. However, this disclosure is not limited thereto. Therefore, device 100 can also be installed on, for example, trucks, lorries, agricultural vehicles, motorcycles, trains, buses, airplanes, drones, boats, ships, robots, etc.
[0036] The device 100 may have multiple detection areas, for example, oriented such that it has as Figure 2 The forward detection region 111, the left detection region 111L, and / or the right detection region 111R are shown. Additionally, the expansion of the detection regions (e.g., near-field detection region, far-field detection region) can be different.
[0037] like Figure 1 As shown, the device 100 includes an acquisition unit 120 and a determination unit 130, and may also include one or more antennas or lasers 110, but the one or more antennas or lasers may also be provided separately from the device 100.
[0038] The one or more antennas 110 may be radar antennas. Here, the one or more antennas 110 may be configured to transmit radar signals, preferably modulated radar signals, such as chirped signals. The signals may be acquired or detected at one or more antennas 110 and are generally referred to hereinafter as return signals. Here, the return signals may be generated by reflections of the transmitted radar signals onto obstacles or objects in the vehicle's surrounding environment (such as pedestrians, other vehicles such as buses or cars), but may also include noise signals generated by other electronic devices, other sources of electromagnetic interference, thermal noise, etc.
[0039] The one or more antennas may be provided individually or as an antenna array, wherein at least one of the one or more antennas 110 transmits radar signals, and at least one of the one or more antennas 110 detects returned signals. The detected or acquired returned signals represent the change in amplitude / energy of the electromagnetic field over time.
[0040] Acquisition unit 120 is configured to acquire radar data for each of the one or more radar antennas 110, the acquired radar data including range data and range change rate data. Acquisition unit 120 can acquire returned signals detected at the one or more antennas and can apply analog-to-digital conversion (A / D) to them. Acquisition unit 120 can convert the delay between the transmitted radar signal and the detected returned signal into range data. The delay can be acquired by correlating the returned signal with the transmitted radar signal, thereby acquiring the range data. Acquisition unit 120 can calculate the Doppler frequency shift or range change rate as range change rate data based on the frequency shift or phase shift of the detected returned signal compared to the transmitted radar signal. The frequency shift or phase shift can be acquired by performing a frequency transformation on the returned signal and comparing its spectrum with the frequency of the transmitted radar signal, thereby acquiring the range change rate data. For example, distance data and distance rate of change / Doppler data can be determined based on return signals detected at one or more antennas, as described in US 7,639,171 or US 9,470,777 or EP 3454079.
[0041] While the above describes an example of acquiring sensor-based data in the form of radar data, this disclosure is not limited thereto, and the acquisition unit 120 can also acquire sensor data based on lidar.
[0042] The acquisition unit 120 can acquire sensor-based data within a data cube, such as distance and angle values in polar coordinates, each representing multiple distance rate of change (Doppler) values. In this case, the acquisition unit 120 (or the determination unit 130 described below) can also be configured to perform a transformation of the (distance, angle) data values from polar coordinates to Cartesian coordinates, i.e., a transformation from (distance, angle) data values to (X, Y) data values. Advantageously, the transformation is performed in a manner that generates multiple Cartesian grids with different spatial resolutions and spatial dimensions, for example, a near-range (X, Y) grid with a spatial dimension of 80m × 80m and a spatial resolution of 0.5m / bin, and a far-range (X, Y) grid with a spatial dimension of 160m × 160m and a spatial resolution of 1m / bin. Those skilled in the art will recognize that this is a more efficient data representation (e.g., regarding memory requirements) compared to generating a single (X, Y) grid with a spatial dimension of 160m × 160m and a spatial resolution of 0.5m / bin.
[0043] In other words, given sensor data from a BEV acquired from, for example, lidar or radar point clouds, the point cloud can first be converted into multiple grids in a world Cartesian coordinate system centered on the vehicle or self-vehicle (e.g., autonomous vehicle or robot). In this process, two parameters can be defined: spatial distance and resolution. Typically, longer distances and higher resolutions are needed to detect more targets and better describe their shapes. However, longer distances and higher resolutions lead to higher memory requirements, memory consumption, and higher computational costs. To address this technical problem, this disclosure represents sensor-based data using multiple Cartesian grids with different spatial distances and resolutions. Thus, the acquired sensor data points can be converted into short-range Cartesian grids and long-range grids with correspondingly different resolutions. Here, the short-range grids have higher spatial resolution, while the long-range grids have lower spatial resolution. For example, if the long and short distances cover 100m and 50m respectively, then the grid size can be 100×100 for both distances. In this case, the spatial resolutions for the long and short distances are 1m / grid (1m / bin) and 0.5m / grid (0.5m / bin), respectively. In practice, the grid (bin) size can vary for different distances. As mentioned above, data at long distances is sparser than data at short distances, so high resolution is unnecessary.
[0044] The determining unit 130 is configured to detect objects in the environment surrounding the vehicle 200 using a set of sensor-based data (e.g., radar-based or lidar-based data) from multiple Cartesian grids with different spatial resolutions and dimensions in a deep neural network. This is illustrated in the flowchart below, which shows a computer-implemented method according to an embodiment of this disclosure. Figure 3 Further explanation will be provided in the context of this.
[0045] Figure 3 A flowchart illustrating a computer-implemented method according to an embodiment of the present disclosure is shown. Figure 3 In step S110, a first set of sensor-based data (e.g., radar-based data or lidar-based data) for a first Cartesian grid having a first spatial dimension and a first spatial resolution is input into the first branch (also called the head) of the deep neural network. Furthermore, according to... Figure 3 In step S120, a second set of sensor-based data for a second Cartesian grid with a second spatial dimension and a second spatial resolution is input into the second branch of the deep neural network. Here, the first set of sensor-based data and the second set of sensor-based data are sensor-based data of the vehicle's surrounding environment.
[0046] As described above, the first spatial dimension is different from the second spatial dimension, and the first spatial resolution is different from the second spatial resolution. For different spatial resolutions and dimensions, the first set of sensor-based data and the second set of sensor-based data are generated based on a set of acquired sensor data regarding the vehicle's surrounding environment. Furthermore, since the first set of sensor-based data and the second set of sensor-based data are generated based on the same acquired sensor data, they include an overlapping region in Cartesian space. In the above examples with a near-range (X, Y) Cartesian grid having a spatial dimension of 80m × 80m and a spatial resolution of 0.5m / bin, and a far-range (X, Y) Cartesian grid having a spatial dimension of 160m × 160m and a spatial resolution of 1m / bin, the overlapping region can be a spatial region of [-40m, 40m] distance × [-40m, 40m] distance around the vehicle at position (X, Y) = (0, 0) in the BEV.
[0047] Here, the deep neural network employs an artificial neural network structure with multiple layers of corresponding network nodes to progressively extract higher-level features from the input sensor-based data. The deep neural network can be a convolutional neural network that uses convolution operations in one or more layers instead of general matrix multiplication, and can have self-determining (self-learning) kernel functions or filtering functions. It can be based on datasets such as the Waymo dataset, the NuScenes dataset, the Oxford RobotCar dataset, etc., and can be used for both LiDAR-based and radar-based datasets. When using publicly available datasets in the form of point clouds, the point clouds are converted into Cartesian grids with different spatial distances and resolutions.
[0048] Alternatively, the training data can be multiple sequences of data cubes recorded from road scenes and manually labeled targets (also known as ground truth). This sequence can be cut into small blocks of fixed length. Thus, the training data can be formatted as a tensor of size N×T×S×R×A×D, where N is the number of training samples (e.g., 50k), where each training sample can include a set of bounding boxes, T is the block length (e.g., 12 timestamps), S is the number of sensors (e.g., 4), R is the number of distance bins (e.g., 10⁸), A is the number of angle bins (e.g., 150), and D is the number of Doppler bins (e.g., 20). The neural network can take a certain number of training samples (also known as the batch size, e.g., 1, 4, or 16, depending on GPU memory availability, etc.), compute the output and loss (i.e., the difference between the ground truth and the detection results) with respect to the ground truth labels, update the network parameters through backpropagation of the loss, and iterate this process until all N samples have been used. This process is called an epoch (i.e., one cycle throughout the entire training dataset). A neural network can be trained using multiple epochs (e.g., 10 epochs) to minimize error and maximize accuracy. The specific values mentioned above are examples of the process used to train a neural network.
[0049] according to Figure 3 In step S130, an interaction is provided at an intermediate level in the deep neural network between the first branch and the second branch of the deep neural network. Here, the intermediate level can be provided after a predetermined number of layers in the deep neural network. Specifically, the intermediate level can be provided after a predetermined number of layers in the deep neural network have independently or separately processed the first set of sensor-based data and the second set of sensor-based data in the first and second branches, respectively. The interaction at the intermediate level of the deep neural network is provided to consider the separately identified features in further processing of subsequent layers of the deep neural network (i.e., processing of the deep neural network after the intermediate level) up to the intermediate level of the overlapping spatial region of the first and second spatial dimensions. The overlapping region is the common spatial region of the first and second Cartesian grids, and can be the central spatial region around the vehicle, for example, a spatial region of [-40m, 40m] distance × [-40m, 40m] distance around the vehicle, as described above.
[0050] That is, in the further processing of the second branch of the deep neural network, the first feature identified by the first branch of the deep neural network with respect to the first set of sensor-based data is considered at an intermediate level, and simultaneously, in the further processing of the first branch of the deep neural network, the second feature identified by the second branch of the deep neural network with respect to the second set of sensor-based data is considered at an intermediate level. Therefore, the various branches of the deep neural network interact to combine features identified for different spatial dimensions and resolutions in the further processing of the deep neural network.
[0051] according to Figure 3 In step S140, the first output of the first branch of the deep neural network and the second output of the second branch of the deep neural network are then fused (output fusion) to detect objects in the vehicle's surrounding environment, particularly the characteristics of objects in the vehicle's surrounding environment. That is, although the corresponding first and second branches of the deep neural network can be independently trained based on publicly available training data to generate independent outputs that include the identifying features of objects in the vehicle's surrounding environment, the fusion of the identifying features of the first and second outputs collects object identifying feature information from multiple spatial scales or resolutions, thereby improving the accuracy of the final object detection.
[0052] In the above embodiments, two sets of sensor-based data for Cartesian grids with different spatial resolutions and dimensions are used. However, this disclosure is not limited in this respect. In particular, three or more sets of sensor-based data for Cartesian grids with different spatial resolutions and dimensions can be generated from the same acquired sensor data, and thus they can be fed into three or more independent branches of a deep neural network. Subsequently, the above-described interactions can be performed at one or more intermediate levels between every two branches of the three or more branches (i.e., between branch 1 and branch 2, between branch 1 and branch 3, and between branch 2 and branch 3). Those skilled in the art will understand that such embodiments involve more than one spatially overlapping region. After further processing in the three or more independent branches of the deep neural network, the outputs of the respective branches are fused to detect one or more objects in the environment surrounding the vehicle.
[0053] Figure 4 A preferred embodiment of the invention is illustrated below. Here, the sensor-based data is radar data that can be acquired in a data cube representing angle and distance data in polar coordinates and Doppler (range rate of change) data. For example, for each of 20 Doppler (range rate of change) values (bins), the data cube can have 150 angle values (bins) and 108 distance values (bins) in polar coordinates.
[0054] According to this preferred embodiment, the acquisition unit 120 (or the determination unit 130 described below) can also be configured to perform a conversion of (distance, angle) data values from polar coordinates to Cartesian coordinates, i.e., a conversion of (distance, angle) data values to Cartesian (X, Y) data values (bins). Specifically, this involves generating near-field and far-field Cartesian (X, Y) grids, the near-field Cartesian (X, Y) grid having a spatial dimension of 80m × 80m and a spatial resolution of 0.5m / bin, and the far-field Cartesian (X, Y) grid having a spatial dimension of 160m × 160m and a spatial resolution of 1m / bin, each for a plurality of Doppler (distance rate of change) values (bins). Although the number of values (bins) in the near-field and far-field Cartesian grids is the same in this embodiment, this is not limiting, and the number can also be different.
[0055] According to this preferred embodiment, such as Figure 4 As shown, data values based on long-range radar and data values based on short-range radar are input into individual branches of a convolutional neural network with multiple layers. Then, in the intermediate level of the convolutional neural network, an interaction is provided between the first and second branches of the convolutional neural network to provide a fusion or combination of features identified by the various branches of the convolutional neural network at different spatial dimensions and resolutions for the overlapping regions of the short-range and long-range grids.
[0056] For example, a feature can be considered as the feature value at the output of each grid cell in the intermediate level. That is, each output of a convolutional neural network can be considered as a 2D feature map, used here for each of the near and far distances, and for multiple channels (e.g., Figure 4 Each of the 64 channels in the image. Because such a feature map involves multiple 2-D grids, each grid encodes one aspect of the data. For example, a color image with RGB channels can be considered as a feature map with 3 channels. In this implementation, feature fusion means concatenating feature maps from different branches and reducing the number of channels, so that the identifying features of the first branch (regarding the far-distance grid) are subsequently considered in the second branch of the convolutional network (regarding the near-distance grid) and vice versa.
[0057] like Figure 4 As further shown, the first and second outputs of the convolutional neural network can include detected object attributes, such as object category, bounding box, object size, object position, object orientation, and / or object velocity (based on Doppler data). Note that the object's position and size can be derived from the bounding box. Figure 4As shown, the first output of the convolutional neural network can identify the bounding boxes of cars, pedestrians, and buses, while the second output can identify pedestrians and buses (but not cars, because cars are not located in the nearby grid). Subsequent fusion of the first and second outputs of the convolutional neural network can be done, for example, based on clustered bounding boxes, to verify whether any object (e.g., a bus) is detected in both outputs and / or statistical analysis is used on the first and second outputs to provide a final output regarding the detected objects, while the statistical measure of the difference between the first and second outputs can be optimized to improve the accuracy of object detection.
[0058] According to another embodiment, the interaction S130 can also be provided by resampling the first intermediate output (e.g., the first feature map) of the first branch at an intermediate level and by resampling the second intermediate output (e.g., the second feature map) of the second branch at an intermediate level. Here, each intermediate output is the output after a predetermined number of layers in a deep or convolutional neural network. Resampling can be performed by matching the spatial resolution of the first intermediate output with that of the second intermediate output. This resampling improves the interaction because it enables a more efficient and simpler fusion of the individual outputs.
[0059] Figure 5 An implementation of resampling the first intermediate output is illustrated. Here, the intermediate output of the first branch of the far-distance grid has 160×160 feature values in a spatial distance of [-160m, 160m]. For example, the overlapping spatial region [-80m, 80m] can be cropped to 80×80 feature values in the overlapping spatial region and then upsampled to 160×160 feature values in the overlapping region. This upsampled first intermediate output can be easily merged or fused with the second intermediate output of the second branch (by linking and / or reducing, as described below).
[0060] Figure 6 An implementation of resampling the second intermediate output is illustrated. This shows the intermediate output of the second branch with respect to a near-field grid, having 160×160 feature values in a spatial distance of [-80m, 80m], downsampled to 80×80 feature values with a spatial resolution matching that of the first intermediate output. Here, the downsampled second intermediate output can be readily merged or blended with the first intermediate output of the first branch, particularly with the cropped intermediate output of the first branch corresponding to the overlapping region (through linking and / or reduction, as described below).
[0061] Thus, as Figure 5 and Figure 6As shown, interactions at intermediate levels of the neural network can be further performed by merging the first intermediate output with the resampled second intermediate output and by merging the second intermediate output with the resampled first intermediate output. Here, the merging can create a first connection between the first intermediate output and the resampled second intermediate output, and a second connection between the second intermediate output and the resampled first intermediate output.
[0062] Here, in Figure 5 and Figure 6 In one implementation, the link can combine each channel of the first intermediate output (e.g., 64 channels) with the resampled second intermediate output (e.g., 64 channels) to obtain a combined grid (e.g., 64 + 64 = 128 channels), and combine the second intermediate output (e.g., 64 channels) with the resampled first intermediate output (e.g., 64 channels) to obtain a combined grid (e.g., 64 + 64 = 128 channels), thereby combining all channels at the link.
[0063] The merging may also include reducing the first link to produce a first reduced output (e.g., in...). Figure 6 (As further illustrated below) and by reducing the second connection to produce a second reduced output, wherein the first reduced output and the second reduced output are used to process subsequent layers of the deep neural network. For example, the reduction may employ kernels (e.g., 1×1 kernels) to reduce the number of channels to the number of channels that serve as the output of an intermediate stage. Figure 6 In this example, the 1×1 kernel of the convolutional neural network is used to reduce the number of channels from 128 to 64.
[0064] Then, the first reduced output or the second reduced output can be used to replace the corresponding portion of the first intermediate output or the second intermediate output. Those skilled in the art will recognize that the corresponding portion preferably refers to an overlapping region. Figure 6 In this example, overlapping regions in the first intermediate output (regarding the distant Cartesian grid) are replaced by a reduced output of the interaction with the nearby Cartesian grid.
[0065] Therefore, the interactions at intermediate levels introduce features independently identified by the first and second branches of the neural network into the corresponding other branch of the neural network, thereby providing the ability to consider feature detection at different spatial resolutions in all subsequent layers of the neural network in the first and second branches. In this sense, although the first and second branches also perform independent data processing after the intermediate levels to produce independent outputs, there is a mixture of feature detections with different spatial dimensions and resolutions at the intermediate levels of the neural network. This not only provides more efficient data representation (as described above) but also improves the accuracy of object detection.
[0066] This can also be called feature fusion. That is, when multiple grids of different spatial dimensions and resolutions are processed separately by a convolutional neural network or a deep neural network, intermediate feature maps are generated at intermediate levels. Without loss of generality, these feature maps have the same spatial resolution as the input grid. Figure 5 and Figure 6 As shown, feature maps from different distances have overlapping regions (possibly the central region). Multiple feature maps can be obtained within this overlapping region, and therefore can be fused. For example, the central portion of a far-distance feature map can be upsampled and merged with a short-distance feature map. Simultaneously, a short-distance feature map can be downsampled to be merged with the central portion of a far-distance feature map. This merging operation can be achieved through connections and reduction, common in deep neural networks.
[0067] Return to reference Figure 3 After further independent processing of each branch of the neural network following the intermediate level, each grid generates a set of final outputs, such as the bounding box and class label of the target, the segmentation mask of the scene, etc.
[0068] They need to be fused to obtain the final output. Here, the fusion S140 of the first output (final output) of the first branch of the deep neural network and the second output (final output) of the second branch of the deep neural network can further include filtering out one or more overlapping bounding boxes. For the bounding box output, the overlapping bounding boxes can be filtered out using conventional non-maximum suppression (NMS). Advantageously, this avoids providing blurred object detection, where the same object is identified by multiple overlapping bounding boxes (e.g., bus, etc.).
[0069] Furthermore, the fusion S140 of the first output (final output) of the first branch of the deep neural network and the second output (final output) of the second branch of the deep neural network can also include prioritizing information from the first or second output by using distance information. For example, small objects such as pedestrians are detected more accurately from a short distance, so higher confidence can be associated with pedestrian bounding boxes. For segmentation outputs, segmentation masks can be resampled to the same resolution and then added. Similarly, distance priors can be used to perform weighted summations. That is, since more confidence is likely to be associated with obstacles or other objects (and their corresponding bounding boxes) in the near-range Cartesian grid, the corresponding information in the final output of the neural network can be more accurate. Therefore, statistical analysis can provide more statistical weights to the output information of the near-range Cartesian grid compared to the output information of the far-range Cartesian grid.
[0070] The above-described computer implementation method can be stored as a computer program in the memory 410 of the computer 400, and can be executed by the processor 420 of the computer 400. The computer 400 can be a vehicle-mounted computer, device, radar sensor, radar system, lidar sensor, lidar system, etc., such as... Figure 7 As shown.
[0071] It will be apparent to those skilled in the art that various modifications and changes can be made to the physical and methodological aspects of the invention and its construction without departing from the scope or spirit of the invention.
[0072] This disclosure has been described with respect to specific embodiments, which are intended in all respects to be illustrative and not restrictive. Those skilled in the art will appreciate that many different combinations of hardware, software, and / or firmware will be suitable for implementing this invention.
[0073] Furthermore, other implementations of this disclosure will be apparent to those skilled in the art upon consideration of the specification disclosed herein and the implementations thereof. The specification and embodiments are merely exemplary. Therefore, it should be understood that the inventive aspect does not lie in all features of any single implementation or configuration of the foregoing disclosure. Accordingly, the true scope and spirit of this disclosure are indicated by the appended claims.
Claims
1. A computer implementation method for object detection in the surrounding environment of a vehicle (200) using a deep neural network, the computer implementation method comprising: The first set of sensor-based data for a first Cartesian grid having a first spatial dimension and a first spatial resolution is input (S110) into the first branch of the deep neural network; The second set of sensor-based data for the second Cartesian grid with the second spatial dimension and the second spatial resolution is input (S120) into the second branch of the deep neural network; At an intermediate level of the deep neural network, a first intermediate output from the first branch to the second branch of the deep neural network is combined with a second intermediate output from the second branch to the first branch of the deep neural network. This combination will be used in subsequent layers of the corresponding first and second branches of the deep neural network, wherein the first intermediate output and the second intermediate output represent features identified at the intermediate level of the deep neural network in the overlapping spatial region of the first spatial dimension and the second spatial dimension. as well as The first output of the first branch of the deep neural network and the second output of the second branch of the deep neural network are fused (S140) to detect objects in the surrounding environment of the vehicle.
2. The computer implementation method according to claim 1, wherein, The first spatial dimension is different from the second spatial dimension, and the first spatial resolution is different from the second spatial resolution.
3. The computer implementation method according to claim 1 or 2, wherein, The combination step includes resampling the spatial resolution of the first intermediate output of the first branch at the intermediate level and resampling the spatial resolution of the second intermediate output of the second branch at the intermediate level.
4. The computer implementation method according to claim 3, wherein, The combination step includes merging the first intermediate output with the resampled second intermediate output and merging the second intermediate output with the resampled first intermediate output.
5. The computer implementation method according to claim 4, wherein, The merging includes generating a first connection between the first intermediate output and the resampled second intermediate output, and generating a second connection between the second intermediate output and the resampled first intermediate output.
6. The computer implementation method according to claim 5, wherein, The merging also includes reducing the first connection to generate a first reduced output, and reducing the second connection to generate a second reduced output, wherein the first reduced output and the second reduced output are used for processing in subsequent layers of the deep neural network.
7. The computer implementation method according to claim 6, wherein, The first reduced output or the second reduced output is used to replace the corresponding part of the first intermediate output or the second intermediate output.
8. The computer implementation method according to claim 1, wherein, The fusion includes filtering out one or more overlapping bounding boxes.
9. The computer implementation method according to claim 1, wherein, The fusion includes prioritizing information from the first output or the second output using distance information.
10. The computer implementation method according to claim 1, wherein, The first output and the second output include information related to one or more of the following: target category, bounding box, object size, object position, object orientation, and object velocity.
11. A computer-readable storage medium including instructions that, when executed by a computer (400), cause the computer (400) to perform the method according to any one of claims 1 to 10.
12. An apparatus (100) for object detection in the environment surrounding a vehicle using a deep neural network, wherein, The device (100) includes: Acquisition unit (120) is configured to acquire sensor-based data about each of one or more radar antennas and / or lasers (110); Determining unit (130), the determining unit (130) is configured to: A first set of sensor-based data for a first Cartesian grid having a first spatial dimension and a first spatial resolution is input into the first branch of the deep neural network; A second set of sensor-based data for a second Cartesian grid with a second spatial dimension and a second spatial resolution is input into the second branch of the deep neural network; At an intermediate level of the deep neural network, a first intermediate output from the first branch to the second branch of the deep neural network is combined with a second intermediate output from the second branch to the first branch. This combination is used in subsequent layers of the corresponding first and second branches of the deep neural network. The first and second intermediate outputs represent features identified at the intermediate level of the deep neural network in the overlapping spatial region of the first and second spatial dimensions. The first output of the first branch of the deep neural network and the second output of the second branch of the deep neural network are fused to detect objects in the surrounding environment of the vehicle.
13. The apparatus (100) according to claim 12, the apparatus (100) further comprising the one or more radar antennas and / or lasers (110).
14. The apparatus (100) according to claim 12 or 13, wherein, The one or more radar antennas and / or lasers (110) are configured to transmit signals and detect return signals; and The acquisition unit (120) is configured to acquire the acquired sensor data based on the return signal.
15. A vehicle (200) having one or more devices (100) according to any one of claims 12 to 14.