Method and system for three-dimensional object detection for autonomous driving

The Master-Slave Transformer architecture with an adaptive controller enhances 3D object detection in autonomous vehicles by efficiently processing and integrating multimodal sensor data, addressing real-time adaptability and accuracy challenges.

WO2026142306A1PCT designated stage Publication Date: 2026-07-02CHANGWON NATIONAL UNIVERSITY INDUSTRY ACADEMY COOPERATION CORPS

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHANGWON NATIONAL UNIVERSITY INDUSTRY ACADEMY COOPERATION CORPS
Filing Date
2025-12-24
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing 3D object detection systems for autonomous vehicles face limitations in accurately perceiving the environment due to the lack of depth information and inefficiencies in combining data from multiple sensors, leading to challenges in real-time processing and adaptability to dynamic environments.

Method used

A Master-Slave Transformer architecture is employed, utilizing multiple slave transformers to process sensor data individually and a master transformer to integrate them, enhanced by an adaptive controller for dynamic weight updates and real-time adaptation, leveraging multimodal fusion techniques to improve computational efficiency and accuracy.

Benefits of technology

The proposed system achieves superior performance in 3D object detection, demonstrating higher accuracy and adaptability to changing environments, particularly with the inclusion of an adaptive controller, effectively fusing multimodal sensor data for enhanced reliability in autonomous driving.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025022689_02072026_PF_FP_ABST
    Figure KR2025022689_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Disclosed are a method and system for three-dimensional object detection for autonomous driving. An object detection method according to one embodiment may comprise the steps of: generating a plurality of feature maps by processing features generated on the basis of multimodal sensor data received from a plurality of sensors different from each other using a multi-slave transformer; dynamically updating weights by inputting a control signal to the plurality of feature maps using an adaptive controller; and generating a global feature map by integrating the feature maps in which the weights are dynamically updated using a master transformer.
Need to check novelty before this filing date? Find Prior Art

Description

3D object detection method and system for autonomous driving

[0001] This invention is the result of research conducted as part of the Phase 3 Industry-Academic Cooperation Leading University Development Project (LINC 3.0), funded by the Ministry of Education and the National Research Foundation of Korea.

[0002] The following description relates to a 3D object detection method and system for autonomous driving.

[0003] Autonomous driving technology is a core technology capable of revolutionizing future transportation systems, and ensuring system stability and reliability is essential for its practical application. For autonomous vehicles to drive safely on the road, the ability to accurately perceive the surrounding environment is critical; to achieve this, objects must be detected and analyzed based on data collected from various sensors. In particular, 3D object detection is a vital element that helps autonomous vehicles precisely identify objects such as obstacles or pedestrians in space and safely plan driving paths. While 2D object detection primarily relies on image data to detect the shape and location of objects, it faces limitations due to a lack of depth information. In contrast, 3D object detection provides depth information, allowing for a more accurate determination of object size, location, and shape, which is essential for establishing precise driving strategies for autonomous vehicles. 3D object detection technology primarily utilizes data collected from sensors such as cameras, LiDAR, and radar; however, since each sensor possesses both unique strengths and limitations, effectively combining them is crucial.

[0004] Recently, multimodal fusion technology has been attracting attention as a method to further enhance the perception performance of autonomous vehicles. Multimodal fusion focuses on combining various data modalities, such as images, text, and sensor data, to create a single, integrated data representation, enabling more sophisticated and comprehensive environmental perception. This provides a foundation for autonomous vehicles to safely respond to more complex road situations and plays a crucial role in the advancement of autonomous driving technology.

[0005] A 3D object detection method and system for autonomous driving are provided.

[0006] An object detection method for an object detection system implemented by at least one computer device, wherein the object detection system comprises a multiple slave transformer, a master transformer, and an adaptive controller, and the at least one computer device comprises at least one processor, and the object detection method comprises: a step of generating a plurality of feature maps by processing features generated based on multimodal sensor data received from a plurality of different sensors using the multiple slave transformer by the at least one processor; a step of dynamically updating weights by inputting control signals to the plurality of feature maps using the adaptive controller by the at least one processor; and a step of generating a global feature map by integrating the feature maps with dynamically updated weights using the master transformer by the at least one processor.

[0007] According to one aspect, the step of processing the features using the multiple slave transformers may be characterized by dividing the features into a plurality of scene regions and combining the features of the plurality of sensors included in each scene region using the cross attention of the slave transformer corresponding to each of the plurality of scene regions.

[0008] According to another aspect, the step of dynamically updating the weights may be characterized by adjusting the weights by inputting the control signal into the plurality of feature maps based on Dynamic Reinforcement Learning (Dynamic RL) according to road conditions.

[0009] According to another aspect, the step of dynamically updating the weights may be characterized by calculating the control signal based on a non-linear activation function, weights, and biases, and adjusting the plurality of feature maps through finite-time synchronization based on the control signal.

[0010] According to another aspect, the object detection method may further include the step of predicting the location of an object using a global feature map by using a master transformer by at least one processor, and generating a bounding box and a label.

[0011] According to another aspect, the object detection method may further include the step of visualizing the multimodal sensor data, including the bounding box and the label, by the at least one processor.

[0012] According to another aspect, the multimodal sensor data may be characterized by including a LiDAR point cloud collected through LiDAR, an image input through a camera, a high-definition (HD) map obtained through a location measured based on a Global Positioning System (GPS), and a radar point cloud collected through radar.

[0013] According to another aspect, the features generated based on the multimodal sensor data may be characterized by including a first feature extracted using a 3D CNN (Convolutional Neural Network) from the LiDAR point cloud, a second feature extracted using EfficientDet from the image, a third feature extracted using a ViT (Vision Transformer) from the high-resolution map, and a fourth feature extracted using TransCAR from the radar point cloud.

[0014] A computer program stored on a computer-readable recording medium is provided to be combined with a computer device to execute the above method on the computer device.

[0015] A computer-readable recording medium is provided on which a computer program for executing the above method is recorded on a computer device.

[0016] An object detection system implemented by at least one computer device, wherein the object detection system comprises a multiple slave transformer, a master transformer, and an adaptive controller, and the at least one computer device comprises at least one processor, wherein the at least one processor processes features generated based on multimodal sensor data received from a plurality of different sensors using the multiple slave transformer to generate a plurality of feature maps, inputs control signals to the plurality of feature maps using the adaptive controller to dynamically update weights, and integrates the feature maps with dynamically updated weights using the master transformer to generate a global feature map.

[0017] A 3D object detection method and system for autonomous driving can be provided.

[0018] FIG. 1 is a drawing illustrating an example of a master-slave transformer architecture in one embodiment of the present invention.

[0019] FIG. 2 is a diagram illustrating an example of generating a feature map for a multi-slave transformer through a multimodal sensor in an embodiment of the present invention.

[0020] FIG. 3 is a diagram illustrating an example of a predicted image from a lidar, radar, and model in one experimental example of the present invention.

[0021] FIG. 4 is a diagram illustrating an example of a vehicle sensor configuration in one experimental example of the present invention.

[0022] FIG. 5 is a diagram illustrating an example of a simulation result of multimodal sensor fusion in an experimental example of the present invention.

[0023] FIG. 6 is a diagram illustrating an example of lidar data and predicted radar data in one experimental example of the present invention.

[0024] FIG. 7 is a diagram showing loss values ​​according to the presence or absence of an adaptive controller in a master-slave transformer architecture in one experimental example of the present invention.

[0025] FIG. 8 is a diagram illustrating an example of the classification accuracy of a master-slave transformer architecture in one experimental example of the present invention.

[0026] FIG. 9 is a flowchart illustrating an example of an object detection method according to an embodiment of the present invention.

[0027] FIG. 10 is a block diagram illustrating an example of a computer device according to an embodiment of the present invention.

[0028] Hereinafter, embodiments will be described in detail with reference to the attached drawings.

[0029] Embodiments of the present invention provide a 3D object detection method and system based on a Master-Slave Transformer architecture based on Multimodal Fusion to improve 3D object recognition in an autonomous driving system. The proposed Master-Slave Transformer architecture can generate a global feature map by distributing and processing multiple sensor data across multiple slave transformers and integrating them in a master transformer. Additionally, by introducing an Adaptive Controller into the Master-Slave Transformer architecture, the stability and reliability of object recognition can be improved by dynamically responding to real-time changing environments. Experimental results showed that the Master-Slave Transformer architecture demonstrated superior performance compared to existing models, and achieved even higher accuracy, particularly when the Adaptive Controller was applied.

[0030] Object recognition research, currently underway due to the need to enhance the safety and reliability of autonomous driving technology, is broadly divided into single-sensor-based and multi-sensor-based object recognition. Single-sensor-based object recognition offers the advantages of lower hardware costs due to the use of a single sensor, and simple data processing and analysis as it handles only a single data point. However, the single-sensor approach entails disadvantages alongside the sensor's strengths. Depending on the sensor, it may be subject to limitations under specific environmental conditions. For example, cameras are sensitive to changes in light, and LiDAR can be affected by weather. Multimodal fusion techniques were devised to overcome these drawbacks; by fusing data from multiple sensors, more accurate and stable object recognition becomes possible.

[0031] Multimodal fusion techniques can be broadly divided into deep learning-based and transformer-based approaches. Deep learning-based approaches primarily utilize Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) to learn complex data patterns and model interactions between sensors to improve recognition performance. However, since deep learning-based approaches typically process each data point independently before combining them at the end, they face limitations regarding complex interactions between modalities and challenges in handling long-term dependencies arising from continuous scenes, such as long sequences of data or video data. Transformer-based approaches are a powerful method capable of complementing the limitations of deep learning-based approaches regarding modal interactions and handling long-term dependencies caused by long sequence data. Because transformer-based approaches can learn interactions between all elements of input data, they minimize information loss between modalities and effectively handle long-term dependencies. This is because transformers utilize attention mechanisms.

[0032] The attention mechanism plays a core role in Transformer models by calculating the relationships between all elements of each input data and assigning weights to important parts, thereby enabling the learning of how each part of the input data relates to others. Attention mechanisms can be divided into self-attention and cross-attention depending on the number of input sequences. Self-attention is a method that assigns weights by considering the relationship between each element within a single sequence and all other elements. This mechanism allows for the learning of the relationships between all elements within the same sequence, ensuring that all parts of the input sequence are learned as being interrelated. The cross-attention mechanism assigns weights by considering the interrelationships between two different sequences. This is primarily used in multimodal Transformers that fuse information between different modalities, where one sequence possesses features updated through information from another sequence. However, these Transformer-based multimodal fusion techniques have the disadvantage of being difficult to process in real-time due to the large memory requirements associated with calculating the interactions between all data.

[0033] Accordingly, the embodiments of the present invention aim to further improve the effective fusion of multimodal sensor data and adaptation in dynamic environments using a transformer-based approach, and to enhance the stability and reliability of object recognition in autonomous driving by supplementing real-time processing capabilities. By applying a master-slave structure utilizing cross-attention to transformer-based multimodal fusion, computational efficiency is increased, and an adaptive controller can be used to enhance dynamic adaptation suitable for various road environments. This model is designed to enable more precise object recognition by using individual transformers tailored to the characteristics of each sensor as slaves and integrating them with a master transformer, and to flexibly respond to environmental changes in real time through an adaptive controller.

[0034] Multimodal Fusion for Improving Object Recognition Performance

[0035] The master transformer can integrate the feature maps received from each slave transformer to ultimately generate a global feature map. Global attention is used during the process of generating the global feature map; unlike slave transformers that handle interactions between two sequences using cross-attention, the master transformer performs the role of generating the global feature map by processing the information between multiple modalities integrated by the slave transformers once again.

[0036] FIG. 1 is a diagram illustrating an example of a master-slave transformer architecture in an embodiment of the present invention. The master-slave transformer architecture (100) according to the present embodiment may include a master transformer (110), multiple slave transformers (120), and an adaptive controller (130). This master-slave transformer architecture (100) uses a master-slave structure and may utilize multiple slave transformers to improve computational efficiency and accuracy. A feature map received from a sensor is divided into a specific number of scene regions, and primary data processing is performed in the slave transformers.

[0037] In the embodiment of FIG. 1, four slave transformers are shown as multiple slave transformers (120) processing each region (A1, A2, B1, B2) of the feature map. At this time, each slave transformer can extract multimodal sensor data from the feature map of a designated specific region and process the extracted multimodal sensor data into a common dimension tensor for interaction and integration between the data. The models used for feature extraction from each multimodal sensor are EfficientDet, TransCAR, 3D CNN (Convolutional Neural Network), and ViT (Vision Transformer), which effectively extract features from data from cameras, high-definition maps (HD maps), LiDAR, and radar sensors, respectively, and enable more accurate recognition through the integration of information. Subsequently, the processed multimodal sensor data can be integrated complementarily within each slave transformer using a cross-attention mechanism.

[0038] Therefore, in this process, the performance of data processing can be improved by using a cross-attention mechanism that divides the feature map processing with multiple slave transformers (120) and clarifies the interrelationships between data so that the information of each sensor can be harmoniously combined.

[0039] The adaptive controller (130) can extract the feature of the last time step of the global feature map generated by the master transformer (110) and generate a control signal to enable the autonomous driving system to dynamically adapt. This signal enables dynamic adjustment of the feature map between multiple slave transformers (120) and the master transformer (110). This adaptive controller (130) can input control signals for the feature maps transmitted from each slave transformer to perform dynamic updates of weights and adjust the feature map according to current road conditions. When the existing global feature map G, the control signal obtained for the last time step feature is denoted as y, the weight as w, and the bias as b, the control signal can be applied as shown in the following Equation 1. Here, x is a vector for the time step dimension, and can mean broadcasted addition.

[0040]

[0041] Additionally, the adaptive controller (130) can adjust the feature map so that the system can converge quickly to an optimal state through finite-time synchronization using a non-linear activation function, the ReLU function, to allow the system to converge quickly to a specific location in state space. This method improves the adaptability and responsiveness to dynamic changes in the autonomous driving system, thereby increasing reliability in an autonomous driving environment.

[0042] In a more specific embodiment, a multimodal master-slave transformer architecture (100) equipped with an adaptive controller (130) for object detection of an autonomous vehicle can process data from various sensors such as cameras, high-resolution maps, lidar, and radar as previously described, can use a Kalman filter for time alignment, can apply a sliding window strategy for missing or delayed data, and can use external calibration to position sensors within the vehicle frame with the prediction of Equation 2 below and the correction of Equation 3 below.

[0043]

[0044]

[0045] Here, can represent the predicted state at time t, where A represents the state transition matrix, B represents the control input matrix, L represents the Kalman gain, and y(t) represents the sensor reading. Then, features can be extracted using ViT for high-resolution maps, TransCAR for radar, 3D CNN for lidar, and EfficientDet for cameras. A multi-slave transformer (120) can combine features from all sensors using cross-attention. This mechanism can be represented as shown in Equation 4.

[0046]

[0047] Q, K, and V are query, key, and value matrices derived from sensor features. The adaptive controller (130) may use Dynamic Reinforcement Learning (Dynamic RL)-based weight adjustment to modify the feature map of each slave transformer according to road conditions, and may use a finite-time stabilization technique such as Equation 5 to ensure that the system quickly converges to an ideal state. The gain update of the adaptive controller (130) can be represented as Equation 6.

[0048]

[0049] Here, W i (t) can represent the weight of sensor i at time t, and e i (t) may represent the error or uncertainty of sensor i, and α may represent the learning rate for weight adjustment.

[0050]

[0051] Here, kx(t) can represent the influence of the current sensor, and t p The system can be dynamically adjusted over time. The master transformer (110) can combine and fuse feature maps of multiple slave transformers (120) to produce 3D bounding boxes and labels as output.

[0052] FIG. 2 is a diagram illustrating an example of generating a feature map for a multi-slave transformer through a multimodal sensor in an embodiment of the present invention.

[0053] For the LiDAR point cloud (211) collected through the LiDAR (210), a voxel grid (212) is generated through preprocessing, and a feature 1 can be extracted from the generated voxel grid (212) using a 3D CNN (213).

[0054] For an image (221) collected through a camera (220), after performing preprocessing of image resizing (222), feature 2 can be extracted from the resized image using EfficientDet (223).

[0055] For a high-resolution map (231) collected based on a location collected via GPS (Global Positioning System, 230), map parsing (232) preprocessing is performed, and then feature 3 can be extracted using ViT (233).

[0056] For the radar point cloud (241) collected through the radar (240), a radar tensor (242) can be generated through preprocessing, and feature 4 can be extracted from the generated radar tensor (242) using TransCAR (243).

[0057] The generated features 1, 2, 3, and 4 can be divided by scene zones as previously described in FIG. 1 and input into the multiple slave transformer (120). At this time, each of the scene zones (A1, A2, B1, B2) may include features of multiple multimodal sensors for the corresponding zone, and each individual slave transformer of the multiple slave transformer (120) may fuse and process the multimodal data. In this case, the master transformer (110) can generate a global feature map by integrating the distributed processing results of the multiple slave transformer (120).

[0058] 3D CNN (213) is a model widely used for 3D data-related tasks such as LiDAR point cloud processing, video analysis, and medical image analysis.

[0059] EfficientDet (223) achieved state-of-the-art accuracy and efficiency in object detection tasks from camera data by combining a new scalable feature network and a complex scaling method.

[0060] ViT (233) can produce competitive results in benchmarks by using HD map data to divide images into patches and process them into sequences.

[0061] TransCAR (243) integrates a convolution layer and a transformer encoder to provide reliable and efficient feature extraction from radar data.

[0062] Experiment and Results

[0063] In this experimental example, an experiment was conducted to compare performance in multimodal fusion using the 'nuScenes mini version' dataset, which provides various sensor data to the master-slave transformer architecture (100) proposed earlier.

[0064] The 'nuScenes' dataset is one of the comprehensive and popular datasets for autonomous driving research. It consists of camera, lidar, and radar sensors, and 23 object classes, including various vehicle types and pedestrians, are annotated in each keyframe. In this experimental example, the 'nuScenes' dataset was used because it allows for the realistic depiction of real-world driving problems through various data, is available for free for research and development, and enables comparison with various autonomous driving algorithms using established evaluation metrics. However, for testing and training in this experimental example, the 'nuScenes mini version' dataset (v1.0-mini.tgz) is used. There are a total of 404 scenarios, of which 324 are for training and 80 are for testing.

[0065] The resources used in this experimental example were 8GB RAM (Random Access Memory) and one T4 GPU (Graphics Processing Unit) simulated in Google Colab, and Pyquaternion, Torch, EfficientDet, Transformers, Nuscenes-Devkit, and Torchvision were installed. In this experimental example, many models using multimode sensor data extracted and processed from cameras (EfficientDet), HD maps (ViT), radar (TransCAR), and LiDAR (3D CNN) were used. In the experiment, four slave Transformers were utilized to perform cross-attention between feature maps of each modality pair. After processing, the features were combined, modified, and merged to generate a single feature map that could be used for object detection. All feature maps were converted into a common dimension tensor of [1, 16, 256] geometry. Before supplying weights to the master transformer, the adaptive transformer (e.g., adaptive controller (130)) dynamically synchronized and updated the weights over a period of time and supplied them to the master transformer. [1, 16, 256] The tensor of the shape is the final integrated feature map. To obtain bounding boxes using the NuScenes dataset, the bounding box coordinates are normalized according to the image size, then the sections are mapped to indices, and the labels and boxes are converted into tensors. To predict bounding boxes, labels, and confidence scores, a neural network model and a loss function for the bounding boxes and labels are defined. After training the model a certain number of times, the loss and mAP are calculated for each epoch. The model is run to obtain predictions, and these predictions are filtered at a confidence threshold. The bounding boxes and labels on the image are generated and displayed using Matplotlib and OpenCV (Open Source Computer Vision Library).FIG. 3 is a diagram illustrating examples of predicted images from a lidar, radar, and model in an experimental example of the present invention. The predicted images of FIG. 3 include bounding boxes and labels.

[0066] Figure 4 is a diagram illustrating an example of a vehicle sensor configuration in an experimental example of the present invention. The vehicle sensors were configured according to camera, radar, lidar, and high-resolution map data included in the 'nuScenes mini version' dataset, and the indicator of the experimental results was the Mean Average Precision (mAP), which is mainly used to evaluate the performance of a model in computer vision tasks such as object recognition.

[0067] In order to verify the object recognition of the proposed master-slave transformer architecture (100) prior to performance comparison, experiments were conducted using the 'nuScenes mini version' dataset.

[0068] FIG. 5 is a diagram illustrating an example of a simulation result of multimodal sensor fusion in an experimental example of the present invention, and FIG. 6 is a diagram illustrating an example of lidar data and predicted radar data in an experimental example of the present invention. In FIG. 5, (a) shows a sample of camera data from the 'nuScenes mini version' dataset as an input image, and (b) shows a predicted image to which the master-slave transformer architecture (100) is applied. When comparing the two images, it can be seen that information regarding bounding boxes and labels has been added to the image of FIG. 5 (b) that has passed through the master-slave transformer architecture (100). In addition, it was confirmed that information regarding bounding boxes and labels has been added in the same way to the image using lidar data from the 'nuScenes mini version' dataset as in FIG. 6 (a) and to the image using radar data as in FIG. 6 (b).

[0069] Therefore, it can be seen that object recognition in an image is successfully achieved through the addition of bounding boxes and labels in a prediction image that has passed through the master-slave transformer architecture (100) according to the embodiments of the present invention.

[0070] FIG. 7 is a diagram showing the loss values ​​with and without an adaptive controller in a master-slave transformer architecture in an experimental example of the present invention. In the graph of FIG. 7, the x-axis represents epochs and the y-axis represents loss. The graph of FIG. 7 shows the relationship between loss and the number of epochs for two different model configurations, where the blue model has no adaptive controller (e.g., adaptive controller (130)), and the red model has an adaptive controller. The sharp decrease in the red line indicates that the adaptive controller helps achieve lower loss values ​​by increasing the learning speed. The adaptive controller makes loss reduction more efficient by dynamically adjusting parameters to help the model better capture basis patterns. As such, the graph of FIG. 7 shows faster convergence of loss values ​​and effective loss reduction in the structure with the adaptive controller applied compared to the structure without the adaptive controller. This indicates that the adaptive controller enhances convergence through faster learning of the dynamic environment, i.e., increased adaptability. Table 1 below shows the results of comparing the performance of various models on the 'nuScenes mini version' dataset.

[0071] ModelmAPPointPillars22.96SECOND(Sparsely EmbeddedConvolutional Detection)38.07Complex-YOLO23.96PointRCNN27.11Multi-modal Master-SlaveTransformer systemwithout Adaptive Controller37.72Multi-modal Master-SlaveTransformer systemwith Adaptive Controller38.46

[0072] The master-slave transformer architecture (100) with an adaptive controller has an average precision of 38.46 and achieved a performance improvement of 0.39 compared to existing models. In addition, an analysis of the performance difference in the proposed model with and without the adaptive controller showed that the case including the adaptive controller was 0.74 better than the case excluding it. This suggests that the adaptive controller has an effect on improving object recognition performance.

[0073] FIG. 8 illustrates an example of classification accuracy of a master-slave transformer architecture in an experimental example of the present invention. In the graph of FIG. 8, the x-axis represents epochs and the y-axis represents classification accuracy. The graph of FIG. 8 indicates that classification accuracy has been significantly improved, which indicates that the model is gaining from more extensive training.

[0074] As such, embodiments of the present invention provide a multimodal master-slave transformer architecture utilizing an adaptive controller for efficient three-dimensional object recognition in autonomous driving. The master-slave transformer architecture may be based on a structure of multiple slave transformers that process each sensor data individually, and a master transformer that integrates them to generate a global feature map. Additionally, by introducing an adaptive controller, it can dynamically respond to changing road environments. Experimental results showed that the proposed architecture demonstrated superior performance compared to existing three-dimensional object recognition models, and achieved higher accuracy, particularly when including an adaptive controller. Through this, it was confirmed that the master-slave transformer architecture effectively fuses multimodal sensor data and operates reliably in dynamic environments.

[0075] FIG. 9 is a flowchart illustrating an example of an object detection method according to an embodiment of the present invention. The object detection method according to the present embodiment may be performed by an object detection system implemented by at least one computer device that includes the master-slave transformer architecture (100) described above. At this time, at least one processor included in the at least one computer device may be implemented to execute a control instruction according to the code of an operating system included in memory or the code of at least one computer program. Here, the at least one processor may operate according to a control instruction provided by the code stored in the at least one computer device to control the object detection system implemented by the at least one computer device so that the object detection system performs steps (910 to 950) included in the method of FIG. 9.

[0076] In step (910), the object detection system can generate multiple feature maps by processing features generated based on multimodal sensor data received from multiple different sensors using multiple slave transformers. For example, the object detection system can divide the features into multiple scene regions and combine the features of multiple sensors included in each scene region by using the cross attention of a slave transformer corresponding to each scene region.

[0077] At this time, multimodal sensor data may include a lidar point cloud collected through lidar, an image input through a camera, a high-resolution map obtained through a location measured based on GPS, and a radar point cloud collected through radar. In this case, features generated based on multimodal sensor data may include a first feature extracted from the lidar point cloud using 3D CNN, a second feature extracted from the image using EfficientDet, a third feature extracted from the high-resolution map using ViT, and a fourth feature extracted from the radar point cloud using TransCAR.

[0078] In step (920), the object detection system can dynamically update weights by inputting control signals to multiple feature maps using an adaptive controller. For example, the object detection system can adjust weights by inputting control signals to multiple feature maps based on Dynamic Reinforcement Learning (Dynamic RL) based on road conditions. As a more specific example, the object detection system can calculate control signals based on a non-linear activation function, weights, and biases, and can adjust multiple feature maps through finite-time synchronization based on the control signals.

[0079] In step (930), the object detection system can generate a global feature map by integrating feature maps with dynamically updated weights using a master transformer. A feature map output by a single slave transformer may be a combination of features from multiple sensors included in a single scene area. In this case, the master transformer can generate a global feature map by integrating feature maps from multiple scene areas.

[0080] In step (940), the object detection system can use a master transformer to predict the location of an object using a global feature map and generate bounding boxes and labels. Bounding boxes can be generated in a 3D shape for 3D object detection, and labels can be assigned to each bounding box.

[0081] In step (950), the object detection system can visualize multimodal sensor data including bounding boxes and labels. Examples of multimodal sensor data including bounding boxes and labels were previously described through FIGS. 3, 5, and 6.

[0082] Thus, according to embodiments of the present invention, a three-dimensional object detection method and system for autonomous driving can be provided.

[0083] FIG. 10 is a block diagram illustrating an example of a computer device according to an embodiment of the present invention. For example, an object detection system based on the master-slave transformer architecture (100) described above may be implemented by at least one computer device, and each of the at least one computer device may correspond to the computer device (1000) of FIG. 10. As illustrated in FIG. 10, the computer device (1000) may include memory (1010), a processor (1020), a communication interface (1030), and an input / output interface (1040). The memory (1010) is a computer-readable recording medium and may include a non-perishable mass storage device such as RAM (random access memory), ROM (read only memory), and a disk drive. Here, non-perishable mass storage devices such as ROM and disk drives may be included in the computer device (1000) as separate permanent storage devices distinct from memory (1010). Additionally, an operating system and at least one program code may be stored in memory (1010). These software components may be loaded into memory (1010) from a computer-readable recording medium separate from memory (1010). This separate computer-readable recording medium may include computer-readable recording media such as floppy drives, disks, tapes, DVD / CD-ROM drives, and memory cards. In another embodiment, software components may be loaded into memory (1010) via a communication interface (1030) rather than a computer-readable recording medium.For example, software components can be loaded into the memory (1010) of a computer device (1000) based on a computer program installed by files received through a network (Network, 1060).

[0084] The processor (1020) may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input / output operations. Instructions may be provided to the processor (1020) via memory (1010) or a communication interface (1030). For example, the processor (1020) may be configured to execute instructions received according to program code stored in a recording device such as memory (1010).

[0085] The communication interface (1030) may provide a function for the computer device (1000) to communicate with other devices through a network (1060). For example, requests, commands, data, files, etc. generated by the processor (1020) of the computer device (1000) according to program code stored in a recording device such as memory (1010) may be transmitted to other devices through the network (1060) under the control of the communication interface (1030). Conversely, signals, commands, data, files, etc. from other devices may be received by the computer device (1000) through the communication interface (1030) of the computer device (1000) via the network (1060). Signals, commands, data, etc. received through the communication interface (1030) may be transmitted to the processor (1020) or memory (1010), and files, etc. may be stored in a storage medium (the permanent storage device described above) that the computer device (1000) may further include.

[0086] The input / output interface (1040) may be a means for interfacing with an input / output device (I / O device, 1050). For example, the input device may include a device such as a microphone, keyboard, or mouse, and the output device may include a device such as a display or speaker. As another example, the input / output interface (1040) may be a means for interfacing with a device in which the functions for input and output are integrated into one, such as a touchscreen. The input / output device (1050) may be composed of a computer device (1000) and a single device.

[0087] Additionally, in other embodiments, the computer device (1000) may include fewer or more components than the components of FIG. 10. However, it is not necessary to clearly illustrate most of the prior art components. For example, the computer device (1000) may be implemented to include at least some of the input / output devices (1050) described above, or may include other components such as a transceiver, a database, etc.

[0088] The system or device described above may be implemented as a hardware component, or a combination of a hardware component and a software component. For example, the device and component described in the embodiments may be implemented using one or more general-purpose or special-purpose computers, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing unit may execute an operating system (OS) and one or more software applications executed on said operating system. Additionally, the processing unit may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing unit may be described as being used as a single unit, but those skilled in the art will understand that the processing unit may include multiple processing elements and / or multiple types of processing elements. For example, the processing unit may include multiple processors or one processor and one controller. In addition, other processing configurations, such as parallel processors, are also possible.

[0089] Software may include computer programs, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired or instruct the processing unit independently or collectively. Software and / or data may be embodied in any type of machine, component, physical device, virtual equipment, computer storage medium, or device so as to be interpreted by the processing unit or to provide instructions or data to the processing unit. Software may be distributed over networked computer systems and may be stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

[0090] The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., either individually or in combination. The medium may continuously store a program executable by a computer, or temporarily store it for execution or download. Furthermore, the medium may be various recording or storage means in the form of a single or multiple hardware components, and is not limited to a medium directly connected to a computer system, but may also exist distributed over a network. Examples of media may include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and media configured to store program instructions, including ROM, RAM, and flash memory. Additionally, other examples of media may include recording or storage media managed by app stores that distribute applications or sites and servers that supply or distribute various other software. Examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

[0091] Although the embodiments have been described above with reference to limited examples and drawings, those skilled in the art can make various modifications and variations from the description above. For example, suitable results can be achieved even if the described techniques are performed in a different order than described, and / or the components of the described system, structure, device, circuit, etc. are combined or assembled in a form different from described, or replaced or substituted by other components or equivalents.

[0092] Therefore, other implementations, other embodiments, and equivalents to the claims also fall within the scope of the claims set forth below.

Claims

1. An object detection method of an object detection system implemented by at least one computer device, The above object detection system includes multiple slave transformers, a master transformer, and an adaptive controller, and The above at least one computer device includes at least one processor, and The above object detection method is, A step of generating a plurality of feature maps by processing features generated based on multimodal sensor data received from a plurality of different sensors by the at least one processor using the multiple slave transformer; A step of extracting a feature vector of the last time step of an existing global feature map generated by the master transformer using the adaptive controller by the at least one processor, generating a control signal by applying a non-linear activation function, weights, and biases to the feature vector, and dynamically updating weights by inputting the control signal to the plurality of feature maps; and A step of generating a global feature map by integrating feature maps with dynamically updated weights using the master transformer by the at least one processor. An object detection method characterized by including 2. In Paragraph 1, The step of processing the above features using the multiple slave transformers is, An object detection method characterized by dividing the above features into a plurality of scene regions and combining the features of the plurality of sensors included in the scene region using the cross attention of a slave transformer corresponding to each of the plurality of scene regions.

3. In Paragraph 1, The step of dynamically updating the above weights is, An object detection method characterized by inputting the control signal into the plurality of feature maps and adjusting the weights based on Dynamic Reinforcement Learning (Dynamic RL) according to road conditions.

4. In Paragraph 1, The above object detection method is, The step of predicting the location of an object using the global feature map using a master transformer by the above at least one processor, and generating a bounding box and a label. An object detection method characterized by further including 5. In Paragraph 4, The above object detection method is, A step of visualizing the multimodal sensor data including the bounding box and the label by the above at least one processor An object detection method characterized by further including 6. In Paragraph 1, An object detection method characterized by the fact that the above multimodal sensor data includes a LiDAR point cloud collected through LiDAR, an image input through a camera, a high-resolution map obtained through a position measured based on a Global Positioning System (GPS), and a radar point cloud collected through radar.

7. In Paragraph 6, An object detection method characterized in that the features generated based on the multimodal sensor data include a first feature extracted from the LiDAR point cloud using a 3D CNN (Convolutional Neural Network), a second feature extracted from the image using EfficientDet, a third feature extracted from the high-resolution map using ViT (Vision Transformer), and a fourth feature extracted from the radar point cloud using TransCAR.

8. A computer program stored on a computer-readable recording medium for executing an object detection method of an object detection system on a computer device in combination with a computer device, The above object detection system includes multiple slave transformers, a master transformer, and an adaptive controller, and The above object detection method is, A step of generating multiple feature maps by processing features generated based on multimodal sensor data received from multiple different sensors using the multiple slave transformers; A step of using the adaptive controller to extract a feature vector of the last time step of an existing global feature map generated by the master transformer, generating a control signal by applying a non-linear activation function, weights, and biases to the feature vector, and dynamically updating weights by inputting the control signal to the plurality of feature maps; and A step of generating a global feature map by integrating feature maps with dynamically updated weights using the master transformer. A computer program characterized by including 9. In Paragraph 8, The step of processing the above features using the multiple slave transformers is, Dividing the above features into a plurality of scene regions, and combining the features of the plurality of sensors included in the scene region by using the cross attention of a slave transformer corresponding to each of the plurality of scene regions. A computer program characterized by 10. In Paragraph 8, The step of dynamically updating the above weights is, Adjusting the weights by inputting the control signal to the plurality of feature maps based on Dynamic Reinforcement Learning (Dynamic RL) according to road conditions. A computer program characterized by 11. In Paragraph 8, The above object detection method is, The step of predicting the location of an object using the global feature map using a master transformer, and generating a bounding box and a label; and A step of visualizing the multimodal sensor data including the bounding box and the label. including more A computer program characterized by 12. In Paragraph 8, The above multimodal sensor data includes a LiDAR point cloud collected through LiDAR, an image input through a camera, a high-resolution map obtained through a location measured based on a Global Positioning System (GPS), and a radar point cloud collected through radar. The features generated based on the multimodal sensor data include a first feature extracted from the LiDAR point cloud using a 3D CNN (Convolutional Neural Network), a second feature extracted from the image using EfficientDet, a third feature extracted from the high-resolution map using ViT (Vision Transformer), and a fourth feature extracted from the radar point cloud using TransCAR. A computer program characterized by 13. An object detection system implemented by at least one computer device, The above object detection system includes multiple slave transformers, a master transformer, and an adaptive controller, and The above at least one computer device includes at least one processor, and By the above at least one processor, Features generated based on multimodal sensor data received from multiple different sensors are processed using the multiple slave transformers to generate multiple feature maps, and Using the above adaptive controller, a feature vector of the last time step of the existing global feature map generated by the master transformer is extracted, a non-linear activation function, weights, and biases are applied to the feature vector to generate a control signal, and the control signal is input to the plurality of feature maps to dynamically update the weights. Generating a global feature map by integrating feature maps with dynamically updated weights using the above master transformer. An object detection system characterized by 14. In Paragraph 13, To process the above features, by the at least one processor, Dividing the above features into a plurality of scene regions, and combining the features of the plurality of sensors included in the scene region by using the cross attention of a slave transformer corresponding to each of the plurality of scene regions. An object detection system characterized by 15. In Paragraph 13, To dynamically update the above weights, by the at least one processor, Adjusting the weights by inputting the control signal to the plurality of feature maps based on Dynamic Reinforcement Learning (Dynamic RL) according to road conditions. An object detection system characterized by 16. In Paragraph 13, By the above at least one processor, Using the Master Transformer, the position of the object is predicted using the global feature map, and bounding boxes and labels are generated. Visualizing the above multimodal sensor data including the bounding box and the label An object detection system characterized by 17. In Paragraph 13, The above multimodal sensor data includes a LiDAR point cloud collected through LiDAR, an image input through a camera, a high-resolution map obtained through a location measured based on a Global Positioning System (GPS), and a radar point cloud collected through radar. The features generated based on the multimodal sensor data include a first feature extracted from the LiDAR point cloud using a 3D CNN (Convolutional Neural Network), a second feature extracted from the image using EfficientDet, a third feature extracted from the high-resolution map using ViT (Vision Transformer), and a fourth feature extracted from the radar point cloud using TransCAR. An object detection system characterized by