Multi-modal 3D object detection method based on knowledge distillation category self-adaptive fusion
By designing a category-aware adaptive fusion network and a knowledge distillation framework, the problems of poor fusion and high computational complexity in multimodal 3D object detection are solved, achieving high-precision and efficient target detection for autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TAIYUAN UNIVERSITY OF SCIENCE AND TECHNOLOGY
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-19
AI Technical Summary
Existing point cloud-based 3D object detection frameworks suffer from problems such as poor multimodal information fusion, high computational complexity, long training time, and difficulty in model deployment in autonomous driving, and thus cannot meet practical needs.
We design a category-aware adaptive fusion network, which adaptively adjusts the grid by using density-aware angle partitioning and distance-aware radial partitioning weight prediction modules, and combines them with a point weight prediction module to achieve feature aggregation. We construct a teacher-student knowledge distillation framework for lightweight processing, and use logits distillation, feature distillation and label distillation for multimodal feature fusion.
It improves the accuracy and robustness of multimodal 3D target detection, achieves a balance between detection efficiency and performance, and reduces model complexity and computational overhead.
Smart Images

Figure CN122244835A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the technical field of 3D target detection methods, and in particular relates to a multimodal 3D target detection method based on knowledge distillation category adaptive fusion. Background Technology
[0002] With the rapid development of the autonomous driving industry, vehicle perception systems, as a key technology in the field of autonomous driving, have received increasing attention. In order to ensure safety during autonomous driving, higher requirements have been placed on target detection technology in the context of autonomous driving. However, the current point cloud-based 3D target detection framework cannot meet the actual needs well, and there are still a series of problems in the real-world application process.
[0003] First, current point cloud-based 3D object detection frameworks cannot adequately meet the practical application needs of autonomous driving. To achieve higher-precision 3D object detection and overcome the limitations of pure point cloud 3D information representation, researchers have gradually chosen to introduce new modal information to supplement point cloud data, thus giving rise to the technical path of multi-sensor complementary information fusion. However, due to the introduction of new modal information, if the relationships between multimodal information are not handled well, the newly introduced modal information will become noise. This not only fails to improve the accuracy of 3D object detection but also negatively impacts the original pure point cloud 3D object detection framework. Therefore, achieving good multimodal data alignment and fusion remains a major challenge for this technical path.
[0004] Secondly, because existing multimodal 3D object detection frameworks often involve feature processing of both images and point clouds, the number of model parameters in these frameworks is often enormous, significantly increasing computational complexity. Furthermore, the vast amount of point cloud data involved in the autonomous driving field makes model training time lengthy, computationally expensive, and difficult to deploy, thus impacting practical applications. Summary of the Invention
[0005] To address the aforementioned technical problems, this invention proposes a multimodal 3D object detection method based on knowledge distillation and category-adaptive fusion. The method designs a category-aware adaptive fusion network, with independent adaptive adjustment networks for different categories, further improving the quality of cross-modal feature fusion and achieving high-precision 3D object detection. Furthermore, it constructs a student-teacher knowledge distillation framework by combining logits distillation, feature distillation, and label distillation, achieving lightweight processing while ensuring the detection performance of the student model, thus achieving a balance between detection efficiency and performance.
[0006] The technical solution protected by this invention is: a multimodal 3D target detection method based on knowledge distillation category adaptive fusion, specifically carried out according to the following steps:
[0007] Step S1: Construct a Category-Aware Adaptive Fusion Multimodal 3D Object Detection Framework (CAAF-DET3D) based on category-aware adaptive fusion. This network includes two feature fusion branches: cylindrical coordinate branch and bird's-eye view branch.
[0008] For images and point cloud data with different target categories, the cylindrical coordinate branch uses adaptive grids and adaptive weights to perform multimodal feature fusion in the cylindrical coordinate system;
[0009] The bird's-eye view branch uses bilinear interpolation to perform a second fusion of multimodal features, and then inputs the fused features into a conventional two-stage 3D object detection network to finally achieve 3D object detection;
[0010] Step S2: Using the multimodal 3D object detection network constructed in step S1, a high-precision teacher model is trained on a large public dataset;
[0011] Step S3: Construct a knowledge distillation framework for the teacher model and the student model, perform lightweight processing on the student model, and use logits distillation, feature distillation, and label distillation to guide the training of the student model;
[0012] Step S4: Perform object detection using the trained student model.
[0013] Furthermore, the specific process of constructing a multimodal 3D object detection network based on category-aware adaptive fusion in step S1 is as follows:
[0014] Step S11: Project the original point cloud from the Cartesian coordinate system to the cylindrical coordinate system. According to formula (1), for any point in the original point cloud... Perform coordinate transformation:
[0015] (1)
[0016] in, Represents the lower point in the Cartesian coordinate system coordinates express radial coordinates in cylindrical coordinate system This indicates the number of points in a point cloud scene. express Angular coordinates in cylindrical coordinate system express Elevation coordinates in cylindrical coordinate system;
[0017] Step S12: Predefine an initial grid that matches the characteristics of the target category; based on the target category labels in the dataset, predefine a set of initial grids that match the characteristics of each category.
[0018] Step S13: Use the density-aware-angle partitioning weight prediction module and the distance-aware-radial partitioning weight prediction module to predict the weights of points of different categories in the point cloud space, and adaptively adjust the initial grid according to the predicted weights.
[0019] Step S14: Use the point weight prediction module to predict the contribution weight of the point-level features and generate weighted point-level features.
[0020] Step S15: Aggregate the point-level features generated in step S14 onto the adjusted fused mesh:
[0021] (9)
[0022] in, This represents the aggregated grid features. This indicates aggregation of the maximum values in the group;
[0023] Step S16: After implementing the category-aware adaptive fusion mesh in the cylindrical coordinate branch, the fused and enhanced point cloud features are input into the bird's-eye view branch, and bilinear interpolation is used to achieve the second multimodal feature fusion. Then, the multimodal features are transformed into point-by-point features for multi-view fusion. Finally, the complete 3D object detection task is achieved by using the Region Proposal Network (RPN), ROI pooling, and anchor-box-based detection head in the conventional two-stage 3D object detection network PV-RCNN.
[0024] Furthermore, the design process of the density-sensing-angle partitioning weight prediction module and the distance-sensing-radial partitioning weight prediction module in step S12 is as follows:
[0025] First, the design of the density-aware angle partitioning weight prediction module is represented as follows:
[0026] (2)
[0027] (3)
[0028] in, This indicates the output weights of the density sensing module. This indicates the output weights of the angle weight prediction module. This represents the corresponding sigmoid function. This indicates two MLP layers. This represents the scene context features obtained after passing through the scene context encoder. This represents the corresponding softmax function;
[0029] Secondly, the design process of the distance-aware radial partitioning weight prediction module is expressed as follows:
[0030] (4)
[0031] (5)
[0032] in, This indicates the output weights of the distance perception module. This indicates the output weights of the radial weight prediction module;
[0033] Finally, adaptive adjustment of the fused mesh is performed; the prediction weights generated by the two modules are input into the multilayer perceptron, which outputs a scaling factor; then, the scaling factor is constrained to a preset range using the sigmoid function, and finally, the scaling factor output by the sigmoid function is compared with the initial mesh. Multiplication is used to achieve adaptive mesh size adjustment and avoid over-mesh adjustment, specifically as follows:
[0034] (6)
[0035] in, This indicates the adjusted grid. Output scaling factors in the angular and radial directions.
[0036] Furthermore, the design process of the point weight prediction module in step S13 is as follows:
[0037] (7)
[0038] in, This represents the predicted contribution weight at point i. This indicates a three-layer MLP. Represents the contextual features of point i;
[0039] The weighted generation process of point-level features is represented as follows:
[0040] (8)
[0041] in, For the generated point-level features, Features are generated by point-by-point weighting.
[0042] Furthermore, the specific process of constructing the knowledge distillation framework for the teacher and student models in step S3 is as follows:
[0043] The knowledge distillation framework for the constructed teacher and student models includes a three-part distillation process: logits distillation, feature distillation, and label distillation.
[0044] Step S31, logits distillation: in the student model In the learning process, the teacher model Parameter freezing and logits distillation aim to enable the student model to learn from the teacher model in regression and classification prediction, specifically as follows:
[0045] (10)
[0046] (11)
[0047] (12)
[0048] in, The logits output of the teacher model represents the classification output and the bounding box regression output, respectively. The logits output of the student model represents the classification output and the bounding box regression output, respectively. For the Sigmoid function, These are the weights for categorical distillation loss and the weights for regression distillation loss, respectively. Foreground mask, It is a very small constant, serving as a numerical stability term. Normalized weights for positive samples The SmoothL1 loss function is used. This represents the encoding transformation after performing a sine difference on the angular dimension. This indicates element-wise multiplication;
[0049] Step S32, Feature Distillation, occurs during the feature learning process. The teacher model parameters are frozen. For intermediate features, feature distillation loss is used to make the student model's feature map as close as possible to the teacher network. The specific design of feature distillation is as follows:
[0050] (13)
[0051] in, Indicates the characteristic distillation loss weight, Indicates the number of Roi. This indicates the number of feature channels after alignment. Indicates the feature map height. Indicates the width of the feature map. These represent the teacher and student characteristics at the i-th Roi, c-th channel, and h, w-th positions, respectively.
[0052] Step S33, Label Distillation: Given a point cloud x and its corresponding set of truth boxes y, label distillation obtains teacher predictions from a pre-trained teacher model. and the corresponding confidence score After filtering the confidence scores using a given threshold of 0.6, high-quality teacher predictions are obtained. Combine it with the truth box set to obtain This allows for the assignment of labels to student networks, followed by the calculation of regression and classification losses.
[0053] Total distillation loss of the frame It can be represented as:
[0054] (14)
[0055] Furthermore, the total training loss of the overall framework Represented as:
[0056] (15)
[0057] in, For the loss in the RPN stage of 3D object detection, The loss for the RCNN stage of 3D object detection.
[0058] The present invention has the following advantages compared with the prior art.
[0059] 1. This invention proposes a multimodal 3D target detection network based on category-aware adaptive fusion. By creating independent adaptive meshes for different categories in the detection network, the meshes are adaptively adjusted through a density-aware angle partitioning weight prediction module and a distance-aware radial partitioning weight prediction module. Furthermore, feature aggregation is achieved by combining a point weight prediction module, thereby refining the multimodal feature fusion process, achieving higher quality multimodal feature fusion, and significantly improving the detection accuracy and model robustness of the multimodal 3D target detection framework.
[0060] 2. This invention designs a density-aware-angle partitioning weight prediction module and a distance-aware-radial partitioning weight prediction module. These two modules adaptively adjust the mesh size by predicting the weights of radial and angular features in cylindrical coordinates, overcoming the limitations of fixed mesh partitioning in existing technologies on the quality of multimodal feature fusion.
[0061] 3. This invention constructs a knowledge distillation framework that combines logits distillation, feature distillation, and label distillation. The student model achieves improved detection efficiency with a lightweight model structure, and learns the feature representation ability of the teacher model through the knowledge distillation framework, ensuring the detection accuracy of the student model. This achieves a balance between performance and efficiency in the multimodal 3D target detection framework. Attached Figure Description
[0062] The present invention will now be described in further detail with reference to the accompanying drawings.
[0063] Figure 1 This is a schematic diagram of the overall framework of the detection method of the present invention.
[0064] Figure 2 A schematic diagram showing the detailed framework of the detection method of this invention.
[0065] Figure 3 Comparison chart of the present invention and existing 3D target detection frameworks.
[0066] Figure 4 The figure shows experimental data on the model complexity of the multimodal 3D target detection framework of this invention.
[0067] Figure 5 This is an ablation experiment diagram of the present invention.
[0068] Figure 6 This is a visualization comparing the detection results of the present invention with those of the baseline model. Detailed Implementation
[0069] To make the objectives, features, and advantages of the present invention readily apparent, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
[0070] like Figure 1 As shown, the multimodal 3D target detection method based on knowledge distillation category adaptive fusion is carried out according to the following steps:
[0071] Step S1: Construct a Category-Aware Adaptive Fusion Multimodal 3D Object Detection Framework (CAAF-DET3D). This network includes two feature fusion branches: a cylindrical coordinate branch and a bird's-eye view branch. The cylindrical coordinate branch performs multimodal feature fusion in cylindrical coordinates using adaptive grids and adaptive weights for images and point cloud data with different object categories. The bird's-eye view branch uses bilinear interpolation for a second fusion of multimodal features, and then inputs the fused features into a conventional two-stage 3D object detection network to ultimately achieve 3D object detection.
[0072] The specific process of constructing the CAAF-DET3D network is as follows:
[0073] Step S11: Project the original point cloud from the Cartesian coordinate system to the cylindrical coordinate system. According to formula (1), for any point in the original point cloud... Perform coordinate transformation:
[0074] (1)
[0075] in, Represents the lower point in the Cartesian coordinate system coordinates express radial coordinates in cylindrical coordinate system This indicates the number of points in a point cloud scene. express Angular coordinates in cylindrical coordinate system express The height coordinates in cylindrical coordinates.
[0076] Step S12: Predefine an initial grid that matches the characteristics of the target category. Specifically, based on the target category labels in the dataset, for each category, predefine a set of initial grids that match the category characteristics.
[0077] For example, when the dataset contains four categories: cars, pedestrians, bicycles, and background, a relatively larger initial grid is used for the car category, taking into account the differences in size, shape, and distribution density among the categories. To cover larger and more regularly shaped spatial areas; a finer-grained small initial grid is used for pedestrians. To finely depict the distribution of its small and numerous targets; a medium initial grid, between the two, is used for the bicycle category. For the background category, a default medium initial grid is used. The initial grid configuration for category awareness described above will enable feature aggregation and encoding for each category at a more suitable spatial resolution, thereby improving overall detection performance.
[0078] Step S13: Design a density-aware angle partitioning weight prediction module and a distance-aware radial partitioning weight prediction module. Use these two modules to predict the weights of different categories of points in the point cloud space, and adaptively adjust the initial grid according to the predicted weights.
[0079] First, the design of the density-aware angle partitioning weight prediction module can be expressed as:
[0080] (2)
[0081] (3)
[0082] in, This indicates the output weights of the density sensing module. This indicates the output weights of the angle weight prediction module. This represents the corresponding sigmoid function. This indicates two MLP layers. This represents the scene context features obtained after passing through the scene context encoder. This represents the corresponding softmax function.
[0083] Secondly, the design process of the distance-aware radial partitioning weight prediction module can be expressed as follows:
[0084] (4)
[0085] (5)
[0086] in, This indicates the output weights of the distance perception module. This indicates the output weights of the radial weight prediction module.
[0087] Finally, adaptive adjustment of the fused mesh is performed. The prediction weights generated by the two modules are input into the multilayer perceptron, which outputs a scaling factor. Then, the scaling factor is constrained to a preset range using the sigmoid function. Finally, the scaling factor output by the sigmoid function is compared with the initial mesh. Multiplication is used to achieve adaptive mesh size adjustment and avoid over-mesh adjustment, which can be specifically expressed as:
[0088] (6)
[0089] in, This indicates the adjusted grid. Output scaling factors in the angular and radial directions.
[0090] Step S14: Design a point weight prediction module to predict the contribution weight of point-level features and generate weighted point-level features.
[0091] The point weight prediction module is designed as follows:
[0092] (7)
[0093] in, This represents the predicted contribution weight at point i. This indicates a three-layer MLP. This represents the contextual features of point i.
[0094] The weighted generation process of point-level features can be expressed as:
[0095] (8)
[0096] in, For the generated point-level features, Features are generated by point-by-point weighting.
[0097] Step S15: Aggregate point-level features onto the adjusted fused mesh:
[0098] (9)
[0099] in, This represents the aggregated grid features. This indicates the aggregation of the maximum values in the group.
[0100] Step S16: After implementing the category-aware adaptive fusion mesh in the cylindrical coordinate branch, the enhanced point cloud features are input into the bird's-eye view branch, and bilinear interpolation is used to achieve a second multimodal feature fusion. Then, the multimodal features are transformed into point-by-point features for multi-view fusion. Finally, the complete 3D object detection task is achieved by utilizing the Region Proposal Network (RPN), ROI pooling, and anchor-box-based detection head in the conventional two-stage 3D object detection network PV-RCNN.
[0101] Step S2: Using the constructed CAAF-DET3D framework, a high-precision teacher model is trained on a large public dataset.
[0102] Step S3: Construct a knowledge distillation framework for the teacher model and the student model, perform lightweight processing on the student model, and use logits distillation, feature distillation, and label distillation to guide the training of the student model.
[0103] The knowledge distillation framework for the constructed teacher and student models consists of three distillation processes: logits distillation, feature distillation, and label distillation.
[0104] Step S31, logits distillation, in the student model In the learning process, the teacher model Parameter freezing and logits distillation aim to enable the student model to learn from the teacher model in regression and classification prediction, which can be specifically expressed as:
[0105] (10)
[0106] (11)
[0107] (12)
[0108] in, The logits output of the teacher model represents the classification output and the bounding box regression output, respectively. The logits output of the student model represents the classification output and the bounding box regression output, respectively. For the Sigmoid function, These are the weights for categorical distillation loss and the weights for regression distillation loss, respectively. Foreground mask, It is a very small constant, serving as a numerical stability term. Normalized weights for positive samples The SmoothL1 loss function is used. This represents the encoding transformation after performing a sine difference on the angular dimension. This indicates element-wise multiplication.
[0109] Step S32, Feature Distillation, occurs during the feature learning process. The teacher model parameters are frozen. For intermediate features, feature distillation loss is used to make the student model's feature map as close as possible to the teacher network. The specific design of feature distillation is as follows:
[0110] (13)
[0111] in, Indicates the characteristic distillation loss weight, Indicates the number of Roi. This indicates the number of feature channels after alignment. Indicates the feature map height. Indicates the width of the feature map. These represent the teacher and student characteristics at the i-th Roi, c-th channel, and h, w-th positions, respectively.
[0112] Step S33, Label Distillation, aims to expand the anchor boxes in each frame of the point cloud. In the label assignment stage of the student model, it not only uses the original ground truth boxes but also adds high-confidence prediction boxes generated by the teacher model. Specifically, given a point cloud x and its corresponding set of ground truth boxes y, label distillation obtains teacher predictions from the pre-trained teacher model. and the corresponding confidence score After filtering the confidence scores using a given threshold of 0.6, high-quality teacher predictions are obtained. Combine it with the truth box set to obtain This allows for the assignment of labels to student networks, followed by the calculation of regression and classification losses.
[0113] Total distillation loss of the frame It can be represented as:
[0114] (14)
[0115] Ultimately, the total training loss of the entire framework Represented as:
[0116] (15)
[0117] in, For the loss in the RPN stage of 3D object detection, The loss for the RCNN stage of 3D object detection.
[0118] Step S4: Perform object detection using the trained student model.
[0119] The multimodal 3D target detection method based on knowledge distillation and adaptive category fusion of the present invention has been described in detail above. The following section presents a simulation experiment of the multimodal 3D target detection method based on knowledge distillation and adaptive category fusion of the present invention.
[0120] Experimental setup
[0121] The method of this invention was evaluated on the large public dataset KITTI, based on the open-source platform PyTorch, using a 64-bit Ubuntu operating system and an Nvidia RTX K8000 graphics card with a maximum video memory of 48GB.
[0122] The teacher model, CCAF-DET3D, designed in this invention, was trained iteratively for 80 rounds on the KITTI dataset and its parameters were frozen to guide the student model. For the student model, this invention performs a lightweighting of CCAF-DET3D, reducing the feature channels by half. This embodiment uses the Adam optimizer, with an initial learning rate of 0.001 and a batch size of 4.
[0123] Figure 2 This paper presents detailed module details of the Category-Aware Adaptive Fusion Multimodal 3D Object Detection Framework (CAAF-DET3D) designed in this invention, showing the design details of the density-aware angle partitioning weight prediction module and the distance-aware radial partitioning weight prediction module.
[0124] Figure 3This paper demonstrates the comparison of the model of this invention with other state-of-the-art models in 3D object detection accuracy on the KITTI dataset. The experimental results show that the model designed in this invention achieves the highest or second-highest average 3D detection accuracy in 6 out of 9 metrics, and achieves the highest average 3D detection accuracy (76.13%) across all 9 metrics. Compared to the baseline network, the method of this invention surpasses the baseline network in all 9 metrics. These results demonstrate the effectiveness of the proposed category-aware adaptive network framework design.
[0125] Figure 4 This paper demonstrates the effectiveness of the teacher-student knowledge distillation framework combining feature distillation, logits distillation, and label distillation designed in this invention. Experimental results in the figure show that this invention achieves a significant improvement in detection efficiency by sacrificing a small amount of detection accuracy. Although this invention sacrifices a small amount of detection accuracy, its detection accuracy is still superior to other advanced multimodal 3D object detection models. Furthermore, this invention achieves a significant efficiency improvement; the number of parameters in this invention is close to the baseline pure point cloud network, and the model complexity and activation number are greatly reduced, significantly improving the model's detection efficiency and achieving a balance between detection performance and detection efficiency in the multimodal 3D object detection framework.
[0126] Figure 5 The visualization results of the framework designed in this invention on the KITI dataset are shown, intuitively demonstrating the differences between the model of this invention and the baseline model. As shown in the figure, the model of this invention exhibits excellent detection performance in long-distance detection, occluded detection, and complex road conditions.
[0127] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A multimodal 3D target detection method based on knowledge distillation category adaptive fusion, characterized in that: Please follow these steps: Step S1: Construct a multimodal 3D object detection network CAAF-DET3D based on category-aware adaptive fusion. This network includes two feature fusion branches: cylindrical coordinate branch and bird's-eye view branch. For images and point cloud data with different target categories, the cylindrical coordinate branch uses adaptive grids and adaptive weights to perform multimodal feature fusion in the cylindrical coordinate system; The bird's-eye view branch uses bilinear interpolation to perform a second fusion of multimodal features, and then inputs the fused features into a conventional two-stage 3D object detection network to finally achieve 3D object detection; Step S2: Using the multimodal 3D object detection network constructed in step S1, a high-precision teacher model is trained on a large public dataset; Step S3: Construct a knowledge distillation framework for the teacher model and the student model, perform lightweight processing on the student model, and use logits distillation, feature distillation, and label distillation to guide the training of the student model; Step S4: Use the trained student model to perform object detection.
2. The multimodal 3D target detection method based on knowledge distillation category adaptive fusion according to claim 1, characterized in that: The specific process of constructing a multimodal 3D object detection network based on category-aware adaptive fusion in step S1 is as follows: Step S11: Project the original point cloud from the Cartesian coordinate system to the cylindrical coordinate system. According to formula (1), for any point in the original point cloud... Perform coordinate transformation: (1) in, Represents the lower point in the Cartesian coordinate system coordinates express radial coordinates in cylindrical coordinate system This indicates the number of points in a point cloud scene. express Angular coordinates in cylindrical coordinate system express Elevation coordinates in cylindrical coordinate system; Step S12: Predefine an initial grid that matches the characteristics of the target category; based on the target category labels in the dataset, predefine a set of initial grids that match the characteristics of each category. Step S13: Use the density-aware-angle partitioning weight prediction module and the distance-aware-radial partitioning weight prediction module to predict the weights of points of different categories in the point cloud space, and adaptively adjust the initial grid according to the predicted weights. Step S14: Use the point weight prediction module to predict the contribution weight of the point-level features and generate weighted point-level features. Step S15: Aggregate the point-level features generated in step S14 onto the adjusted fused mesh: (9) in, This represents the aggregated grid features. This indicates aggregation of the maximum values in the group; Step S16: After implementing the category-aware adaptive fusion mesh in the cylindrical coordinate branch, the fused and enhanced point cloud features are input into the bird's-eye view branch, and bilinear interpolation is used to achieve the second multimodal feature fusion. Then, the multimodal features are transformed into point-by-point features for multi-view fusion. Finally, the complete 3D object detection task is achieved by using the Region Proposal Network (RPN), ROI pooling, and anchor-box-based detection head in the conventional two-stage 3D object detection network PV-RCNN.
3. The multimodal 3D target detection method based on knowledge distillation category adaptive fusion according to claim 2, characterized in that: The design process of the density-sensing-angle partitioning weight prediction module and the distance-sensing-radial partitioning weight prediction module in step S12 is as follows: First, the design of the density-aware angle partitioning weight prediction module is represented as follows: (2) (3) in, This indicates the output weights of the density sensing module. This indicates the output weights of the angle weight prediction module. This represents the corresponding sigmoid function. This indicates two MLP layers. This represents the scene context features obtained after passing through the scene context encoder. This represents the corresponding softmax function; Secondly, the design process of the distance-aware radial partitioning weight prediction module is expressed as follows: (4) (5) in, This indicates the output weights of the distance perception module. This indicates the output weights of the radial weight prediction module; Finally, adaptive adjustment of the fused mesh is performed; the prediction weights generated by the two modules are input into the multilayer perceptron, which outputs a scaling factor; then, the scaling factor is constrained to a preset range using the sigmoid function, and finally, the scaling factor output by the sigmoid function is compared with the initial mesh. Multiplication is used to achieve adaptive mesh size adjustment and avoid over-mesh adjustment, specifically as follows: (6) in, This indicates the adjusted grid. Output scaling factors in the angular and radial directions.
4. The multimodal 3D target detection method based on knowledge distillation category adaptive fusion according to claim 3, characterized in that: The design process of the point weight prediction module in step S13 is as follows: (7) in, This represents the predicted contribution weight at point i. This indicates a three-layer MLP. Represents the contextual features of point i; The weighted generation process of point-level features is represented as follows: (8) in, For the generated point-level features, Features are generated by point-by-point weighting.
5. The multimodal 3D target detection method based on knowledge distillation category adaptive fusion according to claim 4, characterized in that: The specific process of constructing the knowledge distillation framework for the teacher and student models in step S3 is as follows: The knowledge distillation framework for the constructed teacher and student models includes a three-part distillation process: logits distillation, feature distillation, and label distillation. Step S31, logits distillation: in the student model In the learning process, teacher model Parameter freezing and logits distillation aim to enable the student model to learn from the teacher model in regression and classification prediction, specifically as follows: (10) (11) (12) in, The logits output of the teacher model represents the classification output and the bounding box regression output, respectively. The logits output of the student model represents the classification output and the bounding box regression output, respectively. For the Sigmoid function, These are the weights for categorical distillation loss and the weights for regression distillation loss, respectively. Foreground mask, It is a very small constant, serving as a numerical stability term. Normalized weights for positive samples The SmoothL1 loss function is used. This represents the encoding transformation after performing a sine difference on the angular dimension. This indicates element-wise multiplication; Step S32, Feature Distillation, occurs during the feature learning process. The teacher model parameters are frozen. For intermediate features, feature distillation loss is used to make the student model's feature map as close as possible to the teacher network. The specific design of feature distillation is as follows: (13) in, Indicates the characteristic distillation loss weight, Indicates the number of Roi. This indicates the number of feature channels after alignment. Indicates the feature map height. Indicates the width of the feature map. These represent the teacher and student characteristics at the i-th Roi, c-th channel, and h, w-th positions, respectively. Step S33, Label Distillation: Given a point cloud x and its corresponding set of truth boxes y, label distillation obtains teacher predictions from a pre-trained teacher model. and the corresponding confidence score After filtering the confidence scores using a given threshold of 0.6, high-quality teacher predictions are obtained. Combine it with the truth box set to obtain This is used to assign labels to student networks, and finally to calculate regression and classification losses. Total distillation loss of the frame It can be represented as: (14)。 6. The multimodal 3D target detection method based on knowledge distillation category adaptive fusion according to claim 5, characterized in that: Total training loss of the overall framework Represented as: (15) in, For the loss in the RPN stage of 3D object detection, The loss for the RCNN stage of 3D object detection.