A computer-implemented method for optimizing a neural network model for 3D object detection
A layer-wise sparsity allocation framework using Hessian-based rate-distortion analysis efficiently prunes 3D object detection models, reducing computational requirements while preserving accuracy, addressing the limitations of existing methods in real-time applications.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- AGENCY FOR SCI TECH & RES
- Filing Date
- 2025-12-12
- Publication Date
- 2026-06-18
AI Technical Summary
Existing 3D object detection models in neural networks require substantial computational resources and memory, limiting real-time deployment in latency-critical applications like autonomous driving, and existing weight pruning techniques are sub-optimal in maintaining detection accuracy and scalability.
A layer-wise sparsity allocation framework using second-order Hessian-based rate-distortion analysis to minimize distortion in detection outputs, allowing for efficient weight pruning of 3D object detection models, which can be applied post-training without full retraining.
The framework achieves significant reductions in computational complexity while maintaining or improving detection accuracy, enabling efficient deployment in latency-sensitive applications such as autonomous driving, augmented reality, and robotics.
Smart Images

Figure SG2025050790_18062026_PF_FP_ABST
Abstract
Description
[0001] A Computer-Implemented Method for Optimizing a Neural Network Model for 3D Object Detection
[0002] Technical Field
[0003] The present invention relates generally to machine learning and computer vision, and more particularly to methods and systems for optimizing three-dimensional (3D) object detection using layer-wise pruning of deep neural networks.
[0004] Background
[0005] This background is provided for generally presenting the context of the disclosure. Contents of this background section are neither expressly nor implied admitted as prior art against the present disclosure.
[0006] Three-dimensional (3D) deep learning has gained increasing attention in both research and industry due to its wide range of applications, such as autonomous driving, robotics, and augmented or virtual reality (AR / VR). In particular, 3D object detection constitutes a fundamental perception task for autonomous driving systems to accurately understand the driving environment and to support downstream decision-making processes.
[0007] Recent advances in LiDAR-based 3D object detection have achieved high detection accuracy by leveraging complex neural network architectures that process large and unstructured 3D point cloud data. However, these models generally require substantial computational and memory resources to achieve acceptable accuracy. Such high computational cost poses challenges for realtime deployment in latency-critical applications, such as autonomous driving, where fast inference is essential to ensure timely decision-making and safe navigation. To address these challenges, prior works have focused on reducing the computational cost of 3D object detection models. Some approaches exploit the spatial sparsity of 3D point cloud data to skip unnecessary computations, while others remove unimportant points or voxels from raw LiDAR data to reduce memory footprint. Although these methods achieve certain levels of acceleration by leveraging input sparsity, their optimizations are often agnostic to maintaining detection precision and are limited by the inherent redundancy reduction achievable at the data level.
[0008] In parallel, redundancy in the model weights of 3D object detection networks remains a significant but underexplored source of inefficiency. Existing weight pruning or sparsification techniques for 3D neural networks, such as those developed for segmentation tasks, typically remove convolutional connections based on heuristic metrics, such as neighbouring point access rates. However, such approaches tend to be sub-optimal compared to magnitude-based or Taylor-based pruning schemes, which have demonstrated superior performance in two-dimensional (2D) image recognition benchmarks. Furthermore, existing layer-wise pruning strategies often rely on greedy or exhaustive search procedures that require extensive real data collection and retraining, thereby limiting their scalability and practical applicability.
[0009] Accordingly, there remains a need for an efficient and generalized framework for pruning and compressing 3D object detection models that can effectively balance computational efficiency and detection accuracy.
[0010] Summary
[0011] According to a first aspect of the present invention, there is provided a computer-implemented method for optimizing a neural network model for 3D object detection, comprising:
[0012] receiving a pretrained 3D object detection model with multiple neural network layers; computing a layer-wise sparsity allocation across the detection model based on a predefined computational constraint;
[0013] transforming the layer-wise sparsity allocation into a layer-wise pruning ratio for each layer of the model, using second order Hessian-based ratedistortion analysis, wherein the pruning ratio minimizes distortion in detection outputs;
[0014] applying the computed pruning ratios to remove redundant weights in each layer of the model; and
[0015] outputting a pruned, pre-trained 3D object detection model.
[0016] According to a second aspect of the present invention, there is provided a system for optimizing a neural network model for 3D object detection, the system comprising:
[0017] at least one memory; and
[0018] at least one processor communicatively coupled to the at least one memory and configured to:
[0019] receive a pretrained 3D object detection model with multiple neural network layers;
[0020] compute a layer-wise sparsity allocation across the detection model based on a predefined computational constraint; transform the layer-wise sparsity allocation into a layer-wise pruning ratio for each layer of the model using second-order Hessian-based rate-distortion analysis, wherein the pruning ratio minimizes distortion in detection outputs;
[0021] apply the computed pruning ratios to remove redundant weights from each layer of the model; and
[0022] output a pruned, pretrained 3D object detection model.
[0023] Brief description of the drawings
[0024] Embodiments of the present invention will now be described, by way of nonlimiting example, with reference to the drawings in which: Figure 1 schematic illustration of a computer-implemented method for optimizing a neural network model for three-dimensional (3D) object detection, in accordance with various embodiments of the present invention.
[0025] Figure 2 is a flow chart diagram illustrating the computer-implemented method for optimizing a neural network model for 3D object detection, in accordance with various embodiments of the present invention.
[0026] Figure 3 depicts qualitative results of the optimized 3D object detection model on LiDAR data, in accordance with various embodiments of the present invention.
[0027] Figure 4 depicts the optimized layer-wise sparsity allocations for multiple 3D object detection networks under different FLOPS reduction targets, in accordance with various embodiments of the present invention.
[0028] Figure 5 shows layer-wise sparsity allocation results obtained using present teachings, under varying FLOPS constraint levels
[0029] Detailed description
[0030] The present invention provides systems and methods for optimizing or accelerating three-dimensional (3D) object detection models through an efficient weight pruning framework. The proposed framework aims to reduce the computational complexity of 3D detection networks, such as the number of floating-point operations per second (FLOPs), while maintaining or improving detection accuracy.
[0031] Various embodiments of the present method can be applied to 3D point cloud processing to achieve high performance in applications such as augmented or virtual reality (AR / VR), autonomous driving, and robotics. This method involve a layer-wise weight pruning scheme for 3D object detection that operates independently of existing point cloud sparsification methods. The proposed scheme identifies redundant parameters within a pretrained model whose removal results in minimal distortion to the detection output, where the distortion includes both localization (bounding box) distortion and classification confidence distortion. The framework is designed as a universal and modelagnostic pruning module that may be applied to arbitrary 3D detection architectures.
[0032] The pruning framework seeks to minimize detection distortion of the network output while preserving detection precision. At the outset, a 3D object detection model is pretrained. The pretrained model has multiple neural network layers, otherwise layer-wise sparsification and pruning to meet a computational objective is almost trivial, as it disregards the effect of pruning from earlier layers on the feature identification accuracy of later layers. The present framework is then applied to the pretrained model to layer-wise sparsify network weights to reduce computational complexity under a distortionminimizing formulation. In some embodiments, a minimised distortion condition is detected by transforming the layer-wise sparsity allocation computed for the pretrained model into a layer-wise pruning ratio for each layer of the model. The layer-wise pruning ratios are determined by formulating the pruning problem as a Pareto-optimization problem. Thus, pruning occurs until the influence of a layer on computational efficiency cannot be improved without increasing the computational efficiency of another layer. In other embodiments, scalarization-based, metric-based, gradient-based and other pruning schemes may be used in place of Pareto-optimisation. The pruning ratios can then be applied, to remove redundant weights and output the pruned model.
[0033] The proposed pruning framework can be implemented as a plug-and-play posttraining module that operates on a pre-trained 3D object detection model without requiring full retraining. In addition, the framework can be used in complementary combination with spatial pruning methods, such as point-wise or voxel-wise pruning, to further enhance model compression and inference efficiency. Through the integration of these techniques, the present framework achieves substantial acceleration and compression of 3D object detection models while preserving model performance and detection reliability.
[0034] Experimental evaluations conducted on benchmark datasets such as KITTI, NuScenes, and ONCE demonstrate that the proposed approach is capable of maintaining, and in some cases improving, detection precision while achieving substantial reductions in computational cost. For example, in certain implementations, the disclosed framework achieves approximately 3.89x reduction in FLOPS for a CenterPoint model and 3.72x reduction in FLOPS for a PVR. CN N model, without measurable degradation in mean average precision (mAP).
[0035] The pretrained 3D object detection model is a neural network-based model comprising multiple layers and a detector. In particular, the model includes a neural network feature extractor / comprising I layers with the parameter set
[0036]
[0037] .... IT*'1, where denotes the weight tensor in the / -th layer. On top of the neural network is a detection head configured to process the extracted features and generate detection outputs. The detection outputs comprise a bounding box around an object, as a confidence score for that object.
[0038] Given a 3D input x, such as a LiDAR point cloud or vexelized scene, the feature
[0039]
[0040] extractor / (x; produces a learned representation. That representation may encode both geometric and semantic attributes. The detection head the predicts a set of bounding boxes pb(f(x
[0041]
[0042] e RN*xSfwhere Ns denotes the number of predicted bounding boxes and S represents the dimension of the bounding box coordinates, together with corresponding class confidence scores pc(f(x FC1: Q)) e RNI XC, where C stands for the number of classes or object categories. The overall detection output of the network can thus be represented as the concatenation of the bounding box predictions and classification confidences:
[0043]
[0044] Pruning the parameters of the feature extractor f results in a new, sparsified parameter set
[0045]
[0046] The effect of this pruning is quantified as the detection distortion, defined as the difference between the dense model's prediction y and the pruned model's prediction y of the pruned model.
[0047] .. [pf.( fix: W^}'}. ph(f(x;
[0048]
[0049] (2) Different layers contribute differently to overall model's performance. Thus, the influence of pruning layer weights on the accuracy of bounding boxes and confidence scores for objection detection, varies between layers. This is particularly pronounced with information carried by active foreground points or voxels across the network or layers. Thus, poor layer-wise sparsity level assignment may achieve the desired reduction in computational complexity at too significant an impact on performance or detection accuracy, whereas an alternative layer-wise sparsity level assignment could achieve similar computation complexity reduction without significant prediction performance degradation.
[0050] Presently, the pruning problem is formulated as a joint-optimisation problem across all layers. The problem thus determines a layer-wise sparsity allocation that minimizes both bounding box localization distortion and confidence score distortion, constrained to a specified computation reduction target (e.g., FLOPS constraint). In some embodiments, the pruning task is formulated as a pareto- optimization problem according to:
[0051] . *.. JPV, „ mm. & MA - fr,.t sA
[0052]
[0053] This formulation jointly minimizes the distortion caused by pruning while satisfying a specified computation reduction target R (e.g., FLOPS constraint) - thus, jointly minimising detection distortions in both bounding box localization and classification confidence. In this formulation, A e IK denotes the vector of Lagrangian multipliers that balance the trade-off between the bounding-box localization distortion and the classification confidence distortion.
[0054] To avoid intractability as neural network layer numbers and layer node numbers increase, the objective set forth above is transformed into a layer-wise pruning ratio. A layer-wise pruning ratio is a closed-form function of the optimization variable. For each layer, and given a parameter scoring method, such as an Ll- norm score in which the magnitude of a weight | w| is used as its importance measure, or a Taylor-based score in which the first-order term |w-g_w| derived from the weight w and its gradient g_w is used to quantify the sensitivity of the loss to pruning that weight; the corresponding pruning-induced error on the weights, AW, are determined by ranking the weight parameters according to their respective scores and identifying the subset of parameters to be removed. Specifically, the parameters are ordered based on their scoring values, and only the top-k highest-scoring parameters are retained, while the remaining lower- scored parameters are designated for pruning. The resulting difference between the dense weight tensor and the pruned weight tensor constitutes the pruning- induced error AW.
[0055] The distortion between the dense model prediction, y, and the prediction of the pruned model, y, is then approximated using a second-order Taylor expansion.
[0056]
[0057] ' (4) where is the Hessian matrix of the i-th layer weight. The present methods introduce the Hessian-based layer-wise pruning scheme to determine the optimal pruning ratio for each layer in a 3D detection network. By leveraging second-order gradient information, the proposed scheme efficiently estimates the sensitivity of each layer to pruning-induced distortion. The proposed method employs a second-order approximation of the detection distortion, which would generally be expensive to compute. Therefore, as discussed above, the present disclosure introduces a lightweight mechanism to efficiently approximate the Hessian information, thereby enabling practical application to large-scale models.
[0058] Next, the expectation of the squared L2 norm in the objective of Equation (3) is considered, which can be rewritten as the vector inner-product form:
[0059] - ll [(». - E « |(Vw*^***’
[0060]
[0061] (5)
[0062] Upon further expansion of the inner-product term, the cross-term corresponding to each layer pair (i ) where 1 < i j < I, is expressed as follows:
[0063] fw wI1W HI ll \ „ HI i " |
[0064]
[0065] 12 *'"J.1 H ' 1(6)
[0066] When considering the influence of the random variable AlV, the first-order and second-order derivatives Vwy and H, in Equation (6) may be treated as constants and thus permitting their removal from the expectation operation. Additionally, the vector transpose operation is invariant within the expectation. Accordingly, Equation (6) can be rewritten as follow:
[0067]
[0068] Based on Assumption 2 of Xu et al., the 4 cross-terms described above are determined to be equal to zero, such that the expectation of the distortion can be derived as follows:
[0069] J-,... Iv.. 5 4
[0070] / ■ ■ ■ ■ ■rtf' * ’{ / 5
[0071]
[0072] (8)
[0073] After the above relaxation, the original objective may be estimated as follow:
[0074] AWW
[0075]
[0076] (9) Let aikdenote the pruning ratio at layer i corresponding to k weights, where 0 < aiik< 1 for all i and k. In addressing the pruning problem defined by Equation (9), the present approach selects the optimal pruning ratios so as to minimize the distortion as expressed in Equation (9). Let 8i krepresent the distortion incurred when pruning k weights at layer i.
[0077] Specifically, let g denote a state function, in which S represents the minimal distortion caused when pruning j weights at the first i layers. The searching problem or pruning optimization problem may be addressed by decomposing the original problem into a sequence of sub-problems according to the following state translation rule:
[0078] 4
[0079]
[0080] = + 4k E where l < k <j.(10)
[0081] The optimal pruning configuration may then be determined by applying a dynamic programming procedure in accordance with the defined state translation rule, as described in Algorithm 1, wherein the procedure exhibits a time complexity that scales linearly with the total number of model parameters.
[0082] A
[0083]
[0084] lgorithm I
[0085]
[0086] OpiUnvnsusn *ns dyswunu- pfognuitmw Input! / '■ Tfe- hrfal of I*4' pruned fe nt Ht Jwts i ton A-IWMI pnnanv in Uwt fot i <? <; tied |,, / .
[0087]
[0088] Output; Tlsv LptfvJn’ pnsmng (. Vw <d\ for i i < A
[0089] for s front 1 te.< do
[0090] for < to / (io
[0091] li I 1 -<w >;
[0092]
[0093] a <: n» o- arg osteal sjf.
[0094] end for
[0095] wd for
[0096] for 4 from i to 1 do
[0097] The nnnshe' of wfeghte p ine in iw-:
[0098]
[0099] .
[0100] Ths < snt Vg tio of few t fes < » ■•-••fo- Update r
[0101]
[0102] r y
[0103] end for
[0104] To reduce computational overhead in practical implementations, the Hessian matrix Htcorresponding to each layer of the neural network is approximated by an empirical Fisher F, following the approach described by Kurtic et al.:
[0105]
[0106] (11) In this equation, a small dampening constant K > 0 and and identity matrix Id are introduced to improve numerical stability during computation. A direct computation of the distortion term Snover a calibration set of size N may require iterating through multiple pruning ratios C,k to determine the corresponding Qkvalues. Even when the Hessian is approximated by the empirical Fisher matrix, the process would still be computationally intensive at the complexity of O(NKDi4), where K denotes the number of possible pruning ratios and D, = | W(,)| represents the number of neurons or parameters in / -th layer. To address this, the present disclosure observes that the gradient term Vw,y remains substantially constant across different pruning ratios. Accordingly, the same Hessian approximation can be reused for all pruning ratios, thereby reducing the complexity to O((N +K)Di2+ KDi4).
[0107] Furthermore, as the pruning ratio increases incrementally, only a small subset of neurons is newly identified for further removal from the weight tensor that has already undergone pruning. This enables the definition of a subvector ( / '’(G / A) = AW - AW
[0108]
[0109] representing the weights newly pruned
[0110]
[0111] between two successive pruning ratios. When pruning ratio increases from Oi,k-i to di,k, the distortion value <5 / <can then be updated recursively from 5i,k-i using this subvector, as defined by Equation (12):
[0112]
[0113] (12)
[0114] Referring to Figure 1, a schematic diagram is shown illustrating an example of the reduced-dimension computation process for updating the distortion value 5i,k using the subvector update rule defined in Equation (12). As shown, when the pruning ratio increases from a,,k-i to ai,k, a subset of weights is newly identified for removal from the weight tensor of the i-th layer. These newly pruned weights are represented as a subvector c '^a^) = AW(i)(o;,k) - AWW(Oi,k-i).
[0115] The dimension of this subvector, denoted 5 k, is substantially smaller than the original number of parameters D, in the layer, since 6i,k corresponds only to the neurons newly pruned at the current pruning step. The computation of the distortion update is thereby confined to this reduced subspace, allowing the matrix operations to be performed at a significantly lower computational cost. For example, consider a convolutional layer of a 3D object detection model having 10,000 weight parameters ( D, = 10,000). When the pruning ratio increases from ai,k-i = 0.20 to ai,k =0.25, approximately 500 additional parameters are marked for removal. These parameters form a subvector o^ ai'k) of dimension di,k = 500. Instead of recomputing the full Hessian matrix of size 10000x10000, the update of the distortion term 5i,k is performed only within the 500-dimensional subspace defined by o ai,k). This results in a substantial reduction in computation time and memory usage, while preserving the accuracy of the distortion estimation.
[0116] Through this reduced-dimension computation, the proposed framework efficiently updates the distortion term 6i,k incrementally, eliminating redundant recalculation of the full Hessian matrix. As a result, the overall empirical complexity for evaluating layer-wise distortion values across all layers is greatly reduced, while maintaining high fidelity in the distortion estimation.
[0117]
[0118] Algorithm 2 Ifotorifon Minunhed Pruning o ID Oh p'et Dejection Mode) Inputs Training dfo
[0119]
[0120] Zfr, I fomu ri < wt IT? 31) detotekm model
[0121]
[0122] with I layers, umber of gobble pruning ratify for each layer fo, Fine-tsrntng sspoehs. Output: The pruned.»!> de tion nimfoi J‘.
[0123] Infermoe T" o» D, to get ois pnt detect urn*: ¥ -
[0124]
[0125] s VA g Xg|.
[0126] Perform haek"pro ^ga iou on T
[0127] < a list of averaged gradients of eh layers: G ™ {Vwp^ 1 < i < I}, for f from 1 to f do
[0128] < 0,^ <•••• 0.
[0129] .. Ait t< » *51"" 0,
[0130] for I from 1 to A o
[0131] Prone
[0132]
[0133] H ' to get Hz*) given aq.j.: I to
[0134]
[0135] |y?w *(£)> Calculate pruning error matrix:
[0136]
[0137] *- IFii;— IVto,
[0138] CJafo alate following Eq, 13,
[0139] end for
[0140] and for
[0141] Obtain foye vise pmnfog ratios o* using T.& from Algorithm 1,
[0142] for i from I to 1 do
[0143] Prone IF51’’ given »*: IF'T e- IF^' Q Me>»(S.
[0144] end for
[0145] for e from I to E do
[0146] Fi eto o J’ on Zfo
[0147] end for
[0148] Denote the dimension of the subvector o-T^cq^) as di,k, which is much smaller than the total number of parameters Dtin the corresponding layer. The value di,k represents the number of neurons newly pruned between the weight differences AW (i)(Oi,k-i) and AW(i)(ai,k) as the pruning ratio a,,k increases from Oi,k-i. Consequently, the multiplication calculation in Equation (12) can be operated at lower dimensions or reduced subspace, where ∇⊤Wy ∈ ℝd
[0149]
[0150] v v lL' ' are subvector and submatrix indexed from the original ones. The reduced subspace di,k corresponds to a lower-dimensional portion of the model's overall parameter space Dtand is represented by a subvector oC0(ai,k) associated with the current pruning ratio Oi,k. By confining the computation of the distortion terms 3i,kto this reduced subspace, the system avoids recomputing the full Hessian matrix for all parameters in the layer, thereby significantly reducing the computational complexity and memory overhead while maintaining the accuracy of the distortion estimation.
[0151] To further eliminate any potential confusion, the update rule of Equation (12) is illustrated in Figure 1. Given that Oi,o = 0, indicating a condition where no pruning has yet been applied, the corresponding distortion <5i,0= 0. As pruning progresses, the distortion value 6i,a is incrementally updated at each step.
[0152] Therefore, the complexity becomes the summation of K - 1 times of updating is
[0153]
[0154] , since di,k increases linearly, the
[0155]
[0156] F“:K, therefore,
[0157] N 9
[0158] the complexity is around
[0159]
[0160] Hence the total computation complexity for calculating the distortion 5i,k across all I layers is around O(½∑li=1Di2) > significantly lower than the original complexity.
[0161] The present invention provides a weight pruning framework that is formulated to minimize distortion in 3D object detection. Algorithm 2 describes the holistic pruning procedure of the proposed method.
[0162] In certain embodiments, a computer-implemented method for optimizing a neural network model for three-dimensional (3D) object detection is disclosed. The method utilizes a Hessian-based pruning framework to achieve efficient model compression while maintaining detection accuracy. The method may be executed by one or more processors executing instructions stored on a non-transitory computer-readable medium. Referring to Figure 2, a flowchart illustrates an example process flow for optimizing a pretrained 3D object detection model in accordance with the present disclosure.
[0163] The method starts with receiving a pretrained 3D object detection model (step 102). The model comprises multiple neural network layers, which may include both 3D and 2D backbones, feature extraction blocks, and detection heads trained on LiDAR-based datasets such as KITTI, NuScenes, or ONCE. Each layer contains a set of trainable weights that govern the network's response to spatial and semantic features in the input point cloud. Receiving the pretrained model enables the optimization process to be applied post-training, without the need to reinitialize or retrain the model from scratch.
[0164] Next, in step 104, the system computes a layer-wise sparsity allocation for the 3D object detection model. This allocation defines, for each layer, the proportion of weights that may be safely pruned under a predefined computational constraint, such as a target floating-point operation (FLOPS) limit, memory usage or inference latency. This computation is performed in accordance with the rate-distortion formulation described in Equations (3)-(9) of the specification, wherein the expected distortion resulting from pruning in each layer is estimated using a second-order Taylor expansion of the detection outputs, including both bounding box localization and classification confidence. By evaluating the layer-specific contributions to overall detection distortion, the system identifies candidate sparsity levels that balance computational efficiency with preservation of detection performance.
[0165] In step 106, the system transforms the computed layer-wise sparsity allocation determined in step 104 into corresponding pruning ratios for each layer of the model. This transformation is performed using a second-order Hessian-based rate-distortion analysis, as described in Equations (4)— (11) of the specification, wherein the Hessian matrix or empirical Fisher information matrix approximates the second-order curvature of the loss landscape with respect to the layer weights. The pruning ratio for each layer is selected to minimize the predicted distortion in detection outputs, including both bounding-box localization and classification confidence. The system may employ a dynamic programming procedure in accordance with the state translation rule described in Equation (10) and Algorithm 1, thereby efficiently identifying a globally optimal set of layer-wise pruning ratios in polynomial time. In step 108, the system applies the pruning ratios determined in step 106 to remove redundant or low-importance weights from each layer of the model. The pruning is performed incrementally in accordance with the update rule described in Equation (12) of the specification, wherein, for each layer, only a subset of weights newly identified for removal is processed, and the corresponding distortion term δi,kis updated using precomputed Hessian or gradient information. By performing the pruning in this incremental manner, the system avoids recomputation of the full second-order matrices, thereby substantially reducing computational complexity and memory usage while maintaining high fidelity in the estimation of detection distortion. The output of this step is a pruned, pre-trained 3D object detection model suitable for deployment in latency-sensitive applications.
[0166] Finally, in step 110, the system outputs the optimized, pruned 3D object detection model generated in step 108. The resulting model exhibits reduced FLOPS and parameter count while maintaining substantially equivalent detection accuracy relative to the original, dense pretrained model. The pruned model preserves both bounding-box localization and classification confidence, demonstrating effectiveness in real-world 3D object detection tasks. The pruned model may be deployed directly in latency-sensitive applications, including autonomous driving, robotics, and augmented or virtual reality systems, thereby enabling efficient, real-time 3D perception.
[0167] In certain embodiments, the performance of the disclosed pruning framework was evaluated using three representative 3D object detection benchmarks, namely KITTI, NuScenes, and ONCE. The KITTI dataset includes 3,712 training samples, 3,769 validation samples, and 7,518 test samples. Detection targets are categorized into three classes: Car, Pedestrian, and Cyclist, with ground truth bounding boxes divided into " Easy," " Moderate," and " Hard" difficulty levels. Detection performance is evaluated using average precision (AP) for each category, with an intersection-over-union (loU) threshold of 0.7 for cars and 0.5 for pedestrians and cyclists. The NuScenes dataset is a large-scale autonomous driving dataset containing 1,000 driving sequences captured with multiple modalities, including LiDARand cameras. The dataset is partitioned according to the default split, comprising 700 training scenes and 150 validation scenes. The ONCE dataset is a large- scale LiDAR-based dataset designed for autonomous driving, containing approximately one million scenes, of which 16,000 are fully annotated for 3D object detection. Model performance on this dataset is evaluated using mean average precision (mAP).
[0168] Post-training pruning was applied to each detection model, followed by a single fine-tuning phase to recover or enhance detection performance.
[0169] For fair comparison, the number of floating-point operations per second (FLOPS) was calculated across both three-dimensional (3D) and two-dimensional (2D) backbones of the detection model, since the disclosed method prunes parameters in both components. For baseline methods that perform voxel-level sparsification only in 3D backbones, FLOPS reduction was recalculated with respect to both the 3D and 2D components. Except for the results discussed in Table 4, which specifically examine the pruning of the detection head, all reported FLOPS values correspond to the entire network across all three test scenarios.
[0170] Extensive evaluations were conducted to assess the performance of the disclosed pruning framework on multiple 3D object detection benchmarks, including NuScenes, ONCE, and KITTI datasets.
[0171] As shown in Table 1, on the ONCE validation dataset, the disclosed framework (referred to as DM3D) achieved higher detection precision across all three evaluated detector architectures such as PVRCNN, SECOND, and CenterPoint, compared with existing sparse baseline methods under equivalent FLOPS reduction levels. For the PVRCNN and SECOND models, the proposed approach outperformed the voxel-based pruning scheme disclosed in across all evaluated precision metrics for the Car, Pedestrian, and Cyclist classes. In particular, the PVRCNN model exhibited a mean average precision (mAP) improvement of approximately 2% over its dense baseline. On the CenterPoint model, the disclosed framework also exceeded the performance of the current state-of-the- art method under comparable computational constraints.
[0172] As summarized in Table 2, the evaluation on the NuScenes dataset further demonstrated that the proposed pruning method results in smaller performance degradation at similar levels of FLOPS reduction relative to the baselines, as measured by both mAP and nuScenes Detection Score (NDS). For example, when applied to the recent VoxelNeXT network, the disclosed framework achieved a 0.41 mAP increase compared to the Ada3D baseline, which exhibited a 0.75 mAP reduction under the same conditions.
[0173] Results for the KITTI dataset are presented in Table 3. The proposed pruning method maintained comparable or superior performance to baseline approaches across different detection models, particularly with respect to the Car AP (Moderate) metric. For the Voxel R-CNN architecture, the SPSS-Conv baseline yielded a 0.28 gain in the Car AP (Easy) metric compared to the dense model, while the disclosed pruning framework achieved a slightly smaller but still positive improvement, demonstrating consistent performance retention and computational efficiency.
[0174] Table 1: Performance comparison of the present invention on ONCE va! set. Gray background indicates dense model results. For baseline sparse detection results, we list the performance drop with their corresponding dense ones reported in their original papers.
[0175] Pedestrian FLOPS Vehicle Cyclist...., mAP (IoU=0.7) (IoU=0.5) (IoU=0.5) Method,n / w.,
[0176] (%)(drop) 0-3030-50 0-3030-50 50-Inf 0-3030-50
[0177]
[0178] 50-Inf 50-Inf PointRCNN
[0017] / 28.74 52.09 - - 4.28 - - 29.84- PointPillar [3] / 44.34 68.57 - 17.63 - ill sillon Hill SECOND [5] / 51.89 71.16 - - 26.44 - - 58.04- PVRCNN [4] / 52.44 21.91 20.89 18.18 69.8 54.16
[0179] 72.29
[0180] 57.22
[0181] Multi
[0016] 60.61 - -2.85 -5.89 -6.76 -10.73
[0182] -4.42 -0.58 -8.12 -4.77 -2.81
[0183] Proposed 60.61 +O.1 + 11.61 +5.41 +0.31 Approach +2 -2.07 + 0.41 -0.25
[0184] -2.55 -0.91 SECOND [5] / 51.43 83.28 26.65 22.88 15.58 68.69 52.06
[0185] 67.13 33.3 49.82
[0186] Multi
[0016] 52.54 - -1.49 -7.84 -6.24 -2.98 -13.32
[0187] -5.35 -8.83 -4.69 -4.03
[0188] Proposed 52.54 -1.6 0.45 -2.44-3.91-1.54 -1.53 Approach -1.75 -2.55
[0189] 0.0 -0.18 CenterPoint [6] - 64.01 76.09- - 49.37 - - 66.58-
[0190] [6]
[0191] Ada3D [9] 26.82-1.31 -2.26 -0.71- -0.95- Proposed 26.82 -0.7 -0.71 -0.48 - - -0.94
[0192]
[0193] Approach
[0194] Table 2: Performance comparison of the present invention on the NuScenes val set.
[0195] Method fLOPs ■ mAP (drop); NDS (%) (drop) PointPillar [3] / 44.63 < 58.23 SECOND [5] / 50.59: 62.29 CenterPoint-Pillar [6]: / 50.03 60.70 CenterPoint (voxel=0.1) [6]: / 55.43 64.63 • Ada3D [9] (voxel=0.1) ■ 33.24 54.8 (-0.63)
[0196] : 63.53 (-1.1) Proposed Approach; 33.24 55.32 (-0.11)
[0197] (voxe / =CU) 64.36 (-0.27) VoxelNeXT [2] / 60.5 66.6 Ada3D [9]: 85.12 59.75 (-0.75) 65.84
[0198] (-0.76) Proposed Approach: 85.12 60.91 (+0.41)
[0199]
[0200] 66.91 (+0.31) Table 3: Performance comparison of the present invention on KITTI val set for Car class.
[0201] FLOPS Easy FLOPS Easy
[0202] Mod. Hard Mod. Hard
[0203] Method (%)(drop) (drop) (%)(drop) (drop)
[0204] (drop) (drop)
[0205] Voxel R-CNN [6] SECOND
[0034]
[0206] Dense / 89.44 79.2 / 88.08 77.77 75.89
[0207] 78.43
[0208] SPSS-Conv 73.0+0.28 +0.05 88.31 + 0.21 -0.11
[0209] [8] -0.04 -0.15
[0210] Proposed 74.36+0.04 +0.06 78.38 +O.1O
[0211]
[0212] Approach +0.11 + 0.11-0.03
[0213] An ablation study was performed to evaluate the contribution of different network components to the overall pruning performance. In particular, the effects of selectively pruning the three-dimensional (3D) backbone, the two-dimensional (2D) backbone, and the detection head of the model were analyzed using the SECOND detector on the KITTI validation dataset. The corresponding results are summarized in Table 4.
[0214] Table 4: Ablation study when pruning only certain parts of model of SECOND on KITTI val dataset.
[0215] 3D 2D Head FLOPS Car AP Ped. AP Cyc. AP (%) (%) (%) (%) Easy Mod. Easy Mod. Easy Mod.
[0216] High High High
[0217] / / / / 88.09 77.77 53.43 48.63 81.8 66.04
[0218] 75.91 44.2 62.47
[0219] 47.62 100100 93.2 + 0.14 +0.17 -0.42 -0.77 -0.2 -0.33
[0220] + 0.06 -0.39 -0.03
[0221] 47.62 67.57 79.22 -0.41 -0.33 +0.5 -0.22 -1.65 -1.61
[0222] 100 -0.52 -0.12 -1.22
[0223] 47.62 67.57 64.9 -0.43 -0.1 +0.15 +0.16 -1.78 -1.45
[0224]
[0225] 67.37 -0.37 -0.62 -1.22
[0226] The study revealed that pruning either the 3D or 2D backbone individually leads to a substantial reduction in floating-point operations per second (FLOPS), while maintaining or improving detection performance across most object classes and difficulty levels. When pruning was applied to both backbones simultaneously, a balanced trade-off was achieved between computational efficiency and detection precision.
[0227] Further analysis indicated that the pruning of the detection head contributed additional computation reduction but tended to have a relatively larger effect on detection precision, particularly for smaller or more difficult objects. Nonetheless, even in such cases, the proposed pruning framework maintained competitive accuracy across the Car, Pedestrian, and Cyclist categories under varying difficulty levels (" Easy," " Moderate," and " Hard").
[0228] These results demonstrate the adaptability of the disclosed pruning framework, which can be flexibly applied to specific network components according to computational or application constraints while preserving overall model robustness.
[0229] Table 5: Comparison of the proposed hessian-based pruning scheme with pruning using distortion <5^ from actual network output.
[0230] Method Car AP drop Ped. AP drop Cyc. AP drop FLOPS Easy Mod. Easy Mod. Easy Mod.
[0231] High High High
[0232] (%)
[0233] SECOND [5] / 88.09 77.77 53.43 48.63 81.8 66.04
[0234] 44.2 62.47
[0235] Hessian (Proposed
[0236] Approach) +0.14 +0.17 -0.42 -0.77 -0.2 -0.33 87.85 + 0.06 -0.39 -0.03
[0237] Actual Dist. -0.26 +0.05 +0.51 -0.9 -0.78 +0.48
[0238]
[0239] 87.85 +O.1 -0.12 -0.35
[0240] Figure 3 illustrates the comparative detection performance of the disclosed pruning framework (DM3D) against baseline pruning approaches, including Multi-Dimensional Pruned Sparse Convolution method proposed by Li et al., and Ada3D by Zhao et al., across multiple 3D object detection networks and different levels of floating-point operations per second (FLOPs) reduction. The evaluations were conducted on the ONCE validation dataset using three representative detection architectures: CenterPoint, SECOND, and PVRCNN.
[0241] As shown in Figure 3, the DM3D framework consistently achieves superior or comparable detection precision while significantly reducing computational cost. In particular, the CenterPoint model pruned using the DM3D scheme achieves approximately a 3.89x reduction in FLOPS compared to the original dense model, while maintaining higher mean average precision (mAP) than the Ada3D baseline, which is the only prior method reporting comparable mAP performance. For the SECOND and PVRCNN models, the disclosed approach achieves around 2x speedup relative to the baseline methods, demonstrating improved detection accuracy across various object categories as the FLOPS are gradually reduced.
[0242] The curves in Figure 3 further illustrate that the proposed pruning framework establishes a more favourable Pareto frontier between detection accuracy and computational efficiency than existing methods. This confirms that the Hessianbased, layer-wise pruning scheme provides effective model compression with minimal or no loss in 3D detection performance.
[0243] Figure 4 illustrates qualitative examples of detection results generated by the disclosed pruning framework (DM3D) compared with corresponding dense (unpruned) models across multiple driving scenes from the KITTI dataset. In the figure, scenes labeled " A" and " B" correspond to detections performed using the PVRCNN architecture, while scenes labeled " C" and " D" correspond to detections performed using the SECOND architecture. For each scene, subfigure "1" (e.g., A-l, B-l, C-l, D-l) represents results obtained from the model pruned using the disclosed DM3D framework, and subfigure "2" (e.g., A-2, B-2, C-2, D-2) represents results obtained from the original dense model.
[0244] As shown, the detection outputs of the pruned models maintain high-quality bounding boxes and precise localization across major object categories in the LiDAR point cloud data, including vehicles, pedestrians, and cyclists. The consistency of bounding box placement and confidence levels between the pruned and dense models demonstrates that the proposed pruning scheme effectively preserves detection fidelity despite substantial model compression. Accordingly, Figure 4 visually shows that the proposed pruning framework preserves 3D object detection accuracy and localization quality under significant parameter sparsification.
[0245] Figure 5 illustrates the detailed layer-wise sparsity allocation results obtained using the disclosed pruning framework under varying FLOPS constraint levels. The figure presents the optimized sparsity distribution across network layers for three representative 3D object detection architectures evaluated on the ONCE dataset.
[0246] Because the proposed framework exploits redundancies within network weights, it jointly optimizes the sparsity levels of both the three-dimensional (3D) and two-dimensional (2D) backbones and automatically determines the optimal allocation for each layer. As shown, the resulting sparsity patterns differ across model architectures. In the PVR. CNN network, the disclosed method yields higher sparsity in the 2D backbone compared to the 3D backbone, whereas in the SECOND network, the 2D backbone retains more weights than the 3D portion. This suggests that PVRCNN derives greater expressive capacity from its 3D feature extraction components, while SECOND relies more heavily on its 2D backbone for detection precision.
[0247] Across all networks, earlier layers in the 2D backbone tend to remain less pruned, indicating their importance in transferring information smoothly from the 3D domain to the 2D representation space. As the FLOPS target decreases, the overall sparsity distribution remains stable, with the most distortionsensitive layers consistently maintaining lower sparsity levels.
[0248] It will be appreciated that many further modifications and permutations of various aspects of the described embodiments are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims.
[0249] Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
[0250] The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.
Claims
Claims1. A computer-implemented method for optimizing a neural network model for 3D object detection, comprising:receiving a pretrained 3D object detection model with multiple neural network layers;computing a layer-wise sparsity allocation across the detection model based on a predefined computational constraint;transforming the layer-wise sparsity allocation into a layer-wise pruning ratio for each layer of the model, using second order Hessian-based ratedistortion analysis, wherein the pruning ratio minimizes distortion in detection outputs;applying the computed pruning ratios to remove redundant weights in each layer of the model; andoutputting a pruned, pre-trained 3D object detection model.
2. The method according to claim 1, wherein the pretrained 3D object detection model was trained on one or more LiDAR-based datasets selected from the group consisting of KITTI, NuScenes, and ONCE.
3. The method according to any one of claims 1 to 2, wherein the predefined computational constraint comprises a target reduction in one or more of floating-point operation (FLOPS) limit, memory usage, or inference latency.
4. The method according to any one of claims 1 to 3, wherein transforming the layer-wise sparsity allocation into the layer-wise pruning ratio comprises jointly minimizing distortion across all neural network layers.
5. The method according to claim 1, wherein applying the computed pruning ratios comprises incrementally removing weights from each layer in multiple iterations, and updating distortion termsafter each iteration.
6. The method according to claim 5, wherein the distortion term 8ikrepresent the estimated distortion in detection outputs caused by pruning a subset of weights in the i-th neural network layer, and wherein the distortion term 8 is computed over only the subset of weights newly pruned at a given pruning step, / c.
7. The method according to claims 1 and 6, wherein the distortion term 8tkfor the i-th layer is incrementally updated based on distortion term computed in prior pruning step,8. The method according to claim 7, wherein updating the associated distortion term 8i kis performed within a lower-dimensional subspace defined by a subvector of weights corresponding to the subset of weights in the i-th layer newly identified for pruning.
9. The method according to claim 1, wherein the second-order Hessianbased rate-distortion analysis is approximated using previously computed gradient information or an empirical Fisher information matrix to reduce computational complexity.
10. The method according to claim 1, wherein the second-order Hessianbased rate-distortion analysis is computed once and reused across multiple pruning ratios, such that distortion terms for each subset of pruned weights are incrementally updated without recomputing the full analysis.
11. The method according to claim 1, wherein the layer-wise pruning ratio for each layer is selected such that the associated distortion term 8ik, derived from the second-order Hessian-based rate-distortion analysis, is minimized with respect to detection outputs, including bounding box localization and classification confidence.
12. The method according to claim 1, wherein the distortion in detectionoutputs comprises a difference between the predictions of the pretrained dense 3D object detection model and the predictions of the pruned 3D object detection model.
13. A system for optimizing a neural network model for 3D object detection, the system comprising:at least one memory; andat least one processor communicatively coupled to the at least one memory and configured to:receive a pretrained 3D object detection model with multiple neural network layers;compute a layer-wise sparsity allocation across the detection model based on a predefined computational constraint; transform the layer-wise sparsity allocation into a layer-wise pruning ratio for each layer of the model using second-order Hessian-based rate-distortion analysis, wherein the pruning ratio minimizes distortion in detection outputs;apply the computed pruning ratios to remove redundant weights from each layer of the model; andoutput a pruned, pretrained 3D object detection model.
14. The system according to claim 13, wherein the pretrained 3D object detection model was trained on one or more LiDAR-based datasets selected from the group consisting of KITTI, NuScenes, and ONCE.
15. The system according to any one of claims 13 to 14, wherein the predefined computational constraint comprises a target reduction in one or more of floating-point operation (FLOPS) limit, memory usage, or inference latency.
16. The system according to any one of claims 1 to 15, wherein transforming the layer-wise sparsity allocation into the layer-wise pruning ratio comprises jointly minimizing distortion across all neural network layers.
17. The system according to claim 13, wherein applying the computed pruning ratios comprises incrementally removing weights from each layer in multiple iterations, and updating distortion terms 8i kafter each iteration.
18. The system according to claim 17, wherein the distortion term 8ikrepresent the estimated distortion in detection outputs caused by pruning a subset of weights in the i-th neural network layer, and wherein the distortion term 5iifcis computed over only the subset of weights newly pruned at a given pruning step,k.
19. The system according to claims 13 and 18, wherein the distortion term 8,kfor the i-th layer is incrementally updated based on distortion term computed in prior pruning step, S^i.
20. The system according to claim 18, wherein the associated distortion term 8tkis updated within a lower-dimensional subspace defined by a subvector of weights corresponding to the subset of weights in the i-th layer newly identified for pruning.
21. The system according to claim 13, wherein the second-order Hessianbased rate-distortion analysis is approximated using previously computed gradient information or an empirical Fisher information matrix to reduce computational complexity.
22. The system according to claim 13, wherein the second-order Hessianbased rate-distortion analysis is computed once and reused across multiple pruning ratios, such that distortion terms for each subset of pruned weights are incrementally updated without recomputing the full analysis.
23. The system according to claim 13, wherein the layer-wise pruning ratio for each layer is selected such that the associated distortion term 8lk,derived from the second-order Hessian-based rate-distortion analysis, is minimized with respect to detection outputs, including bounding box localization and classification confidence.
24. The system according to claim 13, wherein the distortion in detection outputs comprises a difference between the predictions of the pretrained dense 3D object detection model and the predictions of the pruned 3D object detection model.