Point cloud three-dimensional target detection method and system based on local attention mamba model

By using the hash table index of the local attention Mamba model and the bidirectional Mamba module, combined with grid-aware dynamic filtering and dual sorting sampling, the problem of balancing efficiency and accuracy in traditional methods is solved, and efficient 3D target detection is achieved.

CN122265626APending Publication Date: 2026-06-23SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2026-03-23
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing point-based 3D object detection methods struggle to balance efficiency and accuracy. Traditional architectures suffer from high computational overhead, and sampling strategies often fail to achieve both efficiency and fidelity, resulting in limited detection performance.

Method used

We employ the local attention Mamba model, generating local attention weights through hash table indexing and element-wise multiplication. We combine this with a bidirectional Mamba module to model global long-range dependencies, constructing an efficient point cloud 3D object detection architecture. We utilize a grid-aware dynamic filtering mechanism and a dual sorting sampling strategy to achieve lightweight feature extraction and global context capture.

Benefits of technology

It significantly reduces computational overhead, improves detection efficiency and accuracy, achieves efficient end-to-end processing, and enhances the performance and generalization ability of 3D target detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265626A_ABST
    Figure CN122265626A_ABST
Patent Text Reader

Abstract

The application relates to a point cloud three-dimensional target detection method and system based on a local attention Mamba model, which comprises the following steps: obtaining a key point cloud subset from a point cloud; inputting the key point cloud subset into a local attention Mamba model composed of N hierarchical stacks, and sequentially performing the following steps on each level: for each point, obtaining a neighborhood point set through a hash table index, generating local attention weights based on element-by-element multiplication interaction, and weighting and aggregating neighborhood features to obtain local geometric features; then modeling global long-range dependencies of the local geometric features by using a bidirectional Mamba module, and outputting global features as the input of the next layer; and inputting the global features output by the last level into a bird's eye view backbone network to obtain a three-dimensional target detection result. The application forms a progressive feature learning mechanism of local perception and global enhancement, and significantly improves the expression ability and reasoning efficiency of a point cloud backbone network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of target detection technology, specifically to a point cloud 3D target detection method and system based on the local attention Mamba model. Background Technology

[0002] LiDAR 3D object detection is a cornerstone of fields such as autonomous driving, robotics, and augmented reality, crucial for precise environmental perception. Currently, the mainstream methods for processing point cloud data fall into two main technical routes: voxel-based methods and point-based methods. Voxel-based methods transform sparse and irregular point cloud data into a structured 3D mesh, facilitating processing using 3D sparse convolutional neural networks (SpCNNs) or emerging Transformer architectures, such as... Figure 3 As shown in Figure (b). In contrast, point-based methods operate directly on the original, unstructured point set. Their core advantage lies in their ability to fully preserve accurate geometric details and fine-grained information, avoiding quantization errors and information loss that may occur during voxelization.

[0003] Although point-based methods have great potential in terms of accuracy, such as Figure 3 As shown in Figure (a), its long-standing efficiency bottleneck severely limits its development and application. Existing technologies face the following prominent problems: Balancing feature extraction efficiency and expressive power is challenging: existing methods struggle to design backbone networks that can efficiently capture fine-grained local geometry while simultaneously modeling a wide-ranging global context. Traditional architectures often compromise on one aspect while neglecting the other, or suffer from excessive computational overhead due to complex neighborhood queries and remote interactions.

[0004] Sampling strategies face a dilemma between efficiency and fidelity: to reduce computational burden, large-scale point clouds must be downsampled. However, farthest point sampling (FPS) has high computational complexity and is slow, while random sampling is prone to losing key structural information, which seriously affects detection performance.

[0005] Furthermore, a noteworthy phenomenon is the increasing shift in academic exploration of efficient architectures towards a voxel-based paradigm. It is understood that all recent attempts to improve efficiency using the Mamba architecture have employed voxel-based approaches, while related explorations in the point cloud field have been rarely reported. Considering the inherent advantages in accuracy of directly processing point cloud data, there is reason to believe that designing more efficient point-based 3D object detection architectures is equally important and has significant practical application value.

[0006] A search revealed that Chinese patent application number 202510038073.8 discloses a method for analyzing 3D point clouds by integrating local features and global contextual information. This method effectively improves the accuracy of 3D point cloud tasks by extracting local and global features separately. However, it cannot guarantee the overall lightweight nature of the method. Summary of the Invention

[0007] To address one of the shortcomings of existing technologies, the purpose of this application is to provide a point cloud 3D target detection method and system based on the local attention Mamba model.

[0008] The first aspect of this application provides a point cloud 3D object detection method based on a local attention Mamba model, comprising: Obtain a key subset of the point cloud from the point cloud; The subset of key point clouds is input into a local attention Mamba model consisting of N stacked levels, and each level is executed sequentially: First, for each point, the set of neighboring points is obtained through hash table indexing. Local attention weights are generated based on element-wise multiplication interaction. The neighborhood features are then weighted and aggregated to obtain local geometric features. The bidirectional Mamba module is then used to model the global long-range dependency of the local geometric features, and the output global features are used as the input for the next level. The global features output from the last level are input into the bird's-eye view backbone network to obtain the 3D target detection results.

[0009] Optionally, obtaining the key point cloud subset from the point cloud includes: Divide the point cloud into a uniform 3D grid and determine the grid index to which each point belongs; Several geometric attributes of each point are extracted and concatenated, and then mapped to a high-dimensional semantic space through a point cloud feature extraction network to obtain enhanced features; The importance score of each point is calculated based on the channel mean of the enhanced features and sorted in global descending order; Based on the global descending sort, according to the grid index to which each point belongs and the preset maximum number of points to be retained per grid k, the final subset of point clouds to be retained is determined through a grid-aware dynamic filtering mechanism, so that at most the top k key points in terms of importance are retained in each 3D grid.

[0010] Optionally, the grid-aware dynamic filtering mechanism includes: Sort all points in descending order based on their importance scores, and then iterate through the sorted point sequence. For the data point currently being processed in the sequence, determine its corresponding grid cell; Check if the grid cell has retained a sufficient number of higher importance points within a previously set distance: if yes, it indicates that the grid has met the preset limit for the number of key points, and the current point is discarded; if no, it indicates that the current grid still has room, and the current point is retained. Through the above point-by-point judgment and filtering, the final output is a sparse keypoint subset composed of the retained points, which can simultaneously maintain the spatial distribution structure and contextual semantic information of the original point cloud.

[0011] Optionally, for each point, obtaining the neighborhood point set through a hash table index, generating local attention weights based on element-wise multiplication interaction, and weighted aggregating neighborhood features to obtain local geometric features includes: For each point in the keypoint cloud subset, its feature vector is used as the query vector. The neighborhood point set is determined by using a hash table index, and the feature vector of each neighborhood point is used as the corresponding key vector and value vector. The query vector and each key vector are multiplied element-wise using the multiplication aggregation operator and weighted to obtain the interaction score; the interaction scores of all neighborhood positions are normalized along the neighborhood dimension using the Softmax function to generate local attention weights. The attention weights and value vectors are input into another multiplication aggregation operator for feature modulation to obtain a weighted term. Based on the weighted terms, the weighted terms of all neighborhoods are summed to obtain the output features of the point in the key point cloud subset, which are used as its local geometric features.

[0012] Optionally, the step of using a bidirectional Mamba module to model global long-range dependencies on local geometric features and outputting global features includes: The dynamic propagation process of the local geometric features as input signals is described by a continuous state evolution equation, and the local geometric features are discretized into a recursive form as a Mamba module by the zero-order preservation method. The state transition matrix A in this module is a fixed learnable parameter, and the time step and projection matrix are dynamically generated based on the local geometric features of the current input. The features of all key points within a single batch are sorted by X-axis coordinate from smallest to largest and concatenated into a single long sequence, which is used as the input of the Mamba module to perform sequence modeling along the X-axis direction; The features of all key points in the same batch are reordered according to the Y-axis coordinate and input into the Mamba model to capture contextual information in the orthogonal direction; The outputs from both directions are fused to restore the original order of each point, and the outputs have global features with global long-range dependencies.

[0013] Optionally, the step of inputting the global features output from the last level into the bird's-eye view backbone network to obtain the 3D object detection result includes: The global features are projected onto a bird's-eye view plane, and a 2D feature map is generated by z-axis max pooling. The 2D feature maps are integrated into multiple feature maps of different scales through deconvolution and skip connections; The feature maps at each scale are input into the prediction branch corresponding to the detection head. Each branch independently regresses various parameters of the 3D detection box at the corresponding scale, including center coordinate offset, size, orientation angle and class probability. Cross-scale nonmaximum suppression is performed on candidate detection boxes output at all scales. Redundant detection results are merged based on confidence scores, and the optimal single detection box is selected as the final pose output.

[0014] Optionally, during training, a differentiated loss function is used for supervised learning for different output branches of the detection head, and combined with mixed precision training, warm-up-decay learning rate scheduling, and combined data augmentation strategies to complete end-to-end training of the model.

[0015] A second aspect of this application provides a point cloud 3D object detection system based on a local attention Mamba model, comprising: The sampling module obtains a subset of key point clouds from the point cloud. The Local Attention Mamba module inputs the subset of keypoints into a Local Attention Mamba model consisting of N stacked levels. Each level includes, in sequence: The local feature extraction submodule first obtains the set of neighboring points for each point through a hash table index, generates local attention weights based on element-wise multiplication interaction, and aggregates the neighborhood features in a weighted manner to obtain local geometric features. The global feature extraction submodule then uses the bidirectional Mamba module to model the global long-range dependency of local geometric features, and outputs the global features as the input of the next layer. The detection result generation module inputs the global features output from the last level into the bird's-eye view backbone network to obtain the three-dimensional target detection results.

[0016] A third aspect of this application provides a terminal including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, can be used to perform the method described therein, or to run the system described therein.

[0017] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can be used to perform the method described thereon or to run the system described thereon.

[0018] This application proposes a point cloud-based 3D object detection method based on a local attention Mamba model, constructing an efficient and expressive stacked network architecture. This architecture alternately integrates a local multiplication aggregation module and a bidirectional Mamba module at each layer: the former achieves lightweight neighborhood feature interaction through element-wise multiplication, combined with hash-accelerated neighborhood queries, significantly improving the efficiency of capturing fine-grained local structures; the latter utilizes the sequence modeling capabilities of the state-space model to efficiently capture long-range dependencies across regions, achieving broad-based context awareness. The two work synergistically to form a progressive feature learning mechanism of local awareness and global enhancement, significantly reducing computational overhead while maintaining high expressiveness, effectively solving the technical challenge of balancing accuracy and efficiency in traditional methods.

[0019] Other technical effects resulting from the additional features will be further illustrated in the corresponding embodiments. Attached Figure Description

[0020] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating a point cloud 3D object detection method based on a local attention Mamba model according to an exemplary embodiment; Figure 2 This is a schematic diagram illustrating the model architecture and processing flow of an efficient point cloud detection framework based on dual sorting sampling and local-global feature collaboration, according to an exemplary embodiment. Figure 3 This is a schematic diagram comparing an efficient point cloud 3D object detection framework based on local attention Mamba with existing technical solutions according to an exemplary embodiment, wherein (a) the figure shows a schematic diagram of the processing of unordered point sets in PointNet; (b) the figure shows the processing of voxels in spCNN / Transformer; and (c) the figure shows a schematic diagram of the processing of the original point cloud in the method of this application. Figure 4 This is a schematic diagram showing a performance-speed-parameter comparison between the method of this application and mainstream 3D detection models according to an exemplary embodiment. Figure 5 This is a structural diagram of a point cloud 3D object detection system based on a local attention Mamba model, according to an exemplary embodiment. Detailed Implementation

[0021] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all fall within the protection scope of the present application. Parts not described in detail in the following embodiments can be implemented using existing technology.

[0022] Currently, existing point-based methods generally face the challenge of backbone network design when processing large-scale point cloud data (how to design a backbone network that can efficiently capture fine-grained local geometry and large-scale global context directly from point clouds remains a major challenge). To address these issues, this application provides a point cloud 3D object detection method based on a local attention Mamba model to solve the aforementioned problems.

[0023] Reference Figure 1 , Figure 2 , Figure 3 As shown in Figure (c), in one embodiment of this application, a point cloud 3D target detection method based on the local attention Mamba model includes: S100, obtain the key point cloud subset from the point cloud; S200 inputs a subset of the keypoint cloud into a local attention Mamba model consisting of N stacked layers, with each layer executing sequentially: S201: For each point, the neighborhood point set is obtained through hash table indexing, local attention weights are generated based on element-wise multiplication interaction, and the neighborhood features are weighted and aggregated to obtain local geometric features. S202, then the bidirectional Mamba module is used to model the global long-range dependency of local geometric features, and the output global features are used as the input of the next layer; S300 inputs the global features output from the last level into the bird's-eye view backbone network to obtain the 3D object detection results.

[0024] Specifically, S201 uses hash table indexing to achieve fast neighborhood point lookup, improving query efficiency; it uses element-wise multiplication to generate local attention weights, enhancing key point feature aggregation in a lightweight way and effectively capturing local geometric details.

[0025] Specifically, S202 introduces a bidirectional Mamba module to model long-range dependencies of point sequences, avoiding the high computational overhead of traditional self-attention and achieving efficient capture of a wide range of contextual information.

[0026] In the embodiments described above, local and global modules are stacked alternately, and local fine-grained features and global structure perception are collaboratively optimized at each layer, taking into account both model expressive power and computational efficiency. This supports efficient end-to-end processing of the original point cloud and provides high-quality feature representation for subsequent detection tasks.

[0027] In existing technologies, there is a challenge in balancing sampling efficiency and information fidelity. To address this issue, some specific embodiments of this application propose a dual-sorting sampling strategy using importance scores and grid indices. Therefore, in S100, obtaining a subset of key point clouds from the point cloud can be achieved through the following steps: S101, for any input point cloud, divide it into a uniform 3D grid and determine the grid index to which each point belongs.

[0028] Specifically, a grid index refers to a unique identifier assigned to each small square after dividing the 3D space into several uniform small squares (i.e., "voxel grids"). Each point in the point cloud is assigned to the corresponding grid based on its spatial location and inherits the grid's number; this number is the grid index of that point.

[0029] S102 extracts three geometric attributes for each point within the 3D mesh: absolute coordinates F coords Local centroid offset vector F cluster , grid center offset vector F center The three are concatenated to construct the initial point cloud feature vector, i.e., F. point =Concat (F coords ,F cluster ,F center The point cloud feature extraction network, consisting of two linear layers, maps the initial low-dimensional point cloud feature vectors to a high-dimensional semantic space, resulting in enhanced features F. ' point ; S103, For the enhanced features of each point, calculate its importance score S(p) based on the mean of the feature channels. i ) = Avg(F ' point ), and sorted in global descending order; S104, based on the global descending sort, according to the grid index g, and through the decision condition g. ' j ≠ g ' j-k The final subset of point clouds to be retained is determined.

[0030] Specifically, for the current position j, look at the kth position before it, that is, the jkth point; if This means that, from the position To j, this All points belong to the same grid. Since the list is sorted in descending order of importance, it means that at least k points that are more important than the current point have already been selected. Therefore, the current point is discarded. if This means that the current grid is not yet filled within the range of the first k positions, and the current point belongs to the top k points in terms of importance within the grid. Therefore, the point at the current position j is retained.

[0031] The retained points form a subset of the point cloud containing the original space and contextual information.

[0032] The above embodiments propose an efficient and adaptive point cloud sampling mechanism. A dual sorting sampling strategy is designed to dynamically retain the point set with the most spatial representation (i.e., retain the original space and context information) through local grid partitioning and dual sorting strategy. While avoiding the high computational cost of traditional farthest point sampling (FPS), it significantly reduces the loss of structural information in random sampling.

[0033] It is worth noting that the above embodiments construct a lightweight pure point cloud detection architecture. The entire process requires no voxelization, avoiding quantization errors, and directly implements end-to-end processing on the original point cloud. Computational complexity increases linearly with the point cloud size, significantly reducing the number of parameters and inference latency. Through dual-sorting sampling, hash index neighborhood query, local multiplication aggregation, and bidirectional Mamba global modeling, local geometric features and contextual dependencies are fused layer by layer. The detection process is decomposed into four independently optimizable functional units: sampling initialization, local extraction, global modeling, and detection decoding. The structure is clear and facilitates collaborative optimization and expansion.

[0034] To capture fine local geometric features directly from sparse point clouds with low computational cost, this application introduces a local multiplication aggregation module. This module employs a carefully designed lightweight local attention mechanism, achieving adaptive aggregation of local information and fine-grained reasoning about complex spatial patterns through efficient element-wise multiplication interactions between neighboring point sets. In some specific embodiments of this application, S201, for each point, the neighboring point set is obtained through a hash table index, local attention weights are generated based on element-wise multiplication interactions, and the neighborhood features are weighted and aggregated to obtain local geometric features. This can be achieved through the following steps: S2011: For each point in the keypoint subset, normalize it, and use its feature vector after passing through the linear layer as the query vector. The neighborhood point set is determined by using a hash table index, and the feature vector of each neighborhood point after passing through a linear layer is used as the corresponding key vector. Sum value vector ; S2012, a multiplicative aggregation operator Φ(fi,fj;ω)=ω(fi⊙fj) is designed to perform element-wise multiplication of the features of each point with its neighboring point set, which can enhance the synergistic effect between channels. A local multiplicative aggregation module is then constructed using the multiplicative aggregation operator to complete local feature aggregation. The specific operation is as follows: S2012.1 uses the input point cloud features as the query vector. With neighborhood key vector The interaction score e is calculated using the multiplicative aggregation operator Φ(ω1). ij Attention weights α are generated through a Softmax layer. ij ; S2012.2, the attention weight α ij With the input point cloud feature vector The modulation is performed again using the multiplicative aggregation operator Φ(ω2), and the output feature f is obtained by weighted summation. i out =∑(Φ(α ij ,v j ;ω2)).

[0035] The embodiments described above in this application improve neighborhood retrieval efficiency to near constant time through hash indexing, significantly reducing the computational latency for processing sparse point clouds. High-order feature interactions are achieved through element-wise multiplication operators, enhancing channel synergy while maintaining lightweight design, enabling more adaptive aggregation of complex local geometric features. This embodiment solves the problems of high computational overhead and long processing time in traditional point cloud neighborhood queries, and the difficulty of standard feature aggregation methods (such as addition or concatenation) in capturing subtle local spatial patterns and deep interactions between channels.

[0036] To supplement point features with local perception capabilities with scene-level global context, this application directly applies a bidirectional Mamba layer to the point cloud structure. In some specific embodiments of this application, S202, modeling global long-range dependencies on local geometric features using the bidirectional Mamba module can be achieved through the following steps: S2021, Selective State-Space Modeling, uses the Mamba model to construct an input-adaptive long-range dependency modeling mechanism.

[0037] Specifically: First, establish a continuous state evolution model.

[0038] A state evolution model is established using differential equations: h'(t) = Ah(t) + Bx(t), y(t) = Ch(t) + Dx(t), where A is an N×N state transition matrix, B and C are projection matrices, and D is a direct transit term; t is a continuous time variable, x(t) is the input continuous signal, i.e., the corresponding feature vector in the point cloud; h(t) represents the hidden state, with a dimension of N, responsible for storing the historical information of the sequence; h′(t) represents the derivative of the hidden state with respect to time; and y(t) represents the input signal.

[0039] Then, the above continuous state evolution model is discretized.

[0040] The continuous system is discretized using the zero-order preserve method: Ā = exp(ΔA), B = (ΔA) -1 (exp(ΔA)-I)·ΔB, and then establish the discrete recurrence relation h based on this. k = Āh k-1 + x k With y k = Ch k + Dx k Where Δ (Delta) is the time step (step factor), I is the identity matrix, and Ā is the discretized state transition matrix; It is the discretized input projection matrix; K is the discrete time step index (the number of the corresponding point in the sequence in the point cloud); x k ,h k ,y k These represent the input features, hidden state, and output features at the k-th step, respectively.

[0041] The aforementioned discrete model introduces a dynamic parameter mechanism to achieve content-aware modeling: the time step Δ is dynamically generated through a linear layer to control the granularity of state updates; projection matrices B and C are generated in real time based on input features to achieve content-aware modeling; and the state matrix A is set as a fixed learnable parameter to maintain system stability. This mechanism enables the model to dynamically adjust the state transition process based on point cloud features. This mechanism also allows the model to adaptively adjust the memory update rhythm and information fusion weights based on local geometric semantics (such as edges, planes, or isolated points).

[0042] S2022, based on the above discrete recursive relationship, adds one-dimensional convolution, linear projection layer and gating mechanism to construct a Mamba module.

[0043] S2023, Construct an axially ordered batch sequence input.

[0044] Specifically, to convert an unordered 3D point cloud into a one-dimensional sequence suitable for processing by the Mamba module, the following operations are performed: Within a single training batch, all keypoints are sorted in ascending order of their X-axis coordinates. The local geometric features corresponding to the sorted points are concatenated sequentially to form a long sequence, which serves as the input to the Mamba module.

[0045] This sorting method replaces complex space-filling strategies such as traditional Hilbert curves or Z-order encoding, significantly reducing sequence construction overhead while preserving spatial continuity along the main direction.

[0046] S2024 captures the global context through bidirectional scanning in orthogonal directions.

[0047] First scan: Sort along the X-axis according to the method of S2023 and input the first Mamba module to complete the long-range dependency modeling in the horizontal direction; Second scan: Reorder the same group of points according to their Y-axis coordinates from smallest to largest, concatenate them into a new sequence, and input it into a second Mamba module with the same structure to capture the context information in the vertical direction; The output features from both directions are restored to the original point order, and then fused point by point (such as by addition or concatenation) to obtain the final global feature representation. The fused global features are used as input for the next layer and then processed in the next stacked layer.

[0048] During the aforementioned scanning process, the selective state-space mechanism dynamically adjusts the parameters Δ, A, and B to achieve long-range dependency modeling that adapts to the input. At the same time, it is combined with a bidirectional scanning strategy to capture the 3D spatial structure from an orthogonal dimension, enabling efficient global feature modeling with linear computational overhead.

[0049] The embodiments described above in this application achieve linear growth in computational cost with the size of the point cloud (O(N)), and the inference speed in large-scale scenarios far exceeds that of traditional Transformers. The bidirectional scanning mechanism ensures that the model can still accurately capture the spatial structure in orthogonal directions even with extremely low computing power, effectively solving the technical problem of easily losing small targets at long distances.

[0050] The above embodiments construct a feature backbone that integrates local geometry and global contextual information through stacked local attention Mamba modules. Subsequent detection and decoding processes are then performed on the global features output by this backbone network. In some specific embodiments of this application, in step S300, the global features output from the last layer are input into the bird's-eye view backbone network to obtain the 3D object detection result, which can be achieved through the following steps: S301. In the feature decoding stage, a multi-scale fusion architecture is adopted, and the global features obtained from the last layer are input into the bird's-eye view backbone network.

[0051] Specifically, the network achieves multi-level feature aggregation based on the Feature Pyramid Network (FPN) structure. First, 3D features are projected onto a bird's-eye view plane, and 2D feature maps are generated through z-axis max pooling. Then, feature maps of different scales from 1 / 1 to 1 / 8 are integrated and processed through deconvolution and skip connections.

[0052] S302, input the feature maps of each scale into the prediction branch corresponding to the detection head respectively, and each branch independently regresses various parameters of the 3D detection box at the corresponding scale, including center coordinate offset, size, orientation angle and class probability; S303 performs cross-scale nonmaximum suppression on candidate detection boxes output at all scales, merges redundant detection results based on confidence scores, and selects the optimal single detection box as the final pose output.

[0053] The above embodiments project the local-global fusion features (the global features output by the last layer) onto the bird's-eye view space, and combine the FPN multi-scale backbone network with cross-scale nonmaximum suppression to achieve high-resolution feature recovery and multi-level target detection capabilities. While maintaining efficient inference, it significantly improves the accuracy of small target detection and the stability of pose estimation.

[0054] To improve the final prediction results, in some specific embodiments of this application, during the training process, a differentiated loss function is used for supervised learning for different output branches of the detection head, and combined with mixed precision training, warm-up-decay learning rate scheduling and combined data augmentation strategies to complete the end-to-end training of the model.

[0055] Specifically, the center coordinate offset (Δx, Δy, Δz) is supervised by Smoothed L1 Loss; the bounding box size (l, w, h) is optimized by L1-IoU joint loss; the orientation prediction is discretized into 8 orientation intervals (interval π / 4) and cross-entropy loss is applied; and the target class probability is supervised by Focal Loss.

[0056] Specifically, the training process employs a triple optimization strategy to improve training efficiency: (1) The Mixed Precision Training (AMP) technique uses FP16 computation to accelerate forward propagation and maintains FP32 precision gradient accumulation during backpropagation.

[0057] (2) The learning rate adopts a warm-up-decay mechanism: the first 3 epochs increase linearly from 1e-4 to a peak of 3.5e-3, and the subsequent 33 epochs (nuScenes) or 21 epochs (Waymo) decay to 1e-5 according to the cosine function.

[0058] (3) The data augmentation combination includes five key technologies: inserting 15 instances per frame in the truth box sampling to solve class imbalance; random flipping (X / Y axis probability 0.5) to enhance rotation invariance; ±45° random rotation to improve directional robustness; 0.95-1.05 scale transformation to enhance size adaptability; ±0.2m random translation to improve positional fault tolerance.

[0059] The above embodiments, through multi-task loss coordination and system-level optimization, significantly improve the model convergence speed and generalization ability without increasing network parameters; effectively alleviate problems such as missed detection of small targets, misjudgment of direction and class imbalance, and enhance the stability and accuracy of 3D detection in complex real-world scenarios.

[0060] Based on the same technical concept, other embodiments of this application, such as Figure 5 As shown, a point cloud 3D target detection system 100 based on the local attention Mamba model is provided, including: Sampling module 110 obtains a subset of key point clouds from the point cloud; The Local Attention Mamba module 120 inputs a subset of the keypoint cloud into a Local Attention Mamba model consisting of N stacked layers. Each layer includes, in sequence: The local feature extraction submodule 121 obtains the neighborhood point set for each point through a hash table index, generates local attention weights based on element-wise multiplication interaction, and aggregates neighborhood features in a weighted manner to obtain local geometric features. The global feature extraction submodule 122 uses the bidirectional Mamba module to model the global long-range dependency of local geometric features and outputs global features as input to the next layer. The detection result generation module 130 inputs the global features output from the last level into the bird's-eye view backbone network to obtain the three-dimensional target detection results.

[0061] The specific implementation techniques of each module / unit in the above examples of this application can be referred to the steps of the point cloud 3D target detection method based on the local attention Mamba model in the above embodiments, and will not be repeated here.

[0062] This embodiment constructs a new paradigm for 3D perception based on dynamic sampling, local enhancement, and global modeling: a dual-sorting sampling mechanism achieves key point selection with near-linear complexity while maintaining spatial distribution, significantly accelerating the process compared to traditional FPS; a local aggregation module based on multiplicative aggregation interaction leverages the feature synergy effect of multiplicative aggregation operators to improve fine-grained detection performance with the same computational overhead; and a novel batch serialization bidirectional Mamba module establishes global long-range dependencies with near-linear complexity, significantly improving the generalization performance and detection efficiency of 3D object detection in autonomous driving scenarios compared to existing methods.

[0063] The following examples and comparative examples will be used to further illustrate this application in order to better understand the above-mentioned technical solutions. It should be understood that the following are only some examples and are not intended to limit this application.

[0064] Experimental Example 1: The method of this application was experimentally validated on two autonomous driving 3D object detection datasets, nuScenes and Waymo, to verify the generalization performance of this application in various scenarios. The results are shown in Table 1. Table 1 (Comparison of generalization performance of the method in this application with existing methods on the nuScenes dataset for training and testing)

[0065] Table 2 (Comparison of generalization performance of the method in this application with existing methods on the Waymo dataset for training and testing)

[0066] Based on the experimental results in Tables 1 and 2, the method in this application achieves the best overall performance of 72.4 NDS and 70.59 L2 mAPH on the nuScenes and Waymo datasets, respectively, achieving the strongest generalization performance across scenes and categories.

[0067] Experimental Example 2: The method of this application was experimentally verified on the Waymo autonomous driving 3D object detection dataset, focusing on three model efficiency metrics: model parameter quantity, model computation quantity, and model inference latency, to demonstrate the high efficiency of this application.

[0068] Table 3 (Comparison of the efficiency of the method in this application and existing methods on the Waymo dataset in terms of model parameter count, model computational cost, and model inference latency, for training, testing, and inference).

[0069] Based on the experimental results in Table 3, and as follows: Figure 4 The schematic diagram illustrates that the proposed method demonstrates significant computational efficiency advantages on the Waymo dataset. Compared to the Mamba-based voxel method LION, it reduces model parameters by 50%, computational cost by 28%, and inference latency by 49% while maintaining superior detection performance. Notably, even compared to known voxelization paradigms with inherent efficiency advantages (such as DSVT-voxel), this application achieves further latency reduction and simultaneous improvement in accuracy, marking a breakthrough in overcoming the efficiency bottleneck of point cloud processing paradigms.

[0070] Experimental Example 3: The effectiveness of the method proposed in this application was experimentally verified on the nuScenes autonomous driving 3D object detection dataset for each module in the model backbone network, thus verifying the rationality of the design of this application.

[0071] Table 4 (Ablation experiment results of the method in this application on the nuScenes dataset for each module in the backbone network of the model)

[0072] According to the experimental results in Table 4, the local multiplication aggregation module and the bidirectional Mamba module proposed in this application exhibit a significant synergistic gain effect. When the bidirectional Mamba module adopts a bidirectional scanning mechanism (joint X / Y axis modeling), it improves the performance by 1.2 mAP and 0.9 NDS compared to the unidirectional scanning mode, verifying the necessity of multidimensional spatial dependency modeling for point sequence feature learning. When the local multiplication aggregation module and the bidirectional Mamba module are used together, the detection performance shows a continuous cumulative gain trend, fully verifying the complementarity between local geometric feature aggregation and global long-range dependency modeling.

[0073] Experiment Example 4: The proposed method was compared and validated on the nuScenes dataset using multiple schemes for point serialization mechanisms to evaluate the impact of scanning order on 3D detection performance. Table 5 (Performance comparison results of the proposed method using different Mamba point serialization strategies).

[0074] Based on the experimental results in Table 5, while the geometry-aware serialization scheme (Hilbert / Z-order) achieves superior performance, it requires additional computational overhead to reassemble the reordering points, contradicting the efficiency objective of this invention; the random serialization scheme degrades performance due to the disruption of spatial consistency; and the dynamic shuffle scheme (Shuffle / Although the scanning order diversity is introduced, it reduces modeling consistency and causes performance fluctuations. The X-axis batch serialization scheme adopted in this application achieves detection accuracy comparable to the geometric perception serialization scheme with zero recombination cost by merging batch point clouds along the main axis, thus verifying its effectiveness.

[0075] Based on the same technical concept, in other embodiments of this application, a terminal is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it can be used to perform the above-described method or run the above-described system.

[0076] Based on the same technical concept, in other embodiments of this application, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, can be used to perform the above-described method or to run the above-described system.

[0077] Optionally, the memory is used to store programs; the memory may include volatile memory, such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include non-volatile memory, such as flash memory. The memory is used to store computer programs (such as application programs and functional modules that implement the above methods), computer instructions, etc., and the aforementioned computer programs and computer instructions can be partitioned and stored in one or more memories. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by the processor.

[0078] The aforementioned computer programs, computer instructions, etc., can be stored in partitions within one or more memory locations. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by a processor.

[0079] A processor is used to execute a computer program stored in memory to implement the various steps of the methods involved in the above embodiments. For details, please refer to the relevant descriptions in the preceding method embodiments.

[0080] The processor and memory can be separate structures or integrated structures. When the processor and memory are separate structures, they can be coupled together via a bus.

[0081] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0082] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0083] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0084] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0085] The foregoing has described some specific embodiments of this application. It should be understood that this application is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the substantive content of this application. The above-described preferred features can be used in any combination without conflict.

Claims

1. A point cloud 3D target detection method based on the local attention Mamba model, characterized in that, include: Obtain a key subset of the point cloud from the point cloud; The subset of key point clouds is input into a local attention Mamba model consisting of N stacked levels, and each level is executed sequentially: First, for each point, the set of neighboring points is obtained through hash table indexing. Local attention weights are generated based on element-wise multiplication interaction. The neighborhood features are then weighted and aggregated to obtain local geometric features. The bidirectional Mamba module is then used to model the global long-range dependency of the local geometric features, and the output global features are used as the input for the next level. The global features output from the last level are input into the bird's-eye view backbone network to obtain the 3D target detection results.

2. The point cloud 3D target detection method based on the local attention Mamba model according to claim 1, characterized in that, The process of obtaining a key point cloud subset from the point cloud includes: Divide the point cloud into a uniform 3D grid and determine the grid index to which each point belongs; Several geometric attributes of each point are extracted and concatenated, and then mapped to a high-dimensional semantic space through a point cloud feature extraction network to obtain enhanced features; The importance score of each point is calculated based on the channel mean of the enhanced features and sorted in global descending order; Based on the global descending sort, according to the grid index to which each point belongs and the preset maximum number of points to be retained per grid k, the final subset of point clouds to be retained is determined through a grid-aware dynamic filtering mechanism, so that at most the top k key points in terms of importance are retained in each 3D grid.

3. The point cloud 3D target detection method based on the local attention Mamba model according to claim 2, characterized in that, The grid-aware dynamic filtering mechanism includes: Sort all points in descending order based on their importance scores, and then iterate through the sorted point sequence. For the data point currently being processed in the sequence, determine its corresponding grid cell; Check if the grid cell has retained a sufficient number of higher importance points within a previously set distance: if yes, it indicates that the grid has met the preset limit for the number of key points, and the current point is discarded; if no, it indicates that the current grid still has room, and the current point is retained. Through the above point-by-point judgment and filtering, the final output is a sparse keypoint subset composed of the retained points, which can simultaneously maintain the spatial distribution structure and contextual semantic information of the original point cloud.

4. The point cloud 3D target detection method based on the local attention Mamba model according to claim 1, characterized in that, For each point, the neighboring point set is obtained through a hash table index, local attention weights are generated based on element-wise multiplication, and the neighborhood features are weighted and aggregated to obtain local geometric features, including: For each point in the keypoint cloud subset, its feature vector is used as the query vector. The neighborhood point set is determined by using a hash table index, and the feature vector of each neighborhood point is used as the corresponding key vector and value vector. The query vector and each key vector are multiplied element-wise using the multiplication aggregation operator and weighted to obtain the interaction score; the interaction scores of all neighborhood positions are normalized along the neighborhood dimension using the Softmax function to generate local attention weights. The attention weights and value vectors are input into another multiplication aggregation operator for feature modulation to obtain a weighted term. Based on the weighted terms, the weighted terms of all neighborhoods are summed to obtain the output features of the point in the key point cloud subset, which are used as its local geometric features.

5. The point cloud 3D target detection method based on the local attention Mamba model according to claim 1, characterized in that, The method of using bidirectional Mamba modules to model global long-range dependencies on local geometric features and outputting global features includes: The dynamic propagation process of the local geometric features as input signals is described by a continuous state evolution equation, and the local geometric features are discretized into a recursive form as a Mamba module by the zero-order preservation method. The state transition matrix A in this module is a fixed learnable parameter, and the time step and projection matrix are dynamically generated based on the local geometric features of the current input. The features of all key points within a single batch are sorted by X-axis coordinate from smallest to largest and concatenated into a single long sequence, which is used as the input of the Mamba module to perform sequence modeling along the X-axis direction; The features of all key points in the same batch are reordered according to the Y-axis coordinate and input into the Mamba model to capture contextual information in the orthogonal direction; The outputs from both directions are fused to restore the original order of each point, and the outputs have global features with global long-range dependencies.

6. The point cloud 3D target detection method based on the local attention Mamba model according to claim 1, characterized in that, The step of inputting the global features output from the last level into the bird's-eye view backbone network to obtain the 3D object detection results includes: The global features are projected onto a bird's-eye view plane, and a 2D feature map is generated by z-axis max pooling. The 2D feature maps are integrated into multiple feature maps of different scales through deconvolution and skip connections; The feature maps at each scale are input into the prediction branch corresponding to the detection head. Each branch independently regresses various parameters of the 3D detection box at the corresponding scale, including center coordinate offset, size, orientation angle and class probability. Cross-scale nonmaximum suppression is performed on candidate detection boxes output at all scales. Redundant detection results are merged based on confidence scores, and the optimal single detection box is selected as the final pose output.

7. The point cloud 3D target detection method based on the local attention Mamba model according to claim 6, characterized in that, During training, differentiated loss functions are used for supervised learning of different output branches of the detection head, and combined with mixed precision training, warm-up-decay learning rate scheduling, and combined data augmentation strategies, the end-to-end training of the model is completed.

8. A point cloud 3D target detection system based on the local attention Mamba model, characterized in that, include: The sampling module obtains a subset of key point clouds from the point cloud. The Local Attention Mamba module inputs the subset of keypoints into a Local Attention Mamba model consisting of N stacked levels. Each level includes, in sequence: The local feature extraction submodule first obtains the set of neighboring points for each point through a hash table index, generates local attention weights based on element-wise multiplication interaction, and aggregates the neighborhood features in a weighted manner to obtain local geometric features. The global feature extraction submodule then uses the bidirectional Mamba module to model the global long-range dependency of local geometric features, and outputs the global features as the input of the next layer. The detection result generation module inputs the global features output from the last level into the bird's-eye view backbone network to obtain the three-dimensional target detection results.

9. A terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it can be used to execute the method described in any one of claims 1-7, or to run the system described in claim 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program can be used to perform the method described in any one of claims 1-7, or to run the system described in claim 8.