3D point cloud feature extraction method and system based on semantic injection state space model
By injecting semantic information into the selective state-space model, low-complexity and efficient processing of 3D point cloud data is achieved, which solves the shortcomings of existing models in capturing point cloud structural dependencies and improves the model's global understanding ability and processing speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NINGBO INST OF TECH ZHEJIANG UNIV ZHEJIANG
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-19
AI Technical Summary
Existing selective state-space models cannot effectively capture the dependencies between points and their future spatial neighbors when processing 3D point cloud data, resulting in biased and incomplete understanding of the point cloud structure, as well as high computational complexity.
A semantically injected state space model-based approach is adopted. Through dual-stream collaborative evolution operations, dynamic semantic information is injected into the output projection calculation of the selective state space model, thereby achieving the fusion of non-causal global context and improving the model's global understanding ability while maintaining linear complexity.
While maintaining low computational complexity, the model can better understand the structure of 3D point clouds, improving the processing speed and accuracy of large-scale point cloud data. It is suitable for various computer vision tasks such as 3D object classification, component segmentation, and scene semantic segmentation.
Smart Images

Figure CN122244459A_ABST
Abstract
Description
Technical Field
[0001] This disclosure belongs to the field of computer vision and 3D graphics data processing technology, and relates to the field of machine learning, particularly to a selective state-space model architecture method and system applicable to feature extraction of 3D point cloud data. Background Technology
[0002] Feature engineering refers to the process of extracting, selecting, transforming, and constructing features from raw data using technical means. Its core purpose is to transform raw data into a form that machine learning models can understand and effectively utilize, thereby improving model performance, enhancing model interpretability, or reducing computational complexity.
[0003] 3D point clouds are a primary visual data representation of the 3D physical world, widely used in key fields such as autonomous driving, robotics, and augmented reality. Efficient and accurate analysis of these unstructured sparse point sets is crucial for intelligent agent perception and interaction. Feature extraction is a key technical task in computer vision and 3D image processing. In feature engineering of 3D point cloud data, the 3D point cloud is typically serialized / one-dimensionalized into an ordered sequence of points. Depending on the serialization method, the order of the point sequence may also carry some point cloud information. Feature extractors extract feature information from these point sequences based on different architectures. Currently, Transformer-based architectures have achieved leading performance in point cloud analysis due to their global context modeling capabilities, but their self-attention mechanism has a performance factor that is quadratic (O(N)) of the data size. 2 The computational complexity of )) has certain scalability bottlenecks when dealing with large-scale scenarios.
[0004] Modern architectures using discretized selective state-space models (SSMs), such as Mamba, have emerged as a promising alternative due to their ability to capture certain long-range dependencies while reducing computational complexity to linear (O(N)). However, Mamba and similar models are inherently designed for causal sequences (such as text and time series), where state evolution is strictly dependent on past elements in the sequence. In contrast, 3D point clouds are unordered collections in space, encoding information in non-causal spatial relationships. Even when point clouds are ordered into a one-dimensional sequence using serialization methods such as space-filling curves (e.g., Hilbert curves), a purely causal model cannot capture the crucial dependencies between a point and its spatial neighbors in the "future" positions of the sequence, leading to biased and incomplete understandings of point cloud structure.
[0005] Existing methods for adapting Mamba to point clouds typically employ a bidirectional strategy of independent forward and backward Mamba block processing sequences, followed by shallow fusion. In shallow fusion, the state evolution in each processing step is isolated, with interaction only occurring in the final step. This fails to achieve deep, dynamic integration of global information at each stage of state expansion, thus limiting the model's ability to represent complex geometric structures. Summary of the Invention
[0006] The purpose of this disclosure is to provide a technical solution that enables a certain degree of structured deep understanding of computer vision data such as 3D point clouds to be processed using selective state space models in feature engineering while maintaining low computational complexity. This includes a 3D point cloud feature extraction method, device, and system based on semantically injected state space models. These technical solutions aim to inject non-causal modeling mechanisms into the selective state space models without sacrificing their linear complexity and efficiency, so as to achieve a good global understanding of computer vision data such as 3D point clouds while maintaining low computational power requirements.
[0007] This disclosure first provides a method for feature extraction of 3D point clouds based on a semantically injected state space model. These methods include at least the following steps S100 to S400.
[0008] In S100, 3D point cloud data is acquired and serialized into point sequences using one or more serialization methods. In S200, these point sequences are feature-initialized to obtain initial sequence feature representations and initial global memory representations. These two feature representations are used to construct parallel and interactive sequence feature streams and global memory streams in subsequent processing. The sequence feature streams are used for sequential dependency feature extraction, and the global memory streams are used to accumulate and transmit global semantic information. In S300, the initial sequence feature representations and initial global memory representations are input into a backbone network consisting of multiple sequentially connected processing blocks for iterative evolution; wherein, in at least one processing block of the backbone network, a two-stream cooperative evolution operation is performed, including steps S310 to S340:
[0009] S310, Receive the sequence feature representation and global memory representation input to the processing block, and use them as the current sequence feature representation and current global memory representation;
[0010] S320 generates dynamic semantic information based on the current global memory representation;
[0011] S330, utilizing dynamic semantic information, modulates the linear output computation of the selective state-space model processing the current sequence feature representation based on causal recursion, generating semantically modulated features; the modulation operation maintains the inherent linear computational complexity of the selective state-space model, and enables the semantically modulated features to incorporate the non-causal global context from the dynamic semantic information; and,
[0012] S340, based on the semantically modulated features, obtains the updated global memory representation and the updated sequence feature representation of the block after evolution.
[0013] In S400, deep feature representations characterizing the semantics of 3D point clouds are generated based on the sequence feature representations and / or global memory representations of the final output of the backbone network.
[0014] One improvement to the above method is that the modulation operation in step S330 is achieved by integrating dynamic semantic information into the output projection calculation of the selective state-space model in a linear combination manner. In some further improvements, the linear combination is an additive operation, where dynamic semantic information is combined as an additive term with the projection parameters in the output projection calculation.
[0015] One improvement to the above method is that each processing block in the backbone network is configured to perform a two-stream cooperative evolution operation. In several improved implementations, at an appropriate point after all parallel interactions in the backbone network have ended, the sequential feature stream and the global memory stream are merged to obtain a feature representation of the 3D point cloud.
[0016] One improvement to the above method is that the update of the current global memory representation in step S340 is achieved through a gated fusion step. This gated fusion step is configured to: adjust the contribution of semantically modulated features to the update of the current global memory representation based on its state. In some further improvements, the gated fusion step achieves the update by subjecting the current global memory representation to a gating function before combining it with the semantically modulated features.
[0017] In some specific implementations, injecting dynamic semantic information into the computation process of the selective state-space model is achieved by modifying the output equation of the selective state-space model. The modified discrete-time output equation is as follows: ,in, For this selective state-space model at time step The output characteristics; For learnable or input-dependent raw output projection matrix At time step The parameter vector; Dynamic semantic gate matrix for representing dynamic semantic information At time step parameter vector, matrix Generated based on semantic information extracted from global memory representation, and related to the matrix With the same dimensions, it is used to inject a non-causal global context into the output projection; The hidden state vector is calculated from the recursive equation of the model; These are learnable or input-dependent through-term coefficients. The current sequence features are represented at the corresponding time step. Input features.
[0018] In some specific implementations, dynamic semantic gate matrix The process involves the following steps: pooling and transforming the current global memory representation to obtain a representation containing... Semantic pool of prototypes Simultaneously, routing weights are calculated based on the current sequence feature representation. The dynamic semantic gate matrix V is determined by the routing weights. semantic pool It is obtained by performing a linear combination.
[0019] Further improvements to the above method include: the serialization operation in step S100 is performed according to a dynamic strategy during the model training phase, which allows the rules used for serialization to be variable during training; or, step S320 includes: constructing a semantic information base based on the current global memory representation; determining the association between each point in the sequence and the semantic information base based on the current sequence feature representation; generating corresponding dynamic semantic information for each position in the point sequence according to the association; or, step S400 includes: obtaining the updated sequence feature representations output by multiple processing blocks in the backbone network; and performing cross-layer fusion on multiple updated sequence feature representations to obtain a deep feature representation.
[0020] Another aspect of several technical solutions provides a 3D point cloud processing device, including:
[0021] Data interface, used to acquire 3D point cloud data;
[0022] The processing circuit, connected to the data interface, includes:
[0023] The sequence initialization unit is used to serialize 3D point cloud data and convert it into an initial sequence feature representation and an initial global memory representation;
[0024] A two-stream co-evolutionary unit, used to execute or containing a neural network consisting of multiple sequentially connected processing blocks, the neural network being used to iteratively process initial sequence feature representations and initial global memory representations; and,
[0025] The feature output interface is used to output the deep feature representation of the semantics of the 3D point cloud obtained by the iterative processing evolution of the sequence feature representation and / or global memory representation based on the dual-stream cooperative evolution unit.
[0026] The neural network includes at least one semantic relay processing block, which includes:
[0027] Selective state-space model computation unit;
[0028] The semantic information generation unit is configured to generate dynamic semantic information based on the global memory representation of the input block;
[0029] The semantic injection unit, connected to the semantic information generation unit and the selective state space model computation unit, is configured to inject dynamic semantic information into the selective state space model computation unit's processing of the sequence feature representation of the input block; and,
[0030] The dual-stream update evolution unit, connected to the selective state-space model computation unit, is configured to: update the global memory representation of the input block based on the features output by the selective state-space model computation unit, to obtain the global memory representation output after the block's evolution; and obtain the sequence feature representation output after the block's evolution based on the features output by the selective state-space model computation unit.
[0031] In some specific implementations, the semantic injection unit is configured to generate modulated output projection parameters by performing an addition operation on the dynamic semantic information as an additive term and the output projection parameters inherent in the selective state space model computation unit.
[0032] In some specific implementations, the semantic information generation unit is configured to perform the following operations to generate dynamic semantic information: pooling and linearly transforming the global memory representation of the input block to obtain a semantic prototype pool containing k semantic prototypes; calculating the routing weights of the k semantic prototypes based on the sequence feature representation of the input block; and weighting and aggregating the prototypes in the semantic prototype pool according to the routing weights to output dynamic semantic information.
[0033] Another aspect of the technical solutions also provides a 3D point cloud processing system, including: a processor; and a memory storing computer program instructions thereon; the computer program instructions, when executed by the processor, cause the system to implement the methods described in any of the above aspects. Additionally, a computer-readable storage medium storing computer programs thereon, which, when executed by the processor, implement the methods described in any of the above aspects.
[0034] This disclosure proposes at least one neural network architecture based on a semantically injected state-space model for feature extraction from point cloud data, along with its specific technical implementation. These solutions, through a two-stream architecture and semantic injection mechanism, force the model to capture non-causal global information while maintaining linear complexity. Due to the low-cost operations of global semantic routing and pooling, these solutions, while maintaining O(N) linear complexity, endow the feature extraction neural network model with global awareness capabilities similar to Transformers, resolving the contradiction between "fast but inaccurate (traditional SSM)" and "accurate but slow (Transformer)". On challenging data types such as real-world scanned data and large-scale indoor scenes, the accuracy surpasses existing structures like PointMamba and Point Transformer V3. Since the architectures in this disclosure are not limited to specific tasks, they can serve as a general-purpose backbone, flexibly applied to various tasks such as 3D object classification, part segmentation, and scene semantic segmentation, demonstrating strong generalization capabilities. Since quadratic complexity calculations are eliminated in various implementations, 3D point cloud feature extraction projects using the techniques disclosed herein have lower memory usage and faster inference speeds when processing large-scale point clouds (such as VR / AR environment modeling). Attached Figure Description
[0035] For those skilled in the art, the clear description of various embodiments in this disclosure is sufficient for them to understand the scope of the technical solutions claimed in this disclosure. The accompanying drawings described below are merely some exemplary specific embodiments and technical aspects of this disclosure. Other drawings can be obtained based on these drawings without any creative effort. The following is a brief introduction to the drawings used in the description of the specific embodiments. Obviously, since each drawing only describes one technical aspect, when its description is used in conjunction with other drawings or technical aspects to explain multiple aspects of the implementation in different specific embodiments, the content shown in the drawings has a distinguishing scope of reference when understood in context.
[0036] Figure 1 This is a flowchart illustrating an embodiment of a method disclosed herein;
[0037] Figure 2 This is a flowchart illustrating another embodiment of the method disclosed herein;
[0038] Figure 3 This is a schematic diagram of the structure of an embodiment of the device disclosed herein;
[0039] Figure 4 This is a schematic diagram of the structure of a dual-stream cooperative evolution unit in one embodiment of the device disclosed herein;
[0040] Figure 5This is a schematic diagram of the structure of the dual-stream cooperative evolution unit in another embodiment of the present disclosure;
[0041] Figure 6 This is a schematic diagram of the structure of a semantic relay processing block in an embodiment of the apparatus disclosed herein;
[0042] Figure 7 This is a schematic diagram of the structure of a 3D point cloud processing system in one embodiment of the present disclosure;
[0043] Figure 8 This is a schematic diagram of the data flow of a 3D point cloud feature extraction system deployed on a cloud server in one embodiment of the present disclosure;
[0044] Figure 9 This is a schematic diagram of a dual-stream collaborative evolution operation in a dual-stream collaborative evolution processing block, in one system embodiment of the present disclosure, where a sequence feature representation processing stream and a global memory representation processing stream are executed.
[0045] Figure 10 In order to be in Figure 9 A schematic diagram illustrating the operation of injecting dynamic semantic information into the computation process of a selective state-space model that processes the feature representation of the current sequence in one embodiment of the dual-stream collaborative evolution processing block;
[0046] Figure 11 To describe Figure 9 Pseudocode for an embodiment of the dual-stream collaborative evolution processing block;
[0047] Figure 12 To describe Figure 10 The pseudocode for injecting dynamic semantic information into the computation process of a selective state-space model in one embodiment;
[0048] Figure 13 This is a schematic diagram of the 3D point cloud processing system structure in another system embodiment of this disclosure;
[0049] Figure 14 This is a performance comparison table of one embodiment of the method disclosed in this disclosure on the ModelNet40 classification benchmark;
[0050] Figure 15 This table compares the performance of a method embodiment of this disclosure on the ScanObjectNN(PB_T50_RS) classification benchmark.
[0051] Figure 16 This is a comparison table of component segmentation results for one embodiment of the method disclosed herein on the ShapeNet dataset;
[0052] Figure 17 This is a table showing the evaluation results of Area 5 on the S3DIS dataset for one embodiment of the method disclosed herein;
[0053] Figure 18 This is a comparison table of FLOPs on ModelNet40 (N=1024) for an embodiment of the method disclosed herein;
[0054] Figure 19 This table compares ablation experiments with different design choices in point cloud classification for one embodiment of the method disclosed herein.
[0055] Figure 20 This table compares ablation experiments with different design choices in point cloud segmentation, representing an embodiment of the method disclosed herein. Detailed Implementation
[0056] First, it should be noted that phrases such as "in one embodiment" or "in an embodiment" in this specification do not necessarily refer to the same embodiment, but rather provide specific technical aspects for combining with particular embodiments, wherein specific features, structures, or characteristics can be combined in any suitable manner consistent with this disclosure. The terms "comprising" and "including" are open-ended, as used in the claims, and do not exclude additional structures or steps. Consider the following cited claim: "A deep network comprising one or more processing blocks..." Such claims do not exclude the neural network from including additional structures or steps (e.g., backpropagation networks, parameter update algorithms, etc.). It should be further understood that when a feature is described as "comprising" or "including," it means that the corresponding feature can be implemented as the presented feature, information, data, step, operation, element, and / or component, but does not exclude implementation as other features, information, data, steps, operations, elements, components, and / or combinations thereof supported by the art. When describing multiple (two or more) items, if the relationship between the multiple items is not explicitly defined, the multiple items can refer to one, several, or all of the multiple items. For example, the description "parameter A includes A1, A2, A3" can be implemented as parameter A includes A1, A2, or A3, or it can be implemented as parameter A includes at least two of the three parameters A1, A2, and A3. Various actual processing or implementation steps of terminals, servers, or other devices can be described or stated as being "configured" to perform one or more tasks or task steps. In such a context, "configured" is used to imply the structure (e.g., circuitry) of the terminal / server / computer device, which includes structures (e.g., circuitry) that perform the one or more tasks and task steps during operation by its processing unit. Thus, a terminal / server / computer device can allegedly be configured to perform the task or specific steps within the task even when the specified terminal / server / computer device is currently inoperable or not running (e.g., not connected). Terminal / server / computer devices used with the language “configured as” include hardware—such as circuits, memory storing executable program instructions to perform operations, etc. Furthermore, “configured as” can include general-purpose structures (e.g., general-purpose circuits) manipulated by software or firmware (e.g., a GPU or a general-purpose processor executing software) to operate in a manner capable of performing one or more tasks to be solved. “Configured as” can also include adjusting manufacturing processes (e.g., semiconductor fabrication facilities) to manufacture devices (e.g., integrated circuits) suitable for implementing or performing one or more tasks. As used herein, indicative terms such as “first” and “second” act as labels for the nouns preceding them and do not imply any type of ordering (e.g., spatial, temporal, logical, etc.).For example, a terminal / server or a terminal / server configured to perform tasks and task-specific characteristics may be described herein as the execution of a “first” task / step / algorithm and a “second” task / step / algorithm. The terms “first” and “second” do not necessarily imply that the second algorithm must be executed before the first algorithm. As used herein, the terms “based on,” “according to,” or “depending on” are used to describe one or more factors influencing a determination, and these terms do not exclude additional factors that may influence the determination. That is, the determination may be based solely on these factors or at least partially on these factors. Consider the phrase “determining A based on B,” in which B is a factor influencing the determination of A. Such phrases do not exclude that the determination of A may also be based on C, and in other instances, A may be determined solely on B. When used in the claims, the term “or” is used as an inclusive or, not an exclusive, or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, and any combination thereof. Unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include plural forms. It should be understood that when we say that one element is "connected" or "coupled" to another element, the element can be directly connected or coupled to the other element, or it can mean that the connection between the two elements is established through an intermediate element. Specifically, for two or more elements specified as "connected sequentially," the two adjacent elements can also be connected through an intermediate element. Those skilled in the art will understand that, unless otherwise specified, for ease of description, the examples of neural network sample data provided in this specification all use a single-sample operation method with the batch size dimension omitted.
[0057] In summary, this specification covers various aspects, including methods for improving the efficiency and performance of sequence feature extraction systems using semantically injected state-space models, and the apparatus and systems configured to implement these methods, enabling the computing devices in the system to handle computations of linear complexity while improving the system's global depth perception capability for sequence data such as 3D point clouds. It is easy to understand that existing artificial neural network techniques such as the Mamba model utilize selective state-space models to dynamically change model parameters and gating parameters with the input content, transforming the continuous linear time-invariant state-space model into a linear time-varying discretized model, thereby selectively memorizing, propagating, forgetting, and outputting information. However, because the state-space model is unidirectionally recursive, the output of each step in the discretized selective state-space model strictly depends on the input of that step and all historical states obtained recursively, constituting a unidirectional strict causal constraint. Therefore, its related neural network models cannot be directly used to process unordered three-dimensional spatial data.
[0058] In this specification, "point sequence" refers to the sequence obtained after point cloud data has undergone serialization. The points in the sequence have a specific, one-dimensional order related to the point cloud serialization method and the point cloud's own structure. It is an ordered sequence of points that does not depend on any external device operations such as acquisition strategy, transmission strategy, or storage strategy. A "processing block" is a basic, modular, and repeatedly stackable composite computational unit that constitutes a deep neural network. A processing block encapsulates a set of ordered, parameterized computational components, such as normalization layers, linear projection layers, core computational operators, activation functions, and residual connections. These computational components work together to complete a nonlinear transformation from input features to output features. Typical processing blocks in this field include Transformer Block, Residual Block, and Mamba Block. In this specification, "backbone network" refers to a multi-layered core architecture of a feature extraction network specifically designed to progressively extract and refine general, semantically rich deep feature representations, or deep feature representations, from raw or preprocessed initial feature data. In some implementations, the backbone network serves as the central hub and foundation for the entire feature extraction process, with its output being a task-independent, general feature representation rather than a final task-specific result such as class labels or segmentation maps. In some system instances where the backbone network performs specific tasks, the deep feature representation output by the backbone network can serve as input to various downstream task heads such as classifiers, segmenters, and detectors. Large and complex neural network architectures employing multi-stream feature extraction (such as splitting based on position and color channels), multi-branch feature extraction, multimodal learning, or ensemble learning may also contain multiple independent backbone networks with identical or different functions. It is worth noting that the "backbone network" involved in each embodiment of this specification specifically refers to the multi-layered stacked neural network "composed of multiple sequentially connected processing blocks," which corresponds to the "backbone network" or "feature extractor" network known in the fields of computer vision and 3D perception. Examples include ResNet in image recognition, PointNet++ hierarchical encoders in point cloud processing, or Transformer encoder stacks in natural language processing. This hierarchical structure, composed of multiple sequentially connected processing blocks, allows feature information data to be forward-propagated and iteratively transformed, enabling features to evolve from low-level geometric information to high-level semantic information. In the hierarchical backbone network architecture implementation based on the semantic injection state space model in this demonstration, its depth (i.e., the number of processing blocks) and width (i.e., the feature dimension of each block) are configurable hyperparameters. Due to the multi-layered architecture of sequentially connected processing blocks, the backbone network in this paper... The processing block is also called the first In the backbone network, each "layer" refers to the location of a processing block, not necessarily the execution step of a linear layer, convolutional layer, normalization layer, etc., which are basic computational components. As a preferred embodiment, the backbone network can contain 4, 8, or 12 sequentially connected processing blocks, forming a deep feature encoder structure. Those skilled in the art will understand that this backbone network can also serve as the encoder portion of a more complex layer-by-layer downsampling architecture, such as the multi-layer downsampling encoding portion in a typical U-Net-type encoder-decoder structure, which is used in existing model architectures such as Point Cloud Mamba and Point Mamba. In some implementations, the input to the processing block performing the "two-stream collaborative evolution" in the backbone network is the sequence feature representation and global memory representation of the initialization operation, or it can be the sequence feature representation and global memory representation output after iterative processing of the upper L-1 processing blocks with or without performing two-stream collaboration. The output is the sequence feature representation and global memory representation after L evolutions including itself. For neural network systems that include task heads, in some implementations suitable for segmentation tasks, the neural network also includes an encoder-decoder structure corresponding to this disclosure, such as downsampling connections between processing blocks forming continuous downsampling and layer-by-layer upsampling fusion of the output; in some implementations suitable for classification tasks, the neural network may only include an encoder followed by global pooling and a classification head; in some preferred embodiments, the backbone network consists entirely of processing blocks that perform dual-stream cooperative evolution operations as defined herein.
[0059] Several embodiments of the first aspect of this paper provide a 3D point cloud feature extraction method based on a semantically injected state space model, such as Figure 1 As shown, these methods executed on a computing device processor include at least steps S100 to S400.
[0060] In the serialization step S100, the processor acquires specific 3D point cloud data and serializes it into a point sequence. In several examples related to sorting criteria, the point sequences are sorted based on the inherent properties of the point cloud data; in examples using Hilbert curves, Z-order curves, or Gray code curves for space-filling curve sorting, the encoded values of the points are calculated based on their spatial coordinates, and sorted accordingly; in some examples sorting based on point cloud geometric properties, sorting can be based on the Euclidean distance from the point cloud centroid, the coordinate values along the first principal component of principal component analysis, or the distance to a specified bounding box plane; in some examples based on graph traversal sorting, sequences can be generated based on the K-nearest neighbor graph of the points through depth-first search or breadth-first search, or sorted according to the iterative coordinates of the Frootman-Reinggold force-guided layout algorithm. Other examples using random or mixed strategies generate sequences using random shuffling; or by performing random walks on the graph structure constructed from the point cloud to generate sequences. Some training methods aimed at improving the robustness of backbone network models demonstrate the use of multiple space-filling curves applied randomly during training, or the random selection and sorting of multiple space-filling curves during training, to enhance robustness. In some implementations, the serialization step may include one or a combination of the above examples.
[0061] In the feature initialization step S200, the processor initializes the point sequence to obtain its initial sequence feature representation and initial global memory representation. It is easy to understand that each sequence feature representation, including the initial sequence feature representation, is a sequence of feature vectors with the same length as the point cloud, used to carry point-level features with sequential dependencies. However, its feature dimension can be scaled based on a linear mapping according to the requirements of upstream and downstream processing blocks. The global memory representation, including the initial global memory representation, is an updatable tensor, independent of the sequence order, used to accumulate and transmit global scene semantic information across layers in the network. For computational convenience, in some implementations, the shapes of the sequence feature representation and global memory representation of the same processing block remain consistent on the same side of their input or output. For example, in some applications involving downsampling, the feature dimension increases simultaneously with depth.
[0062] In one aspect, this paper provides some examples of feature initialization for these point sequences in S200 to obtain their initial sequence feature representations. One learnable example involves constructing a shared-weight multilayer perceptron (MLP) and independently applying it to each point in the sequenced point cloud; this MLP takes the original features, such as the point's 3D coordinates, as input and outputs high-dimensional point features; the set of features from all points constitutes the initial sequence feature representation. Another example using a lightweight point network employs a simplified PointNet architecture as the initializer. This architecture includes a shared MLP followed by a symmetric aggregation function, such as max pooling, for local neighborhoods defined by K-nearest neighbors, generating features for each point that incorporate its local spatial context as the initial sequence feature representation. A low-computational-cost initialization example directly maps the coordinates of each point to a high-dimensional feature space through a linear projection layer, such as an unbiased matrix multiplication or a biased fully connected layer. In an initialization demonstration based on pre-trained feature extraction, the output of a lightweight point cloud feature extraction network pre-trained on a large point cloud dataset is used as the initial sequence feature representation for this network to provide more discriminative starting features, such as using a small PointNet++ encoder block. In some GPU-based implementations, the pointwise computation of the aforementioned MLP or linear projection layers is mapped to a single CUDAKernel. Each GPU thread block processes one point sequence block, and the threads within the block execute the MLP or linear projection operations in parallel using a single-instruction multithreading approach to fully utilize the GPU's computing power. Simultaneously, the feature vectors of each point are directly stored in high-speed shared memory after computation for fast access in subsequent steps. In some neural processing unit (NPU) vectorization demonstrations, the point sequence is organized into (N, C...) blocks on the NPU. in The tensor of the NPU is used to transform the fully connected layer computation in this step into efficient batch matrix multiplication, completing the feature transformation of all points at once to improve data throughput. In some demonstrations using ASIC-customized data paths, a dedicated systolic array or multiply-accumulate unit data path is designed for the linear projection of points in the customized AI acceleration chip, so that the coordinate data flows through this fixed hardware path and completes the projection calculation in a single cycle, achieving low latency and low power consumption initialization.
[0063] In one aspect, this paper provides some examples of feature initialization of these point sequences in S200 to obtain their initial global memory representations. In some implementations, the generated initial sequence feature representation is deeply copied, and this copy serves as the initial global memory representation. In some implementations that initialize through independent transformations, the initial sequence feature representation is input into a separate, small neural network module, such as a linear layer or a two-layer MLP, which maps it to another feature space, and its output serves as the initial global memory representation. These small neural network modules can share parameters with the sequence feature initialization module or have independent parameters. In some examples of initialization using learnable parameters to focus the memory stream on learning task-related general patterns, a set of learnable global memory vectors of shape (N,C) or (1,C) is defined and broadcast to all point locations at the start of training to form the initial global memory representation. In other implementations that use feature aggregation initialization, global average pooling or global max pooling is performed on the initial sequence feature representation to obtain a (1,C)-dimensional global vector summarizing the entire point cloud, and this vector is then broadcast to all points to form a consistent initial global memory representation. In some implementations using voxelization for feature initialization, the original point cloud is converted into a sparse voxel mesh. Several layers of 3D sparse convolution are used to extract voxel-level features, and then back-projection methods such as trilinear interpolation are used to assign the voxel features back to each point, explicitly introducing a regular spatial structure as its initial global memory representation. In some hardware-accelerated implementations based on broadcast operations, the broadcast operations involved in this step are implemented on the GPU or NPU using hardware-supported broadcast addressing modes. For example, the global memory vector is stored in constant memory or L1 cache, allowing the computational core to repeatedly read it with zero overhead, avoiding physical data copying and saving storage space and bandwidth in high-bandwidth memory. In some implementations that optimize on-chip memory pre-allocation and reuse during the chip design phase, a fixed-size on-chip static random access memory is reserved for the global memory representation. During the initialization phase, whether the data is directly written to this dedicated storage area by copying the initial sequence feature representation or by transforming the initial sequence feature representation into a global memory representation, this storage area is reused as the physical storage for the global memory stream in all subsequent processing layers, avoiding the off-chip memory access overhead caused by inter-layer data transfer. In some implementations of dedicated pipelines for aggregate initialization optimized through ASIC design, a dedicated reduction pipeline is designed in the ASIC to achieve aggregate initialization based on the initial sequence feature representation. This pipeline works in parallel with the sequence feature generation pipeline. While generating point features, it quickly completes global summation or maximum value calculation through a tree structure and immediately triggers broadcast logic, generating the initial global memory with almost no additional delay. In various aspects, the initialization steps of S200 are algorithmically flexible, and the corresponding hardware optimization demonstrations can be efficiently mapped on modern AI computing hardware through parallel computing, dedicated data paths, intelligent memory layout, and other means.
[0064] In the deep feature extraction step S300, the initial sequence feature representation and the initial global memory representation are iteratively processed in a backbone network composed of multiple sequentially connected processing blocks. These processing blocks include at least one processing block that performs the dual-stream cooperative evolution operation step as described in this paper. In some examples, such as... Figure 2 As shown, all processing blocks in the L-layer architecture of the backbone network are processing blocks that execute the dual-stream collaborative evolution operation steps as described in this paper. It is easy to understand that in the implementation of various aspects, among the multiple sequentially connected processing blocks that constitute the complete functional module of the backbone network as a whole or part, the output data of the previous processing block serves as the main or decisive input data of the next processing block, and in this order, they form a unidirectional, sequential processing pipeline. For two adjacent processing blocks that satisfy this information transmission and dependency relationship, even if there are conventional operations such as data normalization, linear transformation, or nonlinear mapping for activation in between, these operations are still considered as components on the connection path, rather than independent components constituting the backbone network that sever the sequential connection relationship.
[0065] In the deep feature output step S400, a deep feature representation characterizing the semantics of the 3D point cloud is output. This deep feature representation is obtained based on the sequence feature representation evolved from the backbone network and / or the global memory representation. In various implementations, for those derived from... The backbone network consists of sequentially connected processing blocks, with initial sequence feature representation. and initial global memory representation After these processing blocks After each iteration, the sequence feature representations of the evolved backbone network are obtained. and the evolved global memory representation The deep feature representation of the semantics of the 3D point cloud obtained through the backbone network is based on sequence feature representation. and global memory representation These two evolved features are represented.
[0066] In one implementation of S400, the evolved feature representation is directly output as a deep feature representation characterizing the semantics of the 3D point cloud. These demonstrations directly utilize one of the evolved feature representations without complex fusion. In a demonstration of point-level tasks, such as part segmentation and scene semantic segmentation, a prediction needs to be generated for each point, directly using the evolved sequence feature representation. As a deep feature representation for each point, a lightweight point-by-point classification head, such as a shared MLP, is input. In this example... Each feature has been injected with layered semantics, fusing local geometry with its associated global context, thus enabling accurate point-level prediction. An example of completing a sample-level task, for object classification, involves the global memory representation... Perform global average pooling, and use the resulting pooling vector as the depth feature representation of the entire point cloud, then input it into the classifier. As an evolutionary stream output feature specifically designed to accumulate global semantics, its pooled vector is the most condensed and robust representation of the overall category of the sample.
[0067] In one aspect of the S400 implementation, the evolved feature representations are merged and output as a deep feature representation characterizing the semantics of the 3D point cloud. These demonstrations fuse the two-stream features before output based on optimization objectives to integrate information from different perspectives. In one demonstration based on an augmented classification objective, the features are... and The features of corresponding points are concatenated to form enhanced point features. These concatenated features are then subjected to global max pooling and used as both output and input to the subsequent task head classifier. This demonstration integrates features during the concatenation operation at the backbone network output stage. Detailed local geometry and Stable global semantics can provide richer clues for classification, helping to distinguish categories that are locally similar but globally different. In an example based on a robust segmentation objective, [the following is an example of a segmentation target]... Perform broadcast-style global average pooling, The global average vector and The features at each point are summed, and the result is used as the deep feature representation of the backbone network and input into the segmentation head. This fusion operation in the demonstration is equivalent to uniformly injecting the maximum global prior into the features of each point to improve the semantic consistency of predictions in occluded and noisy regions and reduce outlier errors.
[0068] In one aspect of the S400 implementation, the evolved feature representation is merged with features extracted from other backbone networks as the output deep feature representation characterizing the semantics of the 3D point cloud. These examples fuse feature representations evolved within the backbone network with feature representations extracted from external or other branches to leverage richer information sources. One example involves merging features from intermediate layers of the deep backbone network (such as the first...) Extracting sequence features from layers And by upsampling or interpolation, it is made to match the final layer Dimension alignment, , and The features of the corresponding points of the three are concatenated to form the final point-by-point depth features. Typically containing finer geometric details, this merging achieves the fusion of multi-level features, balancing detail and semantics, and is particularly beneficial for fine-grained boundary segmentation. An example of merging with external pre-trained features uses a pre-trained, lightweight point cloud backbone network (such as a small version of PointNet++) to process the original point cloud and extract a set of auxiliary point features. .Will and A weighted summation is performed, with weights determined by a small gating network. and Dynamically generated, deep feature representations are output as a weighted sum. The gated fusion mechanism demonstrated here allows the model to dynamically decide whether to rely on the strength of internally evolved features or external auxiliary features. It can effectively introduce external knowledge (such as more robust local geometric descriptions) to strengthen or modify the features of the main path, thereby improving the model's generalization ability in cross-domain or few-sample scenarios.
[0069] In various aspects, in step S300, the processor, based on the configuration of at least one processing block of the backbone network, executes a dual-stream cooperative evolution operation including the following steps S310 to 340: a block feature input step S310, where the sequence feature representation and global memory representation input to the processing block are used as the current sequence feature representation and current global memory representation of the block, respectively; a dynamic semantic generation step S320, where dynamic semantic information is generated based on the current global memory representation; a semantic injection state space model processing step S330, where the dynamic semantic information is injected into the computation process of a selective state space model that processes the current sequence feature representation, so that the selective state space model outputs semantically modulated features, which obtain non-causal global context awareness while maintaining linear computational complexity; and a dual-stream update step S340, where the current global memory representation is updated based on the semantically modulated features to obtain the global memory representation output after the block evolution; and the sequence feature representation output after the block evolution is obtained based on the semantically modulated features.
[0070] As is easily understood, in the embodiments of the dual-stream collaborative evolution steps in S310 to S340, the dual-stream refers to a processing stream that updates feature representations, including at least two parts: sequential feature representation and global memory representation. Each processing stream consists of several consecutive processing steps for a specific feature representation. Since the processing object of each processing stream step is a specific sequential feature representation or global memory representation, in some examples, the update or evolution of a specific feature representation within a processing stream is also considered an update or evolution of that stream. Collaboration means that in the processing block that processes the dual streams, there are bidirectional data fusion steps in the current processing streams of sequential feature representation and global memory representation. Evolution means that these bidirectional data fusion steps aim to output new sequential feature representations and global memory representations with deeper feature information. Here, sequential feature representation refers to a one-dimensional feature vector sequence, where the vector at each position carries the local geometric and semantic features of the corresponding point, while maintaining the sequence order. In some examples, the length of the sequence feature representation remains the same as the point sequence output by S100 throughout the iterative processing of S200 and S300. In other examples, considering the generation of cluster superpoints during iteration, such as in multi-layer downsampling depth feature extraction, the length of the sequence feature representation may decrease layer by layer, but the causal order preservation of the local geometric and semantic features of the corresponding points remains unchanged. The global memory representation refers to the storage unit used to aggregate, store, and transfer the global semantic context of the entire point cloud among multiple processing blocks. In some examples, the global memory representation corresponds to one or more iteratively updatable storage units, which can be a set of feature vectors of the same length as the sequence feature representation at the same layer. In other examples, the global memory representation is configured as a single global feature vector, referenced and updated sequentially by multiple processing blocks, but its shape remains unchanged.
[0071] In one aspect, several embodiments provide examples of the S310 feature input method. As the most basic data stream coupling method, some examples of directly passing feature representations involve using sequence feature representations from the previous layer. and global memory representation Directly used as the current representation of this layer and In some examples where processing blocks are coupled through normalization, the core operations of this layer are first processed before inputting the normalized processing. and Perform layer normalization separately, and then use the normalization results as... and To stabilize and optimize training. In some examples of processing blocks coupled via linear projection, and Each feature is transformed to a new feature space through an independent linear projection layer, resulting in... and This endows each layer with the ability to adjust feature distribution, as is the case in the backbone networks of some multi-layer downsampling encoders. Due to the dual-stream iterative processing of this technical solution, some hardware optimization examples support memory pointer reuse: in the hardware implementation, because... and Physical copying is typically not performed; instead, memory pointers or tensor views are used for referencing. and The storage address is set, and new memory is only allocated when an update is needed, thereby saving memory bandwidth and storage overhead.
[0072] In one aspect, multiple embodiments provide the representation from the current global memory in S320. Extracting information from this, we generate examples of dynamic semantic information for modulated sequence feature processing. Some prototype-based semantic pooling examples demonstrate this. Global average pooling is performed to obtain a compact context vector, which is then mapped to k learnable semantic prototype vectors through a small MLP, forming a semantic pool. This semantic pool is the generated dynamic semantic information repository. Some examples of attention-weighted semantic vectors will... As keys and values, the current sequence features As a query, a context vector sequence of the same length as the sequence is computed through a cross-attention mechanism. This sequence represents dynamic semantic information, explicitly encoding the global memory content that should be focused on at each sequence position. Some examples of dynamically generated gating vectors will... The input is a lightweight network, such as a two-layer MLP, which directly outputs a dynamic gating vector or matrix of the same dimension as a parameter of the selective state-space model, such as the output projection matrix C. This vector is then used as the dynamic semantic information to be injected. Some implementations of the generation method include static generation where the semantic information is fixed at each layer and defined by learnable parameters; or, the semantic information is generated by... and / or Dynamic generation generated in real time; or, hierarchical generation, where shallow generation focuses on semantic information of local geometry, and deep generation focuses on semantic information of the overall structure.
[0073] In one aspect, several embodiments demonstrate the computation of injecting dynamic semantic information into a selective state-space model during the S330 semantic injection state-space model processing. In some examples using additive injection of output projection, the generated dynamic semantic gate matrix V is added to the original output projection matrix C of the selective state-space model to form a modulated projection matrix (C+V), which is used for discrete point-by-point computation of the output feature vector. This is to achieve non-causal context injection without altering the linearity of recursive computation. In some examples using multiplication-gated state injection, the generated dynamic semantic information is used as a gating vector, linked to the hidden states of the selective state-space model. Perform element-wise multiplication, then output the result via the original projection matrix C: ,in This serves as a gating vector to directly modulate the strength of historical memory. Some examples demonstrate this by splicing and injecting dynamic semantic information into the input / output, combining it with the input of a selective state-space model. or hidden state Feature-level concatenation is performed, followed by fusion and projection through a learnable linear layer to achieve more flexible fusion capabilities. In some demonstrations of semantic injection using modulated discretized selective state-space models' time parameters, dynamic semantic information is transformed into a scalar to modulate the discretized time step parameter Δ of the selective state-space model, thereby changing the model's dependence on the input sequence scale and achieving dynamic adjustment of the receptive field. In some hardware implementations of semantic injection, a fusion computation kernel is designed in custom hardware to perform matrix addition (…). ) or multiplication gating operation and subsequent matrix-vector multiplication ( The fusion design allows dynamic semantic information V or g to be directly input into a single computing core through a dedicated path, enabling injection-computation operations within a single cycle, eliminating intermediate caches, and reducing latency.
[0074] In one aspect, multiple embodiments provide that the S340 dual-stream update step utilizes semantically modulated features. An embodiment of updating dual streams. In some examples using residual updates to sequence feature representations and gated updates to global memory representations, the basic and efficient update strategy for sequence feature representation updates is as follows: In order to integrate new features through residual connections, the global memory representation is updated as follows: ,in Learnable gating functions, such as the sigmoid function and other linear unit components, are used to control the retention and writing of global memory. In some examples of updating global memory representations based on cross-attention, the update of the global memory stream uses a cross-attention mechanism to achieve finer-grained content-aware updates, such as... As a query Aggregation via attention mechanism, using keys and values. Update with relevant information in the database. Then, after residual connection and normalization, the output is... In some examples of updating feature representations in two streams using projection stitching schemes, the following will be implemented: Projected into two features respectively and ,and Depend on and Obtained through residual connection, Then by and After concatenating the feature dimensions, a linear layer is used to compress them back to the original dimensions, thus providing greater independence for the two-way updates. In some hardware implementations specifically designed for output optimization after dual-stream collaborative evolution, a dual-port memory and in-place update strategy are employed. Specifically, dual-port on-chip memory is configured for the global memory representation M, allowing reading of the current data through one port within a single clock cycle. For use by S320 and S330, while the calculated data is transmitted through another port. Write back to the same memory block to achieve efficient pipelining; for processing streams that represent sequence features, a strategy of pre-allocation and in-situ updates can be adopted, alternating between reading and writing in a fixed memory area to reduce dynamic allocation.
[0075] Several embodiments of the second aspect of this document provide a 3D point cloud processing apparatus for at least extracting depth feature representations of 3D point cloud data, comprising a data interface and processing circuitry. These embodiments at least demonstrate the process by which the processing circuitry processes 3D point cloud data obtained from the data interface to obtain its depth feature representation, thereby revealing the structure and function of the processing circuitry including a sequence initialization unit, a two-stream cooperative evolution unit, and a feature output interface.
[0076] In a specific preferred embodiment, an integrated LiDAR point cloud processing device 10 for autonomous driving is included. This device 10 is integrated as a standalone hardware module within the perception system of the autonomous vehicle. Figure 3 As shown, data interface 11 is a physical port conforming to the automotive Ethernet standard, used to receive the raw 3D point cloud data stream generated by the vehicle-mounted LiDAR scanning in real time. Processing circuit 20 is a dedicated artificial intelligence processing chip or a bus-connected chipset. Inside processing circuit 20, various functional units and interfaces are implemented through fixed hardware logic and programmable computing units. Sequence initialization unit 21 is implemented by a dedicated preprocessing engine within the chip / chipset. This engine is configured to: first, use the Hilbert space-filling curve algorithm to map the input unordered point cloud into an ordered point sequence; then, through a built-in linear projection calculation unit, convert the coordinates of each point into a high-dimensional feature vector, thereby generating an initial sequence feature representation. An initial global memory representation is generated by performing global average pooling on the sequence. The dual-stream co-evolution unit 22 consists of a series of configurable neural network processing cores within the chip / chipset. They are connected sequentially. Each processing core can be configured as a semantic relay processing block 31. In some demonstrations with lower requirements and a primary focus on computational efficiency, such as... Figure 4 As shown, only one or more processing cores of the first or intermediate layer are configured as semantic relay processing blocks 31, preferably, at least two consecutive layers of processing cores are deployed therein. , For semantic relay processing block 31, a single continuous iteration is performed on the dual streams; other layers employ non-semantic injection-based dual-stream iteration. In other examples requiring large-scale long-range depth alignment, such as... Figure 5 As shown, all processing cores are configured as semantic relay processing blocks 31 in this paper. Each semantic relay processing block 31 contains: a selective state space model computation unit 41 composed of an optimized matrix multiplier and a recursive state cache unit, used to efficiently perform recursive computation of state space model discretization; a semantic information generation unit 42 composed of a pooler and a lightweight matrix operation unit, used to extract semantic prototypes from the input global memory representation and calculate routing weights; a semantic injection unit 43 composed of an adder and a data path control logic circuit, configured to add the dynamic semantic gate matrix V output by the semantic information generation unit to the output projection parameter C inside the selective state space model computation unit 41 during point-by-point iterative computation; and a dual-stream update evolution unit 44 containing a gated fusion logic circuit and a residual connection adder, used to update and obtain the global memory representation of the block output, and perform residual connection and layer normalization on the features output by the selective state space model computation unit to obtain the sequence feature representation of the block output. The dual-stream update of the semantic relay processing block 31 satisfies synergy, meaning that the sequence feature representation of the block is fused along the global memory representation processing path. Simultaneously, the input global memory representation is injected into the global memory for further sequence feature representations through semantic injection. The internal bus interface of the chip / chipset also includes a feature output interface 23, used to output the evolved sequence feature representation or the vector after global pooling of the last semantic relay processing block as a deep feature representation, and pass it to the downstream planning and control module. Preferably, the semantic injection unit 43 can be deployed in an adder array during hardware design. This array receives the dynamic semantic gate matrix V from the semantic information generation unit 2 and the original output projection matrix C stored internally in the selective state space model calculation unit 41, performing parallel addition operations on each corresponding element. The result (C+V) is written back to the selective state space model calculation unit 41 for subsequent output calculations.
[0077] In one aspect, in some examples using multiplicative gated injection, the semantic injection unit 43 is implemented in hardware as a multiplier array. Its configuration is as follows: a weight matrix of the same dimension as the output projection parameters C is multiplied element-wise with the original output projection parameters C as dynamic semantic information V to obtain a Hadamard product, generating the modulated parameters. This is then used for state-space model calculation. In some examples using affine transformation injection, the semantic injection unit 43 is implemented in hardware as an affine transformation circuit composed of adders and multipliers. Its configuration is as follows: Performing operations... ,in For learnable scaling and bias scalars, more complex modulation of the output projection is possible. In some examples using parameter-selective switch injection, the semantic injection unit 43 is implemented in hardware as a multiplexer and associated control logic. Its configuration is as follows: based on dynamic semantic information V, which serves as a routing index or attention weight, it injects semantic information from a set of pre-stored output projection parameter matrices with different characteristics. Choose one from the options as the parameter matrix currently in use.
[0078] In some examples of multi-layer processing cores using continuous downsampling, the processing cores in processing circuit 20 are sequentially stacked to form a downsampling encoder configuration. Each processing core, in addition to transmitting output features to downstream processing cores, also connects to a bus interface to transmit output features to the bus interface. The feature output interface 23 of the bus interface is also used to upsample, fuse, and decode the output features of each processing core layer by layer, and then transmit this as a depth feature representation to the downstream planning and control module. In various aspects, based on the hardware optimization examples involved in the embodiments of the methods, devices, and systems described herein, by optimizing or implementing specific hardware modules or fixed logic in dedicated chips (ASICs / FPGAs / NPUs, etc.), embodiments of the design, manufacture, or use of the device components of the processing circuits and internal units, interfaces, etc., as described herein can be constructed.
[0079] Several embodiments of the third aspect of this paper provide a 3D point cloud processing system. In an example combining the implementation of the method described herein, reference is made to... Figure 7 In this remote modeling system application scenario, 3D point cloud data of a scene or object is acquired at the surveying site by the LiDAR and MU of the surveying terminal and uploaded to a cloud server via an AIoT network. The cloud server then performs digital twin modeling. The cloud server deploys and runs the 3D point cloud feature extraction system 1000 provided in this demonstration to extract features from the 3D point cloud and send them to downstream tasks. This system application consists of software modules implementing the steps of this disclosure, mainly including: a serialization module 1100, a feature initialization module 1200, a deep feature extraction network 1300, and an output module 1400. The deep feature extraction network 1300 contains multiple semantic relay processing blocks 1310. The input of system 1000 is the raw 3D point cloud data, and the output is a deep feature representation characterizing the semantic information of the point cloud, used for downstream tasks such as classification or segmentation. (Reference) Figure 8 The implementation of each method and step in System 1000 is given below.
[0080] Serialization step (S100)
[0081] In step S100, the serialization module 1100 receives the input 3D point cloud. This point cloud data is typically represented as a set containing N points. Each point It must contain at least three-dimensional coordinates. .
[0082] Serialization module 1100 is configured to serialize unordered point cloud collections Transform into an ordered sequence of points This transformation process aims to provide an input structure for subsequent sequence model processing while preserving as much of the local correlation of the point cloud in 3D space as possible. For example, the space-filling curve can be a Hilbert curve or a Z-order curve. In a preferred embodiment, the serialization module 1100 utilizes the space-filling curve for sorting. Specifically: first, the point cloud... The coordinates are normalized to a preset spatial bounding box, such as a voxel; then, the space-filling curve encoding value corresponding to each point in the bounding box is calculated; finally, all points are sorted according to the encoding values, such as in ascending order, to obtain the point sequence S. The choice of space-filling curve ensures that points that are close to each other in three-dimensional space have a high probability of being close to each other in one-dimensional sequence S.
[0083] Feature initialization step (S200)
[0084] In step S200, the feature initialization module 1200 is configured as a receiving point sequence. This generates two parallel initialization feature representations: an initial sequence feature representation... And an initial global memory representation Both have the same feature dimensions. Specifically:
[0085] Initial sequence feature representation Word embeddings are obtained by applying a shared feature transformation, such as a multilayer perceptron, to each point in sequence S. This transformation maps the original features, such as the point coordinates, to a higher-dimensional, more expressive hidden space. Initial global memory representation. Used to carry and pass global context information in subsequent steps, this example is set to be similar to... Same, that is In other examples, it can also be done by... It is obtained by applying an independent linear or nonlinear transformation, or initialized as a set of learnable parameter vectors and broadcast to all point locations.
[0086] Deep feature extraction step (S300)
[0087] 1300 deep feature extraction networks It consists of sequentially connected semantic relay processing blocks (SRM Blocks). Each processing block 1310 performs a two-stream cooperative evolution operation, iteratively evolving the input sequence feature representation and global memory representation. Based on zero-based indexing, starting from the first... Taking a processing block as an example, refer to Figure 9 The operation includes the following sub-steps:
[0088] S310: Feature Input
[0089] Processing block 1310 receives the sequence feature representation from the output of the previous layer. and global memory representation (For the first layer, the input is) and Within this layer, they are respectively referred to as the current sequence feature representations. and current global memory representation .
[0090] S320: Dynamic Semantic Information Generation
[0091] This step starts from the current global memory representation. Dynamic semantic information is derived from this to construct an accessible semantic information repository for subsequent injection. In a specific implementation, refer to... Figure 10 This step is implemented through a semantic information generation unit, which performs the following steps: memory retrieval to obtain the current global memory representation. Or after self-updating Semantic pooling, for Or the updated Perform global aggregation operations such as adaptive pooling / max pooling. This generates a compact global context vector; prototype generation maps the global context vector to a semantic pool containing k learnable prototype vectors. These learnable prototype vectors are designed to capture typical geometric semantic patterns in point clouds. Simultaneously, based on the current sequence feature representation... or its preprocessed version A route weight calculation unit is configured to generate a route weight matrix. This matrix indicates the relationship between each point in the sequence and the semantic pool. The degree of correlation between the various prototypes.
[0092] S330: Semantic Injection State Space Model Processing
[0093] This step is the core of achieving efficient non-causal modeling. Processing block 1310 contains a Semantic-Injected State-Space Model (SI-SSM) module.
[0094] First, the current sequence feature representation After a preprocessing unit that includes linear projection and causal convolution, intermediate features are obtained. Subsequently, A selective state-space model is input. This selective state-space model has a standard linear recursive state transition equation, thus guaranteeing a linear computational complexity of O(N). The key improvement of this invention lies in modulating the output generation process of the selective state-space model.
[0095] Specifically, the discrete-time output equation of the selective state-space model is modified. The standard output equation is: Where C is the output projection matrix, This represents a causal hidden state. This invention introduces a dynamic semantic gate V, modifying the output equation as follows: The dynamic semantic gate V is constructed from the semantic information generated in step S320. For example, V can be constructed using the routing weight matrix. semantic pool We obtain this by performing a linear combination, i.e. Through this design, the dynamic semantic gate V will originate from global memory. Non-causal semantic information is directly injected into the output calculation of each step of the selective state-space model. This allows any point in the sequence to access information modulated by the global context when generating the output, breaking the pure causal limitation of the standard selective state-space model without adding additional computational complexity. The output of the semantic injection selective state-space model module is denoted as the semantically modulated feature. .
[0096] In other implementations of various aspects, the injection of dynamic semantic information into the computation process of the selective state-space model is not limited to modifying the output projection matrix C. In one example, dynamic semantic information can be used to modulate either the state transition matrix A or the input projection matrix B. In another example, dynamic semantic information acts on the hidden states in the form of gated multiplication. or output In another example, dynamic semantic information is concatenated with the output projection matrix C, and then used after a linear transformation.
[0097] S340: Dual-stream update
[0098] Semantic relay processing block 1310 completes the co-evolution between the sequence feature stream and the global memory stream through this step. Update the sequence feature stream: semantically modulated features. Through a residual connection path, it is connected to the current sequence feature representation. The summations, followed by normalization, result in the updated sequence feature representation. The output is then passed to the next layer. Global memory stream update: semantically modulated features. It is also used to update the current global memory representation. In one implementation, this is achieved through a gating update mechanism: [The remaining text appears to be incomplete and requires further context.] With self-gated regulation By combining these methods, and then performing residual connections and normalization, we obtain the updated global memory representation. Output to the next layer.
[0099] The aforementioned read-write update mechanism ensures that the global memory can continuously integrate the latest semantic information extracted from sequence features as the network depth increases. As a more concrete example, Figure 11 Pseudocode for a specific Semantic Relay Processing Block (SRM Block) 1310 is provided. Easily understood, the pseudocode illustrates how the deep feature extraction network 1300, by sequentially connecting and stacking multiple SRM Blocks 1310, decouples the modeling of local geometry and global semantics into two interconnected processes. The implemented semantic relay mechanism operates through strictly synchronized read-write cycles: the sequence path reads the global prototype from the memory path to enrich its state (via the SI-SSM mechanism), while the memory path updates itself by writing newly processed features back to its history. This ensures that the global context is not static but evolves layer by layer along with local features. The computation of these SRM Blocks 1310 is therefore also described by tracing these two coupled paths / flows: the sequence feature representations of each layer are processed in the sequence context along the sequence path (X-path). The input to each layer... First, a linear layer, a one-dimensional causal convolution, and SiLU activation are used. In the standard preprocessing stage, features are projected onto the hidden dimension D. This representation is then fed into the core Semantic Injection State Space Model (SI-SSM) mechanism.
[0100] The semantically injected state-space model, as the core of the non-causal modeling approach in this paper, reformulates the SSM output equation by introducing a dynamic semantic gate V. Although the standard projection C originates from causal states... Local historical context is aggregated, but the dynamic semantic gate V acts as a non-causal bypass, directly injecting the global context retrieved from the memory stream. The new output is: Clearly, V is constructed through a dynamic routing process, with the causal state tensor h derived from the output of the sequence path preprocessing stage. The computation is performed to maintain strict linearity. Non-causal contexts are introduced only through V, ensuring that global spatial dependencies are captured without sacrificing the O(N) efficiency of the original Mamba.
[0101] As a more specific example, such as Figure 12 The computation steps shown in the pseudocode of the semantic injection state space model calculation module of the semantic relay processing block 1310 can also be represented as follows:
[0102]
[0103] in, For this selective state-space model at time step The output characteristics; The original output projection matrix is learnable or input-dependent; The dynamic semantic information is represented as a dynamic semantic gate matrix, which is generated based on the semantic information extracted from the global memory representation, and is related to the matrix. The same dimension is used to inject a non-causal global context into the output projection; The hidden state vector is calculated from the recursive equation of the model; These are learnable or input-dependent through-term coefficients. The current sequence features represent the input features at the corresponding time step.
[0104] Easily understood, from the perspective of X-path and M-path, this demonstration sets up a "semantic relay" pipeline in SI-SSM to dynamically synthesize semantic gate V from the co-evolving memory stream. This pipeline specifically includes three operations:
[0105] 1) Semantic pool creation operation: Evolved memory stream from M-path Through adaptive pooling operations In conjunction with MLP, the global context is distilled into a small, shared semantic pool. The pool consists of k learnable "geometric prototype" vectors that are dimensionally aligned with the output parameters C of the SSM.
[0106] 2) Dynamic route generation operation: Simultaneously, the sequence stream from the X-path... It is passed through its own MLP and routing function (g, Gumbel-Softmax) to generate a routing weight tensor. This represents the affinity of each point with each "set prototype" in the semantic pool.
[0107] 3) Semantic gate selection operation: Generate dynamic semantic gate tensors by aggregating relevant prototypes for each point. This allows for point-by-point semantic injection of sequence feature representations in selective state-space model computation.
[0108] Clearly, the dynamic semantic gate tensor V encapsulates the most relevant global non-causal semantic context for each point, enabling two spatially distant but semantically related points on the X-path to be routed to the same prototype. The shared non-local context is injected into their final representations through the V-gate, achieving non-causal modeling. The dynamic semantic gate tensor V, in the form of C-parameter biases, is ultimately injected into the final output equation of the selective state-space model to generate the evolved form. ,at the same time It is also fed back to the M-path for evolutionary updates to the global memory representation.
[0109] The memory path (M-path), as a parallel information carrier, primarily functions to evolve hierarchically and accumulate the global context required to build a semantic prototype pool. Its updates are driven by two sources: its own previous state. and new information written from the X-path This implementation uses a gating mechanism to control this update: input memory. Through linear layers and A gated linear path, constructed using the sigmoid activation function, is used to refine its own information. Simultaneously, the refined memory is processed along with input from the X-path. Addition or other combinations. The representation after this combination, along with the original... The residuals are processed through an addition and normalization layer to produce new, updated memory states. To ensure that the memory to be used by the SI-SSM in the next block It is enriched by the latest discoveries in sequence streams. In the current semantic relay processing block, the output of SI-SSM... It is used to calculate the final output through residual connections in the X-path. It is also used to send to the M-path to participate in the global and representational update evolution. When multiple semantic relay processing blocks are stacked and connected, through this tightly coupled read-write cycle, repeated layer by layer, alternating co-evolution is achieved through the continuously stacked semantic relay processing blocks: the X-path obtains non-causal context by reading from memory, while the M-path is enriched by writing new discoveries from the X-path back to itself.
[0110] Deep feature output step (S400)
[0111] After iterative processing through all L processing blocks in the deep feature extraction network 1300, the final sequence feature representation is obtained. and global memory representation The output module 1400, based on the requirements of downstream tasks, is... and / or This generates the final deep feature representation. For classification tasks, the output module 1400 typically contains a global pooling layer (such as max pooling) and a classification head. It can... or Global pooling is performed, or the pooled features from both methods are fused, before the classification head outputs the predicted category. For segmentation tasks, the output module 1400 is typically a pointwise feature decoding head. It can be used directly. As a feature of each point, or and The features of corresponding points are fused, and then a semantic label prediction is generated for each point by the decoder.
[0112] One aspect is that, as an example of the deep feature output step in multi-scale fusion output, Figure 13 This paper demonstrates the implementation of the proposed method in a panoramic segmentation system 2000 for autonomous driving LiDAR point clouds. The system's task scenario involves LiDAR point clouds collected during autonomous driving, characterized by extreme sparseness, non-uniformity, and a scale of hundreds of thousands of points per second. The scene includes dynamic objects such as vehicles and pedestrians, static facilities such as roads, guardrails, and traffic lights, and severe occlusion such as vehicles obscuring pedestrians and buildings obscuring traffic signs. Therefore, the system must simultaneously predict the semantic category and instance ID for each point within the shortest possible latency, i.e., what it is and which object it belongs to. This type of scenario requires addressing long-range dependencies, severe occlusion, and semantic ambiguity, placing dual demands on global context awareness and computational efficiency. The system 2000 in this demonstration includes an end-to-end neural network, with its core being an encoder-decoder architecture backbone network. Multiple sequentially connected semantic relay processing blocks (SRM blocks) constitute the encoder of the backbone network, which completes the dual-stream collaborative evolution of S300 during continuous downsampling. On the X-path, Input a semantically injected selective state-space model (SI-SSM) for sequence modeling and output intermediate features. Clearly, along the X-path traversing the encoder, the sequence feature representation lengths obtained by each semantic relay processing block will differ. In this case, the sequence feature representation no longer corresponds to a single point in the point cloud data, but rather to a superpoint related to several preceding and following points. On the M-path, based on... and Through a Alternatively, lightweight memory update units such as attention pooling can be used to update the global memory representation. This memory summarizes the semantics of the entire scene "seen" up to the current layer. The semantic information generation unit uses the updated memory... As the primary input, dynamic semantic information is generated. Its value reflects the global semantic prior of the current scene, such as "there are a large number of vehicle point clouds ahead". The semantic injection unit will... The computation of the selective state-space model is injected into the processing of each point in the sequence, so that it is modulated by the semantics of the current global scene.
[0113] The multi-scale feature fusion module outputs a point-by-point fused deep feature representation to achieve step S400: First, it upsamples / aligns the shallow and deep sequence feature representations from each output of the encoder, as well as the voxel features obtained in S100, to the same spatial resolution as the input data, such as point cloud scale N, through methods such as trilinear interpolation, 3D deconvolution, or nearest neighbor interpolation. Then, after concatenating the aligned feature representations in the channel dimension, it fuses and reduces the dimensionality of the concatenated high-dimensional features through a 1x1 convolution, which is equivalent to a fully connected function, learns the importance weights between different channels, and outputs the fused deep feature representation. Deep feature representation In the subsequent multi-task decoding head, the semantic segmentation head is used ( The semantic segmentation map is obtained, the center offset map is obtained through the instance center head (MLP), and the instance embedding map is obtained through the instance feature head (MLP). The decoded feature maps are then used in the post-processing module to assign semantic category and instance ID to each point, thus realizing the functionality of the autonomous driving LiDAR point cloud panoramic segmentation system 2000.
[0114] In various aspects, through the above-described multiple implementations, the ability to model non-causal global contexts is integrated into a selective state-space model with linear computational complexity, thereby achieving efficient and accurate feature extraction from 3D point clouds. Those skilled in the art can further understand the implementation effects or improve the implementation through multiple application implementations based on public data. Each application implementation is trained and validated using a specific task head based on the same four-layer processing block stacked backbone network and is referred to as SemSSM; each comparative implementation uses the same hardware platform.
[0115] In an implementation targeting the classification of real, noisy scanned objects, when classifying objects based on the real-world scanned dataset ScanObjectNN (PB_T50_RS), as follows: Figure 15 As shown, an overall accuracy (OA) of 87.9% is achieved, significantly outperforming the second-best comparison method. Specifically, compared to the baseline state-space model (PointMamba, 84.9% OA) employing only a shallow bidirectional fusion strategy, this method outperforms by 3.0 percentage points. This implementation demonstrates the significant advantage of the "deep semantic co-evolution" mechanism over a simple "bidirectional scanning" strategy in overcoming occlusion and background noise in real point cloud data.
[0116] In an implementation targeting a synthetic object classification task, classification is performed based on the standard synthetic dataset ModelNet40. For example... Figure 14 As shown, this method achieved an overall accuracy (OA) of 93.8%, comparable to the performance of Transformer-based models. Meanwhile, as... Figure 18 As shown, the computational complexity (FLOPs) of this method is 1.62G, significantly lower than that of equivalent Transformer models (such as Point Transformer, 6.90G FLOPs), and on par with the efficient baseline state-space model (PointMamba, 1.5G FLOPs). This embodiment demonstrates that the method is competitive in accuracy while maintaining the linear complexity advantage of state-space models in terms of efficiency, achieving a balance between accuracy and efficiency.
[0117] An implementation targeting a large-scale indoor scene semantic segmentation task is presented, using the large indoor scene dataset S3DIS (Area 5) for semantic segmentation. For example... Figure 17 As shown, this method demonstrates advantages in three metrics: overall accuracy (OA, 91.8%), class average intersection-union ratio (Cls. mIoU, 78.9%), and instance average intersection-union ratio (Ins. mIoU, 74.1%). This embodiment illustrates the method's scene-level semantic parsing capabilities and practicality for large-scale scenes containing millions of points and complex structures.
[0118] In a series of ablation comparisons involving multiple selective state-space models, comparative and ablation experiments were conducted on the ScanObjectNN and ShapeNet Part datasets. Figure 19 and Figure 20 As shown, replacing the complete SI-SSM module with the existing standard causal SSM resulted in a performance drop to 81.0% OA on ScanObjectNN, demonstrating the inapplicability of the pure causal model to point cloud data. Replacing the SI-SSM module with a "bidirectional Mamba block + shallow fusion" approach resulted in a performance of 84.9% OA on ScanObjectNN, which, while better than the causal model, was still significantly lower than the complete model (87.9% OA). This implementation demonstrates that the "two-stream collaborative evolution" step with "semantic injection" in this method offers a significant performance gain compared to existing shallow fusion strategies.
[0119] In a set of ablation comparisons that include multiple serialization methods and geometric prototype numbers as design parameters, different serialization methods or different geometric prototype numbers are used for the same backbone network, such as... Figure 19 , 20As shown, this method is less affected by conventional serialization methods and achieves stable and excellent performance within a reasonable parameter range (k=16). It is insensitive to parameter changes and possesses design robustness. The parameter k provides a means to adjust the relationship between model capacity and computational cost.
Claims
1. A 3D point cloud feature extraction method based on semantic injection state space model, characterized in that, Including the following steps: S100: Acquire 3D point cloud data and serialize it into a point sequence; S200, the point sequence is initialized with features to obtain its initial sequence feature representation and initial global memory representation; S300, the initial sequence feature representation and the initial global memory representation are input into a backbone network composed of multiple sequentially connected processing blocks for iterative evolution; wherein, in at least one processing block of the backbone network, a two-stream cooperative evolution operation is performed, the operation including: S310, Receive the sequence feature representation and global memory representation input to the processing block, and use them as the current sequence feature representation and current global memory representation; S320, Based on the current global memory representation, generate dynamic semantic information; S330, using the dynamic semantic information, modulates the linear output computation of the selective state-space model processing the current sequence feature representation based on causal recursion to generate semantically modulated features; the modulation operation maintains the inherent linear computational complexity of the selective state-space model and enables the semantically modulated features to incorporate the non-causal global context from the dynamic semantic information; and... S340, Based on the semantically modulated features, the updated global memory representation and the updated sequence feature representation of the block after evolution are obtained; S400, based on the sequence feature representation and / or global memory representation finally output by the backbone network, generate a deep feature representation characterizing the semantics of the 3D point cloud.
2. The 3D point cloud feature extraction method of claim 1, wherein: The modulation operation in step S330 is achieved by integrating the dynamic semantic information into the output projection calculation of the selective state space model in a linear combination manner.
3. The 3D point cloud feature extraction method of claim 2, wherein: The linear combination method is an addition operation, and the dynamic semantic information is combined with the projection parameters in the output projection calculation as an additive term.
4. The 3D point cloud feature extraction method of claim 1, wherein: Each processing block in the backbone network is configured to perform the dual-stream collaborative evolution operation.
5. The 3D point cloud feature extraction method of claim 1, wherein, The update of the current global memory representation in step S340 is achieved through a gated fusion step; the gated fusion step is configured to: adjust the contribution of the semantically modulated features to the update of the current global memory representation based on the state of the current global memory representation.
6. The 3D point cloud feature extraction method of claim 5, wherein, The gated fusion step achieves the update by subjecting the current global memory representation to a gating function and then combining it with the semantically modulated features.
7. The 3D point cloud feature extraction method according to claim 1, characterized in that, It also includes at least one of the following improvement steps: The serialization operation in step S100 is performed according to a dynamic strategy during the model training phase, and the dynamic strategy makes the rules used for serialization variable during training. Step S320 includes: constructing a semantic information database based on the current global memory representation; determining the association between each point in the sequence and the semantic information database based on the current sequence feature representation; and generating corresponding dynamic semantic information for each position in the point sequence according to the association. as well as, Step S400 includes: obtaining updated sequence feature representations output by multiple processing blocks in the backbone network; and performing cross-layer fusion on the multiple updated sequence feature representations to obtain the deep feature representation.
8. A 3D point cloud processing device, characterized in that, include: Data interface, used to acquire 3D point cloud data; A processing circuit, connected to the data interface, comprising: A sequence initialization unit is used to serialize the 3D point cloud data and convert it into an initial sequence feature representation and an initial global memory representation; A dual-stream collaborative evolution unit, configured to execute or include a neural network consisting of multiple sequentially connected processing blocks, the neural network being used to iteratively process the initial sequence feature representation and the initial global memory representation; and, The feature output interface is used to output the deep feature representation characterizing the semantics of the 3D point cloud obtained by the iterative processing evolution of the sequence feature representation and / or global memory representation based on the dual-stream cooperative evolution unit. The neural network includes at least one semantic relay processing block, which includes: Selective state-space model computation unit; The semantic information generation unit is configured to generate dynamic semantic information based on the global memory representation of the input block; A semantic injection unit, connected to the semantic information generation unit and the selective state space model calculation unit, is configured to inject the dynamic semantic information into the processing of the sequence feature representation of the input block by the selective state space model calculation unit; and, The dual-stream update evolution unit, connected to the selective state space model calculation unit, is configured to: update the global memory representation of the input block based on the features output by the selective state space model calculation unit, to obtain the global memory representation output after the block's evolution; and obtain the sequence feature representation output after the block's evolution based on the features output by the selective state space model calculation unit.
9. A 3D point cloud processing system, characterized in that, include: processor; as well as A memory that stores computer program instructions; When the computer program instructions are executed by the processor, the system implements the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 7.