Event-driven information processing method based on point cloud
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HONG KONG UNIV OF SCI & TECH (GUANGZHOU)
- Filing Date
- 2023-06-14
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies for motion recognition using event cameras suffer from high computational resource consumption, insufficient accuracy, and inability to be directly applied to spiking neural networks, making it particularly difficult to achieve efficient and low-power real-time motion recognition on edge devices.
The initial event cloud data is normalized into point cloud data using a data transformation method. It is then converted into a pulse stream through sliding window segmentation and pulse coding. Feature extraction is performed using a spiking neural network, including global and local feature extractors, and a residual feature module is used for efficient feature extraction.
It achieves efficient and low-power real-time action recognition on edge devices, improves the accuracy and performance of action recognition, solves the problem of high computing resource consumption in existing technologies, and maintains the spatiotemporal invariance of point cloud information.
Smart Images

Figure CN116824159B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of neuromorphic computing, specifically to data transformation, event-driven information processing, feature extractors, and SpikePoint, a point cloud-based spiking neural network, which can achieve efficient, lightweight, low-power, and high-precision action recognition. Background Technology
[0002] Unlike traditional frame image sensors (such as APS sensors), event cameras do not capture images at a fixed rate. Each pixel works independently and outputs an ON event (increased light intensity) or an OFF event (decreased light intensity) when the intensity of light changes beyond a certain threshold, based on the perceived changes in light. For details, please refer to existing technology 1: EP3731516A1.
[0003] Event cameras generate events based on changes in light intensity (also known as event streams or pulse streams). Their output only shows positive and negative values, not intensity. For artificial neural networks, this means they cannot analyze the causes of events, potentially providing incorrect feature information and affecting training results. Furthermore, existing methods typically simply compress event data into images or voxel data for processing. This data format conversion process not only consumes significant computational resources but also leads to the loss of valuable information, thus impacting the accuracy and performance of action recognition.
[0004] Compared to traditional frame cameras, event cameras can efficiently capture details of moving objects, and when combined with advanced motion recognition technology, they have found widespread applications in fields such as robotics, autonomous vehicles, and virtual reality. Advanced motion recognition technology based on event cameras mainly includes the following two key technologies:
[0005] 1) Event-based Spiking Neural Network (SNN): This is a sensing and computing scheme based on a combination of event camera (or pulse sequence obtained by interpolating frame images) and spiking neural network (SNN). It can provide a low-power (down to milliwatt level) and high real-time (down to microsecond level) "integrated sensing and computing" solution, which can be applied to edge computing, Internet of Things and other terminal scenarios to achieve terminal intelligence without network connection.
[0006] However, while this method reduces computation and power consumption and improves speed by leveraging the characteristics of event cameras, the initial algorithm is relatively complex and it is difficult to achieve the accuracy comparable to ANN or DNN. Furthermore, when the Spiking Convolutional Neural Network (SCNN) performs feature mapping, its feature map is still an image feature map, and the event coordinate information is equivalent to location information such as (x, y). Convolution needs to be performed based on the location in the vicinity, indicating that there is still room for power consumption reduction.
[0007] 2) Point cloud processing network
[0008] The development of point-based methods has made it possible to use event data as input, such as PointNet and its improvement PointNet++, as detailed in Existing Technology 2.
[0009] Prior art 2: Qinyi Wang, Yexin Zhang, Junsong Yuan, and Yilong Lu. Space-time event clouds for gesture recognition: From rgb cameras to eventcameras. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1826–1835. IEEE, 2019.
[0010] Existing point cloud processing networks are artificial neural networks (DNNs) based on deep learning. These networks typically consist of convolutional layers, pooling layers, and upsampling layers, and can automatically learn features to perform tasks such as object recognition, tracking, and pose estimation. Since event camera data is usually sparse event stream data, it needs to be converted into dense image data and compressed into image or voxel data formats for processing, consuming significant computational resources and impacting performance. Furthermore, existing action recognition methods use large models, which poses a significant challenge to the computing power of edge devices, as their computational capabilities are typically limited.
[0011] Furthermore, existing point cloud-based processing methods cannot be directly transferred to SNNs. Due to the binarization and dynamic characteristics of SNNs, they cannot implement the frequent high-dimensional data transformations and complex feature extraction operators, such as sampling and grouping, found in artificial neural networks.
[0012] In view of the above-mentioned problems in the existing technology, there is a need for a more efficient and lightweight action recognition method that is low in energy consumption and high in accuracy, and can realize real-time action recognition on edge devices. Summary of the Invention
[0013] To solve or alleviate some or all of the above-mentioned technical problems, the present invention is achieved through the following technical solution:
[0014] This invention relates to a data conversion method, comprising the following steps:
[0015] The spatiotemporal information in the initial event cloud data is normalized to obtain a normalized point cloud dataset.
[0016] The normalized point cloud dataset is mapped from a global point cloud distribution to a local point cloud distribution;
[0017] Pulse coding is used to convert the information of each point in the local point cloud distribution into pulses, resulting in a converted pulse stream.
[0018] In one embodiment, all points in the point cloud dataset refer to event points;
[0019] The step of mapping the point cloud dataset from a global point cloud distribution to a local point cloud distribution includes:
[0020] The point cloud data is sampled to obtain at least one key point, and the same group of points around the key point are selected.
[0021] In one embodiment, the sampling includes one of the following methods: I) selecting at least one key point based on random sampling; II) selecting S key points based on the farthest point sampling method;
[0022] When grouping, the points around the key point are based on one of the following methods:
[0023] I) Identify at least one neighboring point around any key point based on the K-nearest neighbor method;
[0024] II) Draw a sphere in the spatial domain around any key point. Within the sphere around the key point, select at least 10 points uniformly or randomly to obtain the same set of points around the key point.
[0025] In one embodiment, a sliding window is used to segment the initial event cloud data or the normalized point cloud dataset; within any sliding window, the normalized point cloud dataset within the sliding window is mapped from the global point cloud distribution to the local point cloud distribution, and pulse coding is used to convert the information of each point in the local point cloud distribution of the sliding window into pulses, thereby obtaining the converted pulse stream.
[0026] In one type of implementation, the size of the sliding window is adaptive.
[0027] In one type of embodiment, the size of the sliding window is proportional to the complexity of the information indicated in the dataset, or the size of the sliding window is proportional to the speed of the action in the dataset.
[0028] In one type of embodiment, any sliding window may include only a single action.
[0029] In some embodiments, adjacent sliding windows may or may not overlap.
[0030] In one type of embodiment, if adjacent sliding windows overlap, the size of the overlap area between adjacent sliding windows is set based on one of the following factors:
[0031] I) Differences in data between adjacent sliding windows;
[0032] II) Number of sliding windows;
[0033] III) The total amount of data after being segmented by a sliding window.
[0034] This invention also relates to a first type of event-driven information processing method, which utilizes a spiking neural network to perform the following processing:
[0035] A global feature extractor is used to extract global features from the input pulse stream of the spiking neural network to increase the dimension of the input pulses;
[0036] Classification is performed based on the aforementioned global features;
[0037] The global feature extractor includes at least one pair of cascaded first-class convolutional kernels and residual feature blocks;
[0038] The residual feature block includes at least one type II convolutional kernel, and each type II convolutional kernel is coupled with spiking neurons before and after it. The residual feature block is one of the following residual connection methods:
[0039] I) The output residuals of the starting spiking neuron and the output residuals of the ending spiking neuron are connected;
[0040] II) The output of the starting spiking neuron is connected to the input residual of the ending spiking neuron;
[0041] III) The output of the starting spiking neuron is not residually connected to the input or output of the ending spiking neuron.
[0042] In one embodiment, the first type of convolutional kernel is an up-dimensional convolutional kernel, and the number of the first type of convolutional kernel is equal to or one more than the number of residual feature blocks.
[0043] In one type of embodiment, the input of any residual module is coupled to the output of a first-type convolutional kernel;
[0044] The second type of convolutional kernel includes at least one convolutional kernel for dimensionality increase and at least one convolutional kernel for dimensionality decrease, wherein the output dimension of the residual feature block is the same as the input dimension.
[0045] In one type of embodiment, the process includes the following steps before performing global feature extraction:
[0046] Local features of the input pulse stream of the spiking neural network are extracted using a local feature extractor;
[0047] Local features are extracted by the local feature extraction unit, and then local features are merged.
[0048] In one embodiment, the global feature extractor includes at least one pair of cascaded first-class convolutional kernels and residual feature blocks.
[0049] In some embodiments, a pooling operation is included after local feature merging.
[0050] In one type of embodiment, prior to merging local features, the method further includes: size reshaping to reshape the sizes of the local features to be the same or consistent.
[0051] The present invention also relates to a second type of event-driven information processing device, comprising:
[0052] A local feature extractor, whose input is used to receive multiple pulse sequences from an input spiking neural network, is used to extract local features;
[0053] The merging module, whose input is coupled to the output of the local feature extractor, is used to merge the features extracted by the local feature extractor.
[0054] A global feature extractor, whose input is coupled to the output of the merging module, is used to extract global features;
[0055] The classification module performs classification based on the extracted global features to obtain the classification results.
[0056] In one type of embodiment, the event-driven information processing device includes:
[0057] One or more of the local feature extractor and the global feature extractor include at least one pair of cascaded first-type convolutional kernels and residual feature blocks, wherein the number of first-type convolutional kernels is equal to or one more than the number of residual feature blocks;
[0058] The residual feature block includes at least one type II convolutional kernel, and each type II convolutional kernel is coupled with spiking neurons before and after it. The residual feature block is one of the following residual connection methods:
[0059] I) The output residuals of the starting spiking neuron and the output residuals of the ending spiking neuron are connected;
[0060] II) The output of the starting spiking neuron is connected to the input residual of the ending spiking neuron;
[0061] III) The output of the starting spiking neuron is not residually connected to the input or output of the ending spiking neuron.
[0062] In one type of embodiment, the input of any residual module is coupled to the output of a first-type convolutional kernel;
[0063] The second type of convolutional kernel includes at least one convolutional kernel for dimensionality increase and at least one convolutional kernel for dimensionality decrease, wherein the output dimension of the residual feature block is the same as the input dimension.
[0064] In one type of embodiment, before global feature extraction, a size reshaping process is included to reshape the sizes of the local features to be the same or consistent.
[0065] In one type of embodiment, a first pooling operation is included after local feature merging.
[0066] In one type of embodiment, a second pooling operation is included after global feature extraction.
[0067] This invention also relates to a feature extractor for use in the field of neuromorphology, comprising:
[0068] At least one pair of cascaded first-class convolutional kernels and residual feature blocks;
[0069] The number of the first type of convolutional kernels is equal to or one more than the number of residual feature blocks;
[0070] The residual feature block includes at least one type II convolutional kernel, and each type II convolutional kernel is coupled with spiking neurons before and after it. The residual feature block is one of the following residual connection methods:
[0071] I) The output residuals of the starting spiking neuron and the output residuals of the ending spiking neuron are connected;
[0072] II) The output of the starting spiking neuron is connected to the input residual of the ending spiking neuron;
[0073] III) The output of the starting spiking neuron is not residually connected to the input or output of the ending spiking neuron.
[0074] In one type of embodiment, the input of any residual module is coupled to the output of a first-type convolutional kernel;
[0075] The output dimension of any residual module is the same as the input dimension.
[0076] In one type of embodiment, the second type of convolutional kernel includes at least one convolutional kernel for dimensionality increase and at least one convolutional kernel for dimensionality decrease;
[0077] The dimension increase from all increasing-dimensional convolution kernels is the same as the dimension reduction from all decreasing-dimensional convolution kernels.
[0078] In one embodiment, the spiking neuron is a LIF neuron.
[0079] The present invention also relates to an event-driven information processing method based on point clouds, wherein the point cloud-based spiking neural network preprocesses the event stream output by the event imaging device using the data conversion method described in the preceding claim to obtain a preprocessed spiking stream.
[0080] Furthermore, based on the preprocessed pulse stream, a classification result is obtained using a spiking neural network as described in the previous item.
[0081] In one type of embodiment, the following steps are included:
[0082] The initial event cloud is segmented using at least one sliding window to obtain the required point cloud data;
[0083] Sampling and grouping are performed within any sliding window to obtain S groups of local point cloud distributions, where S is a positive integer;
[0084] Based on pulse coding, the information of each point is converted into pulses to obtain at least one pulse stream;
[0085] For any S groups of pulse streams corresponding to a sliding window, local feature extraction is performed respectively;
[0086] The extracted S groups of local features are merged and then subjected to the first pooling to obtain a tensor representing the global feature dimension.
[0087] Global features are extracted using a global feature extractor and then a second pooling is performed.
[0088] Classification is performed based on the global features after the second pooling, resulting in an initial classification result corresponding to the sliding window.
[0089] In one type of implementation, for multiple sliding windows, there are multiple initial classification results, and a voting mechanism is further used to obtain the final classification result.
[0090] In one type of embodiment, S is adaptive, and the specific value of S in any sliding window can be the same or different.
[0091] In one type of embodiment, "segmenting the initial event cloud using at least one sliding window to obtain the required point cloud data" includes the following steps:
[0092] The initial event cloud is segmented using at least one sliding window;
[0093] Transform high-dimensional event cloud data into low-dimensional point cloud data.
[0094] In one embodiment, for any window including the S groups of local point cloud distributions, the corresponding S groups of pulse streams are obtained after pulse coding.
[0095] Some or all of the embodiments of the present invention have the following beneficial technical effects:
[0096] (1) This invention proposes a point cloud-based SNN architecture called SpikePoint. SpikePoint is good at processing raw event cloud data and transmits information through pulses. It has advantages such as high efficiency, low power consumption and parallel processing.
[0097] 2) This invention adopts an event-driven architecture based on point clouds, directly processing raw point cloud data without additional transformations, resulting in good real-time performance and low resource consumption. Simultaneously, it effectively extracts global and local features of the event cloud while maintaining the order of the point cloud, preserving valuable spatiotemporal information and improving the accuracy and performance of action recognition.
[0098] 3) Both the local and global feature extractors of this invention use residual feature modules. Based on residual connections, they efficiently extract features, solving problems such as overfitting and gradient explosion / vanishing. They are also easy to implement in low-power hardware, suitable for edge devices, and capable of real-time action recognition on edge devices. Furthermore, during feature extraction, a bottleneck method is used to achieve a lightweight design, enabling high real-time performance, high accuracy, and low power consumption with a small number of parameters.
[0099] Further beneficial effects will be described in the preferred embodiments.
[0100] The technical solutions / features disclosed above are intended to summarize the technical solutions and features described in the Detailed Embodiments section, and therefore the scope of the description may not be entirely the same. However, these new technical solutions disclosed in this section are also part of the numerous technical solutions disclosed in this invention document. The technical features disclosed in this section, together with the technical features disclosed in the subsequent Detailed Embodiments section and some contents in the drawings not explicitly described in the specification, disclose more technical solutions in a reasonable combination.
[0101] The technical solution formed by combining all the technical features disclosed at any position in this invention is used to support the summary of the technical solution, the modification of the patent document, and the disclosure of the technical solution. Attached Figure Description
[0102] Figure 1 This is a block diagram illustrating the principle of visual recognition based on the SpikePoint neural network of point clouds in this invention.
[0103] Figure 2 This is a flowchart of the preprocessing process in the first preferred embodiment of the present invention;
[0104] Figure 3 This is a preprocessing flowchart in the second preferred embodiment of the present invention;
[0105] Figure 4This is a flowchart of the spiking neural network processing in the embodiments of the present invention;
[0106] Figure 5 This is a flowchart of the spiking neural network processing in another embodiment of the present invention;
[0107] Figure 6 This is a flowchart of the spiking neural network processing in another embodiment of the present invention;
[0108] Figure 7 This is a preprocessing block diagram in a preferred embodiment of the present invention;
[0109] Figure 8 This is a block diagram of a spiking neural network in a preferred embodiment of the present invention;
[0110] Figure 9 This is a block diagram of a spiking neural network in another embodiment of the present invention;
[0111] Figure 10 This is a block diagram of a spiking neural network in another embodiment of the present invention;
[0112] Figure 11 This is a block diagram of a local feature extraction unit in a preferred embodiment of the present invention;
[0113] Figure 12 This is a block diagram of a global feature extractor in a preferred embodiment of the present invention;
[0114] Figure 13 This is a block diagram of the ResF module in some preferred embodiments of the present invention;
[0115] Figure 14 A flowchart of event-driven information processing based on point clouds is shown in a preferred embodiment of the present invention. Detailed Implementation
[0116] Since it is impossible to exhaustively describe all alternative solutions, the key points of the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Other technical solutions and details not disclosed in detail below generally belong to technical objectives or features that can be achieved by conventional means in the art, and due to space limitations, they will not be described in detail here.
[0117] Unless it refers to division, the " / " in any position in this invention represents logical "OR". The serial numbers "first", "second", etc., in any position in this invention are merely descriptive distinguishing marks and do not imply an absolute temporal or spatial order, nor do they imply that terms prefixed with such serial numbers necessarily refer to different things than the same terms prefixed with other modifiers.
[0118] This invention describes various key points used to combine into various specific embodiments, which will be incorporated into various methods and products. In this invention, even if a key point is described only when introducing a method / product solution, it means that the corresponding product / method solution also explicitly includes that technical feature.
[0119] The description of the existence or inclusion of a step, module, or feature at any location in this invention does not imply that such existence is exclusive or unique. Those skilled in the art can obtain other embodiments by supplementing the technical solutions disclosed in this invention with other technical means. The embodiments disclosed in this invention are generally for the purpose of disclosing preferred embodiments, but this does not imply that opposite embodiments of the preferred embodiments are excluded by this invention. As long as such opposite embodiments at least solve one of the technical problems of this invention, they are intended to be covered by this invention. Based on the key points described in the specific embodiments of this invention, those skilled in the art can substitute, delete, add, combine, or change the order of certain technical features to obtain a technical solution that still follows the concept of this invention. These solutions that do not depart from the technical concept of this invention are also within the protection scope of this invention.
[0120] Definitions:
[0121] Event imaging devices are a new type of biomimetic visual sensor, also known as neuromorphic sensors, such as event cameras, dynamic vision sensors (DVS, DAVIS), and event-based imaging fusion sensors. The following discussion will use event cameras as an example, but is not limited to them.
[0122] The event camera captures changes / motion information in the scene, and the output event stream is a time-series data recording changes in image spatial intensity in chronological order. Events in the event stream are based on Address Event Representation (AER) or similar methods. Each event includes the coordinates of the event and the timestamp t (typically accurate to µs / ns), as well as the polarity p of the light intensity change (i.e., brightening or darkening) and / or the photovoltage value of the pixel (i.e., grayscale value). Other representations including coordinate and time information can also be used. In some cases, polarity or grayscale value can be ignored. Furthermore, the coordinates of the event correspond to the sensor dimension. The following description uses a two-dimensional event imaging device and its corresponding action recognition dataset as an example, where the event coordinates are (x, y). However, the sensor can be one-dimensional or three-dimensional, and the dataset corresponding to the sensor can be any other dataset for any purpose; this invention does not limit this.
[0123] Spiking Neural Networks (SNNs), hailed as the third generation of neural networks, mimic the workings of the brain, exhibiting event-driven characteristics and rich spatiotemporal dynamics. SNNs are particularly well-suited for event-based visual applications, uniquely capable of handling this sparse and asynchronous information type by manipulating event-based data in real time without requiring additional processing or filtering.
[0124] Spikes, the spike communication method and dynamic characteristics of SNNs, constitute the most fundamental difference between them and current ANNs or DNNs. SNNs typically use a single-spike mechanism, where each spike has a uniform, fixed unit amplitude. However, in some cases, SNNs use a "multi-spike" mechanism, where "multiple" spikes can be understood as multiple unit amplitude spikes superimposed on the same time step. The specific spike amplitude generated by the multi-spike mechanism can be determined based on the ratio of the spike neuron's membrane voltage to a fixed value (such as a threshold), as illustrated in existing literature WO2023284142A1. This paper incorporates the relevant content in full. The multi-spike mechanism is more training-friendly, improves training efficiency, and is more robust, increasing the accuracy performance advantage of SNNs.
[0125] This invention presents a novel spiking neural network architecture based on point cloud processing, abbreviated as SpikePoint. It excels at processing raw event stream data, maintaining high efficiency through spike communication while keeping the design lightweight. SpikePoint can effectively extract global and / or local features from event clouds while preserving the invariant order of the point cloud. During feature extraction, a bottleneck method is used to effectively reduce model parameters and avoid overfitting, achieving high accuracy while reducing power consumption. Furthermore, based on residual connections, it efficiently extracts features, further addressing issues such as overfitting, gradient explosion / vanishing, and is easily implemented in low-power hardware.
[0126] Figure 1 This is a block diagram illustrating the principle of visual recognition based on the SpikePoint neural network of point clouds in this invention. The output of the event imaging device is also called the event stream, raw event cloud, initial event cloud, or raw point cloud, and will be referred to as the initial event cloud in the following text.
[0127] SpikePoint comprises preprocessing and a Spike Streaming Neural Network (SNN). The initial event cloud is preprocessed to obtain the Spike Stream required by the SNN, wherein the Spike Stream consists of one or more spikes. Simultaneously, the SNN performs calculations based on the preprocessed Spike Stream to obtain the classification result.
[0128] Figure 2This is a flowchart of the preprocessing process in the first preferred embodiment of the present invention, wherein preprocessing, also known as data transformation, specifically includes the following steps:
[0129] S2001. Use a sliding window on the timeline to segment the initial event cloud, which can also be called clipping or slicing.
[0130] The initial event cloud comprises a large number of events output by the event imaging device; a simple event can represent e. m =(x m y m , t m p m ), where m represents the sequence number of the current event, x m y m The coordinates of the current event, t m p m These represent the timestamp and polarity of the current event, respectively. In some cases, polarity is ignored. Although this invention uses a 4D spatiotemporal event representation of the event output by the event imaging device as an example, this invention is not limited to this, as long as the event's time (t) is represented... m Spatial information (coordinates) is sufficient.
[0131] In most datasets, actions may be repeated within a single sample. A sliding window can separate a single action from a set of repetitive actions throughout the sample. The size of the sliding window is proportional to the complexity of the actions in the action recognition dataset; the more complex the actions, the longer the length of each sample in the dataset, and the larger the sliding window, and vice versa. Preferably, a sliding window includes only a single action.
[0132] In some embodiments, the sliding windows do not overlap or overlap.
[0133] In other embodiments, two consecutive windows have overlapping regions. The extent of the overlapping region can be set according to actual needs. Preferably, the size of the overlapping region is determined based on one or more of the following factors: the difference in data between adjacent sliding windows, the number of sliding windows, and the overall data volume after slicing by the sliding windows. Specifically, the larger the overlapping region, the smaller the difference in data between adjacent sliding windows; for the same action recognition dataset, the more sliding windows there are, the larger the overlapping region; furthermore, for any dataset, the more sliding windows there are, the larger the overall data volume after slicing.
[0134] In a preferred embodiment, the overlapping area of two adjacent sliding windows is set to half the window length. For example, if the length of the sliding window is 0.5s or 1.5s, the overlapping area of two consecutive windows is set to 0.25s or 0.5s, respectively.
[0135] For example, an AR dataset used for action recognition can be represented as:
[0136] AR raw ={e m =(x m y m , t m p m )|m=1,...,n} (1)
[0137] Among them, AR raw This represents the set of event data contained in the dataset, where n is the number of events in the set, m represents the sequence number of the current event, and x... m y m The coordinates of the current event, t m p m These represent the timestamp and polarity of the current event, respectively.
[0138] Let the length of the sliding window be L and the number of sliding windows be n. win AR data set within any sliding window after slicing clip It can be represented as:
[0139] AR clip =clip i {e k→l}|i∈(1,n win )|t l -t k =L (2)
[0140] Among them, clip i {e k→l} indicates that the i-th sliding window includes events numbered k to l (e k to e l ), t k and t l For each event e k and e l The timestamp, the window length L of the i-th sliding window = t l -t k .
[0141] In some embodiments, the length of the sliding window used for each sample in the same dataset is equal or approximately equal.
[0142] In other embodiments, for different samples in the same dataset, since the lengths of some samples in the dataset may differ, the length of the sliding window is adaptive and proportional to the speed of the actions in the corresponding sample. The faster the action speed in a sample, the shorter the corresponding sliding window; or, the faster the action speed in a certain time period in a sample, the shorter the sliding window corresponding to that time period, and the slower the action speed in another time period in the same sample, the longer the sliding window corresponding to that time period. Preferably, the sliding window size is set based on the number of events within the sliding window; for example, each sliding window contains an equal or approximately equal number of events.
[0143] S2002, Convert high-dimensional initial event cloud data into low-dimensional point cloud data, and / or normalize spatiotemporal information.
[0144] Event imaging devices output 4D spatiotemporal events. To apply point cloud methods, high-dimensional (e.g., 4D) spatiotemporal events need to be converted into low-dimensional (e.g., 3D) spatiotemporal data, for example, by removing polarity or grayscale values and retaining only temporal and spatial information. This paper uses 4D events as an example, but is not limited to this. If the initial event cloud itself only contains temporal and spatial information, such as e m =(x m y m , t m If the initial event cloud data is converted into low-dimensional point cloud data, then the step of converting the high-dimensional initial event cloud data into low-dimensional point cloud data is ignored.
[0145] Meanwhile, since the length of each sample in the action recognition dataset is different, a sliding window is used to normalize the sampling range within each sliding window, so that the coordinate x m y m and t m All are normalized to [0,1], where x m y m and t m Normalize each point cloud separately to obtain the normalized point cloud dataset within each sliding window:
[0146] AR point ={e′ m =(x′) m y′ m ,t′ m (1) | m = 1, 2, ..., n
[0147]
[0148]
[0149]
[0150] Among them, AR pointe represents the point cloud dataset within the sliding window. m =(x m y m , t m ) and e′ m =(x′) m y′ m ,t′ m ) represent the current event and its corresponding normalized event, respectively, t m Represents the current event timestamp, t n t0 and t0 represent the current endpoint and starting point of the sliding window, respectively. The ordinate, x... m -x0 represents the distance or difference between the maximum and minimum values of the x-coordinates of all events within the current sliding window, y m -y0 represents the distance or difference between the maximum and minimum values of the y-coordinates of all events within the current sliding window.
[0151] The steps of converting high-dimensional initial event cloud data into low-dimensional point cloud data, and / or normalizing spatiotemporal information to [0,1], can occur after using a sliding window to segment the initial event cloud, or the parameters of each dimension in the initial event cloud can be normalized before using a sliding window to segment.
[0152] In other embodiments, the execution order of the steps of using a sliding window to segment the initial event cloud, transforming the high-dimensional initial event cloud data into low-dimensional point cloud data, and normalizing the spatiotemporal information is not unique and can be executed in any reasonable order. Furthermore, none of these steps are mandatory and can be adjusted according to actual conditions. For example, normalization can be performed first, followed by dimensionality reduction, and then segmentation. Alternatively, dimensionality reduction and normalization can be performed without using sliding window segmentation. This invention does not limit the scope of these steps. When the implementation scheme does not include the sliding window segmentation step, the initial event cloud data is processed directly, and correspondingly, subsequent steps are no longer performed on the point cloud dataset within each sliding window. Moreover, for the sake of brevity, subsequent embodiments involving these steps have the same flexibility as this embodiment. The use of any of these steps and the order of the steps can be adjusted according to actual performance requirements, and will not be repeated hereafter.
[0153] However, it is preferable to use a sliding window to segment the initial event cloud and then perform normalization processing. This method can retain more valuable information and make the SNN perform better.
[0154] S2003: Map the point cloud dataset within the sliding window from the global point cloud distribution to the local point cloud distribution.
[0155] To map the point cloud dataset from a global point cloud distribution to a local point cloud distribution, a training or test set is further constructed using sampling and grouping. Step S2003 further includes the following two steps:
[0156] First, the point cloud data within the sliding window is sampled to obtain S key points;
[0157] In some embodiments, random sampling is used to select S key points. In some preferred embodiments, the farthest point sampling (FPS) method is used to select S key points (centroids), which differs from random sampling in that it places greater emphasis on ensuring the spatial uniformity or consistency of the sampling points.
[0158] Due to each sliding window AR clip The obtained point cloud dataset typically contains tens of thousands of data points. To standardize / normalize the number of network parameters, in a preferred embodiment, AR... clip The dimensions of the input data range from [N] clip From [N,3] to [N,3], where N clip AR clip The number of points in the point cloud data is preferably N, which is a power of 2. For example, the value range of N is usually [256, 512, 1024, 2048]. 3 represents 3D point cloud data, i.e., e m =(x m y m , t m ) or e′=(x′ m y′ m ,t′ m ).
[0159] Secondly, find the same set of points around the key point.
[0160] In some embodiments, the K-nearest neighbor method (KNN) is used to identify N' nearest neighbors around a keypoint.
[0161] Furthermore, Euclidean distance can be used as a distance metric, such as selecting the N' points in space that are closest to the keypoint. Note that when using Euclidean distance, each parameter / eigenvalue needs to be standardized or normalized.
[0162] In some other preferred embodiments, a sphere of radius r is drawn in the spatial domain surrounding the key point, and N' points are uniformly or randomly selected within the sphere surrounding the key point to obtain the same set of points around the key point.
[0163] Since the local features learned by SNN are closely related to geometric distance, the local information represented by the N' neighboring points around the key point is not the most complete or perfect. Therefore, in some preferred embodiments, N' points are uniformly or randomly selected within the sphere around the key point.
[0164] After sampling and grouping, each AR clip The input data is converted from [N,3] to [N',S,3], where S is the number of groups or keypoints into which the point cloud is divided, N' is the number of points or events in each group of the point cloud, which can be set according to actual needs, and 3 represents the three-dimensional [x,y,t].
[0165] In a preferred embodiment, in order to improve the performance of the SNN and make the point cloud dataset obtained after sampling and grouping more completely represent the local features, key point coordinates, such as [x0, y0, z0], are added to the dimension, making the input dimension of the SNN 6-dimensional, such as [x, y, t, x0, y0, t0]. In this way, the performance of the SNN network can be greatly improved with a very small increase in the number of parameters (which can be ignored).
[0166] S2004. Convert the information of each point into a spike through pulse coding.
[0167] Typically, the event information output by an event imaging device is a floating-point number, so pulse coding is required to quantize it and convert it into a binary pulse form.
[0168] Pulse coding includes various methods, such as rate coding, temporal coding, bursting coding, and population coding. To avoid loss or excessive compression of input information, in a preferred embodiment, rate coding is used to convert the information at each point into a pulse.
[0169] Pulse coding can be performed on some or all of the sliding window and / or S groups of point cloud data to obtain a preprocessed pulse stream. Furthermore, since the datasets of event imaging devices typically contain a large amount of noise during actual recording, in some embodiments, the initial event cloud can be an event stream obtained by noise reduction of the data acquired by the event imaging device. In other embodiments, statistical methods can also be used to select data from different datasets after a certain period or within a valid time period for the preprocessing operations described above.
[0170] Figure 3 This is a preprocessing flowchart in a second preferred embodiment of the present invention, including the following steps:
[0171] S3001. Use a sliding window on the timeline to segment the initial event cloud.
[0172] S3002, Convert high-dimensional initial event cloud data into low-dimensional point cloud data, and / or normalize spatiotemporal information.
[0173] S3003. Sample the point cloud data within the sliding window to obtain S key points.
[0174] S3004. Convert the information of each point into a spike through pulse coding.
[0175] To save space, the specific implementation details of each step are as described above. Similarly, in other extensions of the second preferred embodiment, the initial event cloud segmentation, the transformation of high-dimensional initial event cloud data into low-dimensional point cloud data, and the normalization processing can be omitted or selected as needed, and these steps can be performed in any order. When the step of segmenting the initial event cloud using a sliding window is not included, subsequent steps such as S3003 directly ignore the consideration of the sliding window, directly sample based on the aforementioned (normalized) point cloud data to obtain S key points, and perform pulse coding on the S key points to obtain a preprocessed pulse stream.
[0176] Since this method does not map the global point cloud distribution to the local point cloud distribution, the preprocessing in this embodiment is not as complete in representing local features as in the first preferred embodiment. As a result, the SNN can obtain very little local information, and therefore, the overall performance is not as good as the preprocessing method in the first preferred embodiment.
[0177] Figure 4 This is a flowchart of the spiking neural network (SNN) processing in a preferred embodiment of the present invention, showing the SNN performing local and global feature extraction followed by classification. The SNN processing includes the following steps:
[0178] S4101. Use a local feature extractor to extract local features of at least one input pulse stream, thereby increasing the point cloud dimension.
[0179] The local feature extractor includes at least one local feature extraction unit. In a preferred embodiment, the number of local feature extraction units corresponds to the number of groups S. For example, the point cloud data processed by the local feature extractor is improved from [N',S,D] to [N',S,D2], where the dimension D2 is greater than D.
[0180] S4102. Merge the local features extracted by the local feature extraction unit to obtain a tensor representing the global feature dimension.
[0181] In some embodiments, prior to merging local features, a size reshaping is further included to reshape the sizes of the local features to be the same or consistent.
[0182] In other embodiments, to further reduce the number of parameters, pooling is performed before or after local feature merging. In a preferred embodiment, the pooling is max pooling. In other embodiments, the pooling can also be average pooling, sum pooling, or other forms.
[0183] In a preferred embodiment, the local features extracted by the local feature extraction unit are reshaped and then merged together. Then, based on max pooling, a tensor [S,D2] representing the global feature dimension is obtained.
[0184] In addition, one or more of the size reshaping, merging, and pooling operation steps and their corresponding order can be selected according to actual needs.
[0185] S4103. Use the global feature extractor to extract global features. For example, after the global feature extractor, the data dimension increases from D2 to D4, and at this time, the tensor [S,D4] is obtained.
[0186] In a preferred embodiment, a pooling operation is further employed to abstract the features after sequentially passing through the local feature extractor and the global feature extractor into a tensor [1,D4], i.e., a single feature vector.
[0187] S4104. The classifier classifies based on a single feature vector to obtain the initial classification result.
[0188] The input pulse stream of a SNN can be one or more pulse streams. This invention does not limit this; for the sake of simplicity, the same principle applies to other parts, as long as the corresponding logic is satisfied. If the input pulse stream of the SNN involves one or more pulse streams originating from multiple sliding windows, then... Figure 4The initial classification result obtained by the SNN is the initial classification result corresponding to the point cloud data in each sliding window. At this time, the preprocessing is performed based on each sliding window to obtain S groups of local point cloud data and their corresponding pulse streams. The SNN first performs local feature extraction and then global feature extraction on the pulse streams corresponding to each group of point cloud data to obtain a tensor representing the global feature dimension. After pooling, the classifier obtains the initial classification result corresponding to the sliding window.
[0189] For multiple sliding windows corresponding to multiple initial classification results, the classifier further uses a voting mechanism to obtain the final classification result. The voting mechanism improves classification performance, as more samples determine the final output. In a preferred embodiment, the number of neurons in the last layer of the classifier is 5 times or more, preferably 10 times, the number of classification categories.
[0190] Figure 5 This is a flowchart of a spiking neural network (SNN) process in another embodiment of the present invention, where the SNN performs only local feature extraction before classification. This processing flow includes the following steps:
[0191] S5101. Use a local feature extractor to extract local features of at least one input pulse stream, thereby increasing the point cloud dimension.
[0192] S5102. Merge the local features extracted by the local feature extraction unit to obtain a tensor representing the global feature dimension, for example, to obtain a tensor [S,D'2] representing the global feature dimension. Similarly, as mentioned above, one or more of the size reshaping, merging, and pooling operation steps and their corresponding order can be selected according to actual needs.
[0193] Here, dimension D'2 can be greater than or equal to dimension D2 in the aforementioned steps.
[0194] In a preferred embodiment, a pooling operation is further employed to abstract the features after the local feature extractor into a single feature vector, such as [1, D'2].
[0195] S5103. The classifier classifies based on a single feature vector to obtain the initial classification result.
[0196] Furthermore, for multiple sliding windows corresponding to multiple initial classification results, the classifier further uses a voting mechanism to obtain the final classification result.
[0197] To save space, the specific implementation details and optional embodiments of each step are as described above.
[0198] in, Figure 4 and Figure 5 The SNN processing flow it represents is similar to Figure 2The preprocessing steps in the first preferred embodiment correspond to those in the first preferred embodiment.
[0199] Figure 6 This is a flowchart of a spiking neural network (SNN) process in another embodiment of the present invention, where the SNN only performs global feature extraction and classification. The processing flow includes the following steps:
[0200] S6101. Based on all input pulse streams, global features are extracted using a global feature extractor to increase the point cloud dimension.
[0201] For example, we obtain a tensor [S,D'4] representing the global feature dimension. Here, dimension D'4 can be greater than or equal to dimension D4 in the previous steps.
[0202] In a preferred embodiment, a pooling operation is further used to abstract the features after the global feature extractor into a single feature vector, such as [1,D'4].
[0203] S6102. The classifier classifies based on a single feature vector to obtain the initial classification result.
[0204] Furthermore, for multiple sliding windows corresponding to multiple initial classification results, the classifier further uses a voting mechanism to obtain the final classification result.
[0205] To save space, the specific implementation details and optional implementations for each step are as described above.
[0206] in, Figure 6 The SNN processing flow it represents is similar to Figure 3 The preprocessing steps in the second preferred embodiment correspond to those described above.
[0207] Figure 7 This is a preprocessing block diagram in a preferred embodiment of the present invention, including a sampling and grouping module and a pulse coding module that are coupled in sequence.
[0208] The segmentation module is used to segment the initial event cloud using a sliding window on the timeline.
[0209] Optionally, when segmenting the initial event cloud, the high-dimensional spatiotemporal events (such as [x,y,t,p]) are converted into point cloud data format (such as [x,y,z]).
[0210] Optionally, the segmentation module normalizes the parameters of each dimension before or after using a sliding window to segment the initial event cloud.
[0211] The sampling and grouping module is used to sample and group the point cloud data within each sliding window, thereby mapping the point cloud dataset within the sliding window from a global point cloud distribution to a local point cloud distribution, thus constructing a standardized training set or test set. In this case, the sampling and grouping module includes at least one sampling and grouping unit, each used to sample and group the point cloud data within a sliding window.
[0212] Preferably, the sampling unit is used to select S key points, where S is a positive number, and the grouping unit is coupled to the sampling unit to find the same group of points (e.g., the number is N') around each key point.
[0213] Optionally, the sampling and grouping module selects only a portion of the data within the sliding window for sampling and grouping, meaning the number of sampling and grouping units is the same as the number of sliding windows. For example, sample data within a time period with low noise and high quality can be selected to maintain the accuracy of the SNN while reducing the number of parameters.
[0214] Optionally, the number of sampling and grouping units in the sampling and grouping module is the same as the number of sliding windows.
[0215] The pulse encoding module performs pulse encoding, converting the information of each point into a pulse.
[0216] In some embodiments, the segmentation module may be omitted. Instead, the sampling and grouping module samples and groups the (normalized) point cloud data corresponding to the initial event cloud, such as obtaining S groups, each group including N' point cloud data.
[0217] In other embodiments, the sampling and grouping module can be replaced by a sampling module, which is only used for sampling, and in this case, the operation of mapping the global point cloud distribution to the local point cloud distribution is not performed.
[0218] Figure 8 This is a block diagram of a spiking neural network in a preferred embodiment of the present invention. The spiking neural network of the present invention includes multiple layers (at least two layers), each layer including at least one spiking neuron. For the multiple layers of the SNN, the functions can be divided according to actual needs to achieve different functions. The multiple layers of the SNN in the SpikePoint of the present invention are at least used to implement the following functions: Figure 5 The functions shown include a local feature extractor, a global feature extractor, and a classifier that perform feature extraction in a hierarchical manner.
[0219] The classic point cloud processing method PointNet utilizes a multilayer perceptron (MLP) to increase the dimensionality of point clouds. While this method is effective for summarizing information from point cloud datasets, it incurs very high network bandwidth and suffers from overfitting during training. Furthermore, given the problems of gradient explosion and vanishing gradients inherent in SNNs, as well as the integral characteristics of spiking neurons, a suitable spiking-based feature extractor is essential for point cloud processing methods.
[0220] The Local Feature Extractor (LME) is coupled to the output of the Pulse Code Module (PCM). It includes multiple LME units, each extracting local features from a group of data. The LME units increase the dimensionality of the point cloud data corresponding to that group. In a preferred embodiment, the number of LME units corresponds to the number of groups, S.
[0221] The merging module, whose input is coupled to the output of the local feature extractor, merges the local features extracted by the local feature extraction unit to obtain a tensor representing the global feature dimension.
[0222] Optionally, the SNN also includes a reshaping module for resizing the features extracted by the local feature extraction unit. The reshaping operation can be performed before or after the merging operation.
[0223] Optionally, a pooling operation is further employed to abstract the features after the local feature extractor into a single feature vector. Preferably, the pooling operation is located after the merging module and is used to perform pooling operations on the merged features.
[0224] Preferably, the reshaping module resizes the local features extracted by the local feature extractor, and the input of the merging module is coupled to the output of the reshaping module to merge the reshaped local features to obtain a tensor representing the global feature dimension.
[0225] A global feature extractor, whose input is coupled to the output of the merging module, is used to extract global features, and the data dimensionality is further increased based on the global feature extractor.
[0226] Classification module: Performs inference based on extracted global features to obtain classification results.
[0227] Optionally, if the input pulse stream of the SNN includes preprocessed data corresponding to multiple sliding windows, the classification module first processes the preprocessed data corresponding to each sliding window to obtain an initial classification result. Multiple sliding windows correspond to multiple initial classification results. The classifier further uses a voting mechanism to obtain the final classification result.
[0228] To save space, the specific implementation details and optional embodiments of each module are described in the aforementioned method flow steps.
[0229] Figure 9 This is a block diagram of a spiking neural network (SNN) in another embodiment of the present invention. In this embodiment, the SNN only performs local feature extraction. It includes a local feature extractor, a merging module, and a classification module coupled in sequence. This implementation simplifies the processing flow, but due to the lack of global features, its performance is significantly inferior. Figure 8 Corresponding implementation examples.
[0230] Figure 10 This is a block diagram of a spiking neural network in another embodiment of the present invention. In this embodiment, the SNN only performs global feature extraction. It includes a global feature extractor and a classification module coupled in sequence.
[0231] Figures 8 to 10 Compared to the aforementioned implementation methods, the performance ranking is as follows: Figure 8 The best implementation is the second best. Figure 10 The final example is... Figure 9 Example.
[0232] Figure 11 This is a block diagram of a local feature extraction unit in a preferred embodiment of the present invention, which includes at least one convolution to increase the input dimension. Furthermore, the dimension that different convolutions can increase can be set according to actual needs. Due to the special nature of point cloud networks, dimensionality operations typically employ one-dimensional convolution.
[0233] To avoid overfitting, gradient explosion, and vanishing gradients during training, this invention includes at least a Residual Feature (ResF) module. The ResF module is coupled to the corresponding convolution, for example, a ResF module (1111, 1112) is coupled after each convolution (1101, 1102), or ResF modules coupled or connected to the corresponding convolutions are added after any number of convolutions. In a preferred embodiment, the number of ResF modules is the same as the number of convolutions, with one ResF module coupled after each convolution.
[0234] like Figure 11 As shown, it includes two convolutions 1101 and 1102, and ResF modules 1111 and 1112 coupled to them respectively. After being processed by this local feature extraction unit, the dimension changes from D to D2.
[0235] To maintain the advantages of the ResF module while ensuring a lightweight design and avoiding excessive parameter increases, this invention employs a bottleneck approach to ensure that the dimensionality of features remains unchanged before and after the residual connections. That is, the input and output dimensions of ResF are the same. For example, the dimensionality of the input is first reduced, and then increased back to its original dimension, or vice versa. Therefore, the ResF module includes at least one convolutional kernel for dimensionality reduction and one convolutional kernel for dimensionality increase.
[0236] Figure 12 This is a block diagram of a global feature extractor in a preferred embodiment of the present invention, which includes at least one convolution to increase the input dimension, and at least one ResF module, wherein a ResF module coupled or connected to a corresponding convolution is added after any convolution. Similarly, the dimension that different convolutions can increase can be set according to actual needs.
[0237] Optionally, each convolution is coupled to a ResF module, the ResF module being coupled to the corresponding convolution, such as... Figure 12 As shown in (a).
[0238] Preferably, in order to further reduce the number of parameters and ensure lightweight operation, the last convolution is processed and directly used as the output of the global feature extractor, that is, the ResF module coupled to the output of the last convolution is removed / omitted.
[0239] Figure 13 A block diagram of the ResF module in some preferred embodiments of the present invention is shown. Figure 13 (a) and Figure 13 (b) shows the ResF module circuit structure using different residual connection methods. Figure 13 (c) is a ResF module circuit structure that does not use residual connections, or where the last stage of convolution is no longer coupled to a ResF module.
[0240] Figure 13 (a) to Figure 13 In (c), the ResF module includes spiking neurons 1311, 1312 and 1313, and convolutional kernels 1301 and 1302. Figure 13 In the ResF module of (a), each convolution is coupled with a spiking neuron before and after it. In addition, the output of the starting spiking neuron 1311 is connected to the output residual of the ending spiking neuron 1313 (summed). Figure 13 In the ResF module of (b), each convolution is coupled with a spiking neuron before and after it. In addition, the output of the starting spiking neuron 1311 is connected to the input residual of the ending spiking neuron 1313 (summed).
[0241] Furthermore, in the ResF module, the dimension increases after convolution 1301, and then increases back to the original dimension after convolution 1302, thus remaining unchanged. This method is called the bottleneck method. For example, the dimension increases from N1 to N2, and then returns to N1. Preferably, the value of N2 is half or twice that of N1.
[0242] The spiking neuron of the present invention can be other spiking neurons such as IAF neurons and LIF neurons commonly found in the field of SNN, preferably LIF neurons, to further avoid overfitting.
[0243] In summary, the ResF module of this invention includes at least one convolutional kernel for dimensionality increase and at least one convolutional kernel for dimensionality reduction. After dimensionality increase by at least one dimensionality-increasing convolutional kernel and dimensionality reduction by at least one dimensionality-reducing convolutional kernel, the output dimension of the final ResF module is the same as the input dimension. Furthermore, tests have shown that based on… Figure 13 (a) to Figure 13 (c) Implementation method, the SNN performance is ranked from best to worst as follows: Figure 13 (a)> Figure 13 (b)> Figure 13 (c)
[0244] Figure 14 A preferred embodiment of the present invention provides a flowchart of event-driven information processing based on point clouds, wherein... Figure 14 (a) is the preprocessing flow, which includes the following steps:
[0245] Use a sliding window to segment the initial event cloud;
[0246] Transform high-dimensional event cloud data into low-dimensional point cloud data, and / or normalize it to obtain the required point cloud data;
[0247] Sampling and grouping are performed within any sliding window to obtain S groups of local point cloud distributions, where S is a positive integer and S is adaptive. The specific values of S in any sliding window can be the same or different.
[0248] Based on pulse coding, the information of each point is converted into pulses to obtain multiple pulse streams.
[0249] The number of pulse streams corresponds to the specific pulse coding method. In a certain embodiment, for any window including S groups of local point cloud distributions, S groups of pulse streams are obtained.
[0250] Figure 14 (b) The SNN processing flow generally includes the following steps:
[0251] For any S groups of pulse streams corresponding to a sliding window, local feature extraction is performed respectively;
[0252] The S groups of local features extracted above are merged and then subjected to the first pooling to obtain a tensor representing the global feature dimension.
[0253] Global features are extracted using a global feature extractor and then a second pooling is performed.
[0254] The classifier classifies based on a single feature vector, obtaining an initial classification result corresponding to the sliding window.
[0255] Multiple sliding windows correspond to multiple initial classification results, and the classifier further uses a voting mechanism to obtain the final classification result.
[0256] There are various alternative and extended solutions for adding / removing steps in the above steps, and specific alternatives can be found in the description above. Furthermore, although the present invention has been described with reference to specific features and embodiments, various modifications, combinations, and substitutions can be made without departing from the invention. The scope of protection of the present invention is not limited to the specific embodiments of the processes, machines, manufacturing processes, material compositions, apparatuses, methods, and steps described in the specification, and these methods and modules may also be implemented in one or more related, interdependent, cooperative, or upstream / downstream products or methods.
[0257] Therefore, the specification and drawings should be simply regarded as a description of some embodiments of the technical solutions defined by the appended claims, and thus the appended claims should be interpreted in accordance with the principle of the greatest reasonable interpretation, and are intended to cover as much as possible all modifications, variations, combinations or equivalents within the scope of the invention, while avoiding unreasonable interpretations.
[0258] To achieve better technical effects or for the needs of certain applications, those skilled in the art may make further improvements to the technical solution based on this invention. However, even if such improvements / designs are inventive and / or progressive, as long as they rely on the technical concept of this invention and cover the technical features defined in the claims, the technical solution should also fall within the protection scope of this invention.
[0259] The technical features mentioned in the appended claims may have alternative technical features, or the order of certain technical processes or material organization may be rearranged. Those skilled in the art, upon learning of this invention, will readily conceive of these alternative means, or alter the order of the technical processes or material organization, and then employ substantially the same means to solve substantially the same technical problems and achieve substantially the same technical effects. Therefore, even if the claims explicitly define the aforementioned means and / or order, these modifications, alterations, and substitutions should all fall within the scope of protection of the claims based on the principle of equivalents.
[0260] The method steps or modules described in the embodiments disclosed in this invention can be implemented in hardware, software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the steps and components of each embodiment have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application or design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered outside the scope of protection claimed by this invention.
Claims
1. A data conversion method characterized by, Includes the following steps: The spatiotemporal information in the initial event cloud data is normalized to obtain a normalized point cloud dataset, wherein the initial event cloud is the event stream output by the event imaging device. The normalized point cloud dataset is mapped from the global point cloud distribution to the local point cloud distribution; Pulse coding is used to convert the information of each point in a local point cloud distribution into pulses, resulting in a converted pulse stream. The pulse stream is used to perform the following processing using a spiking neural network: A global feature extractor is used to extract global features from the input pulse stream of the spiking neural network to increase the dimension of the input pulses; Classification is performed based on the global features; The global feature extractor includes at least one pair of cascaded first-class convolutional kernels and residual feature blocks; The residual feature block includes at least one type II convolutional kernel, and each type II convolutional kernel is coupled to a spiking neuron before and after it. The residual feature block can be one of the following residual connection methods: I) The output residuals of the starting spiking neuron and the output residuals of the ending spiking neuron are connected; II) The output of the starting spiking neuron is connected to the input residual of the ending spiking neuron; III) The output of the starting spiking neuron is not residually connected to the input or output of the ending spiking neuron; The first type of convolutional kernel is an increasing-dimensional convolutional kernel, and the second type of convolutional kernel includes at least one convolutional kernel for increasing dimensionality and at least one convolutional kernel for decreasing dimensionality.
2. The data conversion method according to claim 1, characterized in that: All points in the point cloud dataset refer to event points; The step of mapping the point cloud dataset from a global point cloud distribution to a local point cloud distribution includes: The point cloud data is sampled to obtain at least one key point, and the same group of points around the key point are selected.
3. The data conversion method of claim 2, wherein, The sampling includes one of the following methods: I) Select at least one key point based on random sampling; II) Select S key points based on the farthest point sampling method; Based on the same set of points surrounding the key point using one of the following methods: I) Identify at least one neighboring point around any keypoint based on the K-nearest neighbor method; II) Draw a sphere in the spatial domain around any key point. Within the sphere around the key point, select N' points uniformly or randomly to obtain the same set of points around the key point.
4. The data conversion method according to any one of claims 1 to 3, characterized in that: Using a sliding window, the initial event cloud data or the normalized point cloud dataset is segmented; Within any sliding window, the normalized point cloud dataset within the sliding window is mapped from the global point cloud distribution to the local point cloud distribution. Then, pulse coding is used to convert the information of each point in the local point cloud distribution of the sliding window into pulses, resulting in a converted pulse stream.
5. An event-driven information processing method, characterized by, The following processing is performed using a spiking neural network: A global feature extractor is used to extract global features from the input pulse stream of the spiking neural network to increase the dimension of the input pulses. The pulse stream is obtained by: normalizing the spatiotemporal information in the initial event cloud data to obtain a normalized point cloud dataset, where the initial event cloud is the event stream output by the event imaging device; mapping the normalized point cloud dataset from a global point cloud distribution to a local point cloud distribution; and using pulse coding to convert the information of each point in the local point cloud distribution into pulses to obtain a converted pulse stream. Classification is performed based on the global features; The global feature extractor includes at least one pair of cascaded first-class convolutional kernels and residual feature blocks; The residual feature block includes at least one type II convolutional kernel, and each type II convolutional kernel is coupled with spiking neurons before and after it. The residual feature block is one of the following residual connection methods: I) The output residuals of the starting spiking neuron and the output residuals of the ending spiking neuron are connected; II) The output of the starting spiking neuron is connected to the input residual of the ending spiking neuron; III) The output of the starting spiking neuron is not residually connected to the input or output of the ending spiking neuron; Wherein, the first type of convolutional kernel is an increasing-dimensional convolutional kernel, and the second type of convolutional kernel includes at least one convolutional kernel for increasing dimensionality and at least one convolutional kernel for decreasing dimensionality; The number of the first type of convolutional kernels is equal to or one more than the number of residual feature blocks.
6. The event-driven information processing method according to claim 5, characterized in that: The input of any residual module is coupled to the output of a first-type convolutional kernel; The output dimension of the residual feature block is the same as the input dimension.
7. An event-driven information processing apparatus characterized by comprising: The event-driven information processing device includes: A local feature extractor, whose input is used to receive multiple pulse streams from an input spiking neural network, is used to extract local features. The pulse streams are obtained through the following methods: normalizing the spatiotemporal information in the initial event cloud data to obtain a normalized point cloud dataset, where the initial event cloud is the event stream output by the event imaging device; mapping the normalized point cloud dataset from a global point cloud distribution to a local point cloud distribution; and using pulse coding to convert the information of each point in the local point cloud distribution into pulses to obtain a converted pulse stream. The merging module, whose input is coupled to the output of the local feature extractor, is used to merge the features extracted by the local feature extractor. A global feature extractor, whose input is coupled to the output of the merging module, is used to extract global features; The classification module performs classification based on the extracted global features to obtain the classification results; Wherein, one or more of the local feature extractor and the global feature extractor include at least one pair of cascaded first-type convolutional kernels and residual feature blocks, wherein the number of first-type convolutional kernels is equal to or one more than the number of residual feature blocks; The residual feature block includes at least one type II convolutional kernel, and each type II convolutional kernel is coupled to a spiking neuron before and after it. The residual feature block can be one of the following residual connection methods: I) The output residuals of the starting spiking neuron and the output residuals of the ending spiking neuron are connected; II) The output of the starting spiking neuron is connected to the input residual of the ending spiking neuron; III) The output of the starting spiking neuron is not residually connected to the input or output of the ending spiking neuron; The first type of convolutional kernel is an increasing-dimensional convolutional kernel, and the second type of convolutional kernel includes at least one convolutional kernel for increasing dimensionality and at least one convolutional kernel for decreasing dimensionality.
8. The event-driven information processing apparatus according to claim 7, characterized in that: The input of any residual module is coupled to the output of a first-type convolutional kernel; The output dimension of the residual feature block is the same as the input dimension.
9. An event-driven information processing method based on point clouds, characterized in that: The point cloud-based spiking neural network preprocesses the event stream output by the event imaging device using the data conversion method as described in any one of claims 1 to 4 to obtain a preprocessed spiking stream. Alternatively, based on the preprocessed pulse stream, the classification result can be obtained using the event-driven information processing method as described in any one of claims 5 to 6.
10. A method of processing event-driven information based on a point cloud, characterized in that, Includes the following steps: The initial event cloud is segmented using at least one sliding window to obtain the required point cloud data; wherein the initial event cloud is the event stream output by the event imaging device; Sampling and grouping are performed within any sliding window to obtain S groups of local point cloud distributions, where S is a positive integer; Based on pulse coding, the information of each point is converted into pulses to obtain at least one pulse stream; For any S groups of pulse streams corresponding to a sliding window, local feature extraction is performed respectively; The extracted S groups of local features are merged and then subjected to the first pooling to obtain a tensor representing the global feature dimension. Global features are extracted using a global feature extractor and then a second pooling is performed. Classification is performed based on the global features after the second pooling to obtain the initial classification result corresponding to the sliding window; The global feature extractor includes at least one pair of cascaded first-class convolutional kernels and residual feature blocks; The residual feature block includes at least one type II convolutional kernel, and each type II convolutional kernel is coupled to a spiking neuron before and after it. The residual feature block can be one of the following residual connection methods: I) The output residuals of the starting spiking neuron and the output residuals of the ending spiking neuron are connected; II) The output of the starting spiking neuron is connected to the input residual of the ending spiking neuron; III) The output of the starting spiking neuron is not residually connected to the input or output of the ending spiking neuron; The first type of convolutional kernel is an increasing-dimensional convolutional kernel, and the second type of convolutional kernel includes at least one convolutional kernel for increasing dimensionality and at least one convolutional kernel for decreasing dimensionality.