Event camera based pose estimation method, apparatus, device and storage medium

CN115731259BActive Publication Date: 2026-06-12SHENZHEN RUISHIZHIXIN TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHENZHEN RUISHIZHIXIN TECH CO LTD
Filing Date: 2022-11-30
Publication Date: 2026-06-12

Application Information

Patent Timeline

30 Nov 2022

Application

12 Jun 2026

Publication

CN115731259B

IPC: G06T7/207; G06V10/80; G06V10/40; G06V10/62

AI Tagging

Application Domain

Image analysis Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Railway intelligent loading quality monitoring and parameter correction method and system
CN122198812AImage analysis Ensemble learning
Object evaluation method, object evaluation system, object evaluation device, object evaluation program, and storage medium on which the program is recorded.
JP7872884B1Image analysisOther apparatus
A control method and device of an image acquisition device, an image acquisition device, and a storage medium
CN122199885AImage analysis Character and pattern recognition
Method and System for Imaging and Analysis of a Biological Specimen
US20260168894A1Image enhancement Image analysis
A micro device pose visual positioning method for a semiconductor substrate
CN122192290AImage analysisNavigation by terrestrial means

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing technologies, motion estimation based on the geometric characteristics of objects in images has low accuracy, especially under conditions such as occlusion or foggy/rainy weather.

⚗Method used

By combining event cameras and standard cameras to acquire image and event data respectively, pose regression is performed through feature extraction, filtering and fusion to estimate the system pose change.

🎯Benefits of technology

In high-speed and high dynamic range scenarios, it provides more motion information, improving the accuracy and stability of attitude estimation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115731259B_ABST

Patent Text Reader

Abstract

The application provides an event camera-based pose estimation method and device, equipment and a storage medium. The method comprises: acquiring an APS image pair corresponding to a starting time and an ending time of a pose estimation time interval, and acquiring first interval event data and second interval event data associated with the pose estimation time interval; generating a first event image group according to a plurality of sub-interval event data in the first interval event data, and generating a second event image group according to a plurality of sub-interval event data in the second interval event data; performing feature extraction based on the APS image pair to obtain APS image features, and performing feature extraction based on the first event image group and the second event image group to obtain event image features; performing feature filtering and fusion on the APS image features and the event image features to obtain fused features; and performing pose regression based on the fused features to obtain system pose change data. Thus, the accuracy and stability of pose estimation can be effectively improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of motion estimation technology, and in particular to a pose estimation method, apparatus, device, and storage medium based on an event camera. Background Technology

[0002] Currently, visual odometry (VO) is widely used in navigation and positioning of devices such as autonomous vehicles, drones, and robots. The mainstream visual odometry implementation method mainly estimates the camera pose based on the geometric characteristics of objects in the image. Therefore, the image is required to contain a large number of stable texture features. Once there are occlusions in the scene or the scene is captured in foggy or rainy weather, the solution accuracy of the geometric method will be severely affected without other sensors (IMU, laser, radar, etc.), resulting in low accuracy of motion estimation. Summary of the Invention

[0003] This application provides a pose estimation method, apparatus, device, and storage medium based on an event camera, which can at least solve the problem of low accuracy of estimation results caused by motion estimation based on the geometric characteristics of objects in images in related technologies.

[0004] The first aspect of this application provides a pose estimation method based on an event camera, comprising: acquiring two APS images corresponding to the start and end times of a pose estimation time interval to form an APS image pair; acquiring first interval event data and second interval event data associated with the pose estimation time interval; generating a first event image group based on multiple sub-interval event data in the first interval event data; generating a second event image group based on multiple sub-interval event data in the second interval event data; performing feature extraction based on the APS image pair to obtain APS image features; and performing feature extraction based on the first event image group and the second event image group to obtain event image features; performing feature filtering and fusion on the APS image features and the event image features to obtain fused features; and performing pose regression based on the fused features to obtain system pose change data.

[0005] A second aspect of this application provides a pose estimation device based on an event camera, comprising: an acquisition module, configured to acquire two APS images corresponding to the start and end times of a pose estimation time interval to form an APS image pair, and to acquire first interval event data and second interval event data associated with the pose estimation time interval; a generation module, configured to generate a first event image group based on multiple sub-interval event data in the first interval event data, and to generate a second event image group based on multiple sub-interval event data in the second interval event data; an extraction module, configured to perform feature extraction based on the APS image pair to obtain APS image features, and to perform feature extraction based on the first event image group and the second event image group to obtain event image features; a fusion module, configured to perform feature filtering and fusion on the APS image features and the event image features to obtain fused features; and an estimation module, configured to perform pose regression based on the fused features to obtain system pose change data.

[0006] A third aspect of this application provides an electronic device, including a memory and a processor, wherein the processor is configured to execute a computer program stored in the memory, and when the processor executes the computer program, it implements the steps in the attitude estimation method provided in the first aspect of this application.

[0007] The fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon. When the computer program is executed by a processor, it implements the steps in the attitude estimation method provided in the first aspect of this application.

[0008] As can be seen from the above, according to the event camera-based attitude estimation method, apparatus, device, and storage medium provided in this application, two APS images corresponding to the start and end times of the attitude estimation time interval are acquired to form an APS image pair, and first interval event data and second interval event data associated with the attitude estimation time interval are acquired; a first event image group is generated based on multiple sub-interval event data in the first interval event data, and a second event image group is generated based on multiple sub-interval event data in the second interval event data; feature extraction is performed on the APS image pair to obtain APS image features, and feature extraction is performed on the first event image group and the second event image group to obtain event image features; feature filtering and fusion are performed on the APS image features and event image features to obtain fused features; attitude regression is performed based on the fused features to obtain system attitude change data. Through the implementation of this application, the attitude change of the system during motion is estimated by combining the data collected by both the event camera and the standard camera. When the system is in a high-speed and high-dynamic-range scenario, more motion information can be provided to estimate the system's own attitude change, effectively improving the accuracy and stability of attitude estimation. Attached Figure Description

[0009] Figure 1 A schematic diagram illustrating an application scenario provided in one embodiment of this application;

[0010] Figure 2 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application;

[0011] Figure 3 This is a schematic diagram of the basic flowchart of an attitude estimation method provided in an embodiment of this application;

[0012] Figure 4 This is a schematic diagram of the structure of a pose estimation neural network provided in an embodiment of this application;

[0013] Figure 5 A detailed flowchart illustrating the attitude estimation method provided in an embodiment of this application;

[0014] Figure 6 This is a schematic diagram of the program modules of an attitude estimation device provided in an embodiment of this application. Detailed Implementation

[0015] To make the inventive objectives, features, and advantages of this application more apparent and understandable, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0016] In the description of the embodiments of this application, it should be understood that the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0017] The following describes in detail, with reference to the accompanying drawings, an embodiment of the present application of a pose estimation method, apparatus, device, and storage medium based on an event camera.

[0018] To address the issue of low accuracy in motion estimation based on the geometric characteristics of objects in images in related technologies, this application provides an embodiment of a pose estimation method based on an event camera, applicable to applications such as... Figure 1The scenario shown may include an APS (Active-Pixel Sensor) camera 101, an event camera 102, and an electronic device 103.

[0019] It is worth noting that an active pixel sensor is a commonly used image sensor in which each pixel sensor unit has a photodetector and at least one active transistor. In metal-oxide-semiconductor (MOS) active pixel sensors, MOS field-effect transistors (MOSFETs) are used as amplifiers. There are various types of APS, including the earlier NMOS type APS and the more common complementary MOS (CMOS) type APS.

[0020] The event-based vision sensor (EVS) configured in the event camera is a new type of sensor that simulates the human retina and responds to pixel pulses caused by changes in brightness due to motion. Therefore, it can capture changes in scene brightness (i.e., changes in light intensity) at an extremely high frame rate, record events at specific times and specific pixel locations, and form an event stream instead of a frame stream. This can solve the problems of information redundancy, large data storage and real-time processing volume of traditional cameras.

[0021] In addition, electronic device 103 is a variety of terminal devices with data processing capabilities, including but not limited to smartphones, tablets, laptops, desktop computers, vehicle terminals, and airborne terminals.

[0022] exist Figure 1 In the application scenario shown, APS images can be acquired by APS camera 101 and corresponding event data can be acquired synchronously by event camera 102. Both cameras then send their respective acquired data to electronic device 103. Electronic device 103 performs the following attitude estimation method on the received APS images and event data: First, it acquires two APS images corresponding to the start and end times of the attitude estimation time interval to form an APS image pair, and acquires first interval event data and second interval event data associated with the attitude estimation time interval; then, it generates a first event image group based on multiple sub-interval event data in the first interval event data, and a second event image group based on multiple sub-interval event data in the second interval event data; next, it extracts features based on the APS image pair to obtain APS image features, and extracts features based on the first and second event image groups to obtain event image features; further, it performs feature filtering and fusion on the APS image features and event image features to obtain fused features; finally, it performs attitude regression based on the fused features to obtain system attitude change data.

[0023] like Figure 2 The diagram shown is a schematic representation of an electronic device according to an embodiment of this application. The electronic device mainly includes a memory 201 and a processor 202. The number of processors 202 can be one or more. The memory 201 stores a computer program 203 that can run on the processor 202. The memory 201 and the processor 202 are communicatively connected. When the processor 202 executes the computer program 203, it implements the aforementioned attitude estimation method.

[0024] It should be noted that memory 201 can be high-speed random access memory (RAM) or non-volatile memory, such as disk storage. Memory 201 is used to store executable program code, and processor 202 is coupled to memory 201.

[0025] One embodiment of this application also provides a computer-readable storage medium, which may be disposed in the aforementioned electronic device. The computer-readable storage medium may be as described above. Figure 2 The memory in the illustrated embodiment.

[0026] The computer-readable storage medium stores a computer program that, when executed by a processor, implements the aforementioned attitude estimation method. Furthermore, the computer-readable storage medium can also be a USB flash drive, external hard drive, read-only memory (ROM), RAM, magnetic disk, or optical disk, or any other medium capable of storing program code.

[0027] like Figure 3 This is a basic flowchart of an attitude estimation method provided in an embodiment of this application. The attitude estimation method can be derived from... Figure 1 or Figure 2 The electronic device in the process executes the following steps:

[0028] Step 301: Obtain two APS images corresponding to the start and end times of the attitude estimation time interval to form an APS image pair, and obtain the first interval event data and the second interval event data associated with the attitude estimation time interval.

[0029] In this embodiment, the event camera and the APS camera acquire data synchronously. The event camera has a higher frame rate than the APS camera, thus recording a significant amount of event data during the acquisition of two APS images by the APS camera. Therefore, the event stream contains inter-frame information that the APS camera cannot record. In practical applications, to estimate the camera's motion within a specific time period, it is first necessary to determine the APS images and event data to be used. In this embodiment, two APS images at the start and end times of the attitude estimation time interval for motion estimation are acquired. Furthermore, the attitude estimation time interval is divided into a first time interval and a second time interval according to time sequence, and the corresponding event data for each time interval are acquired. The event data of the first interval corresponds to the APS image at the start time of the attitude estimation time interval, and the event data of the second interval corresponds to the APS image at the end time.

[0030] In one optional embodiment of this example, the steps of obtaining the first interval event data and the second interval event data associated with the attitude estimation time interval include: dividing the attitude estimation time interval into a first time interval and a second event interval according to the midpoint of the attitude estimation time interval; and obtaining the first interval event data corresponding to the first time interval and the second interval event data corresponding to the second time interval, respectively.

[0031] Specifically, in this embodiment, the attitude estimation time interval is divided into a first time interval and a second time interval, and then the event data in the two time intervals are taken as the first interval event data and the second interval event data, respectively.

[0032] Step 302: Generate a first event image group based on multiple sub-interval event data in the first interval event data, and generate a second event image group based on multiple sub-interval event data in the second interval event data.

[0033] Specifically, in this embodiment, the first time interval event data can be split into two sub-interval event data based on the midpoint of the first time interval, and the second time interval event data can be split into two sub-interval event data based on the midpoint of the second time interval. To make fuller use of the event data within the attitude estimation time interval, this embodiment can further divide the two interval event data corresponding to two APS images into multiple sub-interval event data, and then generate corresponding event images. That is, one APS image corresponds to an event image group composed of multiple event images, thereby improving the accuracy of attitude estimation. Preferably, this embodiment can divide each interval event data into two parts, that is, each event image group contains two event images, improving the accuracy of attitude estimation while ensuring that each event image contains sufficient motion information.

[0034] In this embodiment, the event data is represented as e = {x} i ,y i ,t i ,p i}, i∈0,1,…,n-1,x i ,y i t represents the pixel coordinate position. i p represents a timestamp. i This indicates the event polarity. In practical applications, event data can be divided into multiple voxels based on time. Then, the event of one voxel can be converted into a two-dimensional event image. The event images in the first event image group are then aligned temporally with the APS image at the start time, and the event images in the second event image group are aligned temporally with the APS image at the end time. It should be noted that this embodiment can convert event data into a two-dimensional image with a height of 260 and a width of 346.

[0035] In this embodiment, each voxel is converted into an event image based on a preset conversion formula, which is expressed as:

[0036]

[0037]

[0038] Where V(x,y,t) represents a voxel, and B represents the number of voxels.

[0039] Step 303: Extract features based on APS image pairs to obtain APS image features, and extract features based on the first event image group and the second event image group to obtain event image features.

[0040] like Figure 4 The diagram shows a schematic of a pose estimation neural network provided in this embodiment. The overall network includes a feature extraction network A, a feature fusion network B, and a pose estimation network C. In this embodiment, the APS image and event image serve as inputs to the overall network. After network processing, the final pose estimation result is output.

[0041] In this embodiment, the mean and variance of the APS image and the event image are first calculated, and then normalized and standardized. Each APS image and each event image is then copied three times and placed in a three-channel color matrix. The APS image and the event image are represented using a tensor of a preset dimension. The APS image and the event image are then merged along the channel dimension and used as input to a feature extraction network. The feature extraction network extracts features, specifically APS image features and event image features, through its APS image feature extraction network and event image feature extraction network, respectively. It should be noted that since this embodiment has two sets of event images, the event image feature extraction network has two branch networks corresponding to the first event image group and the second event image group, respectively. These branches extract features from the two sets of event images, and then fuse the extracted event image features to obtain the final event image features output by the event image feature extraction network.

[0042] Step 304: Perform feature filtering and fusion on the APS image features and event image features to obtain fused features.

[0043] For details, please refer to the following document again. Figure 4 The feature fusion network in this embodiment may include a self-attention network. The self-attention network captures long-term dependencies and global channel correlations by comparing the similarity of a single channel with all channels. It focuses on key features from the same sensor, allowing for better selection of features collected by the APS camera and the event camera. This self-attention network includes two branches for feature filtering of APS image features and event image features, respectively. It should be understood that in practical applications, the image features output by the feature extraction network can first be transformed using a reshape network, i.e., transforming the feature map channels and / or row and column dimensions.

[0044] In one optional embodiment of this example, the steps of filtering and fusing APS image features and event image features to obtain fused features include: performing convolution processing on the APS image features to obtain corresponding first Query matrix, first Key matrix, and first Value matrix; inputting the first Query matrix, first Key matrix, and first Value matrix into a preset first feature filtering formula for calculation to obtain first coarse-filtered features; performing convolution processing on the event image features to obtain corresponding second Query matrix, second Key matrix, and second Value matrix; inputting the second Query matrix, second Key matrix, and second Value matrix into a preset second feature filtering formula for calculation to obtain second coarse-filtered features; and fusing features based on the first coarse-filtered features and the second coarse-filtered features to obtain fused features.

[0045] In this embodiment, the first feature filtering formula described above can be expressed as:

[0046]

[0047] Among them, [Q f ,K f V f ] = W qkv f, W qkv Let f represent the learnable weight matrix. i Q represents the first coarse-filtering feature. f K f and V f These represent the first Query matrix, the first Key matrix, and the first Value matrix, respectively. k The variance of the matrix elements is represented by T, and the transpose operation is represented by T.

[0048] Furthermore, the second feature filtering formula mentioned above can be expressed as:

[0049]

[0050] Among them, [Q e ,K e V e ] = W qkv e, e i Q represents the second coarse filtering feature. e K e and V e These represent the second Query matrix, the second Key matrix, and the second Value matrix, respectively.

[0051] It should be noted that in this embodiment, the product of the Query matrix and the Key matrix is used as a mask for feature filtering, which can also be called weights, while the Value matrix is a matrix derived from the input APS image features or event image features.

[0052] Specifically, each self-attention network in this embodiment has three 1*1 convolutional layers, which can be used to generate the Query matrix, Key matrix, and Value matrix respectively. The number of channels in the Query matrix and Key matrix is reduced to 1 / 8 of the original, while the number of channels in the Value matrix remains unchanged. The feature map size is H×W for all matrices. The Query matrix is transposed and multiplied by the Key matrix, then normalized by softmax to obtain an Attention Map of [H×W, H×W]. Next, the Attention Map is multiplied by the Value matrix to obtain a Feature Map of [H×W, C]. Finally, the output is reshaped to [H, W, C] to highlight more significant features and filter out interference and useless features.

[0053] In an optional embodiment of this example, the step of fusing features based on the first coarse filtering features and the second coarse filtering features to obtain fused features includes: performing convolution processing on the first coarse filtering features to obtain corresponding third Query matrix, third Key matrix, and third Value matrix; performing convolution processing on the second coarse filtering features to obtain corresponding fourth Query matrix, fourth Key matrix, and fourth Value matrix; inputting the third Query matrix, third Key matrix, third Value matrix, fourth Key matrix, and fourth Value matrix into a preset third feature filtering formula for calculation to obtain the first fine filtering features; inputting the third Key matrix, third Value matrix, fourth Query matrix, fourth Key matrix, and fourth Value matrix into a preset fourth feature filtering formula for calculation to obtain the second fine filtering features; and fusing the first fine filtering features and the second fine filtering features to obtain fused features.

[0054] Please refer to it again. Figure 4 The feature fusion network in this embodiment also includes a cross-attention network. The cross-attention network aims to fuse complementary data from different sensors and pay more attention to the features of different sensors in different scenarios. In other words, it compares the features of event data with the features of APS images and retains the more useful features at the same time.

[0055] In this embodiment, the third feature filtering formula described above can be expressed as:

[0056]

[0057] Where, m f Q represents the first fine-filter feature. f’ K f’ and V f’ K represents the third query matrix, the third key matrix, and the third value matrix, respectively. e’ and V e’ These represent the fourth key matrix and the fourth value matrix, respectively.

[0058] Furthermore, the fourth feature filtering formula mentioned above can be expressed as:

[0059]

[0060] Where, m e Q represents the second fine-filtering feature. e’ This represents the fourth query matrix.

[0061] Step 305: Perform pose regression based on fused features to obtain system pose change data.

[0062] Please refer to it again. Figure 4 In this embodiment, the fused features are input into a pose estimation network for pose regression. The pose estimation network includes Long Short-Term Memory (LSTM) units and fully connected layers. The LSTM part is used to estimate the frame-to-frame motion experienced by the camera. In practical applications, stacking two LSTMs is preferred for better performance. This embodiment first inputs the fused features into the LSTM units of the pose estimation branch network for temporal modeling to obtain estimated data. Then, the estimated data is input into the fully connected layers for processing, outputting system pose change data. This system pose change data includes x, y, and z values representing displacement and Euler angles representing rotation, thus obtaining the system pose change between adjacent time points.

[0063] Figure 5 The method described in this application is a refined attitude estimation method provided in an embodiment of the present application. The implementation process of the attitude estimation method includes the following steps:

[0064] Step 501: Obtain two APS images corresponding to the start and end times of the attitude estimation time interval to form an APS image pair, and obtain the first interval event data and the second interval event data associated with the attitude estimation time interval.

[0065] Step 502: Divide the first time interval event data into two sub-interval event data according to the midpoint of the first time interval, and divide the second time interval event data into two sub-interval event data according to the midpoint of the second time interval;

[0066] Step 503: Generate a first event image group based on multiple sub-interval event data in the first interval event data, and generate a second event image group based on multiple sub-interval event data in the second interval event data;

[0067] Step 504: Extract features based on APS image pairs to obtain APS image features, and extract features based on the first event image group and the second event image group to obtain event image features;

[0068] Step 505: Obtain the first coarse-filtered features based on the Query matrix, the first Key matrix, and the first Value matrix corresponding to the APS image features; and obtain the second coarse-filtered features based on the second Query matrix, the second Key matrix, and the second Value matrix corresponding to the event image features.

[0069] Step 506: Obtain the third Query matrix, third Key matrix, and third Value matrix corresponding to the first coarse filtering feature, and obtain the fourth Query matrix, fourth Key matrix, and fourth Value matrix corresponding to the second coarse filtering feature;

[0070] Step 507: Obtain the first fine-filtering feature based on the third Query matrix, the third Key matrix, the third Value matrix, the fourth Key matrix, and the fourth Value matrix; and obtain the second fine-filtering feature based on the third Key matrix, the third Value matrix, the fourth Query matrix, the fourth Key matrix, and the fourth Value matrix.

[0071] Step 508: Perform feature fusion on the first fine-filter feature and the second fine-filter feature to obtain the fused feature;

[0072] Step 509: Perform pose regression based on the fused features to obtain system pose change data.

[0073] It should be understood that the sequence number of each step in this embodiment does not imply the order in which the steps are executed. The execution order of each step should be determined by its function and internal logic, and should not constitute a unique limitation on the implementation process of this application embodiment.

[0074] Figure 6 An embodiment of this application provides a pose estimation device based on an event camera. This pose estimation device can be used to implement the pose estimation method in the foregoing embodiments. The pose estimation device mainly includes:

[0075] The acquisition module 601 is used to acquire two APS images corresponding to the start and end times of the attitude estimation time interval to form an APS image pair, and to acquire first interval event data and second interval event data associated with the attitude estimation time interval.

[0076] The generation module 602 is used to generate a first event image group based on multiple sub-interval event data in the first interval event data, and to generate a second event image group based on multiple sub-interval event data in the second interval event data;

[0077] The extraction module 603 is used to extract features based on APS image pairs to obtain APS image features, and to extract features based on the first event image group and the second event image group to obtain event image features;

[0078] The fusion module 604 is used to perform feature filtering and fusion on APS image features and event image features to obtain fused features;

[0079] The estimation module 605 is used to perform attitude regression based on fused features to obtain system attitude change data.

[0080] In some implementations of this embodiment, when the acquisition module performs the function of acquiring first interval event data and second interval event data associated with the attitude estimation time interval, it is specifically used to: divide the attitude estimation time interval into a first time interval and a second event interval according to the midpoint of the attitude estimation time interval; and acquire the first interval event data corresponding to the first time interval and the second interval event data corresponding to the second time interval, respectively.

[0081] In some embodiments of this example, the attitude estimation device further includes a splitting module, configured to split the first interval event data into two sub-interval event data according to the midpoint of the first time interval, and to split the second interval event data into two sub-interval event data according to the midpoint of the second time interval.

[0082] In some embodiments of this example, the fusion module is specifically used for: performing convolution processing on the APS image features to obtain corresponding first Query matrix, first Key matrix, and first Value matrix; inputting the first Query matrix, first Key matrix, and first Value matrix into a preset first feature filtering formula for calculation to obtain first coarse-filtered features; performing convolution processing on the event image features to obtain corresponding second Query matrix, second Key matrix, and second Value matrix; inputting the second Query matrix, second Key matrix, and second Value matrix into a preset second feature filtering formula for calculation to obtain second coarse-filtered features; and performing feature fusion based on the first coarse-filtered features and the second coarse-filtered features to obtain fused features.

[0083] In some embodiments of this example, when the fusion module performs the function of feature fusion based on the first coarse filtering feature and the second coarse filtering feature to obtain the fused feature, it is specifically used to: perform convolution processing on the first coarse filtering feature to obtain the corresponding third Query matrix, third Key matrix, and third Value matrix, and perform convolution processing on the second coarse filtering feature to obtain the corresponding fourth Query matrix, fourth Key matrix, and fourth Value matrix; input the third Query matrix, third Key matrix, third Value matrix, fourth Key matrix, and fourth Value matrix into a preset third feature filtering formula for calculation to obtain the first fine filtering feature; input the third Key matrix, third Value matrix, fourth Query matrix, fourth Key matrix, and fourth Value matrix into a preset fourth feature filtering formula for calculation to obtain the second fine filtering feature; and perform feature fusion on the first fine filtering feature and the second fine filtering feature to obtain the fused feature.

[0084] It should be noted that the attitude estimation methods in the foregoing embodiments can all be implemented based on the attitude estimation device provided in this embodiment. Those skilled in the art can clearly understand that, for the sake of convenience and brevity, the specific working process of the attitude estimation device described in this embodiment can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0085] Based on the technical solution of the above embodiments of this application, two APS images corresponding to the start and end times of the attitude estimation time interval are acquired to form an APS image pair, and first interval event data and second interval event data associated with the attitude estimation time interval are acquired; a first event image group is generated based on multiple sub-interval event data in the first interval event data, and a second event image group is generated based on multiple sub-interval event data in the second interval event data; feature extraction is performed on the APS image pair to obtain APS image features, and feature extraction is performed on the first event image group and the second event image group to obtain event image features; feature filtering and fusion are performed on the APS image features and event image features to obtain fused features; attitude regression is performed based on the fused features to obtain system attitude change data. Through the implementation of the solution of this application, the attitude change of the system during motion is estimated by combining the data collected by both the event camera and the standard camera. When the system is in a high-speed and high dynamic range scenario, more motion information can be provided to estimate the attitude change of the system itself, effectively improving the accuracy and stability of attitude estimation.

[0086] It should be noted that the apparatuses and methods disclosed in the several embodiments provided in this application can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or modules may be electrical, mechanical, or other forms.

[0087] The modules described as separate components may or may not be physically separate. Similarly, the components shown as modules may or may not be physical modules; they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0088] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated modules described above can be implemented in hardware or as software functional modules.

[0089] If the integrated module is implemented as a software functional module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned readable storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, ROM, RAM, magnetic disks, or optical disks.

[0090] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0091] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0092] The above is a description of the pose estimation method, apparatus, device and storage medium based on event camera provided in this application. For those skilled in the art, based on the ideas of the embodiments of this application, there will be changes in the specific implementation and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A pose estimation method based on an event camera, characterized in that, include: Two APS images corresponding to the start and end times of the attitude estimation time interval are obtained to form an APS image pair, and first interval event data and second interval event data associated with the attitude estimation time interval are obtained; wherein, the first interval event data corresponds to the APS image at the start time, and the second interval event data corresponds to the APS image at the end time. A first event image group is generated based on multiple sub-interval event data in the first interval event data, and a second event image group is generated based on multiple sub-interval event data in the second interval event data; wherein, one APS image corresponds to an event image group composed of multiple event images; Feature extraction is performed on the APS image pairs to obtain APS image features, and feature extraction is performed on the first event image group and the second event image group to obtain event image features; The APS image features and the event image features are subjected to feature filtering and fusion to obtain fused features; Based on the fused features, attitude regression is performed to obtain system attitude change data; The step of acquiring the first interval event data and the second interval event data associated with the attitude estimation time interval includes: Based on the midpoint of the attitude estimation time interval, the attitude estimation time interval is divided into a first time interval and a second time interval. Acquire the first interval event data corresponding to the first time interval and the second interval event data corresponding to the second time interval, respectively; The step of filtering and fusing the APS image features and the event image features to obtain fused features includes: The APS image features are convolved to obtain the corresponding first Query matrix, first Key matrix and first Value matrix. The first Query matrix, first Key matrix and first Value matrix are input into a preset first feature filtering formula for calculation to obtain the first coarse filter feature. The event image features are convolved to obtain the corresponding second Query matrix, second Key matrix and second Value matrix. The second Query matrix, second Key matrix and second Value matrix are input into a preset second feature filtering formula for calculation to obtain the second coarse filter feature. The first coarse-filtered feature is convolved to obtain the corresponding third Query matrix, third Key matrix and third Value matrix, respectively; and the second coarse-filtered feature is convolved to obtain the corresponding fourth Query matrix, fourth Key matrix and fourth Value matrix, respectively. The third Query matrix, the third Key matrix, the third Value matrix, the fourth Key matrix, and the fourth Value matrix are input into a preset third feature filtering formula for calculation to obtain the first fine-filtered feature; and the third Key matrix, the third Value matrix, the fourth Query matrix, the fourth Key matrix, and the fourth Value matrix are input into a preset fourth feature filtering formula for calculation to obtain the second fine-filtered feature. The first fine-filter feature and the second fine-filter feature are fused to obtain the fused feature; The first feature filtering formula is expressed as: in, This represents the first coarse filtering feature. , as well as These represent the first Query matrix, the first Key matrix, and the first Value matrix, respectively. Represents the variance of the matrix elements. T Indicates the transpose operation; The second feature filtering formula is expressed as: in, This represents the second coarse filtering feature. , as well as These represent the second Query matrix, the second Key matrix, and the second Value matrix, respectively. The third feature filtering formula is expressed as follows: in, This represents the first fine filtering feature. , as well as These represent the third Query matrix, the third Key matrix, and the third Value matrix, respectively. as well as These represent the fourth Key matrix and the fourth Value matrix, respectively. The fourth feature filtering formula is expressed as follows: in, This indicates the second fine filtering feature. This represents the fourth Query matrix.

2. The attitude estimation method according to claim 1, characterized in that, Before the step of generating a first event image pair based on multiple sub-interval event data in the first interval event data, the method further includes: The first time interval event data is split into two sub-interval event data based on the midpoint of the first time interval, and the second time interval event data is split into two sub-interval event data based on the midpoint of the second time interval.

3. A pose estimation device based on an event camera, characterized in that, include: The acquisition module is used to acquire two APS images corresponding to the start and end times of the attitude estimation time interval to form an APS image pair, and to acquire first interval event data and second interval event data associated with the attitude estimation time interval; wherein, the first interval event data corresponds to the APS image at the start time, and the second interval event data corresponds to the APS image at the end time. The generation module is used to generate a first event image group based on multiple sub-interval event data in the first interval event data, and to generate a second event image group based on multiple sub-interval event data in the second interval event data; wherein, one APS image corresponds to an event image group composed of multiple event images; The extraction module is used to extract features based on the APS image pairs to obtain APS image features, and to extract features based on the first event image group and the second event image group to obtain event image features; The fusion module is used to perform feature filtering and fusion on the APS image features and the event image features to obtain fused features; The estimation module is used to perform attitude regression based on the fused features to obtain system attitude change data; The acquisition of first interval event data and second interval event data associated with the attitude estimation time interval includes: Based on the midpoint of the attitude estimation time interval, the attitude estimation time interval is divided into a first time interval and a second time interval. Acquire the first interval event data corresponding to the first time interval and the second interval event data corresponding to the second time interval, respectively; The fusion module is specifically used for: The APS image features are convolved to obtain the corresponding first Query matrix, first Key matrix and first Value matrix. The first Query matrix, first Key matrix and first Value matrix are input into a preset first feature filtering formula for calculation to obtain the first coarse filter feature. The event image features are convolved to obtain the corresponding second Query matrix, second Key matrix and second Value matrix. The second Query matrix, second Key matrix and second Value matrix are input into a preset second feature filtering formula for calculation to obtain the second coarse filter feature. The first coarse-filtered feature is convolved to obtain the corresponding third Query matrix, third Key matrix and third Value matrix, respectively; and the second coarse-filtered feature is convolved to obtain the corresponding fourth Query matrix, fourth Key matrix and fourth Value matrix, respectively. The third Query matrix, the third Key matrix, the third Value matrix, the fourth Key matrix, and the fourth Value matrix are input into a preset third feature filtering formula for calculation to obtain the first fine-filtered feature; and the third Key matrix, the third Value matrix, the fourth Query matrix, the fourth Key matrix, and the fourth Value matrix are input into a preset fourth feature filtering formula for calculation to obtain the second fine-filtered feature. The first fine-filter feature and the second fine-filter feature are fused to obtain the fused feature; The first feature filtering formula is expressed as: in, This represents the first coarse filtering feature. , as well as These represent the first Query matrix, the first Key matrix, and the first Value matrix, respectively. Represents the variance of the matrix elements. T Indicates the transpose operation; The second feature filtering formula is expressed as: in, This represents the second coarse filtering feature. , as well as These represent the second Query matrix, the second Key matrix, and the second Value matrix, respectively. The third feature filtering formula is expressed as follows: in, This represents the first fine filtering feature. , as well as These represent the third Query matrix, the third Key matrix, and the third Value matrix, respectively. as well as These represent the fourth Key matrix and the fourth Value matrix, respectively. The fourth feature filtering formula is expressed as follows: in, This indicates the second fine filtering feature. This represents the fourth Query matrix.

4. An electronic device, characterized in that, Includes memory and processor, of which: The processor is used to execute computer programs stored in the memory; When the processor executes the computer program, it implements the steps in the attitude estimation method according to any one of claims 1 to 2.

5. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps in the attitude estimation method according to any one of claims 1 to 2.