A multi-view image feature extraction method based on deep learning
By combining deep learning methods of visibility prediction and occlusion modeling in an autonomous driving environment, the problems of temporal consistency and spatial visibility in multi-view image feature extraction are solved, achieving stable fusion and high-precision alignment of multi-view image features, and improving the environmental perception capability of the autonomous driving system.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU SECOND LIFE TECH CO LTD
- Filing Date
- 2025-12-30
- Publication Date
- 2026-06-12
AI Technical Summary
Existing multi-view image feature extraction methods are difficult to effectively model the temporal consistency and spatial visibility between multiple views in the fields of autonomous driving and intelligent transportation. This leads to inaccurate estimation of occlusion duration and loss of view transition information, affecting the stability and accuracy of image fusion features.
A deep learning-based approach is adopted, combining visibility prediction mechanism, occlusion duration modeling and improved Mamba selective state space recursive unit. Multi-camera image features are fused under a unified bird's-eye view coordinate system. Multi-channel sampling results are generated through geometric projection transformation and bilinear interpolation. Write gate and forget gate are used to control the update of hidden state, thereby achieving multi-scale feature enhancement.
It improves the spatial alignment accuracy and occlusion modeling capability of multi-view image features, ensures stable modeling of target visibility changes in complex traffic scenarios, and provides multi-view feature representation with strong continuity, high alignment accuracy and robustness.
Smart Images

Figure CN122200570A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and deep learning technology, and in particular to a multi-view image feature extraction method based on deep learning. Background Technology
[0002] In the current fields of autonomous driving and intelligent transportation, vehicle perception systems typically rely on multi-view cameras to construct environmental perception models, enabling the identification and tracking of static and dynamic targets in the surrounding scene. Traditional methods often employ frame-by-frame processing, extracting features from images from different perspectives before fusing them. However, this approach often overlooks the geometric consistency and visibility occlusion relationships between multiple perspectives, leading to spatial alignment errors in the fused features. Image-level feature fusion schemes typically rely on fixed positional relationships for projection mapping, making them ill-suited to unstructured environments in complex road scenarios, such as frequent changes in vehicle steering, acceleration, and occlusion, thus interfering with downstream tasks.
[0003] In image feature extraction, while existing convolutional neural network models can effectively capture local texture features, they struggle to model temporal consistency and spatial visibility across multi-view images. Some methods attempt to introduce the Bird's-Eye-View (BEV) space to align image features and enhance spatial consistency, but issues such as inaccurate estimation of occlusion duration and missing viewpoint transition information persist. For the fusion of multi-frame image sequences, most existing methods employ simple temporal sliding windows or attention weighting, lacking a recursive mechanism for historical states, which can easily lead to the accumulation of redundant information or short-term visual bias.
[0004] Especially when dealing with scenes with frequent occlusion and drastic dynamic changes in viewpoint, existing methods struggle to establish stable spatial continuity modeling mechanisms. They lack explicit modeling paths for key semantic features such as occlusion duration and raster-level visibility state changes, resulting in fragmented, drifting, or blurred areas in the generated bird's-eye view. Furthermore, the lack of an effective selective memory mechanism during state modeling prevents weighted updates and dynamic control of visible information at different time scales, impacting the quality of the final multi-view fusion features.
[0005] Therefore, there is an urgent need for a multi-view image feature extraction method that can combine visibility prediction, occlusion modeling and state recursion mechanism, while maintaining spatial alignment accuracy, improve the ability to model changes in target visibility in dynamic scenes, and have temporal consistency control and multi-scale spatial feature enhancement capabilities, so as to provide a more stable and reliable image feature representation basis for perception tasks in autonomous driving environments. Summary of the Invention
[0006] One objective of this invention is to propose a multi-view image feature extraction method based on deep learning. This invention fully integrates visibility prediction mechanism, occlusion duration modeling, geometric projection transformation and improved Mamba selective state space recursive unit, describes in detail the unified fusion process of multi-camera images in bird's-eye view coordinate system, and constructs an image feature extraction framework with temporal consistency control and multi-scale spatial enhancement capabilities. It has the advantages of high spatial alignment accuracy, strong occlusion modeling capability and high feature expression stability.
[0007] A deep learning-based multi-view image feature extraction method according to an embodiment of the present invention includes the following steps: S1. Acquire raw image frames from multiple cameras around the vehicle and establish camera identifiers and time indexes; S2. Perform preprocessing on the original image frames and output intermediate feature maps from multiple perspectives; S3. Connect the visibility prediction branch to the intermediate feature map, and form a grid-level visibility mask based on the generated grid-level visibility probability map and occlusion duration map. S4. The intermediate feature map, visibility probability map and occlusion duration map are processed by geometric projection transformation and bilinear interpolation to obtain a multi-channel sampling result set from the viewpoint to the bird's-eye view grid. S5. Establish a space-filling curve index for the multi-channel sampling result set, input the generated gated input vector into the improved Mamba selective state-space recursive unit, and output and summarize to obtain the bird's-eye view feature with enhanced consistency. S6. The multi-view fusion feature tensor output after the bird's-eye view features are calculated is used as the feature extraction result for synchronous caching, and is input into the recursive unit together with the gated input vector to drive the continuous update of the selective state space recursive unit.
[0008] Optionally, S1 specifically includes S11. Equip the vehicle with multiple cameras fixed in the circumferential position, set camera labels and complete the calibration of camera internal and external parameters; S12. Based on the synchronous trigger signal during vehicle operation, acquire the original image frames corresponding to each camera under the same time index; the synchronous trigger signal is used in autonomous driving systems. In such systems, vehicles are usually equipped with multiple cameras installed at different positions such as the front, side, and rear. In order to ensure that these cameras acquire images at the same time, the autonomous driving system uses a synchronous trigger signal to control all cameras to start image acquisition simultaneously.
[0009] S13. Correspond the original image frames with the corresponding timestamps, camera identifiers, camera intrinsic parameters and camera extrinsic parameters to construct a time-view joint index table. S14. Based on the camera identifiers and imaging parameters recorded in the time-view joint index table, establish a data structure for the multi-camera image acquisition task at each moment.
[0010] Optionally, S2 specifically includes S21. The preprocessing includes performing distortion correction operations on the original image frames respectively, correcting the geometric deformation of the image using camera intrinsic parameters, and outputting a geometrically corrected image. The geometrically corrected image is subjected to brightness normalization processing, which maps the pixel values to a uniform distribution range to obtain a normalized image frame; S22. Input the normalized image frame into the shared-weight convolutional feature encoder, and extract the intermediate layer feature representation using the preset convolutional layer structure; S23. Perform uniform dimensional adjustment and tensor standardization on the intermediate layer feature map corresponding to each viewpoint to form an intermediate feature map that matches the size of the input image; the intermediate feature map is a three-dimensional tensor structure, each spatial position corresponds to a channel vector, the channel vector is composed of feature values of multiple channel dimensions, and each feature value represents the representation intensity of the spatial position on the corresponding channel. S24. Store each intermediate feature map into the feature buffer according to the camera identifier to provide structured input data for subsequent visibility prediction and geometric projection steps.
[0011] Optionally, S3 specifically includes: S31. Read the intermediate feature maps of each view from the feature buffer according to the camera identifier, and match them with the time-view joint index table; S32. A visibility prediction branch is established on each intermediate feature map. The visibility prediction branch performs convolution, normalization, nonlinear activation and upsampling operations in sequence. A single-channel feature response map is output through the prediction convolution layer. A probability mapping operation is performed on the single-channel feature response map. The single-channel feature response map is normalized by the Sigmoid function to obtain the corresponding visibility probability, forming a grid-level initial visibility probability map with the same size as the intermediate feature map, which is used as the input for generating the occlusion duration map. The intermediate feature map of the corresponding viewpoint is received and input into the first convolutional layer of the visibility prediction branch according to the preset number of channels. The first convolutional layer adopts convolution operation with a kernel size of 3×3, a stride of 1, and an edge padding of 1, and outputs the first convolutional feature map. Normalization and nonlinear activation operations are performed on the first convolutional feature map to obtain the second convolutional feature map; The second convolutional feature map is input into the second convolutional layer, which performs a convolution operation with a kernel size of 3×3, a stride of 1, and an edge padding of 1, and outputs the third convolutional feature map. An upsampling operation is performed on the third convolutional feature map. The upsampling operation uses bilinear interpolation to adjust the feature map size to be consistent with the spatial size of the intermediate feature map by a preset upsampling factor, thereby obtaining the upsampled feature map. The upsampled feature map is input into the prediction convolutional layer, which uses a convolution operation with a kernel size of 1×1, a stride of 1, and edge padding of 0 to output a single-channel feature response map. A probability mapping operation is performed on the single-channel feature response map to map the response value of each spatial location to a probability value in the interval [0, 1]. Since the spatial location corresponds one-to-one with the grid cell of the intermediate feature map, the result is the initial probability map of grid-level visibility corresponding to the grid cell of the intermediate feature map.
[0012] S33. Based on the time index, read the initial probability map of the raster-level visibility under the previous time index from the visibility cache. Compare the initial probability map of the raster-level visibility under the current time index with the result of the previous time index raster by raster. If the visibility probability of the current raster is higher than the threshold, reset the occlusion duration value of the corresponding raster to zero. If the visibility probability of the current raster is lower than the threshold, add one to the occlusion duration value of the previous time index, generate and update the occlusion duration map under the current time index. S34. Perform binarization on the initial raster-level visibility probability map under the current time index, and combine connectivity filtering to remove isolated regions and noise, generating a spatially continuous and stable raster-level visibility probability map. The binarization process refers to mapping the visibility probability value of each grid cell in the initial visibility probability map of the grid level to a binary state. If the visibility probability value of the grid cell meets the visibility condition, it is marked as visible; if the probability value of the grid cell does not meet the visibility condition, it is marked as invisible, thereby obtaining a binarized mask map composed of visible and invisible cells. The connectivity filtering refers to: in the binarized mask image obtained by binarization, dividing the grid cells into connected regions according to the adjacency relationship of the grid, removing isolated connected regions with an area smaller than a preset threshold, and retaining only the grid regions that meet the connectivity requirements, so as to generate a spatially continuous and stable grid-level visibility probability map.
[0013] The threshold in S33 is used for occlusion time accumulation, and the threshold in S34 is used for generating a probability mask.
[0014] Optionally, S4 specifically includes: S41. Based on the camera intrinsic and extrinsic parameters in the time-view joint index table, perform geometric projection transformation on the intermediate feature maps of each viewpoint and map them to the preset grid coordinate positions under the unified bird's-eye view reference coordinate system. S42. Perform bilinear interpolation sampling on the intermediate feature map after geometric projection transformation, and map the feature values to the corresponding grid cells through bilinear interpolation to generate a projected feature map with fixed resolution. S43. Perform the same geometric projection transformation and bilinear interpolation operation on the grid-level visibility probability map and the occlusion duration map to ensure that the grid-level visibility probability map and the occlusion duration map maintain a one-to-one correspondence with the projected feature map in the grid position under a unified bird's-eye view reference coordinate system. S44. Under a unified bird's-eye view reference coordinate system, the projection feature map, projection visibility probability map and projection occlusion duration map of each view are stitched together at the channel level to obtain multi-view to bird's-eye view grid sampling results containing multi-channel feature information. S45. Summarize the multi-channel sampling results of all cameras according to the camera identifier and time index to form a multi-view to bird's-eye view grid multi-channel sampling result set under the current time index.
[0015] Optionally, S5 specifically includes: S51. Establish a space filling curve index for the multi-channel sampling result set from multiple viewpoints to the bird's-eye view grid under a unified bird's-eye view reference coordinate system. The space filling curve index arranges all grid cells sequentially according to the preset grid scanning path. The grid scanning path is a path for sorting two-dimensional grid cells in a unified bird's-eye view reference coordinate system in a one-dimensional manner. The path is a Hilbert curve, which traverses grid cells recursively to maintain local spatial adjacency.
[0016] S52. According to the space filling curve index, the channel vector corresponding to the projection intermediate feature map, the probability value of the projection visibility probability map, and the duration value of the projection occlusion duration map at each grid cell are spliced together to form a gated input vector. S53. Input the gating input vector into the improved Mamba selective state space recursive unit, calculate the write gate and forget gate based on the current gating parameters, and output the gating decision. The selective state-space recursive unit is a deep neural network module based on a state-space model. It realizes the recursive modeling of the gated input vector through parameterized state update equations and output equations, and controls the writing and retention of the hidden state of the gated input vector based on the output gating decisions of the write gate and forget gate during the recursive process. The state update equation is responsible for passing and updating the hidden state between time indices. Incorporating the improvements of this invention, the update process is no longer a simple linear pass, but rather incorporates gating decisions using write and forget gates:
[0017] When the gating decision is to write, the state update equation receives the gating input vector and, in combination with the weights of the write gate output, performs conditional writing on the hidden state of the previous cycle. When the gating decision is forget, the state update equation combines the time decay parameter of the forget gate output to update the hidden state of the previous cycle with time decay.
[0018] The output equation maps the updated hidden state to the bird's-eye view feature at the current time index and aggregates the results of all raster cells to generate a bird's-eye view feature with enhanced consistency.
[0019] In this invention, the result of this step is used as the input for subsequent multi-scale upsampling and residual coupling, and is also cached to participate in the recursive modeling of the next time index.
[0020] The selective state-space recursion unit includes a write gate and a forget gate: The write gate is used to control the conditional writing of the gated input vector to the hidden state, and the forget gate is used to control the time decay update of the hidden state in the previous cycle. S54. When the gating decision is write, perform a conditional write operation on the hidden state of the corresponding grid cell in the previous cycle according to the gating input vector. The conditional write operation refers to the following: when the gating decision is to write, the gating input vector and the hidden state of the corresponding grid cell in the previous period are weighted and combined according to the weight coefficient of the write gate output, and the combination result is updated to the hidden state of the grid cell in the current period.
[0021] S55. When the gating decision is forget, a time decay update operation is performed on the hidden state of the corresponding grid cell in the previous cycle according to the time decay parameter; the time decay parameter is dynamically calculated by the forget gate and is used to control the decay amplitude of the hidden state. The time decay update operation refers to the following: when the gating decision is to forget, the hidden state of the corresponding grid cell in the previous period is weighted and calculated with the time decay parameter to reduce but not completely clear the influence of the historical hidden state, and the weighted result is used as the hidden state of the grid cell in the current period.
[0022] S56. Output and summarize the hidden states of all grid cells after processing by the selective state space recursion unit under the current time index to generate a consistent bird's-eye view feature.
[0023] Optionally, S6 specifically includes: S61. The bird's-eye view features are input to a multi-scale upsampling module, which consists of parallel upsampling branches. Each upsampling branch performs upsampling operations at different magnifications for the input bird's-eye view features with enhanced consistency, thereby obtaining multi-scale bird's-eye view features. The upsampling operation specifically includes: First branch: Perform bilinear interpolation upsampling of ×2 on the consistent bird's-eye view features of the input, doubling the size of the feature map space to obtain the first-scale bird's-eye view features; The second branch: Perform a ×4 bilinear interpolation upsampling on the consistent bird's-eye view features of the input, which expands the feature map space size to four times the original size, to obtain the second-scale bird's-eye view features; The third branch: Perform ×8 bilinear interpolation upsampling on the consistent bird's-eye view features of the input, expanding the feature map space size to eight times the original, and obtaining multi-scale bird's-eye view features.
[0024] S62. The multi-scale bird's-eye view features are spliced and standardized in the channel dimension to form a multi-scale fusion feature tensor. The splicing and standardization operation refers to: aligning the bird's-eye view features of each scale pixel by pixel in the channel dimension and splicing them into a unified tensor, and performing a normalization operation on the unified tensor in the channel dimension to ensure the consistency of the numerical distribution of features at different scales and to form a multi-scale fusion feature tensor.
[0025] S63. Divide the multi-scale fusion feature tensor into two computation paths. The first path performs a convolution operation on the multi-scale fusion feature tensor. The second path keeps the original value of the multi-scale fusion feature tensor unchanged. The outputs of the two paths are added at the corresponding positions to obtain the residual coupling feature tensor. S64. Perform nonlinear activation and normalization operations on the residual coupled feature tensor to output a multi-view fused feature tensor under a unified bird's-eye view reference coordinate system. S65. The multi-view fusion feature tensor is used as the feature extraction result, and the hidden state corresponding to the current time index is kept in synchronous cache. S66. Under the next time index, the cached hidden state and the current gating input vector are jointly input into the recursive unit to drive the continuous update of the selective state space recursive unit.
[0026] Optionally, the driver specifically includes: In the next time index, the cached hidden state and the current gating input vector are input together into the improved Mamba selective state space recursion unit; The recursive unit first calculates the output values of the write gate and the forget gate, and then generates the gating decision; When the gating decision is to write, the output weight coefficient of the write gate is called, and the gating input vector is weighted and combined with the hidden state of the previous cycle to update the hidden state of the current cycle. When the gating decision is to forget, the time decay parameter calculated by the forget gate is called to perform time decay update on the hidden state of the previous cycle to obtain the hidden state of the current cycle. The updated hidden state is cached and used as the recursive input for the next time index.
[0027] Optionally, the improved Mamba selective state-space recursion unit includes an input convolutional feature encoder, a write gate, a forget gate, a state update equation, and an output equation, which are executed sequentially within the recursion cycle.
[0028] The convolutional feature encoder receives a gated input vector from each grid cell. This gated input vector is formed by concatenating the channel vector of the projected intermediate feature map, the probability value of the projected visibility probability map, and the duration value of the projected occlusion duration map. The input encoding module performs linear transformations and nonlinear mappings on the gated input vector using a set of preset weight parameters and activation functions to obtain the encoded feature representation, which serves as the basic input for subsequent gated computation.
[0029] The write gate calculates the write gate weight at the current time index based on the feature representation output by the input encoding module. This module contains a set of independent linear transformation parameters and sigmoid activation units, used to map the feature representation to weight coefficients in the interval [0, 1]. These weight coefficients represent the degree of writing to the current input state as a control factor.
[0030] The forget gate adopts the same structure and computational process as the write gate, receives the same feature representation input, and independently calculates the forget gate weight. The forget gate weight is used to control the retention strength of the hidden state of the previous cycle and serves as the basis for calculating the time decay factor.
[0031] The state update process receives the hidden state from the previous cycle, the gating input vector for the current cycle, and the write gate weights and forget gate weights. The gating input vector is weighted according to the write gate weights to form candidate write states; the hidden state from the previous cycle is time-decayed according to the forget gate weights to form decayed states; finally, the candidate write states and decayed states are weighted and added together to generate the updated hidden state for the current cycle.
[0032] The state update equation outputs the updated hidden state for the current period to the next module and caches this hidden state for reading in the next time index recursion period. The output module also retains the hidden states of all raster cells output in the current period as the basis for generating consistent enhanced bird's-eye view features under a unified bird's-eye view reference coordinate system.
[0033] The beneficial effects of this invention are: This invention establishes a spatial filling curve index under a unified bird's-eye view reference coordinate system and constructs a gated input vector using the channel vector of the projected intermediate feature map, the probability value of the projected visibility probability map, and the duration value of the projected occlusion duration map. This achieves precise alignment of features from multiple cameras in spatial position, thereby avoiding the geometric mismatch problem that exists in multi-view feature fusion in traditional methods.
[0034] This invention introduces write and forget gates into the improved Mamba selective state space recursive unit, enabling the interaction between the gated input vector and the hidden state to have conditional write and time decay update capabilities. This effectively controls the update path of the hidden state under multiple time indices, making the cross-temporal feature extraction process more stable.
[0035] This invention combines grid-level visibility probability and grid-level occlusion duration dynamic modeling in the feature processing process, which can continuously characterize occlusion in complex traffic scenarios and ensure stable transmission and fusion of multi-view image features when the vehicle turns, accelerates or the environment changes.
[0036] This invention transforms consistent bird's-eye view features into multi-view fusion feature tensors through multi-scale upsampling, stitching, residual coupling, and normalization. Combined with a hidden state caching mechanism, it achieves recursive updates of cross-time indexes, thereby providing multi-view feature representations with strong continuity, high alignment accuracy, and excellent robustness for environmental perception in autonomous driving and intelligent transportation scenarios. Attached Figure Description
[0037] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0038] Figure 1 This is an overall flowchart of a deep learning-based multi-view image feature extraction method proposed in this invention; Figure 2 This is a schematic diagram of the improved Mamba algorithm structure of a deep learning-based multi-view image feature extraction method proposed in this invention. Figure 3 This is a schematic diagram of the selective state-space recursive unit of a deep learning-based multi-view image feature extraction method proposed in this invention. Figure 4 This is a flowchart illustrating the generation of the gated input vector for a deep learning-based multi-view image feature extraction method proposed in this invention. Detailed Implementation
[0039] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0040] refer to Figure 1-4 A deep learning-based multi-view image feature extraction method includes the following steps: S1. Acquire raw image frames from multiple cameras around the vehicle and establish camera identifiers and time indexes; S2. Perform preprocessing on the original image frames and output intermediate feature maps from multiple perspectives; S3. Connect the visibility prediction branch to the intermediate feature map, and form a grid-level visibility mask based on the generated grid-level visibility probability map and occlusion duration map. S4. The intermediate feature map, visibility probability map and occlusion duration map are processed by geometric projection transformation and bilinear interpolation to obtain a multi-channel sampling result set from the viewpoint to the bird's-eye view grid. S5. Establish a space-filling curve index for the multi-channel sampling result set, input the generated gated input vector into the selective state-space recursive unit of the improved Mamba algorithm, and output and summarize to obtain the bird's-eye view features with enhanced consistency. S6. The multi-view fusion feature tensor output after the bird's-eye view features are calculated is used as the feature extraction result for synchronous caching, and is input into the recursive unit together with the gated input vector to drive the continuous update of the selective state space recursive unit.
[0041] This invention proposes a deep learning-based multi-view image feature extraction method. It acquires raw image frames from multiple cameras around a vehicle and establishes camera identifiers and time indices. Preprocessing is then performed on the images to obtain intermediate feature maps from multiple perspectives. A visibility prediction branch is then added to these intermediate feature maps to generate a raster-level visibility probability map and an occlusion duration map, forming a raster-level visibility mask. These features are mapped to a unified bird's-eye view coordinate system using geometric projection transformation and bilinear interpolation to obtain a multi-channel sampling result set. A gated input vector is constructed based on the space-filling curve index and input into an improved Mamba selective state-space recursive unit to output consistent bird's-eye view features. Multi-scale computation generates and caches a multi-view fusion feature tensor, and the gated input vector drives the recursive unit to achieve continuous updates of the time index.
[0042] In this embodiment, S1 specifically includes S11. Equip the vehicle with multiple cameras fixed in the circumferential position, set camera labels and complete the calibration of camera internal and external parameters; S12. Based on the synchronous trigger signal during vehicle movement, acquire the original image frames corresponding to each camera under the same time index; S13. Correspond the original image frames with the corresponding timestamps, camera identifiers, camera intrinsic parameters and camera extrinsic parameters to construct a time-view joint index table. S14. Based on the camera identifiers and imaging parameters recorded in the time-view joint index table, establish a data structure for the multi-camera image acquisition task at each moment.
[0043] This invention configures multiple cameras fixedly installed in the circumferential position of a vehicle, completes the camera identification setting and calibration of camera intrinsic and extrinsic parameters, and uses synchronous trigger signals to collect original image frames from various perspectives under the same time index during vehicle operation. The collected image frames are linked with timestamps, camera identification and imaging parameters to establish a time-view joint index table, and a structured data foundation is built for the multi-camera image acquisition task at each moment based on this index table.
[0044] In this embodiment, S2 specifically includes S21. The preprocessing includes performing distortion correction operations on the original image frames respectively, correcting the geometric deformation of the image using camera intrinsic parameters, and outputting a geometrically corrected image. The geometrically corrected image is subjected to brightness normalization processing, which maps the pixel values to a uniform distribution range to obtain a normalized image frame; S22. Input the normalized image frame into the shared-weight convolutional feature encoder, and extract the intermediate layer feature representation using the preset convolutional layer structure; S23. Perform uniform dimensional adjustment and tensor standardization on the intermediate layer feature map corresponding to each viewpoint to form an intermediate feature map that matches the size of the input image; the intermediate feature map is a three-dimensional tensor structure, each spatial position corresponds to a channel vector, the channel vector is composed of feature values of multiple channel dimensions, and each feature value represents the representation intensity of the spatial position on the corresponding channel. S24. Store each intermediate feature map into the feature cache area according to the camera identifier.
[0045] This invention performs distortion correction and brightness normalization preprocessing on the original image frame to generate a normalized image frame, and inputs it into a weighted convolutional feature encoder to extract intermediate layer feature representations. Then, it performs uniform dimensional adjustment and tensor normalization on the intermediate layer feature maps of each viewpoint to form an intermediate feature map with a three-dimensional tensor structure composed of channel vectors and feature values. Finally, the intermediate feature map is stored in the feature buffer according to the camera identifier to provide input data for subsequent feature processing.
[0046] In this embodiment, S3 specifically includes: S31. Read the intermediate feature maps of each view from the feature buffer according to the camera identifier, and match them with the time-view joint index table; S32. A visibility prediction branch is established on each intermediate feature map. The visibility prediction branch performs convolution, normalization, nonlinear activation and upsampling operations in sequence. A single-channel feature response map is output through the prediction convolution layer. A probability mapping operation is performed on the single-channel feature response map. The single-channel feature response map is normalized by the Sigmoid function to obtain the corresponding visibility probability, forming a grid-level initial visibility probability map with the same size as the intermediate feature map, which is used as the input for generating the occlusion duration map. S33. Based on the time index, read the initial probability map of the raster-level visibility under the previous time index from the visibility cache. Compare the initial probability map of the raster-level visibility under the current time index with the result of the previous time index raster by raster. If the visibility probability of the current raster is higher than the threshold, reset the occlusion duration value of the corresponding raster to zero. If the visibility probability of the current raster is lower than the threshold, add one to the occlusion duration value of the previous time index, generate and update the occlusion duration map under the current time index. S34. Perform binarization on the initial raster-level visibility probability map under the current time index, and combine connectivity filtering to remove isolated regions and noise, generating a spatially continuous and stable raster-level visibility probability map. This invention reads intermediate feature maps from the feature buffer according to the camera identifier and matches them with the time-view joint index table. A visibility prediction branch is established on the intermediate feature map to generate a raster-level visibility initial probability map. The occlusion duration map is updated by comparing the results of the previous cycle with the time index. Then, the initial probability map is binarized and connectivity filtered to obtain a spatially continuous and stable raster-level visibility probability map. Finally, a raster-level visibility mask is generated based on the visibility probability map and the occlusion duration map and stored in the visibility buffer.
[0047] In this embodiment, S4 specifically includes: S41. Based on the camera intrinsic and extrinsic parameters in the time-view joint index table, perform geometric projection transformation on the intermediate feature maps of each viewpoint and map them to the preset grid coordinate positions under the unified bird's-eye view reference coordinate system. S42. Perform bilinear interpolation sampling on the intermediate feature map after geometric projection transformation, and map the feature values to the corresponding grid cells through bilinear interpolation to generate a projected feature map with fixed resolution. S43. Perform the same geometric projection transformation and bilinear interpolation operation on the raster-level visibility probability map and the occlusion duration map; S44. Under a unified bird's-eye view reference coordinate system, the projection feature map, projection visibility probability map and projection occlusion duration map of each view are stitched together at the channel level to obtain multi-view to bird's-eye view grid sampling results containing multi-channel feature information. S45. Summarize the multi-channel sampling results of all cameras according to the camera identifier and time index to form a multi-view to bird's-eye view grid multi-channel sampling result set under the current time index.
[0048] This invention performs geometric projection transformation on intermediate feature maps of each viewpoint based on camera intrinsic and extrinsic parameters in the time-view joint index table and maps them to a unified bird's-eye view reference coordinate system. It generates a fixed-resolution projection feature map using bilinear interpolation sampling and performs the same projection and sampling operations on the grid-level visibility probability map and the occlusion duration map. Then, it performs channel-level stitching on the projection feature map, projection visibility probability map, and projection occlusion duration map of each viewpoint under the unified bird's-eye view reference coordinate system, and finally summarizes them to obtain a multi-channel sampling result set of multi-view to bird's-eye view grid under the current time index.
[0049] In this embodiment, S5 specifically includes: S51. Establish a space filling curve index for the multi-channel sampling result set from multiple viewpoints to the bird's-eye view grid under a unified bird's-eye view reference coordinate system. The space filling curve index arranges all grid cells sequentially according to the preset grid scanning path. S52. According to the space filling curve index, the channel vector corresponding to the projection intermediate feature map, the probability value of the projection visibility probability map, and the duration value of the projection occlusion duration map at each grid cell are spliced together to form a gated input vector. S53. Input the gating input vector into the improved Mamba selective state space recursive unit, calculate the write gate and forget gate based on the current gating parameters, and output the gating decision. The selective state-space recursive unit is a deep neural network module based on a state-space model. It realizes the recursive modeling of the gated input vector through parameterized state update equations and output equations, and controls the writing and retention of the hidden state of the gated input vector based on the output gating decisions of the write gate and forget gate during the recursive process. The selective state-space recursion unit includes write gates and forget gates: The write gate is used to control the conditional writing of the gated input vector to the hidden state, and the forget gate is used to control the time decay update of the hidden state in the previous cycle. S54. When the gating decision is write, perform a conditional write operation on the hidden state of the corresponding grid cell in the previous cycle according to the gating input vector. S55. When the gating decision is forget, perform a time decay update operation on the hidden state of the corresponding grid cell in the previous cycle according to the time decay parameter. S56. Output and summarize the hidden states of all grid cells after processing by the selective state space recursion unit under the current time index to generate a consistent bird's-eye view feature.
[0050] This invention establishes a space-filling curve index for the multi-channel sampling result set under a unified bird's-eye view reference coordinate system. Based on the index, the channel vector of the projected intermediate feature map, the probability value of the projected visibility probability map, and the duration value of the projected occlusion duration map are concatenated to form a gated input vector. The gated input vector is then input into an improved Mamba selective state-space recursive unit. The gated decision outputs of the write gate and forget gate are used to perform conditional write or time decay update operations, respectively. Finally, the output is summarized to obtain a bird's-eye view feature with enhanced consistency.
[0051] In this embodiment, S6 specifically includes: S61. The bird's-eye view features are input to a multi-scale upsampling module, which consists of parallel upsampling branches. Each upsampling branch performs upsampling operations at different magnifications for the input bird's-eye view features with enhanced consistency, thereby obtaining multi-scale bird's-eye view features. S62. The multi-scale bird's-eye view features are spliced and standardized in the channel dimension to form a multi-scale fusion feature tensor. S63. Divide the multi-scale fusion feature tensor into two computation paths. The first path performs a convolution operation on the multi-scale fusion feature tensor. The second path keeps the original value of the multi-scale fusion feature tensor unchanged. The outputs of the two paths are added at the corresponding positions to obtain the residual coupling feature tensor. S64. Perform nonlinear activation and normalization operations on the residual coupled feature tensor to output a multi-view fused feature tensor under a unified bird's-eye view reference coordinate system. S65. The multi-view fusion feature tensor is used as the feature extraction result, and the hidden state corresponding to the current time index is kept in synchronous cache. S66. Under the next time index, the cached hidden state and the current gating input vector are jointly input into the recursive unit to drive the continuous update of the selective state space recursive unit.
[0052] In this embodiment, the improved Mamba algorithm introduces a gated input vector and adds write and forget gates to control the conditional writing and time decay updates of the hidden state.
[0053] This invention generates multi-scale bird's-eye view features by inputting consistent bird's-eye view features into multi-scale upsampling branches, and then concatenates and normalizes them in the channel dimension to form a multi-scale fusion feature tensor. After convolution and identity mapping, a residual coupled feature tensor is obtained. Subsequently, nonlinear activation and normalization are performed to output a multi-view fusion feature tensor in a unified bird's-eye view reference coordinate system. This tensor is used as a synchronous cache for feature extraction results and is input into the recursive unit along with the gated input vector in the next time index to drive the continuous update of the selective state space recursive unit. The improved Mamba algorithm introduces a gated input vector and adds a write gate and a forget gate to realize the conditional writing and time decay update of the hidden state.
[0054] Example 1: To verify the feasibility of this invention in practice, it was applied to the circumferential environmental perception task of an autonomous driving test vehicle. The vehicle was equipped with eight wide-angle cameras fixedly mounted at the front, rear, and sides. The camera sampling frequency was 30 frames per second, and all cameras acquired images simultaneously via a synchronous trigger signal during operation. The test scenario was a typical intelligent transportation environment, with roads including multi-lane straight sections, intersections, and multiple vehicles participating in the traffic. This scenario presented complex lighting conditions, such as alternating periods of strong direct sunlight and shadowed areas, along with occlusion phenomena, such as large trucks obscuring smaller cars and bicycles weaving through traffic. Traditional multi-camera image fusion methods are prone to inaccurate feature alignment and unstable modeling of occlusion persistence in such scenarios, thus affecting the environmental perception robustness of the autonomous driving system.
[0055] During the application, the vehicle drove continuously for 40 minutes on the test road section, accumulating approximately 72,000 sets of multi-camera image frames. All images first underwent distortion correction and brightness normalization preprocessing, and then intermediate feature maps were extracted by a convolutional feature encoder with shared weights. Subsequently, a visibility prediction branch was connected to each intermediate feature map to generate a raster-level visibility initial probability map with the same size as the image, and the occlusion duration map was updated by combining it with historical frames. Through binarization and connectivity filtering operations, a spatially continuous and stable visibility probability map was obtained, which was further used to generate a raster-level visibility mask. Table 1 shows the comparative experimental data, covering key indicators such as target alignment error, occlusion recovery rate, perceptual latency, and frame rate.
[0056] Table 1 Comparison of Experimental Data for Multi-View Feature Extraction Methods
[0057] Under a unified bird's-eye view reference coordinate system, the intermediate feature map, visibility probability map, and occlusion duration map are mapped to a fixed-resolution bird's-eye view grid through geometric projection and bilinear interpolation, generating multi-channel sampling results. Then, based on the Hilbert space-filling curve index, the channel vectors of the projected intermediate feature map, the probability values of the projected visibility probability map, and the projected occlusion duration values are concatenated to form a gated input vector, which is then input into an improved Mamba selective state-space recursive unit. During the recursion process, the write gate and forget gate control the conditional writing and temporal decay updates of the hidden states, respectively, ensuring consistency of the bird's-eye view features under multiple temporal indices. The output bird's-eye view features are then upsampled at multiple scales, concatenated, coupled with residuals, and normalized to form a multi-view fusion feature tensor under the unified bird's-eye view reference coordinate system, which is cached for subsequent temporal index recursion.
[0058] The test results are shown in Table 1. Compared with traditional methods based on convolution and simple weighted fusion in the same scenario, the multi-view fusion features generated by this invention show better performance in terms of target boundary clarity, occlusion recovery stability, and cross-frame consistency. Taking the detection of a small target 20 meters in front of the vehicle as an example, the feature alignment error of the traditional method is about 0.38 meters, while the alignment error of the method of this invention is stable within 0.12 meters. Under continuous occlusion for 5 seconds, the feature recovery rate of the traditional method is only 62%, while that of the method of this invention reaches 89%. Under complex lighting conditions, the average perception latency of the traditional method is 210 milliseconds, while that of the method of this invention is reduced to 145 milliseconds. Throughout the entire test, the method of this invention maintained real-time operation with 72,000 sets of multi-camera image inputs, with an average processing frame rate of 27 frames / second, meeting the real-time requirements of autonomous driving scenarios.
[0059] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A multi-view image feature extraction method based on deep learning, characterized in that, Includes the following steps: S1. Acquire raw image frames from multiple cameras around the vehicle and establish camera identifiers and time indexes; S2. Perform preprocessing on the original image frames and output intermediate feature maps from multiple perspectives; S3. Connect the visibility prediction branch to the intermediate feature map, and form a grid-level visibility mask based on the generated grid-level visibility probability map and occlusion duration map. S4. The intermediate feature map, visibility probability map and occlusion duration map are processed by geometric projection transformation and bilinear interpolation to obtain a multi-channel sampling result set; S5. Establish a space-filling curve index for the multi-channel sampling result set, input the generated gated input vector into the selective state-space recursive unit of the improved Mamba algorithm, and output and summarize to obtain the bird's-eye view features. S6. Input the calculated multi-view fusion feature tensor and the gated input vector together to drive the continuous update of the selective state space recursive unit.
2. The method for multi-view image feature extraction based on deep learning according to claim 1, characterized in that, S1 specifically includes S11. Equip the vehicle with multiple cameras fixed in the circumferential position, set camera labels and complete the calibration of camera internal and external parameters; S12. Based on the synchronous trigger signal during vehicle movement, acquire the original image frames corresponding to each camera under the same time index; S13. Correspond the original image frames with the corresponding timestamps, camera identifiers, camera intrinsic parameters and camera extrinsic parameters to construct a time-view joint index table. S14. Based on the camera identifiers and imaging parameters recorded in the time-view joint index table, establish a data structure for the multi-camera image acquisition task at each moment.
3. The method for multi-view image feature extraction based on deep learning according to claim 1, characterized in that, S2 specifically includes S21. The preprocessing includes performing distortion correction operations on the original image frames respectively, correcting the geometric deformation of the image using camera intrinsic parameters, and outputting a geometrically corrected image. The geometrically corrected image is subjected to brightness normalization processing, which maps the pixel values to a uniform distribution range to obtain a normalized image frame; S22. Input the normalized image frame into the shared-weight convolutional feature encoder, and extract the intermediate layer feature representation using the preset convolutional layer structure; S23. Perform uniform dimensional adjustment and tensor standardization on the intermediate layer feature map corresponding to each viewpoint to form an intermediate feature map that matches the size of the input image; the intermediate feature map is a three-dimensional tensor structure, each spatial position corresponds to a channel vector, the channel vector is composed of feature values of multiple channel dimensions, and each feature value represents the representation intensity of the spatial position on the corresponding channel. S24. Store each intermediate feature map into the feature cache area according to the camera identifier.
4. The method for multi-view image feature extraction based on deep learning according to claim 1, characterized in that, S3 specifically includes: S31. Read the intermediate feature maps of each view from the feature buffer according to the camera identifier, and match them with the time-view joint index table; S32. A visibility prediction branch is established on each intermediate feature map. The visibility prediction branch performs convolution, normalization, nonlinear activation and upsampling operations in sequence. A single-channel feature response map is output through the prediction convolution layer. A probability mapping operation is performed on the single-channel feature response map. The single-channel feature response map is normalized by the Sigmoid function to obtain the corresponding visibility probability, forming a grid-level visibility probability map with the same size as the intermediate feature map, which serves as the input for generating the occlusion duration map. S33. Based on the time index, read the initial probability map of the raster-level visibility under the previous time index from the visibility cache. Compare the initial probability map of the raster-level visibility under the current time index with the result of the previous time index raster by raster. If the visibility probability of the current raster is higher than the threshold, reset the occlusion duration value of the corresponding raster to zero. If the visibility probability of the current raster is lower than the threshold, add one to the occlusion duration value of the previous time index, generate and update the occlusion duration map under the current time index. S34. Perform binarization on the initial raster-level visibility probability map under the current time index, and combine connectivity filtering to remove isolated regions and noise, generating a spatially continuous and stable raster-level visibility probability map.
5. The method for multi-view image feature extraction based on deep learning according to claim 1, characterized in that, S4 specifically includes: S41. Based on the camera intrinsic and extrinsic parameters in the time-view joint index table, perform geometric projection transformation on the intermediate feature maps of each viewpoint and map them to the preset grid coordinate positions under the unified bird's-eye view reference coordinate system. S42. Perform bilinear interpolation sampling on the intermediate feature map after geometric projection transformation, and map the feature values to the corresponding grid cells through bilinear interpolation to generate a projected feature map with fixed resolution. The fixed-resolution projection feature map is generated by aligning the positions of the preset grid coordinates. S43. Perform the same geometric projection transformation and bilinear interpolation operation on the raster-level visibility probability map and the occlusion duration map; S44. Under a unified bird's-eye view reference coordinate system, the projection feature map, projection visibility probability map and projection occlusion duration map of each view are stitched together at the channel level to obtain multi-view to bird's-eye view grid sampling results containing multi-channel feature information. S45. Summarize the multi-channel sampling results of all cameras according to the camera identifier and time index to form a multi-view to bird's-eye view grid multi-channel sampling result set under the current time index.
6. The method for multi-view image feature extraction based on deep learning according to claim 1, characterized in that, S5 specifically includes: S51. Establish a space filling curve index for the multi-channel sampling result set from multiple viewpoints to the bird's-eye view grid under a unified bird's-eye view reference coordinate system. The space filling curve index arranges all grid cells sequentially according to the preset grid scanning path. S52. According to the space filling curve index, the channel vector corresponding to the projection intermediate feature map, the probability value of the projection visibility probability map, and the duration value of the projection occlusion duration map at each grid cell are spliced together to form a gated input vector. S53. Input the gating input vector into the improved Mamba selective state space recursive unit, calculate the write gate and forget gate based on the current gating parameters, and output the gating decision. The selective state-space recursive unit is a deep neural network module based on a state-space model. It realizes the recursive modeling of the gated input vector through parameterized state update equations and output equations, and controls the writing and retention of the hidden state of the gated input vector based on the output gating decisions of the write gate and forget gate during the recursive process. The selective state-space recursion unit includes write gates and forget gates: The write gate is used to control the conditional writing of the gated input vector to the hidden state, and the forget gate is used to control the time decay update of the hidden state in the previous cycle. S54. When the gating decision is write, perform a conditional write operation on the hidden state of the corresponding grid cell in the previous cycle according to the gating input vector. S55. When the gating decision is forget, perform a time decay update operation on the hidden state of the corresponding grid cell in the previous cycle according to the time decay parameter. S56. Output and summarize the hidden states of all grid cells after processing by the selective state space recursion unit under the current time index to generate a consistent bird's-eye view feature.
7. The method for multi-view image feature extraction based on deep learning according to claim 1, characterized in that, S6 specifically includes: S61. The bird's-eye view features are input to a multi-scale upsampling module, which consists of parallel upsampling branches. Each upsampling branch performs upsampling operations at different magnifications for the input bird's-eye view features with enhanced consistency, thereby obtaining multi-scale bird's-eye view features. S62. The multi-scale bird's-eye view features are spliced and standardized in the channel dimension to form a multi-scale fusion feature tensor. S63. Divide the multi-scale fusion feature tensor into two computation paths. The first path performs a convolution operation on the multi-scale fusion feature tensor. The second path keeps the original value of the multi-scale fusion feature tensor unchanged. The outputs of the two paths are added at the corresponding positions to obtain the residual coupling feature tensor. S64. Perform nonlinear activation and normalization operations on the residual coupled feature tensor to output a multi-view fused feature tensor under a unified bird's-eye view reference coordinate system. S65. The multi-view fusion feature tensor is used as the feature extraction result, and the hidden state corresponding to the current time index is kept in synchronous cache. S66. Under the next time index, the cached hidden state and the current gating input vector are jointly input into the recursive unit to drive the continuous update of the selective state space recursive unit.
8. The method for multi-view image feature extraction based on deep learning according to claim 1, characterized in that, The improved Mamba algorithm introduces a gated input vector and adds write and forget gates to control the conditional writing and time decay updates of the hidden state.