A multi-task cooperative perception method for automatic parking

By acquiring bird's-eye view images of the vehicle's surroundings and extracting multi-scale features using a backbone network with a cross-stage local connectivity structure, combined with a multi-task decoupling architecture for parking space detection and road element segmentation, the technical problems of feature conflict and computational efficiency in the automatic parking scenario are solved, thereby improving detection accuracy and segmentation precision.

CN122244828APending Publication Date: 2026-06-19DONGFENG MOTOR GRP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DONGFENG MOTOR GRP
Filing Date
2026-03-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multi-task perception models in automatic parking scenarios suffer from technical problems such as difficulty in achieving both accuracy and efficiency due to the conflict between the features of parking space detection and road segmentation tasks, as well as the difficulty in balancing multi-scale feature extraction and computational efficiency in the backbone network.

Method used

By acquiring a bird's-eye view of the vehicle's surroundings, multi-scale features are extracted using a backbone network with a cross-stage local connectivity structure, and feature fusion is performed. Based on a multi-task decoupling architecture, parking space detection and road element segmentation tasks are executed independently.

🎯Benefits of technology

It improves the accuracy of parking space detection and road element segmentation in complex scenarios, meets the requirements of location sensitivity for detection tasks and category discrimination for segmentation tasks, and solves the bottleneck problems of feature conflict and computational efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244828A_ABST
    Figure CN122244828A_ABST
Patent Text Reader

Abstract

This application discloses a multi-task collaborative perception method for automated parking, relating to the field of visual detection technology. The method includes: acquiring a bird's-eye view of the target environment surrounding the vehicle; extracting features from the bird's-eye view based on a pre-set backbone network to generate multi-scale feature maps; fusing features from the multi-scale feature maps to generate a fused feature map; and performing parking space detection and road element segmentation on the fused feature map based on a multi-task decoupling architecture to obtain parking space detection results and road element segmentation results. This application acquires a bird's-eye view of the vehicle's surroundings, extracts and fuses multi-scale features using a backbone network, and, based on a multi-task decoupling architecture, executes parking space detection and road element segmentation in parallel by independent task heads. This application achieves collaborative optimization and feature decoupling of the two types of tasks, improving detection accuracy and segmentation precision in complex scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of visual inspection technology, and in particular to a multi-task collaborative perception method for automatic parking. Background Technology

[0002] In automated parking scenarios, vehicles need to simultaneously perform two tasks: parking space detection and road element segmentation, to adapt to complex and ever-changing parking environments. Currently, deep learning-based multi-task collaborative optimization models have become the mainstream technical solution in this field because they can extract common visual features through a shared backbone network and assign different task heads to perform detection and segmentation separately, thereby improving computational efficiency and reducing deployment costs.

[0003] However, existing models of this type still face significant technical bottlenecks in practical applications. On the one hand, parking space detection focuses on accurate target localization and boundary regression, while road element segmentation requires pixel-level semantic classification. The feature representation requirements of the two types of tasks are fundamentally different, but existing models usually adopt a strongly coupled feature sharing mechanism, leading to feature conflicts between tasks and making it difficult to balance the perception accuracy of both types of tasks in complex scenarios such as occlusion. On the other hand, mainstream backbone networks such as Darknet or ResNet often cannot effectively balance the needs of fine-grained spatial details and global semantic information when extracting multi-scale features, and the computational overhead is large, making it difficult to meet the real-time deployment requirements of in-vehicle platforms. Therefore, a multi-task collaborative perception method for automatic parking is urgently needed to solve the aforementioned technical problems. Summary of the Invention

[0004] The summary section introduces a series of simplified concepts, which will be further explained in detail in the detailed description section. This summary section is not intended to limit the key and essential technical features of the claimed technical solutions, nor is it intended to determine the scope of protection of the claimed technical solutions.

[0005] This application aims to address the technical challenges of existing multi-task perception models in automated parking scenarios, where conflicting features between parking space detection and road segmentation tasks lead to difficulties in achieving both accuracy and computational efficiency, as well as the difficulty in balancing multi-scale feature extraction and computational efficiency within the backbone network. The proposed method acquires a bird's-eye view of the vehicle's surroundings, extracts and fuses multi-scale features using a backbone network, and employs a decoupled multi-task architecture to execute parking space detection and road element segmentation in parallel by independent task heads. This application achieves collaborative optimization and feature decoupling between the two tasks, improving detection accuracy and segmentation precision in complex scenarios.

[0006] Firstly, this application provides a multi-task cooperative perception method for automated parking, including: Obtain a bird's-eye view of the environment surrounding the target vehicle; Based on a preset backbone network, feature extraction is performed on the target bird's-eye view to generate a multi-scale feature map; The multi-scale feature maps are fused to generate a fused feature map; Based on a multi-task decoupling architecture, parking space detection and road element segmentation are performed on the fused feature map to obtain parking space detection results and road element segmentation results.

[0007] In some implementations, acquiring a target bird's-eye view of the environment surrounding the target vehicle includes: Multiple fisheye cameras deployed around the target vehicle are used to acquire multi-channel ring-shaped field-of-view images of the target vehicle. Based on preset camera calibration parameters, distortion correction is performed on each of the circular field-of-view images to generate a corrected image; The multiple corrected images are subjected to viewpoint transformation to generate multiple bird's-eye view images; The multi-view bird's-eye view images are stitched and fused at the pixel level to generate the target bird's-eye view of the environment surrounding the target vehicle.

[0008] In some implementations, the preset backbone network is a backbone network with a cross-stage local connectivity structure. The step of extracting features from the target bird's-eye view based on the preset backbone network to generate a multi-scale feature map includes: Based on the backbone network with the cross-stage local connectivity structure, the target bird's-eye view is sequentially processed through the backbone layer and multiple stage layers to extract features, generating multiple multi-scale feature maps at different levels.

[0009] In some implementations, the multiple stage layers include a first stage layer, a second stage layer, a third stage layer, and a fourth stage layer. Based on the backbone network with a cross-stage local connectivity structure, the target bird's-eye view is sequentially processed through the backbone layer and multiple stage layers to extract features, generating multiple multi-scale feature maps at different levels, including: Based on the backbone layer, the target bird's-eye view is downsampled to generate a first intermediate feature map; Based on the first stage layer, the first intermediate feature map is downsampled a second time, and features are extracted through the first cross-stage local connection module to generate a first-level feature map. Based on the second stage layer, the first-level feature map is downsampled a third time, and features are extracted through the second cross-stage local connection module to generate the second-level feature map; Based on the third stage layer, the second-level feature map is downsampled a fourth time, and features are extracted through the third cross-stage local connection module to generate the third-level feature map; Based on the fourth stage layer, the third-level feature map is downsampled a fifth time, and features are extracted through the fourth cross-stage local connection module to generate the fourth-level feature map.

[0010] In some implementations, after performing a fifth downsampling on the third-level feature map based on the fourth-stage layer and extracting features through a fourth cross-stage local connection module to generate the fourth-level feature map, the method further includes: Channel adjustment is performed on the fourth-level feature map to generate a pooled input feature map; The pooling input feature map is input to a first pooling branch, a second pooling branch, a third pooling branch, and a fourth pooling branch, respectively. The first pooling branch determines the pooling input feature map as a first pooling feature map. The second pooling branch performs pooling processing on the pooling input feature map based on a first max pooling layer to generate a second pooling feature map. The third pooling branch performs pooling processing on the second pooling feature map based on the first max pooling layer to generate a third pooling feature map. The fourth pooling branch performs pooling processing on the third pooling feature map based on the first max pooling layer to generate a fourth pooling feature map. The first pooling feature map, the second pooling feature map, the third pooling feature map, and the fourth pooling feature map are concatenated along the channel dimension to generate a first concatenated feature map; Channel compression is performed on the first spliced ​​feature map to generate an enhanced fourth-level feature map.

[0011] In some implementations, the step of fusing features from the multi-scale feature maps to generate a fused feature map includes: The enhanced fourth-level feature map is subjected to a first convolutional process to generate a first adjusted feature map; The first adjusted feature map is subjected to a first upsampling process to generate a first upsampled feature map; The third-level feature map is subjected to a second convolution process to generate a second adjusted feature map; The first upsampled feature map and the second adjusted feature map are subjected to a first fusion process to generate a first fused feature map; The first fused feature map is subjected to a second upsampling process to generate a second upsampled feature map; The second-level feature map is subjected to a third convolution process to generate a third adjusted feature map; The second upsampled feature map and the third adjusted feature map are subjected to a second fusion process to generate a second fused feature map; The second fused feature map is subjected to a third upsampling process to generate a third upsampled feature map; The first-level feature map is subjected to a fourth convolution process to generate a fourth adjusted feature map. The third upsampled feature map and the fourth adjusted feature map are subjected to a third fusion process to generate the fused feature map.

[0012] In some implementations, the multi-task decoupling architecture includes a parking space detection head module and a road element segmentation head module. The step of performing parking space detection and road element segmentation on the fused feature map based on the multi-task decoupling architecture to obtain parking space detection results and road element segmentation results includes: The fused feature map is input into the parking space detection head module, and processed through multiple branches of the parking space detection head module to generate the parking space detection result; The fused feature map is input into the road element segmentation head, and processed by the road element segmentation head module to generate the road element segmentation result.

[0013] In some embodiments, the parking space detection head module includes a heatmap branch, a corner branch, a center point offset branch, and a corner point offset branch. The process of generating the parking space detection result through multiple branches of the parking space detection head module includes: Based on the heatmap branch, the fused feature map is convolved to generate a first heatmap for characterizing the location and category of the parking space center point. Based on the corner branch, the fused feature map is convolved to generate a second heatmap for characterizing the corner position of the parking space; Based on the center point offset branch, the fused feature map is convolved to generate a first offset map that characterizes the coordinate offset of the parking space center point. Based on the corner offset branch, the fused feature map is convolved to generate a second offset map that characterizes the coordinate offset of each corner point of the parking space. The parking space detection result is determined based on the first heat map, the second heat map, the first offset map, and the second offset map.

[0014] In some implementations, the process of generating the road element segmentation result through the road element segmentation head module includes: Based on the road element segmentation head module, the fused feature map is subjected to multiple feature compression and upsampling processes to generate the road element segmentation result.

[0015] In some implementations, the road element segmentation head module includes a first batch of normalized basic residual blocks, a second batch of normalized basic residual blocks, a first convolutional combination layer, a first transposed convolutional layer, a second convolutional combination layer, a second transposed convolutional layer, and an output convolutional combination layer. Based on the road element segmentation head module, the fused feature map undergoes multiple feature compression and upsampling processes to generate the road element segmentation result, including: Based on the first batch of normalized basic residual blocks, the fused feature map is subjected to first feature compression to generate a first compressed feature map. Based on the second batch of normalized basic residual blocks, the first compressed feature map is subjected to second feature compression to generate a second compressed feature map. Based on the first convolutional combination layer, the second compressed feature map is subjected to convolutional processing to generate a first intermediate feature map, wherein the first convolutional combination layer includes convolutional processing, batch normalization processing and activation function processing; Based on the first transposed convolutional layer, the first intermediate feature map is upsampled to generate a first upsampled feature map; Based on the second convolutional combination layer, the first upsampled feature map is subjected to convolutional processing to generate a second intermediate feature map, wherein the second convolutional combination layer includes convolutional processing and batch normalization processing. Based on the second transposed convolutional layer, the second intermediate feature map is upsampled a second time to generate a second upsampled feature map; Based on the output convolutional combination layer, the second upsampled feature map is convolved to generate a segmentation prediction map, and the segmentation prediction map is determined as the segmentation result of the road element. The output convolutional combination layer includes convolution processing and batch normalization processing, and the size of the segmentation prediction map is the same as that of the target bird's-eye view.

[0016] Secondly, this application proposes a multi-task cooperative sensing device for automatic parking, comprising: The image acquisition unit is used to acquire a bird's-eye view of the environment surrounding the target vehicle. The feature extraction unit is used to extract features from the target bird's-eye view based on a preset backbone network and generate a multi-scale feature map. The feature fusion unit is used to perform feature fusion on the multi-scale feature map to generate a fused feature map. The task-aware unit is used to perform parking space detection and road element segmentation on the fused feature map based on a multi-task decoupling architecture, so as to obtain parking space detection results and road element segmentation results.

[0017] In summary, the multi-task collaborative perception method for automated parking provided in this application offers a unified coordinate system and distortion-free visual input by acquiring a bird's-eye view of the target vehicle's surrounding environment. This avoids the spatial distortion problem caused by lens distortion in the original fisheye image, ensuring that the feature extraction process is based on accurate spatial relationships. Based on a pre-defined backbone network, features are extracted from the target bird's-eye view to generate multi-scale feature maps. This allows the model to simultaneously capture fine-grained texture details of small targets such as parking lines and global semantic information of large scenes such as the overall road layout, providing rich multi-level feature representations. Feature fusion is performed on the multi-scale feature maps to generate a fused feature map. This integrates shallow high-resolution detail features with deep strong semantic features across scales, giving the fused feature map both accurate spatial localization capabilities and rich semantic understanding capabilities. This meets the location sensitivity requirements of detection tasks and the category discriminative requirements of segmentation tasks. Based on a multi-task decoupling architecture, parking space detection and road element segmentation are performed using fused feature maps. By using parallel and independent task processing paths to focus on the feature preferences of target localization and pixel classification respectively, the two types of tasks avoid mutual interference during feature learning. This allows the detection head to focus on the regression of parking space boundaries and corners, and the segmentation head to focus on the pixel-level semantic classification of road elements. The two work together to optimize based on shared fused features, and finally output parking space detection results and road element segmentation results, achieving an overall improvement in multi-task perception performance in complex parking scenarios. Attached Figure Description

[0018] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit this specification. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings: Figure 1 A flowchart illustrating a multi-task collaborative perception method for automatic parking provided in this application embodiment; Figure 2 This is a schematic diagram of a multi-task collaborative sensing device for automatic parking provided in an embodiment of this application. Detailed Implementation

[0019] The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus. The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them.

[0020] This application is primarily applied to in-vehicle environmental perception systems in automated parking scenarios. Specifically, it involves using visual sensors to perceive the surrounding environment in real time during low-speed driving or parking to identify available parking spaces and understand ground traffic signs. This method can be deployed in intelligent vehicles equipped with a panoramic surround-view system, providing automated parking functions with perception information on parking space location and type, as well as road elements such as lane lines and arrow markings, assisting the vehicle in completing parking path planning and decision-making control.

[0021] To facilitate understanding of the technical solution of this application, some terms appearing in the text are explained below.

[0022] The target bird's-eye view in this application refers to a top-down view of the vehicle's surrounding environment generated by simultaneously acquiring ring-shaped field-of-view images through multiple fisheye cameras deployed around the vehicle, performing distortion correction and perspective transformation on each image, and then stitching and fusing the multiple bird's-eye view images pixel by pixel. This image eliminates the distortion of the original fisheye image and provides a seamless panoramic image from top to bottom with the vehicle as the center.

[0023] The pre-defined backbone network in this application refers to a deep convolutional neural network structure used to extract visual features from input images. The pre-defined backbone network adopted is a neural network with a cross-stage local connection structure. The cross-stage local connection structure refers to a network design method that divides the input feature map into two parts along the channel dimension, one part is directly transmitted, and the other part is spliced ​​and fused after deep processing by multiple bottleneck units. This method is used to enhance feature representation ability while reducing computational load.

[0024] The multi-scale feature maps in this application refer to multiple feature maps at different levels output after the target bird's-eye view image is input into a preset backbone network and then downsampled and feature extracted level by level through the backbone layer and multiple stage layers. These feature maps contain features at different levels, ranging from fine-grained spatial details to global semantic information.

[0025] The fused feature map in this application refers to the feature map generated by cross-scale fusion of multi-scale feature maps through a feature pyramid network. Specifically, it involves upsampling the deep feature map and then fusing it with the shallow feature map, so that the final feature map simultaneously possesses high-resolution spatial details and rich semantic information.

[0026] The multi-task decoupling architecture of this application refers to an architecture that inputs the fused feature map into independent parking space detection head modules and road element segmentation head modules for processing. This architecture avoids mutual interference between the two types of tasks during the feature learning process by separating the task processing paths, allowing the detection head to focus on target localization and boundary regression, and the segmentation head to focus on pixel-level semantic classification.

[0027] The parking space detection head module of this application refers to a neural network module used to generate parking space detection results, including a heatmap branch, a corner branch, a center point offset branch, and a corner point offset branch. The heatmap branch outputs a first heatmap representing the position and category of the parking space's center point; the corner branch outputs a second heatmap representing the position of the parking space's corner points; the center point offset branch outputs a first offset map representing the coordinate offset of the parking space's center point; and the corner point offset branch outputs a second offset map representing the coordinate offset of each corner point of the parking space. The information output by each branch collectively constitutes the parking space detection result.

[0028] The road element segmentation head module of this application refers to a neural network module used to generate road element segmentation results. It includes a first batch of normalized basic residual blocks, a second batch of normalized basic residual blocks, a first convolutional combination layer, a first transposed convolutional layer, a second convolutional combination layer, a second transposed convolutional layer, and an output convolutional combination layer. This module sequentially performs multiple feature compression and upsampling processes on the fused feature map to output a segmentation prediction map of the same size as the target bird's-eye view, used to represent pixel-level classification results for various road elements such as lane lines, arrows, and no-stopping zones.

[0029] Please see Figure 1 The above is a flowchart illustrating a multi-task cooperative perception method for automatic parking provided in an embodiment of this application, including: S110. Obtain a bird's-eye view of the target vehicle's surrounding environment; For example, multiple fisheye cameras deployed around the target vehicle synchronously acquire images of the surrounding circular field of view. Each image is then subjected to distortion correction to eliminate geometric distortion caused by the lens. The corrected images are then converted from the original viewpoint to a top-down bird's-eye view. Finally, the multiple bird's-eye view images are stitched and fused pixel by pixel to generate a seamless bird's-eye view of the target that covers the environment around the vehicle.

[0030] S120. Based on the preset backbone network, extract features from the target bird's-eye view and generate a multi-scale feature map; For example, the preset backbone network is a neural network with a cross-stage local connection structure. The backbone layer performs preliminary feature extraction and downsampling on the input target bird's-eye view, and then passes through multiple stage layers for progressive downsampling and deep feature extraction. Each stage layer uses the cross-stage local connection structure to segment, transform and fuse features, thereby reducing the amount of computation while enhancing the feature expression ability and outputting multiple feature maps at different levels. These feature maps contain multi-scale visual information from high-resolution spatial details to low-resolution global semantics.

[0031] S130. Perform feature fusion on the multi-scale feature maps to generate a fused feature map; For example, feature maps from multiple different levels extracted by the backbone network are input into a feature pyramid network. Through a top-down upsampling path and lateral connections, the strong semantic information of the deep feature maps is progressively passed and fused with the shallow high-resolution feature maps. After multiple upsampling and fusion operations, a fused feature map containing both spatial details and semantic information is generated. This fused feature map integrates multi-scale features, providing a unified visual representation with both localization accuracy and semantic understanding capabilities for parking space detection and road element segmentation tasks.

[0032] S140. Based on a multi-task decoupling architecture, parking space detection and road element segmentation are performed on the fused feature map to obtain parking space detection results and road element segmentation results.

[0033] For example, the generated fused feature maps are input into the parking space detection head module and the road element segmentation head module in the multi-task decoupled architecture, respectively. Parking space detection and road element segmentation tasks are executed separately through parallel and independent processing paths. This architecture, through decoupling, allows the detection head to focus on feature learning for target localization and boundary regression, while the segmentation head focuses on feature learning for pixel-level semantic classification, avoiding mutual interference between the two tasks during feature extraction. The parking space detection head outputs parking space detection results containing parking space location, category, and keypoint offset information. The road element segmentation head outputs a pixel-level classification map of the same size as the input image, serving as the road element segmentation result, thus enabling perception of the vehicle's surrounding environment.

[0034] In summary, this application provides a unified coordinate system and distortion-free spatial reference for visual perception by acquiring a bird's-eye view of the target vehicle's surrounding environment. This avoids the spatial distortion problem caused by lens distortion in the original fisheye image, enabling feature extraction to be based on accurate positional relationships. Based on a pre-defined backbone network with a cross-stage local connectivity structure, features are extracted from the target bird's-eye view and multi-scale feature maps are generated. This allows the model to reduce computational load while capturing fine-grained spatial details and global semantic information, solving the technical bottleneck of mainstream backbone networks in balancing multi-scale feature extraction and computational efficiency. By fusing the multi-scale feature maps to generate a fused feature map, shallow high-resolution features and deep strong semantic features are integrated across scales, giving the fused features both spatial positioning capabilities and rich semantic information. The system possesses strong semantic understanding capabilities, simultaneously satisfying the dual requirements of location sensitivity in detection tasks and category discrimination in segmentation tasks. Based on a multi-task decoupling architecture, the fused feature maps are input into independent parking space detection head modules and road element segmentation head modules for parallel processing. By separating task processing paths, the detection head can focus on target localization and boundary regression, while the segmentation head can focus on pixel-level semantic classification. This avoids mutual interference between the two types of tasks during feature learning, solving the technical problem of existing multi-task models where accuracy is difficult to achieve simultaneously due to feature coupling. Ultimately, it achieves synergistic optimization and performance improvement for both parking space detection and road element segmentation tasks, improving detection accuracy and segmentation fineness in complex parking scenarios such as occlusion and lighting changes, and providing a more reliable and comprehensive environmental perception foundation for automatic parking systems.

[0035] In some instances, obtaining a bird's-eye view of the environment surrounding the target vehicle includes: Multiple fisheye cameras deployed around the target vehicle are used to acquire multi-channel ring-shaped field-of-view images of the target vehicle. Based on preset camera calibration parameters, distortion correction is performed on each ring field of view image to generate a corrected image; Perform viewpoint transformation on the multi-channel corrected images to generate multi-channel bird's-eye view images; Pixel-level stitching and fusion of multiple bird's-eye view images are used to generate a target bird's-eye view of the environment surrounding the target vehicle.

[0036] For example, acquiring a bird's-eye view of the environment surrounding a target vehicle first requires simultaneously capturing multiple circular field-of-view images using multiple fisheye cameras deployed around the vehicle. Specifically, ultra-wide-angle fisheye cameras are installed in front of, behind, to the left and right of the vehicle. These cameras can cover a 360° area around the vehicle, ensuring that parking lines and various road elements are fully captured. Due to the extremely wide field of view of fisheye lenses, the original images acquired are distorted, exhibiting phenomena such as curved lines and spatial compression. However, this design minimizes the number of cameras used while ensuring blind-spot-free environmental coverage. The synchronous acquisition mechanism ensures that the four images are aligned in time, laying the foundation for time synchronization in image processing.

[0037] Based on preset camera calibration parameters, distortion correction is performed on each ring-shaped field-of-view image to generate a corrected image. The camera calibration parameters are obtained beforehand by calibrating the intrinsic and extrinsic parameters of the fisheye camera, including intrinsic parameters (focal length, principal point coordinates, and distortion coefficients) and extrinsic parameters (rotation matrix and translation vector of the camera relative to the vehicle coordinate system). The distortion correction process uses these parameters to establish a mapping relationship from the distorted image coordinates to the ideal image coordinates. A resampling algorithm maps each pixel in the original image to its corrected position, thereby eliminating geometric distortion caused by the lens and restoring the image to a normal image that conforms to perspective projection. After correction, the originally curved lane lines and parking lines are straightened, and the shapes and sizes of objects are closer to the real world, providing a pixel-level basis for viewpoint transformation.

[0038] Viewpoint transformation is performed on multiple corrected images to generate multiple bird's-eye view images. Viewpoint transformation converts each corrected image from its original camera viewpoint (e.g., forward, rear, left, right) into a unified top-down bird's-eye view. This transformation is typically achieved using inverse perspective mapping technology. Its core is establishing a mapping relationship from the image plane to the ground plane. Assuming the ground is flat, and combining extrinsic parameters such as camera height, pitch angle, and yaw angle, the 3D coordinates of each pixel in the corrected image on the ground are calculated and then projected onto a unified bird's-eye view grid. The transformed bird's-eye view image presents a top-down view of the ground scene centered on the vehicle. Objects in the image maintain their true spatial proportions, and the perspective effect of objects appearing larger when closer and smaller when farther away is eliminated, ensuring consistent scale for visual information in different directions.

[0039] Pixel-level stitching and fusion of multiple bird's-eye view images generates a target bird's-eye view of the environment surrounding the vehicle. Since the four cameras cover different directions, overlapping areas exist in the generated bird's-eye views around the vehicle. The stitching and fusion process first requires determining the spatial relationships between adjacent images. Typically, this is done by calculating the coordinate transformation matrix for each bird's-eye view based on the vehicle size, camera installation location, and calibration parameters, achieving pixel-level alignment. In the overlapping areas, image fusion algorithms (such as weighted averaging and multi-band fusion) are used to smoothly transition pixels from different cameras, eliminating stitching seams, lighting differences, and ghosting, generating a 360° surround-view bird's-eye view covering the area around the vehicle. This target bird's-eye view, centered on the vehicle and with a unified coordinate system and scale, presents the layout information of parking spaces and road elements around the vehicle, providing standardized input data for feature extraction and multi-task perception in neural network models.

[0040] In summary, the target bird's-eye view obtained through the above steps in this application embodiment possesses accurate spatial geometric relationships due to synchronous acquisition and distortion correction, avoiding the positional distortion caused by lens distortion in the original fisheye image, and enabling feature extraction to be performed based on the real spatial location. Viewpoint transformation unifies multiple images to the same top-down perspective, eliminating scale inconsistencies caused by perspective differences, allowing parking spaces and road elements in different directions to be compared and correlated in the same coordinate system. Pixel-level stitching and fusion further eliminates stitching seams and brightness differences between multiple views, generating a continuous and complete panoramic image, avoiding information loss or duplication caused by image fragmentation. Therefore, this target bird's-eye view, as input to the model, provides a visual foundation for the backbone network to extract multi-scale features, ensuring that the feature extraction process can focus on the geometric structure and semantic content of the real environment, improving the accuracy and robustness of parking space detection and road element segmentation tasks.

[0041] In some instances, the pre-defined backbone network is a backbone network with a cross-stage local connectivity structure. Based on the pre-defined backbone network, features are extracted from the target bird's-eye view to generate multi-scale feature maps, including: Based on a backbone network with a cross-stage local connectivity structure, the target bird's-eye view is sequentially processed through the backbone layer and multiple stage layers to extract features, generating multi-scale feature maps at multiple different levels.

[0042] For example, the pre-defined backbone network performs initial downsampling on the input target bird's-eye view through the backbone layer, outputting a first intermediate feature map. This feature map is then processed sequentially through the first to fourth stage layers. Each stage layer first performs downsampling with a stride of 2 through a convolutional module, and then segments and fuses the features through a cross-stage local connection module. The cross-stage local connection module divides the input features into two parts along the channel dimension. One part is directly passed, and the other part is processed by multiple bottleneck units to extract deep features before being concatenated with the directly passed part, thus reducing computational cost while enhancing feature representation. Finally, the network outputs four feature maps at different levels, denoted as the first to fourth level feature maps, forming a multi-scale feature map.

[0043] In summary, this embodiment employs a cross-stage local connection structure to divide the input feature map into two parts for separate processing. This reduces computational load while achieving efficient feature reuse and gradient optimization, avoiding computational redundancy in traditional backbone networks. The stepwise downsampling and feature extraction in the backbone layer and multiple stage layers ensure that shallow feature maps retain fine spatial details, while deep feature maps contain global semantic information. This balances the positioning accuracy requirements of parking space detection with the semantic understanding requirements of road element segmentation, providing rich multi-scale visual representations for feature fusion and multi-task perception.

[0044] In some instances, multiple stage layers include a first-stage layer, a second-stage layer, a third-stage layer, and a fourth-stage layer. Based on a backbone network with a cross-stage local connectivity structure, the target bird's-eye view is sequentially processed through the backbone layer and multiple stage layers to extract features, generating multiple multi-scale feature maps at different levels, including: Based on the backbone layer, the target bird's-eye view is downsampled to generate the first intermediate feature map; Based on the first stage layer, the first intermediate feature map is downsampled a second time, and features are extracted through the first cross-stage local connection module to generate the first-level feature map. Based on the second-stage layer, the first-level feature map is downsampled a third time, and features are extracted through the second cross-stage local connection module to generate the second-level feature map; Based on the third-stage layer, the second-level feature map is downsampled a fourth time, and features are extracted through the third cross-stage local connection module to generate the third-level feature map; Based on the fourth-stage layer, the third-level feature map is downsampled a fifth time, and features are extracted through the fourth cross-stage local connection module to generate the fourth-level feature map.

[0045] For example, the backbone layer uses a convolutional module to process the input target bird's-eye view image. This convolutional module is configured with a kernel size of 3×3, a stride of 2, and padding of 1. The input target bird's-eye view image is 352×352×3 in size. After processing by this convolutional module, the spatial resolution of the feature map is halved, the number of channels increases to 48, and the output is a first intermediate feature map with a size of 176×176×48. This operation completes the preliminary feature extraction and spatial downsampling of the input image, laying the feature foundation for the deep feature extraction of the stage layers.

[0046] The first-stage layer receives the first intermediate feature map output from the backbone layer, which has a size of 176×176×48. The first-stage layer first performs a second downsampling on the first intermediate feature map using a convolutional module with a kernel size of 3×3, a stride of 2, and padding of 1, reducing the feature map size from 176×176 to 88×88, while increasing the number of channels from 48 to 96, generating a second downsampled feature map. This feature map then enters the first cross-stage local connection module for processing. In the first cross-stage local connection module, the input feature map is divided into two branches along the channel dimension, each containing 48 channels. One branch acts as a short-circuit connection, directly preserving the original features without any convolutional transformation; the other branch sequentially passes through three cascaded bottleneck units for deep feature extraction. Each bottleneck unit consists of a 1×1 convolution for dimensionality reduction, a 3×3 convolution for feature extraction, and a residual connection, used to enhance the semantic expressiveness of the features while maintaining the feature map size. After processing by three bottleneck units, the feature map of this branch is then passed through a 1×1 convolutional layer to adjust the number of channels to 48, matching the channel dimension of the short-circuit branch. The feature maps of the two branches are then concatenated along the channel dimension to obtain an output feature map with 96 channels and a size of 88×88, which serves as the first-level feature map generated by the first-stage layer. This process preserves shallow spatial details while gradually abstracting more discriminative semantic features through the stacking of bottleneck units.

[0047] The second-stage layer receives the feature map from the first-stage layer as input, with a size of 88×88×96. The second-stage layer first performs a third downsampling using a convolutional module with a kernel size of 3×3, a stride of 2, and padding of 1, reducing the feature map size from 88×88 to 44×44 and increasing the number of channels from 96 to 192, generating a third downsampled feature map. This feature map then enters the second cross-stage local connection module. In the second cross-stage local connection module, the input feature map is divided into two branches along the channel dimension, each containing 96 channels. The short-circuit branch directly transmits the original features; the other branch sequentially passes through six cascaded bottleneck units for feature transformation. The structure of each bottleneck unit is the same as that of the first-stage layer; by stacking more bottleneck units, the receptive field is further expanded, and more complex spatial structural information is extracted. After processing by 6 bottleneck units, the feature map of this branch is adjusted back to 96 channels through a 1×1 convolutional layer, and then concatenated with the feature map of the short-circuit branch in the channel dimension to obtain an output feature map with 192 channels and a size of 44×44, which serves as the second-level feature map generated by the second-stage layer. This stage expands the receptive field of the network by increasing the number of bottleneck units, enabling the model to better perceive the layout relationship between parking spaces and the spatial correlation between road elements.

[0048] The third-stage layer receives the second-stage feature map as input, with a size of 44×44×192. First, it performs a fourth downsampling using a convolutional module with a kernel size of 3×3, a stride of 2, and padding of 1, reducing the feature map size from 44×44 to 22×22 and increasing the number of channels from 192 to 384, generating a fourth downsampled feature map. This feature map then enters the third cross-stage local connection module. In this module, the input feature map is split into two branches along the channel dimension, each containing 192 channels. The short-circuit branch retains the original features; the other branch sequentially passes through nine cascaded bottleneck units for deep semantic extraction, abstracting high-level semantic concepts, such as the category attributes of parking spaces and the semantic categories of road elements, through deeper network layers. After passing through 9 bottleneck units, the feature map of this branch is adjusted back to 192 channels using a 1×1 convolutional layer, and then concatenated with the short-circuit branch to obtain an output feature map with 384 channels and a size of 22×22, which serves as the third-level feature map generated by the third-stage layer. The feature map output at this stage has stronger semantic information and can effectively represent the abstract spatial relationships and contextual semantics between parking spaces and various road elements in complex scenes.

[0049] The fourth-stage layer receives the feature map from the third-stage layer as input, with a size of 22×22×384. The fourth-stage layer first performs fifth downsampling through a convolutional module with a kernel size of 3×3, a stride of 2, and padding of 1, reducing the feature map size from 22×22 to 11×11 and increasing the number of channels from 384 to 576, generating the fifth-downsampled feature map. This feature map then enters the fourth cross-stage local connection module. In this module, the input feature map is split into two branches along the channel dimension, each containing 288 channels. The short-circuit branch preserves the original feature structure; the other branch sequentially passes through 12 cascaded bottleneck units for high-intensity feature abstraction, mining global semantic information, such as the overall layout of the scene and the macroscopic distribution of different elements, through deeper network layers. After passing through 12 bottleneck units, the feature map of this branch is adjusted back to 288 channels by a 1×1 convolutional layer, and then concatenated with the short-circuit branch to obtain an output feature map with 576 channels and a size of 11×11, which serves as the fourth-level feature map generated by the fourth-stage layer. This feature map integrates deep global semantics with some shallow details, providing a high semantic density feature foundation for the spatial pyramid pooling module.

[0050] In summary, the four-stage layers of this embodiment employ progressive downsampling and alternating stacking of cross-stage local connection modules. The convolutional modules in each stage layer reduce spatial resolution and increase the number of channels to gradually expand the receptive field and compress the dimensionality of the feature map. The cross-stage local connection modules divide the feature map into short-circuit branches and deep processing branches, reducing computational load while ensuring full feature reuse and efficient gradient propagation. The short-circuit branches directly transmit the original features, avoiding information loss, while the deep processing branches gradually enhance the abstraction level of the features through different numbers of bottleneck units. The increasing number of bottleneck units (3, 6, 9, and 12 respectively) allows the network to adjust the feature extraction depth at different scales; that is, shallow stages use fewer bottleneck units to preserve spatial structure, while deep stages use more bottleneck units to extract global semantics. This design meets the positioning accuracy requirements of parking space detection tasks and the semantic understanding requirements of road element segmentation tasks.

[0051] It should be noted that the multiple stage layers include a first stage layer, a second stage layer, a third stage layer, and a fourth stage layer. Specifically, the first stage layer contains a convolutional module and a first cross-stage local connection module to generate the first-level feature map; the second stage layer contains a convolutional module and a second cross-stage local connection module to generate the second-level feature map; the third stage layer contains a convolutional module and a third cross-stage local connection module to generate the third-level feature map; and the fourth stage layer contains a convolutional module and a fourth cross-stage local connection module to generate the fourth-level feature map.

[0052] In some instances, after performing a fifth downsampling on the third-level feature map based on the fourth-stage layer, and generating the fourth-level feature map through feature extraction via the fourth cross-stage local connectivity module, the process further includes: Channel adjustment is performed on the fourth-level feature map to generate a pooled input feature map; The pooling input feature map is fed into the first pooling branch, the second pooling branch, the third pooling branch, and the fourth pooling branch, respectively. The first pooling branch determines the pooling input feature map as the first pooling feature map. The second pooling branch performs pooling processing on the pooling input feature map based on the first max pooling layer to generate the second pooling feature map. The third pooling branch performs pooling processing on the second pooling feature map based on the first max pooling layer to generate the third pooling feature map. The fourth pooling branch performs pooling processing on the third pooling feature map based on the first max pooling layer to generate the fourth pooling feature map. The first pooling feature map, the second pooling feature map, the third pooling feature map, and the fourth pooling feature map are concatenated along the channel dimension to generate the first concatenated feature map; Channel compression is performed on the first spliced ​​feature map to generate an enhanced fourth-level feature map.

[0053] For example, in the fourth-stage layer, the fourth-level feature map output by the fourth cross-stage local connection module has a size of 11×11×576. This feature map is first input to the first convolutional layer for channel adjustment. The first convolutional layer uses a convolution operation with a kernel size of 1×1 to adjust the number of channels of the fourth-level feature map from 576 to 256, while keeping the spatial size of the feature map unchanged, i.e., outputting a pooled input feature map with a size of 11×11×256. The purpose of this channel adjustment operation is to unify the feature representation dimension, lay the feature foundation for multi-branch pooling processing, and enable different branches to perform parallel computation and fusion on the same feature dimension.

[0054] The generated pooling input feature maps are fed in parallel into four pooling branches for processing. The first pooling branch identifies the input feature map as the first pooling feature map and performs no pooling operation on it, preserving the original feature information. The second pooling branch performs pooling processing on the input feature map based on the first max-pooling layer, configured with a kernel size of 5×5, a stride of 1, and padding of 2. This pooling operation generates the second pooling feature map, whose feature size remains 11×11×256. The third pooling branch further pools the second pooling feature map based on the same first max-pooling layer, inputting it into the same 5×5 max-pooling layer to generate the third pooling feature map, whose feature size also remains 11×11×256. The fourth pooling branch further pools the third pooling feature map based on the first max-pooling layer, inputting it into the same 5×5 max-pooling layer to generate the fourth pooling feature map, whose feature size also remains 11×11×256. Through this cascaded pooling design, the four branches each obtain multi-scale contextual features with progressively expanding receptive fields.

[0055] The first, second, third, and fourth pooling feature maps are concatenated along the channel dimension. Each of the four feature maps has a spatial size of 11×11 and 256 channels. The concatenated feature map has a size of 11×11×1024. This concatenation operation integrates the original features with features processed by different pooling operations, ensuring that the concatenated feature map simultaneously contains original detailed information, local contextual information, and a broader range of global contextual information, thus achieving the aggregation of multi-scale features along the channel dimension.

[0056] Channel compression is performed on the first concatenated feature map. The first concatenated feature map is input into the second convolutional layer, which uses a 1×1 kernel to compress the number of channels in the first concatenated feature map from 1024 to 576 while maintaining the same spatial size. The output feature map has a size of 11×11×576, which serves as the enhanced fourth-level feature map. This channel compression operation integrates and reduces the dimensionality of multi-scale fusion features, allowing the enhanced fourth-level feature map to retain the semantic information of the original features while incorporating the multi-scale contextual features obtained through cascaded pooling.

[0057] In summary, through the spatial pyramid pooling module described above, the receptive field of the fourth-level feature map in this embodiment is effectively expanded, and multi-scale contextual information fusion is achieved. This design enables the model to simultaneously perceive small-scale details (such as parking line endpoints) and large-scale scene layouts (such as the arrangement of multiple consecutive parking spaces) in a single-scale feature map, alleviating the scale adaptability problem caused by the fixed receptive field in traditional backbone networks. The multi-branch pooling and channel concatenation mechanism enhances the global contextual representation capability of features without increasing the computational burden, enabling the parking space detection head to more accurately locate occluded corner points, while allowing the road element segmentation head to more finely divide the boundaries of different categories of regions, thereby improving the accuracy and robustness of the perception task in complex parking scenarios.

[0058] In some instances, feature fusion is performed on multi-scale feature maps to generate fused feature maps, including: The enhanced fourth-level feature map is subjected to a first convolutional process to generate a first adjusted feature map; The first adjusted feature map is subjected to a first upsampling process to generate a first upsampled feature map; The third-level feature map is subjected to a second convolution to generate a second adjusted feature map. The first upsampled feature map and the second adjusted feature map are subjected to a first fusion process to generate a first fused feature map; The first fused feature map is subjected to a second upsampling process to generate a second upsampled feature map; The second-level feature map is subjected to a third convolution process to generate a third adjusted feature map. The second upsampled feature map and the third adjusted feature map are subjected to a second fusion process to generate a second fused feature map. The second fused feature map is subjected to a third upsampling process to generate a third upsampled feature map; The first-level feature map is processed by a fourth convolution to generate a fourth adjusted feature map. The third upsampled feature map and the fourth adjusted feature map are then fused together to generate a fused feature map.

[0059] For example, the process of fusing features from multi-scale feature maps to generate a fused feature map is implemented through a feature pyramid network structure. This structure aims to integrate the strong semantic information of deep features with the high-resolution details of shallow features across scales. Before feature fusion, the backbone network generates four different levels of feature maps: a first-level feature map, a second-level feature map, a third-level feature map, and an enhanced fourth-level feature map. The enhanced fourth-level feature map, processed by the spatial pyramid pooling module, has a size of 11×11×576 and contains the richest global semantic information; the third-level feature map has a size of 22×22×384; the second-level feature map has a size of 44×44×192; and the first-level feature map has a size of 88×88×96. These four feature maps form the basis of the multi-scale feature pyramid, representing different levels of visual information from high resolution to high semantics.

[0060] The fourth-level feature map, enhanced by the spatial pyramid pooling module, undergoes a first convolutional process. This enhanced fourth-level feature map has a size of 11×11×576. The first convolutional process uses a 1×1 kernel to adjust the number of channels in the enhanced fourth-level feature map to a preset uniform dimension of 64 channels, while maintaining the spatial size of the feature map, generating a first adjusted feature map with a size of 11×11×64. This channel adjustment operation allows feature maps from different levels to be aligned and merged along the same channel dimension.

[0061] The generated first adjusted feature map undergoes a first upsampling process using a transposed convolution operation with a stride of 2, effectively doubling the spatial size of the feature map. After this first upsampling, the size of the first adjusted feature map increases from 11×11 to 22×22, while maintaining the number of channels at 64, resulting in a first upsampled feature map with a size of 22×22×64. This operation progressively transfers strong semantic information from the deep feature map to a higher resolution space.

[0062] A second convolution is performed on the third-level feature map, which originates from the output of the third-stage layer of the backbone network and has an original size of 22×22×384. The second convolution operation uses a 1×1 kernel to adjust the number of channels in the third-level feature map to the same 64 channels as the first upsampled feature map, while maintaining the spatial size of 22×22, generating a second adjusted feature map with a size of 22×22×64. This process aligns the upsampled features from the deep path with the original features from the same level of the backbone network in the channel dimension.

[0063] The first upsampled feature map and the second adjusted feature map are subjected to a first fusion process, which employs element-wise addition. The first upsampled feature map carries the deep semantic information passed down through upsampling, while the second adjusted feature map retains the original spatial structure features of the third level. Both have a size of 22×22×64. By adding elements-wise, the deep semantic information and mid-level spatial details are integrated to generate a first fused feature map with a size of 22×22×64. This fused feature map contains both an understanding of the global scene and retains the spatial structure information from the third-level feature map.

[0064] The generated first fused feature map is then subjected to a second upsampling process, which also employs a transposed convolution operation with a stride of 2. After the second upsampling process, the size of the first fused feature map is enlarged from 22×22 to 44×44, while the number of channels remains unchanged at 64, resulting in a second upsampled feature map with a size of 44×44×64.

[0065] A third convolution is applied to the second-level feature map, which originates from the output of the second-stage layer of the backbone network and has an original size of 44×44×192. The third convolution operation uses a 1×1 kernel to adjust the number of channels in the second-level feature map to 64, while maintaining the spatial size of 44×44, generating a third adjusted feature map with a size of 44×44×64. This process aligns the upsampled features from the previous fusion path with the original features of the second level in the channel dimension.

[0066] The second upsampled feature map and the third adjusted feature map are then fused together using an element-wise addition operation. The second upsampled feature map carries the deep semantic and mid-level structural information passed down from the first two fusion stages, while the third adjusted feature map retains the original detailed features of the second level. Both maps are 44×44×64 pixels in size. By adding elements-wise, the high-level semantic information is integrated with the spatial details of the second level, generating a second fused feature map with a size of 44×44×64. This fused feature map further enriches the hierarchical nature of the features, allowing for a more complete combination of spatial details and semantic information.

[0067] The generated second fused feature map undergoes a third upsampling process, employing a transposed convolution operation with a stride of 2. After this third upsampling, the size of the second fused feature map is enlarged from 44×44 to 88×88, while the number of channels remains unchanged at 64, resulting in a third upsampled feature map with a size of 88×88×64. This operation transfers the fused features to the same spatial resolution as the first-level feature map, providing a basis for size matching for the final fusion with the first-level feature map.

[0068] A fourth convolution is applied to the first-level feature map, which originates from the output of the first-stage layer of the backbone network and has an original size of 88×88×96. The fourth convolution operation uses a 1×1 kernel to adjust the number of channels in the first-level feature map to 64, while maintaining the spatial size of 88×88, generating a fourth adjusted feature map with a size of 88×88×64. This process aligns the upsampled features from the deepest fusion path with the original high-resolution features of the first level in the channel dimension.

[0069] The third upsampled feature map and the fourth adjusted feature map are then fused together using an element-wise addition operation. The third upsampled feature map carries global semantic information and multi-level spatial structure features passed down through multiple levels of fusion, while the fourth adjusted feature map retains the original high-resolution detail features of the first level. Both maps are 88×88×64 pixels in size. Through element-wise addition, the deep semantic understanding and the shallow, fine spatial details are finally integrated to generate a fused feature map of size 88×88×64. This fused feature map, as the final output of the feature pyramid network, provides a unified visual representation with both high-resolution spatial details and rich semantic information for parking space detection and road element segmentation tasks.

[0070] In summary, this embodiment constructs a top-down feature propagation path by adjusting convolutional channels, upsampling spatial alignment, and element-wise addition fusion of feature maps at different levels. In this path, each level's feature map undergoes a 1×1 convolution to unify the number of channels to the same dimension, eliminating channel differences between different levels and enabling features from different levels to be expressed in the same feature space. Through progressive upsampling, the semantic information of deep feature maps is gradually transferred to a higher spatial resolution, ensuring that deep semantics and shallow details are aligned in the same spatial location. The element-wise addition fusion method integrates the transmitted semantic information with the spatial details of the current level, preserving the integrity of both types of information. The final fused feature map retains spatial details such as parking line edges and corner positions from the first-level feature map, while also containing global scene semantics and contextual information from the enhanced fourth-level feature map, satisfying the requirements of parking space detection for position regression accuracy and road element segmentation for pixel-level semantic classification.

[0071] In some instances, the multi-task decoupling architecture includes a parking space detection head module and a road element segmentation head module. Based on this architecture, parking space detection and road element segmentation are performed on the fused feature map to obtain parking space detection results and road element segmentation results, including: The fused feature map is input into the parking space detection head module, and processed through multiple branches of the parking space detection head module to generate parking space detection results; The fused feature map is input into the road element segmentation head, and processed by the road element segmentation head module to generate the road element segmentation result.

[0072] For example, after the fused feature map is input into the parking space detection head module, it is processed by four parallel convolutional branches. The heatmap branch outputs the center point location and category information of the parking space, the corner point branch outputs the corner point location of the parking space, and the center point offset branch and corner point offset branch output the coordinate offsets between the center point and each corner point, respectively. Through multi-branch collaboration and offset correction, the geometric boundary and category attributes of the parking space are finally determined, and the parking space detection result is generated.

[0073] The fused feature map is input to the road element segmentation head module. It first undergoes progressive feature compression through multiple basic convolutional blocks to preserve semantic information and reduce dimensionality. Then, it is upsampled step by step through transposed convolutional layers to restore the input image size. Finally, it generates a segmentation prediction map through the output convolutional layer, realizing pixel-level classification of road elements such as lane lines, arrows, and no-stopping zones, and outputting the road element segmentation results.

[0074] In summary, this application's embodiments employ a multi-task decoupling architecture, inputting the fused feature map into independent detection and segmentation heads respectively. This allows parking space detection and road element segmentation to be processed in parallel without feature interference. The detection head focuses on target localization and boundary regression, while the segmentation head concentrates on pixel-level semantic classification, avoiding feature conflicts between the two tasks. The shared fused feature map simultaneously preserves spatial details and global semantics, providing high-quality input for both heads. This collaboratively improves the accuracy of parking space detection and the fineness of road element segmentation in complex scenarios such as occlusion and lighting changes, providing a reliable environmental perception foundation for the automatic parking system.

[0075] In some instances, the parking space detection head module includes heatmap branches, corner branch, center point offset branch, and corner point offset branch. Processing through these multiple branches generates parking space detection results, including: Based on the heatmap branch, the fused feature map is convolved to generate a first heatmap that represents the location and category of the parking space center point. Based on the corner branch, the fused feature map is convolved to generate a second heatmap to represent the corner position of the parking space. Based on the center point offset branch, the fused feature map is convolved to generate the first offset map to characterize the coordinate offset of the parking space center point. Based on the corner offset branch, the fused feature map is convolved to generate a second offset map that represents the coordinate offset of each corner point of the parking space. The parking space detection results are determined based on the first heat map, the second heat map, the first offset map, and the second offset map.

[0076] For example, based on the heatmap branch, the fused feature map is convolutionally processed to generate a first heatmap representing the location and category of the parking space center point. The fused feature map is the feature map output by the feature pyramid network, with a spatial size of 88×88 pixels and 64 channels. The heatmap branch first maps the number of input channels from 64 to 32 through a convolutional layer with a kernel size of 3×3, padding of 1, and stride of 1, outputting an intermediate feature map with a size of 88×88×32; this convolutional layer is followed by a ReLU activation function to introduce non-linearity and enhance the expressive power of the features. The intermediate feature map is then compressed from 32 to 2 through a convolutional layer with a kernel size of 1×1 and stride of 1, generating a first heatmap with a size of 88×88×2. The heatmap has two channels, which are used to characterize the probability of the existence of the parking space center point and the parking space category information, respectively: the value of each pixel position in the first channel represents the confidence that the point belongs to the parking space center point, and the higher the value, the more likely it is to be the parking space center point; the second channel encodes the parking space type through different discrete values, including perpendicular parking spaces, parallel parking spaces and angled parking spaces, and outputs the category attribute at the same time as locating the center point, thereby realizing the synchronous output of center point location and category prediction.

[0077] Based on the corner point branch, the fused feature map is convolutionally processed to generate a second heatmap representing the location of parking space corner points. The corner point branch also takes a fused feature map of size 88×88×64 as input, and its structural design is similar to the heatmap branch. The branch first passes through a convolutional layer with a kernel size of 3×3, padding of 1, and stride of 1, reducing the number of channels from 64 to 32, outputting an intermediate feature map of size 88×88×32, followed by a ReLU activation function. This intermediate feature map is then passed through a convolutional layer with a kernel size of 1×1 and stride of 1, further compressing the number of channels from 32 to 1, generating a second heatmap of size 88×88×1. This heatmap represents the probability of the existence of each corner point in the parking space; that is, the value of each pixel reflects the confidence that the point belongs to a parking space corner point. By finding local peak points in the second heatmap, the approximate integer coordinates of all candidate corner points can be obtained. Corner branching enables the network to learn the visual features of corners, such as the endpoints and intersections of parking lines, so that it can still make inferences based on visible corners even when the parking space is partially obscured.

[0078] Based on the center point offset branch, the fused feature map is convolutionally processed to generate a first offset map representing the coordinate offset of the parking space center point. The input to the center point offset branch is also an 88×88×64 fused feature map. This branch first passes through a convolutional layer with a kernel size of 3×3, padding of 1, and stride of 1, adjusting the number of channels from 64 to 32, outputting an 88×88×32 intermediate feature map, which is then activated by the ReLU function. The intermediate feature map passes through a convolutional layer with a kernel size of 2×2 and stride of 1, with padding adjusted according to the output size to ensure that the spatial dimension remains unchanged, mapping the number of channels from 32 to 2, generating a first offset map of size 88×88×2. Each pixel in this offset map corresponds to two values, representing the pixel's position relative to the true center point in the horizontal direction (…). x Direction) and vertical direction ( y The offset in direction. Since the center point position predicted by the first heatmap is based on the integer coordinates of the downsampled feature map, there is a positioning error caused by quantization. The role of the center point offset branch is to compensate for this error through regression learning and provide a correction. The center point coordinates can be obtained by adding the integer coordinates of the peak point in the first heatmap to the offset value of the corresponding position in the first offset map.

[0079] Based on the corner offset branch, the fused feature map is convolved to generate a second offset map representing the coordinate offset of each corner point of the parking space. The corner offset branch takes a fused feature map of size 88×88×64 as input, and its processing flow is similar to the aforementioned offset branch. First, a convolutional layer with a kernel size of 3×3, padding of 1, and stride of 1 compresses the number of channels from 64 to 32, outputting an intermediate feature map of size 88×88×32, and applies the ReLU activation function. The intermediate feature map is then passed through a convolutional layer with a kernel size of 1×1 and stride of 1, expanding the number of channels from 32 to 8, generating a second offset map of size 88×88×8. This 8-channel feature map corresponds to the horizontal and vertical coordinate offsets of the four corner points of each parking space: the first two channels represent the first corner point. x offset and y Offset, the next two channels represent the second corner point x offset and y The offset, and so on. Similar to the center point offset branch, this branch is used to correct the quantization error of the integer coordinates of corner points in the second heatmap, enabling the network to output the sub-pixel-level precise positions of the corner points through regression learning. During inference, the precise coordinates of each corner point are obtained by combining the integer coordinates of the peak points of each corner point in the second heatmap with the offset values ​​of the corresponding positions in the second offset map.

[0080] Based on the first heatmap, second heatmap, first offset map, and second offset map, the parking space detection results are determined. First, local peak points are extracted from the first heatmap as candidate parking space centers, and the category of each candidate center is determined according to the channel index of the heatmap. Simultaneously, local peak points are extracted from the second heatmap as candidate corner points. Based on the inherent geometric constraints of the parking space, such as the relative positions and distances from the center point to each corner point, the candidate centers are matched with surrounding candidate corner points to filter out effective corner point combinations, thus constructing preliminary parking space hypotheses. On this basis, the integer coordinates of each candidate center are corrected to subpixel level using the first offset map, and the integer coordinates of each candidate corner point are corrected to subpixel level using the second offset map, obtaining the coordinates of the center point and corner points. The parking space bounding box is constructed based on the corrected corner point coordinates, and combined with the corrected center point category information, the parking space detection results are output, including the parking space's location, shape, and type. This multi-branch collaborative working and offset compensation mechanism enables parking space output based on visible keypoints even in partially occluded scenes, improving the robustness and accuracy of parking space detection.

[0081] In summary, this application's embodiment achieves collaborative optimization of parking space key point detection and offset regression through the multi-branch decoupling design of the parking space detection head. The independent setting of the heatmap branch and corner branch allows the network to learn the features of the center point and corner points separately, avoiding mutual interference between the two types of key points when sharing features. This ensures that parking space key points can still be detected even in complex environments such as lighting changes and shadow occlusion. The introduction of the center point offset branch and corner point offset branch directly compensates for the integer coordinate quantization error caused by downsampling, enabling the model to output sub-pixel level coordinates, improving the positioning accuracy of the parking space bounding box, and meeting the high requirements of automatic parking for position accuracy. The parallel processing of the four branches allows the detection results to integrate multi-level information of the center point, corner points, and their offsets. Even when some corner points are occluded, parking space inference can still be completed based on the center point and visible corner points, enhancing the model's adaptability and robustness. This achieves parking space perception capability in automatic parking scenarios, providing an accurate environmental information foundation for path planning and decision control.

[0082] In some instances, the road element segmentation head module processes the data to generate road element segmentation results, including: The road element segmentation head module includes a first batch of normalized basic residual blocks, a second batch of normalized basic residual blocks, a first convolutional combination layer, a first transposed convolutional layer, a second convolutional combination layer, a second transposed convolutional layer, and an output convolutional combination layer. Based on the road element segmentation head module, the fused feature map undergoes multiple feature compression and upsampling processes to generate road element segmentation results, including: Based on the first batch of normalized basic residual blocks, the fused feature map is compressed to generate the first compressed feature map. Based on the second batch of normalized basic residual blocks, the first compressed feature map is subjected to second feature compression to generate the second compressed feature map. Based on the first convolutional combination layer, the second compressed feature map is convolved to generate the first intermediate feature map. The first convolutional combination layer includes convolution processing, batch normalization processing and activation function processing. Based on the first transposed convolutional layer, the first intermediate feature map is upsampled to generate a first upsampled feature map; Based on the second convolutional combination layer, the first upsampled feature map is convolved to generate the second intermediate feature map. The second convolutional combination layer includes convolution processing and batch normalization processing. Based on the second transposed convolutional layer, the second intermediate feature map is upsampled a second time to generate a second upsampled feature map; Based on the output convolutional combination layer, the second upsampled feature map is convolved to generate a segmentation prediction map. The segmentation prediction map is determined as the road element segmentation result. The output convolutional combination layer includes convolution processing and batch normalization processing. The size of the segmentation prediction map is the same as that of the target bird's-eye view.

[0083] For example, the road element segmentation head module takes the fused feature map output by the feature pyramid network as input. This fused feature map has a spatial size of 88×88 pixels and 64 channels. The fused feature map is then input into the first batch of normalized basic residual blocks for first feature compression. The first batch of normalized basic residual blocks contains two parallel convolutional branches: the first branch uses a convolutional layer with a kernel size of 3×3, a stride of 1, and padding of 1, followed by batch normalization; the second branch uses a convolutional layer with a kernel size of 1×1 and a stride of 1, followed by batch normalization. The output feature maps of both branches are both 88×88 pixels in size and have 16 channels. The output feature maps of the two branches are then fused element-wise to generate a first compressed feature map with a size of 88×88×16. This module, through its dual-branch structure, compresses the channel dimension while maintaining spatial resolution, extracting both local spatial features and capturing cross-channel global information.

[0084] The first compressed feature map is input into the second batch of normalized basic residual blocks for a second feature compression process. The structure of the second batch of normalized basic residual blocks is the same as that of the first batch, also containing a 3×3 convolution with batch normalization branch and a 1×1 convolution with batch normalization branch. The output feature map size of the two branches remains 88×88, with 16 channels. After element-wise addition and fusion, a second compressed feature map with a size of 88×88×16 is generated. The two cascaded basic residual blocks further enhance the semantic abstraction ability of the features through step-by-step feature compression, while gradually reducing the number of channels from 64 to 16, reducing the computational burden of subsequent upsampling processes while retaining key information.

[0085] The second compressed feature map is input into the first convolutional combination layer for convolution processing. The first convolutional combination layer uses a 3×3 kernel, a stride of 1, and padding of 1, followed by batch normalization and ReLU activation. This operation maintains the spatial size of the feature map at 88×88 and the number of channels at 16, generating the first intermediate feature map. This convolutional combination layer further extracts local texture features through convolution operations, accelerates training convergence and stabilizes the feature distribution through batch normalization, and introduces non-linear transformation capabilities through a linear rectified function, enabling the model to learn local discriminative features such as road element edges and corners.

[0086] Based on the first transposed convolutional layer, the generated first intermediate feature map undergoes a first upsampling process to generate a first upsampled feature map. The first transposed convolutional layer is configured with a kernel size of 2×2 and a stride of 2, used to double the spatial size of the input feature map. After processing by this transposed convolutional layer, the spatial size of the first intermediate feature map is enlarged from 88×88 to 176×176, while the number of channels remains unchanged at 16, generating the first upsampled feature map. The transposed convolution operation achieves learnable upsampling of the feature map by inserting zero values ​​between pixels in the input feature map and performing learnable convolutional kernel operations. This allows the model to adaptively fill in detailed information according to task requirements, gradually restoring the spatial resolution lost during the compression stage, while preserving the semantic features obtained through training.

[0087] Based on the second convolutional combination layer, the generated first upsampled feature map is convolved to produce a second intermediate feature map. The second convolutional combination layer sequentially includes a 3×3 kernel convolution operation and a batch normalization operation. After processing by this convolutional combination layer, the spatial size of the first upsampled feature map remains unchanged at 176×176, and the number of channels remains unchanged at 16, generating the second intermediate feature map. This convolutional combination layer smooths out any jagged artifacts that may exist in the upsampled feature map through convolution operations and stabilizes the feature distribution through batch normalization, providing a foundation for further feature processing adjustments to the upsampled feature map.

[0088] Based on the second transposed convolutional layer, a second upsampling process is applied to the generated second intermediate feature map to produce a second upsampled feature map. The second transposed convolutional layer is configured with a kernel size of 2×2 and a stride of 2, which further doubles the spatial size of the input feature map. After processing by this transposed convolutional layer, the spatial size of the second intermediate feature map is enlarged from 176×176 to 352×352, while maintaining the number of channels at 16, thus generating the second upsampled feature map. This transposed convolutional layer, together with the first transposed convolutional layer, constitutes a cascaded upsampling structure. Through two upsampling operations with a stride of 2, the feature map is gradually restored from the original fused feature map's 88×88 size to 352×352, achieving spatial resolution alignment with the input target bird's-eye view and providing a spatially matched feature foundation for pixel-level classification output.

[0089] Based on the output convolutional combination layer, the generated second upsampled feature map is convolved to produce a segmentation prediction map, which is then used as the road element segmentation result. The output convolutional combination layer sequentially includes a convolution operation with a kernel size of 1×1 and a stride of 1, followed by batch normalization. This normalization adjusts the number of channels in the input feature map from 16 to the preset number of output categories, which is 18, corresponding to the various road elements and background categories to be segmented. After processing by the output convolutional combination layer, the spatial size of the second upsampled feature map remains unchanged at 352×352, while the number of channels expands from 16 to 18, generating a segmentation prediction map with a size of 352×352×18. Each pixel in the segmentation prediction map contains a feature vector of length 18. Each element value in the vector represents the probability that the pixel belongs to the corresponding road element category. The category label of the pixel can be obtained by taking the maximum value. Categories include solid lane lines, dashed lane lines, parking lines, zebra crossings, speed bumps, stop lines, no-stopping zones, guide lines, manhole covers, left turn arrows, right turn arrows, left front arrows, right front arrows, left and right turn arrows, straight arrows, U-turn arrows, left and right front arrows, and background. This achieves pixel-level road element classification results with the same size as the input target bird's-eye view.

[0090] In summary, the road element segmentation head in this embodiment compresses channels while maintaining high resolution through dual-branch residual blocks, preserving spatial details and reducing computational load; convolutional combination layers enhance feature discrimination capabilities; two transposed convolutions combined with intermediate convolutions achieve upsampling, ensuring clear boundaries; and the output convolutional layer completes pixel-level classification. This module outputs a high-resolution segmentation map while preserving shallow details and deep semantics, identifying road elements and improving the robustness and accuracy of the automatic parking perception system.

[0091] In some instances, it also includes: Based on the parking space detection results and road element segmentation results, an automatic parking control strategy for the target vehicle is determined.

[0092] For example, the parking space detection results provide available parking space information in the environment surrounding the target vehicle, including the center point coordinates, the coordinates of the four corner points, and the parking space type for each space, which is categorized into perpendicular, parallel, and angled parking spaces. The road element segmentation results provide a pixel-level classification map of the same size as the target bird's-eye view. This classification map annotates various road elements, including solid lane lines, dashed lane lines, parking lines, zebra crossings, speed bumps, stop lines, no-parking zones, guide lines, manhole covers, and various directional arrows. These two types of results together constitute the vehicle's understanding of the surrounding parking environment, providing fundamental data for the development of automatic parking control strategies.

[0093] Based on the parking space detection results, the automatic parking control system selects a suitable target parking space from multiple candidate spaces. The selection process comprehensively considers factors such as the matching degree between the parking space type and the vehicle's minimum turning radius, the adaptability of the parking space size to the vehicle's outline, the relative distance between the parking space and the current vehicle's position, and the presence of obstacles around the parking space. Simultaneously, road element segmentation results are used to evaluate the parking availability of the target space. For example, by identifying whether the parking space lines are complete, and whether there are no-parking zone signs or guide lines within the parking space area, semantic information indicating no-parking is excluded, spaces that do not meet regulations or safety requirements are eliminated. Through the above analysis, the optimal target parking space and its precise pose relative to the vehicle are determined.

[0094] After identifying the target parking space, the automatic parking control system combines lane lines, arrows, and other guidance information from the road element segmentation results to generate a global parking path. Lane lines in the road element segmentation results constrain the vehicle's driving area during parking, ensuring the vehicle does not cross the lines or enter the oncoming lane; arrows indicate the driving direction, guiding the vehicle to enter the parking space in the correct orientation. For example, if the segmentation results identify a left-turn arrow, the path planning prioritizes approaching the parking space in the left-turn direction. The path planning algorithm, based on the parking space corner coordinates and the vehicle's current position, uses curve interpolation or a search algorithm to generate a smooth reference trajectory that satisfies the vehicle's kinematic constraints. This trajectory includes the vehicle's position sequence and the corresponding desired heading angle.

[0095] During the path tracking control phase, the automatic parking control system calculates the lateral and heading deviations between the vehicle's current position and the desired trajectory in real time, based on the planned reference trajectory. Based on these deviations, a control algorithm generates steering control commands, driving the steering system to execute corresponding actions, ensuring the vehicle travels along the reference trajectory. Simultaneously, the system adjusts the driving speed according to road element segmentation results. For example, when a speed bump or stop line is detected ahead, the vehicle slows down in advance to ensure a smooth passage; when a zebra crossing is detected, the system slows down or stops to observe and ensure pedestrian safety.

[0096] During parking, the parking space detection results and road element segmentation results are continuously updated to adapt to dynamic environmental changes. For example, when a vehicle approaches a target parking space, the heatmap and corner information re-output by the parking space detection module can more accurately locate the parking space boundary, thereby fine-tuning the parking path for high-precision parking. The road element segmentation results monitor in real time whether parking space lines are obstructed or new obstacles suddenly appear. If an anomaly is detected, parking is immediately paused and the path is replanned or a warning is issued to avoid collisions.

[0097] The automatic parking control strategy also includes parking posture adjustment. Once the vehicle is roughly in the parking space, based on the corner coordinates and parking line segmentation results from the parking space detection, it is determined whether the vehicle is centered in the parking space and parallel to the parking line. If there is a deviation, the vehicle is controlled to make minor forward, backward, and steering corrections until the vehicle posture meets the preset parking accuracy requirements. Based on the integrity of the parking line in the road element segmentation results and the surrounding environment, parking is confirmed and the parking gear is automatically engaged, completing the entire automatic parking process.

[0098] In summary, this application integrates parking space detection results obtained from multi-task collaborative perception with road element segmentation results, providing an environmental understanding basis for automatic parking control strategies. Parking space detection results ensure accurate identification and location of target parking spaces, while road element segmentation results provide rich semantic constraints and guidance information. The two complement each other, enabling vehicles to achieve safe and precise automatic parking operations in complex and ever-changing parking scenarios, thus improving the intelligence level of the automatic parking system and the user experience.

[0099] Please see Figure 2 The diagram below illustrates a multi-task cooperative sensing device for automatic parking, as provided in this application embodiment, comprising: Image acquisition unit 21 is used to acquire a bird's-eye view of the environment surrounding the target vehicle; Feature extraction unit 22 is used to extract features from the target bird's-eye view based on a preset backbone network and generate a multi-scale feature map; Feature fusion unit 23 is used to fuse features from multi-scale feature maps to generate a fused feature map. The task perception unit 24 is used to perform parking space detection and road element segmentation on the fused feature map based on a multi-task decoupling architecture, so as to obtain parking space detection results and road element segmentation results.

[0100] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit it; although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features.

[0101] Although preferred embodiments have been described in this specification, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications that fall outside the scope of this specification.

[0102] Obviously, those skilled in the art can make various modifications to this specification without departing from its spirit and scope. Therefore, this specification also intends to include any modifications that fall within the scope of the claims and their equivalents.

Claims

1. A multi-task collaborative perception method for automated parking, characterized in that, include: Obtain a bird's-eye view of the environment surrounding the target vehicle; Based on a preset backbone network, feature extraction is performed on the target bird's-eye view to generate a multi-scale feature map; The multi-scale feature maps are fused to generate a fused feature map; Based on a multi-task decoupling architecture, parking space detection and road element segmentation are performed on the fused feature map to obtain parking space detection results and road element segmentation results.

2. The method according to claim 1, characterized in that, The acquisition of a target bird's-eye view of the environment surrounding the target vehicle includes: Multiple fisheye cameras deployed around the target vehicle are used to acquire multi-channel ring-shaped field-of-view images of the target vehicle. Based on preset camera calibration parameters, distortion correction is performed on each of the circular field-of-view images to generate a corrected image; The multiple corrected images are subjected to viewpoint transformation to generate multiple bird's-eye view images; The multi-view bird's-eye view images are stitched and fused at the pixel level to generate the target bird's-eye view of the environment surrounding the target vehicle.

3. The method according to claim 1, characterized in that, The preset backbone network is a backbone network with a cross-stage local connectivity structure. Based on the preset backbone network, feature extraction is performed on the target bird's-eye view to generate a multi-scale feature map, including: Based on the backbone network with the cross-stage local connectivity structure, the target bird's-eye view is sequentially processed through the backbone layer and multiple stage layers to extract features, generating multiple multi-scale feature maps at different levels.

4. The method according to claim 3, characterized in that, The multiple stage layers include a first stage layer, a second stage layer, a third stage layer, and a fourth stage layer. Based on the backbone network with a cross-stage local connectivity structure, the target bird's-eye view is sequentially processed through the backbone layer and multiple stage layers to extract features, generating multiple multi-scale feature maps at different levels, including: Based on the backbone layer, the target bird's-eye view is downsampled to generate a first intermediate feature map; Based on the first stage layer, the first intermediate feature map is downsampled a second time, and features are extracted through the first cross-stage local connection module to generate a first-level feature map. Based on the second stage layer, the first-level feature map is downsampled a third time, and features are extracted through the second cross-stage local connection module to generate the second-level feature map; Based on the third stage layer, the second-level feature map is downsampled a fourth time, and features are extracted through the third cross-stage local connection module to generate the third-level feature map; Based on the fourth stage layer, the third-level feature map is downsampled a fifth time, and features are extracted through the fourth cross-stage local connection module to generate the fourth-level feature map.

5. The method according to claim 4, characterized in that, After performing a fifth downsampling on the third-level feature map based on the fourth-stage layer, and extracting features through the fourth cross-stage local connection module to generate the fourth-level feature map, the process further includes: Channel adjustment is performed on the fourth-level feature map to generate a pooled input feature map; The pooling input feature map is input to a first pooling branch, a second pooling branch, a third pooling branch, and a fourth pooling branch, respectively. The first pooling branch determines the pooling input feature map as a first pooling feature map. The second pooling branch performs pooling processing on the pooling input feature map based on a first max pooling layer to generate a second pooling feature map. The third pooling branch performs pooling processing on the second pooling feature map based on the first max pooling layer to generate a third pooling feature map. The fourth pooling branch performs pooling processing on the third pooling feature map based on the first max pooling layer to generate a fourth pooling feature map. The first pooling feature map, the second pooling feature map, the third pooling feature map, and the fourth pooling feature map are concatenated along the channel dimension to generate a first concatenated feature map; Channel compression is performed on the first spliced ​​feature map to generate an enhanced fourth-level feature map.

6. The method according to claim 5, characterized in that, The step of fusing features from the multi-scale feature maps to generate a fused feature map includes: The enhanced fourth-level feature map is subjected to a first convolutional process to generate a first adjusted feature map; The first adjusted feature map is subjected to a first upsampling process to generate a first upsampled feature map; The third-level feature map is subjected to a second convolution process to generate a second adjusted feature map; The first upsampled feature map and the second adjusted feature map are subjected to a first fusion process to generate a first fused feature map; The first fused feature map is subjected to a second upsampling process to generate a second upsampled feature map; The second-level feature map is subjected to a third convolution process to generate a third adjusted feature map; The second upsampled feature map and the third adjusted feature map are subjected to a second fusion process to generate a second fused feature map; The second fused feature map is subjected to a third upsampling process to generate a third upsampled feature map; The first-level feature map is subjected to a fourth convolution process to generate a fourth adjusted feature map. The third upsampled feature map and the fourth adjusted feature map are subjected to a third fusion process to generate the fused feature map.

7. The method according to claim 1, characterized in that, The multi-task decoupling architecture includes a parking space detection head module and a road element segmentation head module. Based on the multi-task decoupling architecture, parking space detection and road element segmentation are performed on the fused feature map to obtain parking space detection results and road element segmentation results, including: The fused feature map is input into the parking space detection head module, and processed through multiple branches of the parking space detection head module to generate the parking space detection result; The fused feature map is input into the road element segmentation head, and processed by the road element segmentation head module to generate the road element segmentation result.

8. The method according to claim 7, characterized in that, The parking space detection head module includes a heatmap branch, a corner branch, a center point offset branch, and a corner point offset branch. The parking space detection result is generated by processing through these multiple branches of the parking space detection head module, including: Based on the heatmap branch, the fused feature map is convolved to generate a first heatmap that characterizes the location and category of the parking space center point. Based on the corner branch, the fused feature map is convolved to generate a second heatmap for characterizing the corner position of the parking space; Based on the center point offset branch, the fused feature map is convolved to generate a first offset map that characterizes the coordinate offset of the parking space center point. Based on the corner offset branch, the fused feature map is convolved to generate a second offset map that characterizes the coordinate offset of each corner point of the parking space. The parking space detection result is determined based on the first heat map, the second heat map, the first offset map, and the second offset map.

9. The method according to claim 7, characterized in that, The process of generating the road element segmentation result through the road element segmentation head module includes: Based on the road element segmentation head module, the fused feature map is subjected to multiple feature compression and upsampling processes to generate the road element segmentation result.

10. The method according to claim 9, characterized in that, The road element segmentation head module includes a first batch of normalized basic residual blocks, a second batch of normalized basic residual blocks, a first convolutional combination layer, a first transposed convolutional layer, a second convolutional combination layer, a second transposed convolutional layer, and an output convolutional combination layer. Based on the road element segmentation head module, the fused feature map undergoes multiple feature compression and upsampling processes to generate the road element segmentation result, including: Based on the first batch of normalized basic residual blocks, the fused feature map is subjected to first feature compression to generate a first compressed feature map. Based on the second batch of normalized basic residual blocks, the first compressed feature map is subjected to second feature compression to generate a second compressed feature map. Based on the first convolutional combination layer, the second compressed feature map is subjected to convolutional processing to generate a first intermediate feature map, wherein the first convolutional combination layer includes convolutional processing, batch normalization processing and activation function processing; Based on the first transposed convolutional layer, the first intermediate feature map is upsampled to generate a first upsampled feature map; Based on the second convolutional combination layer, the first upsampled feature map is subjected to convolutional processing to generate a second intermediate feature map, wherein the second convolutional combination layer includes convolutional processing and batch normalization processing. Based on the second transposed convolutional layer, the second intermediate feature map is upsampled a second time to generate a second upsampled feature map; Based on the output convolutional combination layer, the second upsampled feature map is convolved to generate a segmentation prediction map, and the segmentation prediction map is determined as the segmentation result of the road element. The output convolutional combination layer includes convolutional processing and batch normalization processing, and the size of the segmentation prediction map is the same as that of the target bird's-eye view.