Lightweight bird's eye view perception method and model construction method

CN122244829APending Publication Date: 2026-06-19QINGDAO INST OF COMPUTING TECH XIDIAN UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: QINGDAO INST OF COMPUTING TECH XIDIAN UNIV
Filing Date: 2026-03-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Mar 2026

Application

19 Jun 2026

Publication

CN122244829A

IPC: G06V20/58; G06V10/82; G06V10/44; G06N3/096; G06N3/045; G06V10/774; G06V10/764; G06V10/77; G06V10/74; G06N3/0455

AI Tagging

Application Domain

Biological models Scene recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A power distribution network voltage support evaluation method, system, device and medium based on generalized regulation resources
CN122225477ABiological models Ac network voltage adjustment
System(s) and method(s) for generative model processing of image data including object(s) having particular feature(s) and / or classification(s)
WO2026122857A1Biological models
Knowledge graph construction method and device, equipment and storage medium
CN119149753BImprove timing analysisImproving performance in directional reasoningBiological models Knowledge representation
QA system and method
US20260162247A1Programme control Image enhancement
Systems and methods for data collection in an industrial environment
US20260161153A1Machine part testing Receivers monitoring

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122244829A_ABST

Patent Text Reader

Abstract

This invention provides a lightweight bird's-eye view perception method and model construction method. Through a feature decoupling distillation strategy, it effectively eliminates computational redundancy in the BEV space construction process, enabling complex dense query perception algorithms to adapt to low-to-mid-range vehicle embedded chips with limited computing power, achieving deep synergy between perception accuracy and inference speed. By employing the Hungarian algorithm to obtain the optimal allocation of positive and negative samples of the teacher model, it clarifies the query-target correspondence that the student model should learn, injects the query embedding of the teacher model as prior knowledge into the student model decoder, forces allocation consistency, constrains the prediction logic of the student model, and achieves bidirectional logical alignment. This avoids the assignment drift problem in the early stages of DETR-like decoder training, significantly shortens the training cycle of the lightweight model, and improves the model's positioning stability under complex road conditions.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of autonomous driving perception and target detection technology, and in particular to a lightweight bird's-eye view perception method and model construction method. Background Technology

[0002] As a core direction of future transportation systems, autonomous driving relies heavily on environmental perception modules for safe and efficient operation. These modules fuse multiple sensors, including LiDAR, vision cameras, and millimeter-wave radar, to detect the vehicle's surrounding environment in real time and build scene models, providing crucial decision-making support for downstream planning and control modules. While LiDAR-based target detection algorithms have demonstrated excellent accuracy in public benchmark tests, their high hardware costs and performance sensitivity under complex weather conditions hinder their widespread adoption in mass-produced vehicles. In contrast, pure vision solutions, with their significant advantages of low cost, high hardware integration, and rich semantic information, have become the core direction for the mass production and deployment of autonomous driving perception technologies, and the industry's R&D focus is gradually shifting towards pure vision 3D perception solutions.

[0003] With the popularization of the BEV (Bird's Eye View) perspective perception concept, domestic and international autonomous driving perception frameworks have transitioned from the traditional 2D image space to a unified 3D top-down view space. Among these, feature construction methods based on back-mapping have become a research hotspot due to their powerful spatiotemporal modeling capabilities. Depending on the query density, this technical path has further evolved into two schemes: sparse query and dense query. The sparse query method, represented by DETR3D, uses a fixed number of learnable sparse vectors to represent potential targets and utilizes camera intrinsic and extrinsic parameter matrices to achieve feature sampling from 3D reference points to multi-view images. While it boasts advantages in computational efficiency and inference speed, its limited query quantity makes it difficult to construct a complete and continuous BEV representation, resulting in insufficient robustness in environmental modeling under complex scenarios. The dense query method, represented by BEVFormer, rasterizes the BEV space, assigns query vectors to each grid point, and combines spatial cross-attention and temporal self-attention mechanisms to achieve dense feature aggregation and historical frame information integration. This provides semantically rich and spatially consistent dense BEV feature maps, with perception performance comparable to some LiDAR solutions, providing a solid data foundation for downstream tasks such as obstacle avoidance and path planning.

[0004] However, while current intensive BEV perception solutions achieve high-precision perception, they also face a core bottleneck hindering their deployment on in-vehicle devices. On one hand, the intensive query mechanism results in computational complexity that quadratically increases with BEV grid resolution. The frequent spatial cross-attention and temporal self-attention interactions in the Transformer architecture generate significant computational overhead and memory usage, making the original intensive model unsuitable for low- to mid-range in-vehicle edge computing platforms with limited computing power, thus restricting the application of high-order perception algorithms in mass-produced vehicles. On the other hand, autonomous BEV scenarios exhibit significant sparsity characteristics of "more background, fewer targets," but existing technologies generally employ full-image equal-weight feature extraction and distillation methods, failing to effectively separate high-value target regions from redundant background noise. Lightweight student models are prone to losing deep semantic information of key targets during parameter compression, making it difficult to maintain perception accuracy in complex 3D spaces while reducing computational load.

[0005] Furthermore, intensive BEV perception models generally employ a DETR-like ensemble prediction mechanism, whose bipartite graph matching exhibits significant instability in the early stages of training. For lightweight models, randomly initialized query vectors are prone to severe "assignment drift" during the decoding phase, resulting in a lack of a unified semantic benchmark for feature alignment between teacher and student models. Simultaneously, the decoder's "black box" exploration mode lacks explicit guidance on real physical constraints, further limiting the convergence speed and perception ceiling of lightweight models. Given the limited computing power of in-vehicle embedded chips, achieving deep synergy between perception accuracy and inference efficiency while eliminating computational redundancy has become a core challenge for the mass production and deployment of pure vision-based BEV perception technology. A targeted lightweight training scheme is urgently needed to provide a feasible path for adapting high-level autonomous driving technology to mid-to-low-end vehicle models. Summary of the Invention

[0006] To address the aforementioned technical issues, this application provides a lightweight bird's-eye view perception method and model construction method.

[0007] The first aspect of this application provides a method for constructing a lightweight bird's-eye view perception model, comprising the following steps: Step S1: Construct a perception framework based on knowledge distillation, wherein the perception framework includes a pre-trained teacher model and a student model to be trained. Step S2: Acquire multi-view images and input them into the perception framework. Obtain teacher BEV feature maps through the BEV encoder of the teacher model. The student's BEV feature map is obtained through the BEV encoder of the student model. ; Step S3: Based on the dynamic query guidance mask, analyze the teacher's BEV feature map. Student BEV Feature Map Distillation is performed to obtain the foreground Focal loss. Attention Alignment Feature Loss and global consistency loss : Step S4: Obtain the binary matching permutation mapping of the teacher model based on the Hungarian algorithm. Establish an index mapping relationship between the teacher model and the student model to obtain the teacher BEV feature map. Student BEV Feature Map Feature space collaborative loss ; Step S5: Based on the truth-injection query interaction, obtain the perceptual enhancement loss function. Cross-modal feature alignment loss ; Step S6: Combine the original detection task loss and the foreground Focal loss. Attention Alignment Feature Loss Global consistency loss Spatial coordination loss Perception enhancement loss function and cross-modal alignment loss Configure the integrated training loss function Among them, the ensemble training loss function The calculation method is as follows:

[0008] in, The final integration training loss function; This represents the loss of the original detection task; For attention-aligned feature loss; Forward Focal loss; This results in a loss of global consistency. For feature space collaborative loss; For perceptual enhancement loss function; Cross-modal feature alignment loss; to To balance hyperparameters, it is used to adjust the weight ratio of each parameter; By integrating the training loss function The student model is trained to obtain a lightweight bird's-eye view perception model.

[0009] In some embodiments of this application, step S3 includes the following steps: Step S31: Obtain the query vector using the BEV decoder of the teacher model. Calculate the teacher model query vectors Teacher BEV Feature Map Spatial similarity response, to obtain each query vector Corresponding spatial response map and initial mask ; Step S32, calculate each query vector Quality assessment score Obtain the spatial response graph corresponding to the valid query vector and the dynamic query guidance mask. ; Step S33: Dynamically query the bootstrap mask. Teacher BEV feature maps respectively Student BEV Feature Map The foreground target region and background environment region are decoded, and weighted distillation is performed on the obtained foreground target region and background environment region respectively to obtain the foreground Focal loss. and attention alignment feature loss ; Step S34, extract the teacher's BEV feature map. Student BEV Feature Map Input the global context module separately to obtain the global consistency loss. .

[0010] In some embodiments of this application, the quality assessment score and dynamic query guide mask The calculation method is as follows: ; ; in, For the teacher model targeting the first The classification prediction score output by each query; Intersection, union, and comparison; This represents the total number of object queries in the teacher model. For the teacher model based on the first Predicted bounding boxes for each query vector; The teacher model is assigned to the first teacher after binary matching. The true bounding box of each query vector; This represents the intersection-union ratio (IU) between the ground truth bounding box and the predicted bounding box. To adjust the balance between classification prediction scores and intersection-union ratio weights; For the first query vectors The corresponding initial mask.

[0011] In some embodiments of this application, step S4 includes the following steps: Step S41: Obtain the prediction results of the teacher model, and obtain positive and negative samples based on the prediction results of the teacher model; Step S42: Obtain the query vector using the BEV decoder of the teacher model. The query vector of the teacher model Positive and negative samples are input into the Hungarian algorithm to obtain a binary matching permutation map. ; Step S43: Using the stable prediction results of the teacher model as pseudo-true values, and permuting and mapping them through binary matching. Establish an index mapping relationship between the teacher model and the student model, and associate the pseudo-truth value labels of the teacher model with the corresponding query vectors of the student model; Step S44: Calculate the teacher's BEV feature map Student BEV Feature Map Feature space collaborative loss The positive sample distillation loss and negative sample distillation loss between the student model prediction results and the pseudo-true values of the teacher model.

[0012] In some embodiments of this application, in step S44, the feature space collaborative loss... The calculation method is as follows: ; Where N represents the number of positive sample query vectors; A linear transformation function representing feature dimension alignment; In the teacher model, the first The output features of a positive sample query vector in the decoder's hidden layer This represents the corresponding query feature in the student model after matching.

[0013] In some embodiments of this application, step S5 includes the following steps: Step S51: Use a multilayer perceptron (MLP) to input the target category label and the three-dimensional center coordinates. Physical dimensions Trigonometric function values of yaw angle and the two-dimensional instantaneous velocity vector of the ground The ten-dimensional physical attributes are encoded into a true value feature vector. ; Step S52, convert the true feature vector As a truth query, it is dynamically injected into the query pool of the student model decoder, so that it can participate in the decoding calculation in parallel with the original learnable query vector. The truth query simulates the spatial constraints and communication patterns between instances in the real scene through the self-attention mechanism, and extracts the deterministic object representation corresponding to the truth target from the BEV feature map of the student model through the cross-attention mechanism. Step S53, define the perceptual enhancement loss function This allows for precise constraints on the predicted output of the truth query after it has passed through the detection head.

[0014] In some embodiments of this application, step S5 further includes the following steps: Step S54: Based on the three-dimensional center coordinates of the real target, from the student BEV feature map Local feature blocks at corresponding positions in the middle of the cropping process ; Step S55: Use the feature pooling layer to separate local feature blocks. Transformed into a fixed-dimensional visual vector; Step S56: Based on the contrastive learning mechanism of the multimodal neural network CLIP, calculate the visual vector and the truth semantic feature vector generated by the truth encoder. Cosine similarity; Step S57: Based on the obtained cosine similarity, construct a similarity matrix containing N instance pairs. ; Step S58: Calculate object features within the batch. F With truth semantic feature vector The normalized dot product between them; ; in, ; ; It is the logarithmic scaling factor that can be learned during the contrastive learning process. Represents matrix multiplication; Step S59: Obtain cross-modal feature alignment loss Furthermore, bidirectional optimization is performed on the BEV axis and the truth axis, forcing the BEV target features to align with the truth semantic space.

[0015] A second aspect of this application provides a lightweight bird's-eye view perception method, which deploys the lightweight bird's-eye view perception model constructed by the above-described construction method to an autonomous vehicle terminal, and includes the following process: Deploy cameras around autonomous vehicles to collect multi-view images of the surroundings in real time. The acquired multi-view images are subjected to image filtering, image normalization, and image standardization processing. The processed multi-view images are fed into a lightweight bird's-eye view perception model for perception, and the target detection results under the BEV perspective are obtained.

[0016] A third aspect of this application provides a lightweight bird's-eye view perception device, the device comprising at least one processor and at least one memory coupled together; the memory stores a computer executable program of a lightweight bird's-eye view perception model constructed by the construction method as described in any one of claims 1 to 8; when the processor executes the computer executable program stored in the memory, the processor executes a lightweight bird's-eye view perception method.

[0017] A fourth aspect of this application provides a computer-readable storage medium storing a computer program or instructions for a lightweight bird's-eye view perception model constructed by the above-described construction method, wherein when the program or instructions are executed by a processor, the processor performs a lightweight bird's-eye view perception method.

[0018] Compared with the prior art, the present invention has the following advantages and beneficial effects: (1) Significantly improved computational efficiency: Through the feature decoupling distillation strategy, the computational redundancy in the BEV space construction process is effectively eliminated, enabling the complex dense query perception algorithm to adapt to the low-end vehicle embedded chip with limited computing power, and realizing deep synergy between perception accuracy and inference speed. (2) Enhanced convergence robustness: By using the Hungarian algorithm to obtain the optimal allocation of positive and negative samples of the teacher model, the query-target correspondence that the student model should learn is clarified. The query embedding of the teacher model is injected into the student model decoder as prior knowledge, which forces the allocation consistency, constrains the prediction logic of the student model, and achieves bidirectional logical alignment. This avoids the assignment drift problem in the early stage of training of DETR-like decoders, significantly shortens the training cycle of the lightweight model, and improves the localization stability of the model under complex road conditions. (3) The perceptual representation is more discriminative: By injecting the encoded ground truth feature vector into the query pool of the student model decoder, the communication between instances is simulated through the self-attention mechanism. At the same time, cross-modal contrastive learning loss is introduced to guide the local BEV features of the student model to align with the ground truth semantic space. Cross-modal feature alignment loss is also introduced to guide the local BEV features of the student model to align with the ground truth semantic space. This breaks the black box limitation of the decoding process and forces the lightweight representation to align with the ground truth semantic space, effectively repairing the feature discontinuity problem of small models when dealing with occluded targets and long-tailed scenes.

[0019] It should be understood that the above general description and the following detailed description are merely exemplary and explanatory, and do not limit this document. Attached Figure Description

[0020] The accompanying drawings, which form part of this document, are used to provide a further understanding of the document. The illustrative embodiments and descriptions herein are used to explain the document and do not constitute an undue limitation thereof. In the drawings: Figure 1 This is an overall logical block diagram of the perception method provided in an exemplary embodiment of this application; Figure 2 This is an overall architecture diagram of the perception model provided in an exemplary embodiment of this application; Figure 3 This is an exemplary embodiment of the present application providing an overall architecture diagram of a perception network based on knowledge distillation; Figure 4 This is a logical block diagram of the truth-injection query interaction provided in an exemplary embodiment of this application; Figure 5 This is a block diagram of the query prior allocation distillation logic provided in an exemplary embodiment of this application; Figure 6 This is a schematic diagram of a global context module provided in an exemplary embodiment of this application; Figure 7 This is an example of a 3D object detection result based on the nuScenes dataset provided in an exemplary embodiment of this application; Figure 8 This is a rendering of the perception model generated by an exemplary embodiment of this application; Figure 9 This is a simplified structural diagram of a lightweight bird's-eye view sensing device provided in an exemplary embodiment of this application. 。 Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of this application can be arbitrarily combined with each other.

[0022] Example 1: An exemplary embodiment of this application provides a lightweight bird's-eye view perception model construction method. In this application, based on knowledge distillation, the constructed perception framework is trained through three stages: feature representation enhancement, logical allocation alignment, and decoding interaction reinforcement, to achieve efficient transfer of knowledge from the teacher model to the student model. Figure 1 As shown, the method includes the following steps: Step S1: Construct a perception framework based on knowledge distillation, such as... Figure 2 and 5 As shown, the perception framework includes a pre-trained teacher model, a student model to be trained, and an image feature encoding module; both the teacher and student models include a BEV encoder and a decoder. Preferably, the teacher model uses the pre-trained BEVFormer-Base, which has a 6-layer encoder and a 6-layer decoder; the student model uses the lightweight BEVFormer-Tiny, which has a 3-layer decoder, reducing computational requirements by reducing the number of encoder layers and the BEV feature dimension.

[0023] Step S2: Acquire multi-view images and input them into the image feature encoding module of the perception framework. The image feature encoding module receives the multi-view image sequence acquired by the surround-view camera and constructs a unified BEV representation through the spatiotemporal Transformer architecture; then, the teacher BEV feature map is obtained through the BEV encoder of the teacher model. The student's BEV feature map is obtained through the BEV encoder of the student model. Where H and W are the height and width of the BEV feature map, respectively, and d is the feature channel dimension.

[0024] Step S3: Based on the dynamic query guidance mask, analyze the teacher's BEV feature map. Student BEV Feature Map Distillation is performed to obtain the foreground Focal loss. Attention Alignment Feature Loss and global consistency loss ; The specific step S3 includes the following steps: Step S31: Obtain the query vector using the BEV decoder of the teacher model. Where M is the total number of query vectors preset by the teacher model, and d is the feature dimension of the query vector, which is output by the teacher model decoder and represents the feature encoding of the potential target. Calculate the teacher model's... query vectors Teacher BEV Feature Map Spatial similarity response, to obtain each query vector Corresponding spatial response map and initial mask Specifically, the teacher's BEV feature map Flattened While maintaining the query vector The dimensions remain unchanged, and the query vector is... Perform matrix multiplication with the flattened teacher BEV feature map to obtain the initial spatial response matrix, where the matrix contains the first... Line number The value of the column represents the first The query vector and the teacher BEV feature map Feature similarity of each spatial location. Reshape the initial spatial response matrix to... This yields the spatial response map and initial mask for each query vector. .

[0025] Step S32, calculate each query vector Quality assessment score Obtain the spatial response graph corresponding to the valid query vector and the dynamic query guidance mask. .

[0026] Specifically, the first and initial masks are extracted from the output of the teacher model. Classification prediction score for each query vector And the intersection-union ratio (IUU) of the predicted bounding box and the corresponding ground truth box. .

[0027] Quality assessment score and dynamic query guide mask The calculation method is as follows: ; ; in, For the teacher model targeting the first The classification prediction score output by each query; Intersection, union, and comparison; This represents the total number of object queries in the teacher model. For the teacher model based on the first Predicted bounding boxes for each query vector; The teacher model is assigned to the first teacher after binary matching. The true bounding box of each query vector; This represents the intersection-union ratio (IU) between the ground truth bounding box and the predicted bounding box. To adjust the balance between classification prediction scores and intersection-union ratio weights; For the first query vectors The corresponding initial mask.

[0028] Set a quality score threshold so that query vectors below the threshold are marked as invalid predictions, and their corresponding spatial response maps are not included in subsequent mask generation.

[0029] Step S33: Dynamically query the bootstrap mask. Teacher BEV feature maps respectively Student BEV Feature Map The foreground target region and background environment region are decoded, and weighted distillation is performed on the obtained foreground target region and background environment region respectively to obtain the foreground Focal loss. and attention alignment feature loss ; Based on dynamic query bootstrap mask In this application, the student's BEV feature map will be used. Teacher BEV Feature Map The alignment process is deeply decoupled. The high-interest regions with higher weights (i.e., masked activation regions) correspond to key obstacle regions of interest to the teacher model, such as vehicles and pedestrians. Focal Distillation loss is applied to calculate the student's BEV feature map. Teacher BEV Feature Map Focal loss prospects Force the student model to accurately capture the teacher model's discriminative representation of key obstacles; foreground Focal loss The calculation formula is as follows:

[0030] The meanings of each parameter are as follows: A spatial attention mask representing the teacher model; The channel attention mask representing the teacher model; The spatial attention mask representing the student model; The channel attention mask representing the student model; This indicates a dynamic query guide mask. Indicates the background environment area mask; A linear transformation function representing feature dimension alignment; and These represent the hyperparameters that balance the weights of foreground distillation and background distillation, respectively. This represents the element-wise multiplication of matrices (Hadamard product). express The square of the norm.

[0031] To further narrow the gap in feature representation, the BEV activation distribution of the student model is guided to maintain morphological consistency with that of the teacher model, and attention-aligned feature loss is used. The calculation formula is as follows:

[0032] in, Indicates the mean absolute error; A spatial attention mask representing the teacher model; The spatial attention mask representing the student model; The channel attention mask representing the teacher model; The channel attention mask representing the student model; This represents the balance coefficient. This loss term improves the student model's feature discrimination ability in the BEV space by forcing the student model to simulate the teacher's spatial and channel concerns.

[0033] Step S34, extract the teacher's BEV feature map. Student BEV Feature Map Input the global context module separately to obtain the global consistency loss. The global context module is a mature model in existing technology, and its structure is as follows: Figure 6 As shown.

[0034] Teacher BEV feature map Student BEV Feature Map The global context module is input separately, and long-range dependencies of the entire graph are extracted through spatial pooling and bottleneck transformation, and then transformed into a global consistency loss. Backpropagation is performed to ensure that the lightweight student model does not lose its understanding of the macroscopic structure of the scene while eliminating computational redundancy, thus avoiding the loss of long-range dependencies due to lightweighting.

[0035] In this application, the dynamic query guide mask is used. It plays a "feature gating" role in the loss function calculation, that is, in the high interest region covered by Ψ, it enhances the mean squared error between features. Distance minimization constraints guide student models to reproduce features with high fidelity; in 1 For the Ψ-environmental background region, long-range dependencies are extracted using the global context module. From the perspective of manifold learning, this ensures that the student model prioritizes preserving the manifold structure of key targets when compressing feature dimensions, while using global operators to compensate for semantic fragmentation caused by decoupling.

[0036] Step S4: Obtain the binary matching permutation mapping of the teacher model based on the Hungarian algorithm. Establish an index mapping relationship between the teacher model and the student model to obtain the teacher BEV feature map. Student BEV Feature Map Feature space collaborative loss ; For example, step S4 includes the following steps: Step S41: Initiate each round of model training iteration, run the teacher model, obtain the prediction results of the teacher model, and obtain positive and negative samples based on the prediction results of the teacher model combined with the ground truth annotations. Positive samples are the sum of the predicted bounding boxes and the ground truth boxes. Prediction results exceeding the set threshold and with a class confidence score higher than the threshold; negative samples are prediction results that do not match the ground truth box, or Prediction results that are less than a set threshold, or whose category confidence is less than a threshold.

[0037] Step S42: Obtain the query vector using the BEV decoder of the teacher model. The query vector of the teacher model The Hungarian algorithm is used to input positive and negative samples to solve for the optimal matching of the bipartite graph, resulting in a bipartite matching permutation mapping of positive and negative samples at the teacher model output. This is to clarify the target true value sample or negative sample category corresponding to each teacher's query vector.

[0038] Step S43: Using the stable prediction result of the teacher model as the pseudo-true value, extract the query embedding vector from the last decoder layer of the teacher model. Adapt the query vector dimension of the student model through a linear transformation layer and inject it into the query vector pool of the student model decoder as prior knowledge for query allocation in the student model; then, through binary matching permutation mapping... Establish an index mapping relationship between the teacher model and the student model, and associate the pseudo-truth value labels of the teacher model with the corresponding query vectors of the student model.

[0039] In this application, within the Transformer ensemble prediction framework, the correspondence between the prediction results and the true target is established using a bipartite graph matching algorithm. Confirmed. Because the student model's parameters are extremely unstable in the early stages of training, it can experience severe assignment drift. In this application, the converged binary matching permutation mapping of the teacher model is utilized. As a logical baseline, the query state of the teacher model is directly projected into the student decoder space through the "query prior injection" path.

[0040] Based on this, the teacher model The inference trajectory from "shallow global spatial distribution capture" to "deep fine-grained geometric attribute decoupling" in the layer decoder is logically segmented; for the student model with reduced layer count, a mapping function is used. The meanings of each parameter are as follows: For the matched teacher model decoder layer index, For the current decoder layer index of the student model, This represents the total number of layers in the teacher model decoder. The total number of layers in the student model decoder is defined. Based on this, an equidistant sampling relationship is established, projecting the representation logic of key nodes in the teacher model onto the corresponding shallow architecture of the student model. By forcing the intermediate layers of the student model to learn the bipartite graph matching results that have reached a steady state in the teacher model, the "assignment drift" phenomenon caused by random search of query points in the early stages of training of the Transformer-like decoder is suppressed, ensuring that the student model reproduces the complete perceptual evolution process while compressing the number of parameters. A pre-trained initial search space is provided for the student model, avoiding gradient oscillations caused by random initialization and ensuring the semantic consistency between the lightweight architecture and the high-performance architecture in the inference trajectory.

[0041] For example, to address the representational fragmentation problem caused by a significant reduction in the number of decoding layers in the student model, this application employs an equal-interval sampling mapping strategy. When the student model retains only 3 decoder layers while the teacher model has 6 layers, a mapping function is used... Establish the mapping logic.

[0042] Step S44: Calculate the teacher's BEV feature map Student BEV Feature Map Feature space collaborative loss The positive sample distillation loss and negative sample distillation loss between the student model prediction results and the pseudo-true values of the teacher model.

[0043] Among them, feature space collaborative loss The calculation method is as follows: ; Where N represents the number of positive sample query vectors; Represents a linear transformation function used for feature dimension alignment; In the teacher model, the first The output features of a positive sample query vector in the decoder's hidden layer This represents the corresponding query feature in the student model after matching.

[0044] Feature space collaborative loss The student model parameters are added to the ensemble training loss and updated via backpropagation, forcing the feature evolution trajectory of the student model to align with the higher-order inference path of the teacher model.

[0045] Step S5: Based on the truth-injection query interaction, obtain the perceptual enhancement loss function. Cross-modal feature alignment loss .

[0046] To overcome the "black box" limitation of decoding interaction, this application injects explicit truth streams into the student model during the training phase.

[0047] Specifically, it includes the following steps: Step S51: Use a multilayer perceptron (MLP) to label the target category. 3D center coordinates Physical dimensions Yaw angle trigonometric functions and the two-dimensional instantaneous velocity vector of the ground Attribute encoding as true feature vectors The MLP consists of three fully connected layers, with the middle layer using the ReLU activation function. The final output is a ground truth feature vector with the same feature dimensions as the student model's BEV. This completes the transformation from physical attributes to semantic features.

[0048] Step S52, during the training phase, the true feature vectors are... As a truth query, it is dynamically injected into the query pool of the student model decoder, allowing it to participate in decoding computation in parallel with the original learnable query vectors. The truth query interacts with other query vectors through a self-attention mechanism, simulating the spatial constraints and communication patterns between instances in a real-world scenario. Furthermore, it extracts deterministic object representations corresponding to the truth target from the student model's BEV feature map through a cross-attention mechanism, constructing a "truth-heuristic" feature extraction path. By using truth information as an anchor point, deterministic physical constraints are provided directly at the feature interaction level.

[0049] Step S53, define the perceptual enhancement loss function This allows for precise constraints on the predicted output of the truth query after it has passed through the detection head.

[0050] Perceptual enhancement loss function The calculation formula is as follows: ; in, This represents the predicted output after the truth query has been processed by the decoder and the detector head. The corresponding ground truth labels contain the category information of the real target and the ten-dimensional 3D physical regression parameters; The loss is a weighted combination of perception enhancement constraint loss, classification discrimination loss, and 3D bounding box regression loss. Since ground truth queries have a clear target orientation, this constraint process skips the Hungarian matching step and directly establishes a natural index correspondence between the predicted results and the ground truth, thereby inspiring the student model's decoder to learn more discriminative feature representations.

[0051] Step S54: Based on the three-dimensional center coordinates of the real target, from the student BEV feature map Local feature blocks at corresponding positions in the middle of the cropping process The cropping range is a fixed-size area centered on the target center and covering the entire BEV projection of the target.

[0052] Step S55: Use the feature pooling layer to separate local feature blocks. It is transformed into a fixed-dimensional visual vector; this provides standardized input for subsequent cross-modal contrastive learning and establishes the basis for the association between single-target visual features and ground truth semantic features.

[0053] Step S56: Based on the contrastive learning mechanism of the multimodal neural network CLIP, calculate the visual vector and the truth semantic feature vector generated by the truth encoder. Cosine similarity; Step S57: Based on the obtained cosine similarity, construct a similarity matrix containing N instance pairs. ; Step S58: Calculate object features within the batch. F With truth semantic feature vector The normalized dot product between them; ; in, ; ; It is the logarithmic scaling factor that can be learned during the contrastive learning process. This represents matrix multiplication.

[0054] Furthermore, cross-modal feature alignment loss is employed. We perform bidirectional optimization on the BEV axis and the truth axis, forcing the BEV target features to align with the truth semantic space, in order to repair the feature discrimination gaps of the lightweight model in complex contexts.

[0055] Step S59: Obtain cross-modal feature alignment loss Furthermore, bidirectional optimization is performed on the BEV axis and the ground truth axis, forcing the BEV target features to align with the ground truth semantic space. Among these optimizations, the cross-modal feature alignment loss... The calculation formula is as follows: ; in, I Let M represent the target identity matrix, and let M represent the cross-modal similarity moments calculated from visual features and ground truth semantic features. This represents the cross-entropy loss function. During training, for the same instance, its visual features and ground truth features are mutually positive samples, while different instances within the same batch are mutually negative samples. This loss function forces the model to increase the similarity of positive sample pairs and suppress interference between negative samples, thereby improving the model's object discrimination and localization accuracy in the BEV space. In this application, explicit ground truth injection and cross-modal constraints enable the model to generate highly discriminative BEV representations even in complex scenes with blurred perceptual boundaries. Moreover, this mechanism is automatically removed during the inference stage without generating any additional computational load. It narrows the geometric distance between visual representations and true semantics in the latent space, thus achieving explicit correction of the perceptual space of the lightweight model without increasing inference costs.

[0056] Step S6: Combine the original detection task loss and the foreground Focal loss. Attention Alignment Feature Loss Global consistency loss Spatial coordination loss Perception enhancement loss function and cross-modal alignment loss Configure the integrated training loss function Among them, the ensemble training loss function The calculation method is as follows:

[0057] in, The final integration training loss function; The original detection task loss includes the discrimination loss of the classification branch for foreground targets and the regression loss of the 3D bounding box regression branch for ten-dimensional physical parameters; To achieve multi-dimensional alignment, we use spatial attention masks and channel attention masks to perform attention alignment feature loss. Foreground Focal loss is used to enhance the student model's ability to lock onto the target region during the early stages of training; This is the global consistency loss, used to optimize the consistency of the global space in the later stages of training; The feature space collaborative loss is used to minimize the feature mean square error L2 distance between the corresponding index query vectors of the teacher and student models, and to suppress assignment drift. The perception enhancement loss function is used to precisely constrain the predicted output of the truth query after it passes through the detection head and optimize the interaction patterns between instances. This is a cross-modal feature alignment loss used to guide the alignment of the BEV feature space towards the truth semantic space; to To balance hyperparameters, it is used to adjust the weight ratio of each parameter.

[0058] Employing a joint training paradigm, utilizing in the early stages of training Enhance the student model's ability to lock onto target regions, utilizing this feature in the later stages of training. 、; and Optimize the consistency of the global space and the interaction mode between instances to achieve deep collaborative optimization from low-level localization to high-level semantics.

[0059] By integrating the training loss function A lightweight bird's-eye view perception model is obtained by training a student model. This application combines the original detection task loss and the foreground Focal loss. Attention Alignment Feature Loss Global consistency loss Spatial coordination loss Perception enhancement loss function and cross-modal alignment loss Perform joint weighted training to obtain the final ensemble training loss function. This allows for multi-dimensional constraints on the student model, ensuring its lightweight nature while improving detection accuracy to the point where it approaches that of the teacher model.

[0060] Experiments have shown that: To verify the effectiveness of this method, lightweight perception and knowledge distillation experiments were conducted on the publicly available autonomous driving visual perception dataset nuScenes.

[0061] Experimental setup: To verify the effectiveness of the method in resource-constrained scenarios, a rigorous lightweight deployment and distillation test environment was constructed: In terms of model configuration, the BEVFormer-Base model with a large number of parameters was selected as the teacher model, and a lightweight model with significantly reduced backbone network parameters and decoder layers was used as the student model to simulate the computational constraints of low-end automotive chips; in terms of data environment, the nuScenes dataset, which contains 1000 complex urban scenes with complete 3D annotations and various weather changes, was used for full-process training and evaluation; in terms of evaluation metrics, the average detection accuracy (mAP) and nuScenes comprehensive score (NDS) were used to measure perception performance, and the lightweight effect was evaluated by combining inference latency and memory usage.

[0062] Experimental results: Table 1 Experimental Data

[0063] The experimental results are shown in Table 1. The data indicate that, under the same in-vehicle embedded computing power constraints, this system not only improved the NDS score of the student model from 32.4% to 50.8%, but also increased the mAP accuracy to 35.6%. Simultaneously, logical alignment improved the system's convergence speed by 6 times (reducing it from 24 epochs to 4 epochs). Furthermore, ablation experiments targeting long-tailed small targets demonstrate that the GT-QI module improved the system's discrimination accuracy for such targets by 8.5% without increasing inference costs.

[0064] Logical alignment stability test: By visualizing the matching process of the decoder, it was found that the benchmark method suffers from severe Hungarian assignment oscillations in the early stages of training, requiring approximately 24 epochs to converge to a stable state; while this invention, through teacher query prior injection, enables the student model to reach assignment stability in the 4th epoch. Specifically, as shown... Figure 7 and 8 As shown, in complex intersection scenarios, the prediction boxes generated by this invention can anchor target instances more quickly and robustly, significantly reducing assignment delay and assignment drift.

[0065] Ablation experiments verified that removing the Truth-Injection-QI (GT-QI) module alone reduced the detection accuracy for long-tailed small targets (such as motorcycles and signs) by approximately 8.5%. This confirms the independent contribution of each module in this invention to improving the upper limit of lightweight perception and optimizing feature evolution trajectory.

[0066] Example 2: An exemplary embodiment of this application provides a lightweight bird's-eye view perception method, which deploys the lightweight bird's-eye view perception model constructed by the construction method of Embodiment 1 to an autonomous vehicle terminal, and includes the following process: Deploy cameras around autonomous vehicles to collect multi-view images of the surroundings in real time. The acquired multi-view images are subjected to image filtering, image normalization, and image standardization processing. The processed multi-view images are fed into a lightweight bird's-eye view perception model for perception, and the target detection results under the BEV perspective are obtained.

[0067] Example 3: An exemplary embodiment of this application provides a lightweight bird's-eye view perception device. The device includes at least one processor and at least one memory, with the processor and memory coupled together. The memory stores a computer-executable program for a lightweight bird's-eye view perception model constructed by the construction method of Embodiment 1. When the processor executes the computer-executable program stored in the memory, it causes the processor to execute a lightweight bird's-eye view perception method. The internal bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, the buses in the accompanying drawings are not limited to only one bus or one type of bus. The memory may include high-speed RAM memory, and may also include non-volatile memory (NVM), such as at least one disk storage device, and may also be a USB flash drive, external hard drive, read-only memory, disk, or optical disk, etc.

[0068] The device can be provided as a terminal, server, or other form of device.

[0069] Figure 9 This is a block diagram illustrating an illustrative device. The device may include one or more of the following components: a processing component, a memory, a power supply component, a multimedia component, an audio component, an input / output (I / O) interface, a sensor component, and a communication component. The processing component typically controls the overall operation of the electronic device, such as operations associated with display, telephone calls, data communication, camera operation, and recording operations. The processing component may include one or more processors to execute instructions to perform all or part of the steps of the methods described above. Furthermore, the processing component may include one or more modules to facilitate interaction between the processing component and other components. For example, the processing component may include a multimedia module to facilitate interaction between the multimedia component and the processing component.

[0070] Memory is configured to store various types of data to support the operation of electronic devices. Examples of this data include instructions for any application or method used to operate on an electronic device, contact data, phonebook data, messages, pictures, videos, etc. Memory can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.

[0071] A power supply component provides power to various components of an electronic device. The power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the electronic device. A multimedia component includes a screen that provides an output interface between the electronic device and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, the multimedia component includes a front-facing camera and / or a rear-facing camera. When the electronic device is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

[0072] The audio component is configured to output and / or input audio signals. For example, the audio component includes a microphone (MIC) configured to receive external audio signals when the electronic device is in an operating mode, such as call mode, recording mode, or voice recognition mode. The received audio signals may be further stored in memory or transmitted via a communication component. In some embodiments, the audio component also includes a speaker for outputting audio signals. The I / O interface provides an interface between the processing component and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to, a home button, volume buttons, a power button, and a lock button.

[0073] The sensor assembly includes one or more sensors for providing state assessments of various aspects of the electronic device. For example, the sensor assembly can detect the on / off state of the electronic device, the relative positioning of components such as the display and keypad of the electronic device, changes in the position of the electronic device or a component of the electronic device, the presence or absence of user contact with the electronic device, the orientation or acceleration / deceleration of the electronic device, and temperature changes of the electronic device. The sensor assembly may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly may also include an optical sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly may also include an accelerometer, a gyroscope, a magnetometer, a pressure sensor, or a temperature sensor.

[0074] The communication component is configured to facilitate wired or wireless communication between electronic devices and other devices. The electronic device can access wireless networks based on communication standards, such as WiFi, 2G, or 3G, or combinations thereof. In one exemplary embodiment, the communication component receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra-Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

[0075] In an exemplary embodiment, the electronic device may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the methods described above.

[0076] Example 4: An exemplary embodiment of this application provides a computer-readable storage medium storing a computer program or instructions for a lightweight bird's-eye view perception model constructed by the construction method of Embodiment 1. When the program or instructions are executed by a processor, the processor executes a lightweight bird's-eye view perception method.

[0077] Specifically, a system, apparatus, or device may be provided equipped with a readable storage medium on which software program code implementing the functions of any of the embodiments described above is stored, and the computer or processor of the system, apparatus, or device reads and executes the instructions stored in the readable storage medium. In this case, the program code read from the readable medium itself can implement the functions of any of the embodiments described above, therefore, the machine-readable code and the readable storage medium storing the machine-readable code constitute a part of the present invention.

[0078] The aforementioned storage media can be implemented using any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disks or optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD-RW), magnetic tape, etc. The storage media can be any available medium accessible to general-purpose or special-purpose computers.

[0079] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly manifested as execution by a hardware processor, or execution by a combination of hardware and software modules within the processor.

[0080] It should be understood that the storage medium is coupled to the processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and storage medium can reside in application-specific integrated circuits (ASICs). Alternatively, the processor and storage medium can exist as discrete components in a terminal or server.

[0081] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0082] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.

[0083] In this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that an article or device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such an article or device. Without further limitation, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the article or device that includes that element.

[0084] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.

[0085] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if these modifications and variations fall within the scope of the claims of this application and their equivalents, the intent of this application also includes these modifications and variations.

Claims

1. A method for constructing a lightweight bird's-eye view perception model, characterized in that, Includes the following steps: Step S1: Construct a perception framework based on knowledge distillation, wherein the perception framework includes a pre-trained teacher model and a student model to be trained. Step S2: Acquire multi-view images and input them into the perception framework. Obtain teacher BEV feature maps through the BEV encoder of the teacher model. The student's BEV feature map is obtained through the BEV encoder of the student model. ; Step S3: Based on the dynamic query guidance mask, analyze the teacher's BEV feature map. Student BEV Feature Map Distillation is performed to obtain the foreground Focal loss. Attention Alignment Feature Loss and global consistency loss ; Step S4: Obtain the binary matching permutation mapping of the teacher model based on the Hungarian algorithm. Establish an index mapping relationship between the teacher model and the student model to obtain the teacher BEV feature map. Student BEV Feature Map Feature space collaborative loss ; Step S5: Based on the truth-injection query interaction, obtain the perceptual enhancement loss function. Cross-modal feature alignment loss ; Step S6: Combine the original detection task loss and the foreground Focal loss. Attention Alignment Feature Loss Global consistency loss Spatial coordination loss Perception enhancement loss function and cross-modal alignment loss Configure the integrated training loss function Among them, the ensemble training loss function The calculation method is as follows: in, The final integration training loss function; This represents the loss of the original detection task; For attention-aligned feature loss; Forward Focal loss; This results in a loss of global consistency. For feature space collaborative loss; For perceptual enhancement loss function; Cross-modal feature alignment loss; to To balance hyperparameters, it is used to adjust the weight ratio of each parameter; By integrating the training loss function The student model is trained to obtain a lightweight bird's-eye view perception model.

2. The lightweight bird's-eye view perception model construction method according to claim 1, characterized in that, Step S3 includes the following steps: Step S31: Obtain the query vector using the BEV decoder of the teacher model. Calculate the teacher model query vectors Teacher BEV Feature Map Spatial similarity response, to obtain each query vector Corresponding spatial response map and initial mask ; Step S32, calculate each query vector Quality assessment score Obtain the spatial response graph corresponding to the valid query vector and the dynamic query guidance mask. ; Step S33: Dynamically query the bootstrap mask. Teacher BEV feature maps respectively Student BEV Feature Map The foreground target region and background environment region are decoded, and weighted distillation is performed on the obtained foreground target region and background environment region respectively to obtain the foreground Focal loss. and attention alignment feature loss ; Step S34, extract the teacher's BEV feature map. Student BEV Feature Map Input the global context module separately to obtain the global consistency loss. .

3. The lightweight bird's-eye view perception model construction method according to claim 2, characterized in that, The quality assessment score and dynamic query guide mask The calculation method is as follows: ；； in, For the teacher model targeting the first The classification prediction score output by each query; Intersection, union, and comparison; This represents the total number of object queries in the teacher model. For the teacher model based on the first Predicted bounding boxes for each query vector; The teacher model is assigned to the first teacher after binary matching. The true bounding box of each query vector; This represents the intersection-union ratio (IU) between the ground truth bounding box and the predicted bounding box. To adjust the balance between classification prediction scores and intersection-union ratio weights; For the first query vectors The corresponding initial mask.

4. The lightweight bird's-eye view perception model construction method according to claim 1, characterized in that, Step S4 includes the following steps: Step S41: Obtain the prediction results of the teacher model, and obtain positive and negative samples based on the prediction results of the teacher model; Step S42: Obtain the query vector using the BEV decoder of the teacher model. The query vector of the teacher model Positive and negative samples are input into the Hungarian algorithm to obtain a binary matching permutation map. ; Step S43: Using the stable prediction results of the teacher model as pseudo-true values, and permuting and mapping them through binary matching. Establish an index mapping relationship between the teacher model and the student model, and associate the pseudo-truth value labels of the teacher model with the corresponding query vectors of the student model; Step S44: Calculate the teacher's BEV feature map Student BEV Feature Map Feature space collaborative loss The positive sample distillation loss and negative sample distillation loss between the student model prediction results and the pseudo-true values of the teacher model.

5. The lightweight bird's-eye view perception model construction method according to claim 4, characterized in that, In step S44, the feature space collaborative loss The calculation method is as follows: ； Where N represents the number of positive sample query vectors; In the teacher model, the first The output features of a positive sample query vector in the decoder's hidden layer This represents the corresponding query feature in the student model after matching.

6. The lightweight bird's-eye view perception model construction method according to claim 1, characterized in that, Step S5 includes the following steps: Step S51: Use a multilayer perceptron (MLP) to input the target category label and the three-dimensional center coordinates. Physical dimensions Trigonometric function values of yaw angle and the two-dimensional instantaneous velocity vector of the ground The constructed ten-dimensional physical attribute encoding is a true value feature vector. ; Step S52, convert the true feature vector As a truth query, it is dynamically injected into the query pool of the student model decoder, so that it can participate in the decoding calculation in parallel with the original learnable query vector. The truth query simulates the spatial constraints and communication patterns between instances in the real scene through the self-attention mechanism, and extracts the deterministic object representation corresponding to the truth target from the BEV feature map of the student model through the cross-attention mechanism. Step S53, define the perceptual enhancement loss function This allows for precise constraints on the predicted output of the truth query after it has passed through the detection head.

7. The lightweight bird's-eye view perception model construction method according to claim 6, characterized in that, Step S5 further includes the following steps: Step S54: Based on the three-dimensional center coordinates of the real target, from the student BEV feature map Local feature blocks at corresponding positions in the middle of the cropping process ; Step S55: Use the feature pooling layer to separate local feature blocks. Transformed into a fixed-dimensional visual vector; Step S56: Based on the contrastive learning mechanism of the multimodal neural network CLIP, calculate the visual vector and the truth semantic feature vector generated by the truth encoder. Cosine similarity; Step S57: Based on the obtained cosine similarity, construct a similarity matrix containing N instance pairs. ; Step S58: Calculate object features within the batch. F With truth semantic feature vector The normalized dot product between them; ； in, ; ; It is the logarithmic scaling factor that can be learned during the contrastive learning process. Represents matrix multiplication; Step S59: Obtain cross-modal feature alignment loss Furthermore, bidirectional optimization is performed on the BEV axis and the truth axis, forcing the BEV target features to align with the truth semantic space.

8. A lightweight bird's-eye view perception method, characterized in that, Deploying the lightweight bird's-eye view perception model constructed by the construction method as described in any one of claims 1 to 7 to an autonomous vehicle terminal includes the following process: Deploy cameras around autonomous vehicles to collect multi-view images of the surroundings in real time. The acquired multi-view images are subjected to image filtering, image normalization, and image standardization processing. The processed multi-view images are fed into a lightweight bird's-eye view perception model for perception, and the target detection results under the BEV perspective are obtained.

9. A lightweight bird's-eye view sensing device, characterized in that, The device includes at least one processor and at least one memory, the processor and the memory being coupled together; the memory stores a computer-executable program of a lightweight bird's-eye view perception model constructed by the construction method according to any one of claims 1 to 7; when the processor executes the computer-executable program stored in the memory, the processor executes a lightweight bird's-eye view perception method.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program or instructions for a lightweight bird's-eye view perception model constructed by the construction method as described in any one of claims 1 to 7, wherein when the program or instructions are executed by a processor, the processor performs a lightweight bird's-eye view perception method.