Unmanned vehicle visual transformer model compression method and system
By equating the Transformer Block structure to a fourth-order tensor kernel and performing TT-SVD dual-kernel decomposition, the high complexity problem of Transformer models on autonomous vehicle platforms is solved, achieving lightweight compression and real-time performance improvement, making it suitable for embedded inference chips in autonomous vehicles.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2025-12-16
- Publication Date
- 2026-06-30
AI Technical Summary
When existing Transformer models are deployed on autonomous vehicle platforms, they are limited by power consumption, computing power and storage space, resulting in high inference latency and reduced real-time performance. Furthermore, data-driven compression methods lack robustness and interpretability when migrating across tasks, modalities or datasets.
The Transformer Block structure is modeled as a fourth-order tensor kernel. By using TT-SVD dual-kernel tensor decomposition, a low-rank latent space is introduced to break the density of full-dimensional channel interaction and construct a lightweight structure suitable for embedded inference chips in autonomous vehicles, while preserving the integrity of nonlinear activation and residual connections.
Maintaining inference accuracy on resource-constrained in-vehicle hardware, improving parameter compression rate and inference speed, reducing deployment and inference latency of Transformer models, and meeting the embedded deployment requirements of autonomous vehicles.
Smart Images

Figure CN121708558B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of intelligent deployment technology for unmanned vehicles, specifically relating to a method and system for compressing the visual Transformer model of an unmanned vehicle. Background Technology
[0002] In recent years, the demand for vision-based autonomous vehicles (RVs) to perform target detection tasks in complex environments has been growing. RVs need to process data input from onboard cameras or other sensors in real time to achieve target recognition or tracking. Due to the complexity of task scenarios, the varying scale of targets, and the high real-time requirements, models need to possess powerful global feature extraction and context modeling capabilities. The Transformer (a deep learning model architecture based on a self-attention mechanism) demonstrates strong feature representation capabilities by capturing global dependencies, achieving significant performance in RV deployment tasks. However, the standard Transformer architecture is typically based on a multi-layered stacked Block structure (including attention mechanisms and feedforward networks), resulting in a huge number of parameters and massive computational demands. Especially in high-dimensional embedding and long sequence input scenarios, the number of parameters increases quadratically. For RV platforms, their onboard computing units (embedded NPUs) are limited by power consumption, computing power, and storage space, making it difficult to handle the high-complexity inference process of standard Transformers. Therefore, direct deployment of Transformers often faces problems such as high inference latency and decreased real-time performance.
[0003] Current compression techniques for Transformer models are primarily data-driven, relying on the weight distribution and statistical characteristics of the model after training and employing methods such as pruning and distillation. These methods are typically posterior compression processes, depending on training samples specific to a particular task. While they may maintain superior performance on that task, they lack robustness and interpretability when transferring across tasks, modalities, or datasets, exhibiting insufficient generalization ability and adaptive degradation. Furthermore, data-driven methods struggle to reveal the structural redundancy mechanisms at the operator level within the Transformer, making it difficult to provide theoretically interpretable compression of the model's high-dimensional mapping structure. Summary of the Invention
[0004] The purpose of this invention is to address the problems in the prior art by providing a method and system for compressing the Transformer model for autonomous vehicles. The complete Transformer Block structure is modeled as a fourth-order tensor kernel, and TT-SVD (Tensor Column-Singular Value) dual-kernel tensor decomposition is performed to obtain the optimal dual-kernel approximation. A low-rank latent space is introduced, and feature transfer in the latent space breaks the density and redundancy of the full-dimensional channel interaction in the original Transformer Block. The physical structure is reconstructed based on the compressed TT-SVD dual kernel to achieve lightweight structural replacement, while preserving the integrity of the original structure's nonlinear activation and residual connections, thus constructing an efficient Transformer model suitable for embedded inference chips in autonomous vehicles.
[0005] To achieve the above objectives, the present invention provides the following technical solution:
[0006] Firstly, a method for compressing the vision Transformer model of an autonomous vehicle is provided, including:
[0007] Image sequence perception data collected by the onboard camera of an autonomous vehicle is acquired. This data is then segmented and tokenized to obtain the input sequence matrix for the Transformer model. Features are extracted from the input sequence matrix using an attention mechanism, and a fourth-order tensor kernel is mathematically derived. The expression is achieved through the fourth-order tensor kernel. Describe the tensor mapping relationship from input features to output features, and construct the tensor representation of the attention mechanism;
[0008] Based on the fourth-order tensor kernel It possesses the Kronecker structure and linear homomorphic properties, and is decoupled into the spatial dependency relationship and feature channel mapping relationship between different targets in the autonomous vehicle perception scenario;
[0009] Taylor expansion is used to analyze the feedforward network of the Transformer model. The activation function is locally linearized, thereby transforming the feedforward network into a continuous linear tensor mapping. The upper bound of the error is quantized using the Frobenius norm, thus achieving error controllability.
[0010] According to the fourth-order tensor kernel The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input and output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed by tensor column TT-singular value SVD dual kernel to obtain rank R representation and dual kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, so that the Transformer model after tensor decomposition and lightweight compression can still maintain inference accuracy on resource-constrained vehicle hardware.
[0011] The compressed TT-SVD dual cores are mapped back to physical space according to their element sources to construct a Transformer model suitable for embedded inference chips in autonomous vehicles.
[0012] As a preferred embodiment, the input sequence matrix of the Transformer model is ,satisfy In the formula, Represents the set of real numbers. To represent the number of tokens, For the Embedding dimension.
[0013] As a preferred approach, the input sequence matrix of the Transformer model is used to extract features through an attention mechanism, and a fourth-order tensor kernel is derived mathematically. The expression is achieved through the fourth-order tensor kernel. The tensor mapping relationship from input features to output features is described, and the steps for constructing the tensorized representation of the attention mechanism are as follows: The computational expression of the attention mechanism is as follows:
[0014]
[0015] In the formula, , , The projection matrix; The projection dimensions of the query and key are used to calculate the attention weights. ; The projection dimension of the value is used to calculate the output;
[0016] Define the input sequence matrix of the Transformer model ,in, For the number of tokens, For the embedding dimension, the first The first token Each component is denoted as ;
[0017] Define the query matrix Key matrix Value matrix ,in, For the projection dimensions of Query and Key, For the projection dimension of Value;
[0018] Define attention scoring matrix , where the component is ;
[0019] Define attention matrix The component is , And satisfy for each They all ;
[0020] The output of the attention mechanism is:
[0021]
[0022] The output component in the formula is denoted as ;
[0023] The input features at all key positions are weighted and summed according to their corresponding attention weights to obtain the output feature values for each channel. Based on this, the calculation expression of the attention mechanism is fully expanded in component form to obtain:
[0024]
[0025]
[0026] In the formula, The position of the key; The location of the query; The attention weight coefficient reflects the degree of dependence between the two. To output feature channels, For the token channels in the input sequence. The Value mapping matrix maps the input features to the output feature space, achieving a linear transformation.
[0027] Input features according to the above formula To output features The mapping relationship is represented as a fourth-order tensor kernel form. This makes the multi-dimensional interaction between the Query position, output channel, Key position, and input channel explicit. The entire mapping relationship can be represented by strict tensor contraction, as shown in the following expression:
[0028]
[0029] In the formula, Corresponding output sequence and output channel; Corresponding input sequence and input channel; fourth-order tensor kernel Specifically, it consists of sequence interaction weights. and channel mapping weights They are jointly determined, reflecting the correlation between sequence positions and the linear mapping between channels.
[0030] As a preferred embodiment, the method based on the fourth-order tensor kernel... With its Kronecker structure and linear homomorphic properties, the decoupling of spatial dependencies and feature channel mappings between different targets in autonomous vehicle perception scenarios includes the following steps:
[0031] For the obtained fourth-order tensor kernel Reconstructed according to standard vectorization rules, the expression is as follows:
[0032]
[0033] Prove that T is a Kronecker Product in the form of the following two methods:
[0034] Method 1:
[0035] According to standard vectorization rules, the output will be... and input All components are stacked column by column:
[0036]
[0037] In the formula, ; ;
[0038] turn up Make Established, at the same time The components satisfy the Kronecker product form;
[0039] In the original component form, substitute and ,get:
[0040]
[0041] If we take the matrix and ,but:
[0042]
[0043] Because the elemental level has a commutative law Therefore, the two expressions above are equivalent, resulting in:
[0044]
[0045] In the formula, For Kronecker product;
[0046] Method 2:
[0047] Based on the mixed multiplication property of the Kronecker product, we obtain:
[0048]
[0049] but:
[0050]
[0051] Continuing to use the mixing rule of the Kronecker product, we get:
[0052]
[0053] Ultimately, we arrive at the same result:
[0054]
[0055] Given that the attention weight matrix and channel projection matrix are trainable, the fourth-order tensor kernel It has linear homomorphic properties, and the linear mapping can be decoupled into independent tensor product structures of spatial location dimension and channel mapping dimension, which respectively correspond to the spatial dependency relationship between different targets and the mapping relationship of feature channels in the autonomous vehicle perception scenario.
[0056] As a preferred approach, the Taylor expansion is used to apply the method to the feedforward network of the Transformer model. The activation function performs local linearization, thereby equating the feedforward network to a continuous linear tensor mapping. The steps for quantizing the upper bound of the error using the Frobenius norm and achieving error controllability include:
[0057] The feedforward network in the Transformer model is applied to the feature vector of each token, specifically through two layers of element-wise non-linear mapping, as shown in the following expression:
[0058]
[0059] In the formula, Output derived from the attention mechanism; , These are the dimension-up mapping and the bias vector, respectively. , These are the dimension reduction mapping and the bias vector, respectively. Element-wise nonlinearity is applied to the activation function;
[0060] For each token - Feature Dimensions They all have the following component forms:
[0061]
[0062]
[0063] because The activation function violates linear separability and hinders the direct representation of the tensor kernel; here, Taylor expansion is used for local linearization. Specifically, for At a certain point A first-order Taylor expansion yields the following expression:
[0064]
[0065] get:
[0066]
[0067] In the formula, It is a diagonal matrix;
[0068] After local linearization of the activation function, the feedforward network is described by a continuous linear mapping:
[0069]
[0070] In the formula, For mapping bias terms;
[0071] Since the bias term does not change the low-rank structure of the tensor kernel, it is usually absorbed into the output vector or ignored during kernel decomposition. Furthermore, the residual connections within the Transformer Block and the normalization effect of LayerNorm normalization normalize the bias, making its impact on tensor kernel decomposition negligible. Therefore, ignoring the bias term yields:
[0072]
[0073] In the formula, ;
[0074] The linearization error described above is defined as follows:
[0075]
[0076] By using the upper bound of the Frobenius norm quantification error, we obtain:
[0077]
[0078] In the formula, Denotes the Frobenius norm of a matrix. It is the residual matrix, which directly quantifies the error of local linearization. It is the amplification factor of the FFN weight matrix of the second-layer feedforward network;
[0079] For ReLU and GELU activation functions, if local linearization is performed by selecting the zeros, we have:
[0080]
[0081] All of them can quantify the upper bound of the error and achieve error control through the Frobenius norm, providing a basis for determining the latent space dimension in subsequent embedded deployment and engineering design.
[0082] As a preferred embodiment, the method is based on a fourth-order tensor kernel. The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input-output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed using tensor column TT-singular value SVD dual-kernel decomposition to obtain rank R representation and dual-kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, ensuring that the tensor-decomposed and lightweight compressed Transformer model maintains inference accuracy on resource-constrained vehicle hardware. The steps include:
[0083] By using the output of the attention mechanism as the input to the feedforward network of the Transformer model, the input-output relationship of the entire Transformer Block can be obtained:
[0084]
[0085] definition Then, we obtained:
[0086]
[0087] The complete fourth-order tensor kernel form of the Transformer Block is obtained from the mathematical expression at the element level:
[0088]
[0089] For fourth-order tensor kernels According to standard row indexing rules ,common Rows; and column indexing rules ,common Okay; put it in Constructing a matrix under the modulus partitioning rule ;
[0090] Perform SVD decomposition on matrix M:
[0091]
[0092] In the formula, , , , ;
[0093] By using the low-rank approximation, we obtain:
[0094]
[0095] Perform TT-style dual-core decomposition and structural reconstruction:
[0096]
[0097] In the formula, , ;
[0098] They are indexed by row respectively and column indexes Reconstructing the inverse mapping yields a dual-core system. and At the same time:
[0099]
[0100] According to Eckart–Young's theorem, for all decompositions satisfying rank-R, SVD gives... It is the globally optimal approximation in the Frobenius norm sense, and the best rank R approximation measured by the Frobenius norm, with a truncation error of:
[0101]
[0102] make Since the matrix transformation is isometric, Frobenius preserves this under reshape. Converting the above matrix optimality back to tensor form, we have:
[0103]
[0104] Therefore, TT-SVD provides a dual-core... and Similarly, using the Frobenius norm as the optimal tensor approximation, spatial attention and channel mapping between targets are respectively implemented in the perception scenario of autonomous vehicles.
[0105] As a preferred embodiment, the step of mapping the compressed TT-SVD dual cores back to physical space according to their element sources to construct a Transformer model suitable for embedded inference chips in autonomous vehicles includes:
[0106] TT-SVD tensor decomposition and A low-rank latent space is introduced to enable feature transfer between the two kernels. Specifically, the former realizes modal compression and feature extraction in the latent space dimension, while the latter realizes latent feature recovery and channel reconstruction.
[0107] In engineering design, the rank R is selected through an energy retention rate strategy, which uniquely determines the dimension of the latent space.
[0108]
[0109] Based on the given formula for the upper bound of the absolute error, minimum rank R constraint and dimensionality control are performed:
[0110]
[0111] We selected the TinyBERT model and the SST-2 task dataset to carry out the design and replacement of a lightweight structure based on TT-SVD dual-core tensor decomposition, and completed experimental verification.
[0112] As a preferred approach, the steps of selecting the TinyBERT model and the SST-2 task dataset, designing and replacing a lightweight structure based on TT-SVD dual-core tensor decomposition, and completing experimental verification include:
[0113] The TT-SVD dual-core tensor decomposition obtained and By mapping the element source back to the physical space, a two-layer lightweight Block structure is obtained. The rank R and the latent space dimension are uniquely determined by the "energy retention rate" and error upper bound control strategy. This replaces the Block structure in the original Transformer model. The nonlinear activation and residual connection of the original model remain unchanged to preserve the generalization ability of the model.
[0114] A lightweight compression comparison experiment of the TinyBERT model was conducted on the SST-2 task dataset. The test device was a GPU RTX 3070, the batch size was 16, and the test conditions and test parameters were completely consistent to ensure the fairness of the experiment.
[0115] The lightweight compressed Transformer model is exported in ONNX format via Open Neural Network Exchange, and the Huawei Ascend Tensor Compiler ATC tool is used to convert and optimize the model to OM format so that it can be adapted to the inference and computing architecture of the Huawei Ascend 310B chip.
[0116] The converted .om model was loaded into the Ascend heterogeneous computing architecture CANN inference framework, and the Huawei Ascend 310B chip was deployed to complete the experimental verification.
[0117] As a preferred embodiment, the input / output and inference process of the Huawei Ascend 310B chip and the autonomous vehicle platform data includes:
[0118] By embedding the Huawei Ascend 310B chip in the autonomous vehicle, the image sequence data acquired by the vehicle camera is preprocessed by calling the tokenizer tool through the Huawei Ascend hardware front-end application layer to obtain the input sequence matrix of the Transformer model.
[0119] The input data, including the tag ID and attention mask, is encapsulated in the form of a tensor, and the data transfer and memory copy operations are completed through the Ascend Computing Library ACL.
[0120] The .om model is scheduled and executed by the Huawei Ascend 310B chip. Each operator in the model is mapped to the corresponding operator kernel on the neural network processor (NPU) and runs on the chip core. The process sequentially performs embedding lookup, attention calculation, feedforward network calculation, and layer normalization. Finally, the output is generated through linear transformation. The entire process is accelerated end-to-end on the NPU.
[0121] Secondly, a system for compressing the vision Transformer model of an autonomous vehicle is provided, including:
[0122] The attention mechanism tensor representation module acquires image sequence perception data collected by the vehicle's onboard camera, performs word segmentation and tokenization operations on the image sequence perception data to obtain the input sequence matrix of the Transformer model, and extracts features from the input sequence matrix of the Transformer model through the attention mechanism, mathematically deriving a fourth-order tensor kernel. The expression is achieved through the fourth-order tensor kernel. Describe the tensor mapping relationship from input features to output features, and construct the tensor representation of the attention mechanism;
[0123] The fourth-order tensor kernel relation decoupling module is used to decouple the fourth-order tensor kernel relation. It possesses the Kronecker structure and linear homomorphic properties, and is decoupled into the spatial dependency relationship and feature channel mapping relationship between different targets in the autonomous vehicle perception scenario;
[0124] The local linearization module is used to perform Taylor expansion on the feedforward network of the Transformer model. The activation function is locally linearized, thereby transforming the feedforward network into a continuous linear tensor mapping. The upper bound of the error is quantized using the Frobenius norm, thus achieving error controllability.
[0125] The TT-SVD dual-core decomposition module is used to decompose data according to the fourth-order tensor kernel. The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input and output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed by tensor column TT-singular value SVD dual kernel to obtain rank R representation and dual kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, so that the Transformer model after tensor decomposition and lightweight compression can still maintain inference accuracy on resource-constrained vehicle hardware.
[0126] The TT-SVD dual-core mapping module is used to map the compressed TT-SVD dual cores back to physical space according to the element source, and construct a Transformer model suitable for embedded inference chips in autonomous vehicles.
[0127] Compared with the prior art, the present invention has at least the following beneficial effects:
[0128] This invention preprocesses image sequence perception data acquired by an autonomous vehicle's onboard camera using an onboard CPU or a circuit board CPU, serving as input to the Transformer attention mechanism. A precise expression of a fourth-order tensor kernel is derived for the attention mechanism, and a tensor-quantized representation of the attention mechanism is constructed to describe the tensor mapping relationship from input to output. Based on the strictly separable Kronecker structure and linear homomorphic properties of this fourth-order tensor kernel, it can be decoupled into spatial dependencies and feature channel mapping relationships between different targets in the autonomous vehicle perception scenario. This separability is the foundation for subsequent tensor decomposition and structural compression. Taylor expansion is used to locally linearize the activation function in the Transformer feedforward network, breaking down the nonlinearity barrier in the tensor representation and equating it to a continuous tensor mapping. The upper bound of the error is precisely quantized using the Frobenius norm to achieve error control, providing a basis for embedded deployment and engineering design. A fourth-order tensor kernel is used to model the entire Transformer Block, representing the complete input-output relationship in the autonomous vehicle target detection task using tensor quantization. A dual-core tensor decomposition is then performed using the Tensor Column T-Singular Value Decomposition (TT-SVD) method, yielding the optimal rank R representation and the best dual-core proof. This achieves spatial attention and channel mapping in the autonomous vehicle target detection task, while ensuring that the compressed model maintains inference accuracy on resource-constrained in-vehicle hardware. The compressed dual-core structure introduces a low-rank latent space for feature transfer, breaking the density and redundancy of full-dimensional channel interactions in the original Transformer Block. The compressed dual cores are mapped back to physical space according to their element sources, constructing an efficient Transformer model suitable for embedded inference chips in autonomous vehicles. This invention selects the TinyBERT model and the SST-2 task dataset to verify the lightweight compression performance of the Transformer model on a GPU server. For the autonomous vehicle target detection task, relying on an autonomous vehicle platform, it selects the Huawei Ascend 310B chip for engineering design and domestic NPU deployment testing. The results show that in the GPU RTX3070 deployment scenario, the parameter compression rate in a single-layer Transformer Block reaches 3.2×, and the single-layer inference speed is improved by 3×; in the total number of model parameters, the compression rate reaches 1.3×, and the end-to-end inference speed is improved by 1.8×, achieving lightweight compression of the overall Transformer structure; in the 310B chip deployment scenario, with a concurrency of 2, the NPU utilization reaches 70%, the average latency reaches 6ms, and the throughput metric QPS (Queries Per Second) reaches 330 times / second.This invention presents a Transformer compression method based on TT-SVD dual-core tensor decomposition for deployment in unmanned vehicles. It also has certain universality and scalability in other unmanned equipment based on computer vision, reducing the inference latency of Transformer model deployment and meeting the requirements for interpretable compression. Attached Figure Description
[0129] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention. For those skilled in the art, other related drawings can be obtained from these drawings without creative effort.
[0130] Figure 1 Overall architecture diagram of the unmanned vehicle vision Transformer model compression method according to an embodiment of the present invention;
[0131] Figure 2 Flowchart of the TT-SVD dual-core tensor decomposition algorithm according to an embodiment of the present invention;
[0132] Figure 3 Flowchart of the algorithm for determining the rank R and latent space dimension based on the "energy retention rate" strategy in this embodiment of the invention;
[0133] Figure 4 This invention provides a data flow diagram for deploying a lightweight TinyBERT model based on the Huawei Ascend 310B chip. Detailed Implementation
[0134] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, those skilled in the art can obtain other embodiments without creative effort.
[0135] It is understood that the terms “first”, “second”, etc., used in this application may be used to describe various concepts in this application. Unless otherwise stated, these concepts are not limited by these terms, which are only used to distinguish one concept from another.
[0136] Unless otherwise defined, all technical and scientific terms used in this application have the same meaning as commonly understood by one of ordinary skill in the same technical field as this application. The terms used in this application are for the purpose of describing embodiments of this application only and are not intended to limit this application.
[0137] This invention provides a Transformer compression method based on TT-SVD dual-core tensor decomposition for autonomous vehicle deployment. Starting from the model structure itself, it directly focuses on the multi-layered nested dual-mode (token×feature) interaction structure within the Transformer, which can be mathematically modeled as a high-order tensor kernel operator with multiple-input multiple-output mapping. Based on this, this invention proposes a structured compression method that equivalently models the complete Transformer Block structure as a fourth-order tensor kernel. TT-SVD dual-core tensor decomposition is then performed to obtain the optimal dual-core approximation. A low-rank latent space is introduced, and feature propagation in the latent space breaks the density and redundancy of the full-dimensional channel interactions in the original Transformer Block. The physical structure is reconstructed based on the compressed TT-SVD dual core, achieving lightweight structural replacement while preserving the integrity of the original structure's nonlinear activations and residual connections, thus constructing an efficient Transformer model suitable for embedded inference chips in autonomous vehicles. Finally, the TinyBERT model and the SST-2 task dataset were selected to conduct lightweight compression verification of the Transformer model on the GPU; for the target detection task in the embedded deployment scenario of unmanned vehicles, the Huawei Ascend 310B chip was selected for engineering design and domestic NPU deployment testing.
[0138] Please see Figure 1 The unmanned vehicle vision Transformer model compression method of this invention specifically includes:
[0139] S1. Acquire image sequence perception data collected by the onboard camera of the autonomous vehicle, perform word segmentation and tokenization operations on the image sequence perception data to obtain the input sequence matrix of the Transformer model; extract features from the input sequence matrix of the Transformer model through an attention mechanism, and derive the fourth-order tensor kernel mathematically. The expression is achieved through the fourth-order tensor kernel. Describe the tensor mapping relationship from input features to output features, and construct the tensor representation of the attention mechanism;
[0140] S2, based on the fourth-order tensor kernel It possesses the Kronecker structure and linear homomorphic properties, and is decoupled into the spatial dependency relationship and feature channel mapping relationship between different targets in the autonomous vehicle perception scenario;
[0141] S3. Apply Taylor expansion to the feedforward network of the Transformer model. The activation function is locally linearized, thereby transforming the feedforward network into a continuous linear tensor mapping. The upper bound of the error is quantized using the Frobenius norm, thus achieving error controllability.
[0142] S4, according to the fourth-order tensor kernel The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input and output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed by tensor column TT-singular value SVD dual kernel to obtain rank R representation and dual kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, so that the Transformer model after tensor decomposition and lightweight compression can still maintain inference accuracy on resource-constrained vehicle hardware.
[0143] S5. Map the compressed TT-SVD dual core back to the physical space according to the source of its elements, and construct a Transformer model suitable for embedded inference chips for unmanned vehicles.
[0144] In one possible implementation, the input sequence matrix of the Transformer model in step S1 is: ,satisfy In the formula, Represents the set of real numbers. To represent the number of tokens, For the Embedding dimension.
[0145] Furthermore, for the image sequence perception data collected by the vehicle-mounted camera in the autonomous vehicle target detection task, the input sequence matrix of the Transformer is obtained through standard tokenization operation (performed by the vehicle-mounted CPU or the onboard CPU). The attention mechanism dynamically weights and fuses the positional features of each position in the input sequence according to the similarity between the query and the key, thereby generating the output features of each query position, thus capturing the positional relationship in the sequence, that is, the spatial dependency relationship of different targets in the autonomous vehicle perception scenario, and realizing the modeling of global information.
[0146] The complete computational process of the attention mechanism is as follows:
[0147]
[0148] In the formula, , , The projection matrix; The projection dimensions of the query and key are used to calculate the attention weights. ; The projection dimension of the value is used to calculate the output;
[0149] Define the input sequence matrix of the Transformer model ,in, For the number of tokens, For the embedding dimension, the first The first token Each component is denoted as ;
[0150] Define the query matrix Key matrix Value matrix ,in, For the projection dimensions of Query and Key, For the projection dimension of Value;
[0151] Define attention scoring matrix , where the component is ;
[0152] Define attention matrix The component is , And satisfy for each They all ;
[0153] The output of the attention mechanism is:
[0154]
[0155] The output component in the formula is denoted as ;
[0156] The input features at all key positions are weighted and summed according to their corresponding attention weights to obtain the output feature values for each channel. Based on this, the calculation expression of the attention mechanism is fully expanded in component form to obtain:
[0157]
[0158]
[0159] In the formula, The position of the key; The location of the query; The attention weight coefficient reflects the degree of dependence between the two. To output feature channels, For the token channels in the input sequence. The Value mapping matrix maps the input features to the output feature space, achieving a linear transformation.
[0160] Input features according to the above formula To output features The mapping relationship is represented as a fourth-order tensor kernel form. This makes the multi-dimensional interaction between the Query position, output channel, Key position, and input channel explicit. The entire mapping relationship can be represented by strict tensor contraction, as shown in the following expression:
[0161]
[0162] In the formula, Corresponding output sequence and output channel; Corresponding input sequence and input channel; fourth-order tensor kernel Specifically, it consists of sequence interaction weights. and channel mapping weights They are jointly determined, reflecting the correlation between sequence positions and the linear mapping between channels.
[0163] In one possible implementation, step S2 is based on the fourth-order tensor kernel. With its Kronecker structure and linear homomorphic properties, it decouples the spatial dependencies and feature channel mappings between different targets in autonomous vehicle perception scenarios, including:
[0164] For the obtained fourth-order tensor kernel Reconstructed according to standard vectorization rules, the expression is as follows:
[0165]
[0166] Prove that T is a Kronecker Product in the form of the following two methods:
[0167] Method 1:
[0168] According to standard vectorization rules, the output will be... and input All components are stacked column by column:
[0169]
[0170] In the formula, ; ;
[0171] turn up Make Established, at the same time The components satisfy the Kronecker product form;
[0172] In the original component form, substitute and ,get:
[0173]
[0174] If we take the matrix and ,but:
[0175]
[0176] Because the elemental level has a commutative law Therefore, the two expressions above are equivalent, resulting in:
[0177]
[0178] In the formula, For Kronecker product;
[0179] Method 2:
[0180] According to the mixed-product property of the Kronecker product (vec—Kronecker mixed-product property), we obtain:
[0181]
[0182] but:
[0183]
[0184] Continuing to use the mixing rule of the Kronecker product, we get:
[0185]
[0186] Ultimately, we arrive at the same result:
[0187]
[0188] Obtained through standard vectorization rules and derivation The inverse mapping is the fourth-order tensor component, as given above. The strictly separable Kronecker structure demonstrates that, given the trainability of the attention weight matrix and the channel projection matrix, the tensor kernel... It has linear homomorphic properties, and its linear mapping can be decoupled into independent tensor product structures of spatial location dimension and channel mapping dimension, which correspond to the spatial dependency relationship between different targets and the mapping relationship of feature channels in the autonomous vehicle perception scenario, respectively. This modal separability characteristic at the tensor level is the basis for subsequent tensor decomposition and structural compression.
[0189] In one possible implementation, step S3 involves applying Taylor expansion to the feedforward network of the Transformer model. The activation function performs local linearization, thereby equating the feedforward network to a continuous linear tensor mapping. The upper bound of the error is quantized using the Frobenius norm to achieve error control, including:
[0190] The feedforward network in the Transformer model is applied to the feature vector of each token, specifically through two layers of element-wise non-linear mapping, as shown in the following expression:
[0191]
[0192] In the formula, Output derived from the attention mechanism; , These are the dimension-up mapping and the bias vector, respectively. , These are the dimension reduction mapping and the bias vector, respectively. Element-wise nonlinearity is applied to the activation function;
[0193] For each token - Feature Dimensions They all have the following component forms:
[0194]
[0195]
[0196] because The activation function violates linear separability and hinders the direct representation of the tensor kernel; here, Taylor expansion is used for local linearization. Specifically, for At a certain point A first-order Taylor expansion yields the following expression:
[0197]
[0198] get:
[0199]
[0200] In the formula, It is a diagonal matrix;
[0201] After local linearization of the activation function, the feedforward network is described by a continuous linear mapping:
[0202]
[0203] In the formula, For mapping bias terms;
[0204] Since the bias term does not change the low-rank structure of the tensor kernel, it is usually absorbed into the output vector or ignored during kernel decomposition. Furthermore, the residual connections within the Transformer Block and the normalization effect of LayerNorm normalization normalize the bias, making its impact on tensor kernel decomposition negligible. Therefore, ignoring the bias term yields:
[0205]
[0206] In the formula, ;
[0207] The linearization error described above is defined as follows:
[0208]
[0209] By using the upper bound of the Frobenius norm quantification error, we obtain:
[0210]
[0211] In the formula, Denotes the Frobenius norm of a matrix. It is the residual matrix, which directly quantifies the error of local linearization. It is the amplification factor of the FFN weight matrix of the second-layer feedforward network;
[0212] For ReLU and GELU activation functions, if local linearization is performed by selecting the zeros, we have:
[0213]
[0214] All of them can quantify the upper bound of the error and achieve error control through the Frobenius norm, providing a basis for determining the latent space dimension in subsequent embedded deployment and engineering design.
[0215] In one possible implementation, step S4, according to the fourth-order tensor kernel... The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input-output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed using tensor column TT-singular value SVD dual-kernel decomposition to obtain rank R representation and dual-kernel proof. This enables target location association spatial attention and channel mapping in the autonomous vehicle target detection task, allowing the tensor-decomposed and lightweight compressed Transformer model to maintain inference accuracy on resource-constrained in-vehicle hardware.
[0216] By using the output of the attention mechanism as the input to the feedforward network of the Transformer model, the input-output relationship of the entire Transformer Block can be obtained:
[0217]
[0218] definition Then, we obtained:
[0219]
[0220] The complete fourth-order tensor kernel form of the Transformer Block is obtained from the mathematical expression at the element level:
[0221]
[0222] For fourth-order tensor kernels According to standard row indexing rules ,common Rows; and column indexing rules ,common Okay; put it in Constructing a matrix under the modulus partitioning rule ;
[0223] Perform SVD decomposition on matrix M:
[0224]
[0225] In the formula, , , , ;
[0226] By using the low-rank approximation, we obtain:
[0227]
[0228] Perform TT-style dual-core decomposition and structural reconstruction:
[0229]
[0230] In the formula, , ;
[0231] They are indexed by row respectively and column indexes Reconstructing the inverse mapping yields a dual-core system. and At the same time:
[0232]
[0233] According to Eckart–Young's theorem, for all decompositions satisfying rank-R, SVD gives... It is the globally optimal approximation in the Frobenius norm sense, and the best rank R approximation measured by the Frobenius norm, with a truncation error of:
[0234]
[0235] make Since the matrix transformation is isometric, Frobenius preserves this under reshape. Converting the above matrix optimality back to tensor form, we have:
[0236]
[0237] Therefore, TT-SVD provides a dual-core... and Similarly, using the Frobenius norm as the optimal tensor approximation, spatial attention and channel mapping between targets are respectively implemented in the perception scenario of autonomous vehicles.
[0238] In one possible implementation, step S5 maps the compressed TT-SVD dual core back to physical space according to the element source, constructing a Transformer model suitable for embedded inference chips in autonomous vehicles, including:
[0239] TT-SVD tensor decomposition and A low-rank latent space is introduced to enable feature transfer between the two kernels. Specifically, the former realizes modal compression and feature extraction in the latent space dimension, while the latter realizes latent feature recovery and channel reconstruction. Unlike the dense mapping of the original Transformer Block, the decomposed dual-kernel lightweight structure is essentially a low-rank tensor flow, which breaks the redundancy of full-dimensional channel interaction in the original Transformer Block, thereby significantly reducing the number of parameters and computational complexity.
[0240] In engineering design, the rank R is selected through an energy retention rate strategy, which uniquely determines the dimension of the latent space.
[0241]
[0242] Based on the given formula for the upper bound of the absolute error, minimum rank R constraint and dimensionality control are performed:
[0243]
[0244] We selected the TinyBERT model (an open-source Transformer model) and the SST-2 task dataset to carry out the design and replacement of a lightweight structure based on TT-SVD dual-core tensor decomposition, and completed experimental verification.
[0245] Furthermore, in this embodiment of the invention, the TinyBERT model and the SST-2 task dataset are selected to carry out the design and replacement of a lightweight structure based on TT-SVD dual-core tensor decomposition, and the experimental verification steps include:
[0246] The TT-SVD dual-core tensor decomposition obtained and By mapping the element source back to the physical space, a two-layer lightweight Block structure is obtained. The rank R and the latent space dimension are uniquely determined by the "energy retention rate" and error upper bound control strategy. This replaces the Block structure in the original Transformer model. The nonlinear activation and residual connection of the original model remain unchanged to preserve the generalization ability of the model.
[0247] A lightweight compression comparison experiment of the TinyBERT model was conducted on the SST-2 task dataset. The test device was a GPU RTX3070 with a batch size of 16. The test conditions and parameters were completely consistent to ensure the fairness of the experiment. The results show that in a single-layer Transformer Block, the parameter compression rate reaches 3.2× and the single-layer inference speed is improved by 3×. In terms of the total number of model parameters, the compression rate reaches 1.3× and the end-to-end inference speed is improved by 1.8×, which better meets the real-time requirements of embedded deployment.
[0248] The lightweight compressed Transformer model was exported in the Open Neural Network Exchange (ONNX) format and then converted and optimized using the Huawei Ascend Tensor Compiler (ATC) tool to make it compatible with the inference and computing architecture of the Huawei Ascend 310B chip.
[0249] The converted .om model was loaded into the Ascend heterogeneous computing architecture CANN inference framework, deployed on the Huawei Ascend 310B chip, and experimental verification was completed. The test content included NPU utilization, average latency, throughput (QPS), etc. The results showed that in the 310B chip deployment scenario, with a concurrency of 2, the NPU utilization reached 70%, the average latency reached 6ms, and the throughput (QPS) reached 330 times / second. The Transformer compression method based on TT-SVD dual-core tensor decomposition proposed in this invention for autonomous vehicle deployment has good adaptability and timeliness advantages on domestic chips.
[0250] Furthermore, the input / output and inference processes of data from the Huawei Ascend 310B chip and the autonomous vehicle platform include:
[0251] The Huawei Ascend 310B chip is embedded in the autonomous vehicle. For the image sequence data acquired by the vehicle camera, the Huawei Ascend hardware front-end application layer calls the tokenizer tool to perform data preprocessing to obtain the input sequence matrix of the Transformer model. Since it involves non-element calculation, this process is performed on the onboard CPU.
[0252] The input data, including token IDs and attention masks, is encapsulated in the form of tensors. Data transfer and memory copy operations are completed through the Ascend Computing Library (ACL) to ensure data access efficiency in the subsequent model inference stage.
[0253] The .om model is scheduled and executed by the Huawei Ascend 310B chip. Each operator in the model is mapped to the corresponding operator kernel on the neural network processor (NPU) and runs on the chip core. The process sequentially performs embedding lookup, attention calculation, feedforward network calculation, and layer normalization. Finally, the output is generated through linear transformation. The entire process is accelerated end-to-end on the NPU.
[0254] Please refer to Table 1, which shows the experimental results of lightweight structure replacement and compression using the TinyBERT model and the SST-2 task dataset in this embodiment of the invention. The TinyBERT model (an open-source Transformer model) was selected for practice, and lightweight structure design and replacement based on TT-SVD dual-core tensor decomposition were carried out, followed by lightweight compression experimental verification. The results show that the Transformer compression method based on TT-SVD dual-core tensor decomposition in this invention achieves a parameter compression ratio of 3.2× and a single-layer inference speed improvement of 3× in a single TransformerBlock; and a compression ratio of 1.3× and an end-to-end inference speed improvement of 1.8× in the total number of model parameters.
[0255] Table 1
[0256]
[0257] Please refer to Table 2, which shows the performance of the embodiments of the present invention in the unmanned vehicle target detection task under embedded deployment scenarios. The Huawei Ascend 310B chip was selected for the deployment test of the lightweight TinyBERT model using a domestically produced NPU. For the unmanned vehicle target detection task under embedded deployment scenarios, the Huawei Ascend 310B chip was selected for the deployment test of the lightweight Transformer model using a domestically produced NPU. The results show that the Transformer compression method based on TT-SVD dual-core tensor decomposition proposed in this invention for unmanned vehicle deployment achieves a concurrency of 2, an NPU utilization rate of 70%, an average latency of 6ms, and a throughput of 330 QPS under the 310B chip deployment scenario, demonstrating good adaptability and real-time performance on domestically produced chips. It should be noted that, in addition to unmanned vehicle target detection tasks, the Transformer compression method based on TT-SVD dual-core tensor decomposition proposed in this invention has certain universality and scalability for various unmanned equipment based on computer vision.
[0258] Table 2
[0259]
[0260] Another embodiment of the present invention also proposes an unmanned vehicle vision Transformer model compression system, comprising:
[0261] The attention mechanism tensor representation module acquires image sequence perception data collected by the vehicle's onboard camera, performs word segmentation and tokenization operations on the image sequence perception data to obtain the input sequence matrix of the Transformer model, and extracts features from the input sequence matrix of the Transformer model through the attention mechanism, mathematically deriving a fourth-order tensor kernel. The expression is achieved through the fourth-order tensor kernel. Describe the tensor mapping relationship from input features to output features, and construct the tensor representation of the attention mechanism;
[0262] The fourth-order tensor kernel relation decoupling module is used to decouple the fourth-order tensor kernel relation. It possesses the Kronecker structure and linear homomorphic properties, and is decoupled into the spatial dependency relationship and feature channel mapping relationship between different targets in the autonomous vehicle perception scenario;
[0263] The local linearization module is used to perform Taylor expansion on the feedforward network of the Transformer model. The activation function is locally linearized, thereby transforming the feedforward network into a continuous linear tensor mapping. The upper bound of the error is quantized using the Frobenius norm, thus achieving error controllability.
[0264] The TT-SVD dual-core decomposition module is used to decompose data according to the fourth-order tensor kernel. The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input and output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed by tensor column TT-singular value SVD dual kernel to obtain rank R representation and dual kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, so that the Transformer model after tensor decomposition and lightweight compression can still maintain inference accuracy on resource-constrained vehicle hardware.
[0265] The TT-SVD dual-core mapping module is used to map the compressed TT-SVD dual cores back to physical space according to the element source, and construct a Transformer model suitable for embedded inference chips in autonomous vehicles.
[0266] Another embodiment of the present invention provides an electronic device comprising:
[0267] A memory for storing at least one instruction; and a processor for executing the instructions stored in the memory to implement the autonomous vehicle vision Transformer model compression method.
[0268] Another embodiment of the present invention provides a computer-readable storage medium storing at least one instruction, which is executed by a processor in an electronic device to implement the aforementioned unmanned vehicle vision Transformer model compression method.
[0269] For example, the instructions stored in the memory can be divided into one or more modules / units. These modules / units are stored in a computer-readable storage medium and executed by the processor to complete the unmanned vehicle vision Transformer model compression method described in this invention. The one or more modules / units can be a series of computer-readable instruction segments capable of performing specific functions, which describe the execution process of the computer program on the server.
[0270] The electronic device may be a smartphone, laptop, PDA, or cloud server, among other computing devices. It may include, but is not limited to, a processor and memory. Those skilled in the art will understand that the electronic device may also include more or fewer components, or combinations of certain components, or different components; for example, it may also include input / output devices, network access devices, buses, etc.
[0271] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0272] The memory can be an internal storage unit of the server, such as a hard drive or RAM. It can also be an external storage device, such as a plug-in hard drive, Smart Media Card (SMC), Secure Digital (SD) card, or FlashCard. Furthermore, the memory can include both internal and external storage units. The memory is used to store computer-readable instructions and other programs and data required by the server. It can also be used to temporarily store data that has been output or will be output.
[0273] It should be noted that the information interaction and execution process between the above-mentioned module units are based on the same concept as the method embodiment. For details on their specific functions and technical effects, please refer to the method embodiment section. They will not be repeated here.
[0274] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0275] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying the computer program code to a photographing device / terminal device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks.
[0276] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0277] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A method for compressing the visual Transformer model of an autonomous vehicle, characterized in that, include: Image sequence perception data collected by the onboard camera of an autonomous vehicle is acquired. This data is then segmented and tokenized to obtain the input sequence matrix for the Transformer model. Features are extracted from the input sequence matrix using an attention mechanism, and a fourth-order tensor kernel is mathematically derived. The expression is achieved through the fourth-order tensor kernel. Describe the tensor mapping relationship from input features to output features, and construct the tensor representation of the attention mechanism; Based on the fourth-order tensor kernel It possesses the Kronecker structure and linear homomorphic properties, and is decoupled into the spatial dependency relationship and feature channel mapping relationship between different targets in the autonomous vehicle perception scenario; Taylor expansion is used to analyze the feedforward network of the Transformer model. The activation function is locally linearized, thereby transforming the feedforward network into a continuous linear tensor mapping. The upper bound of the error is quantized using the Frobenius norm, thus achieving error controllability. According to the fourth-order tensor kernel The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input and output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed by tensor column TT-singular value SVD dual kernel to obtain rank R representation and dual kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, so that the Transformer model after tensor decomposition and lightweight compression can still maintain inference accuracy on resource-constrained vehicle hardware. The compressed TT-SVD dual cores are mapped back to physical space according to their element sources to construct a Transformer model suitable for embedded inference chips in autonomous vehicles. According to the fourth-order tensor kernel The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input-output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed using tensor column TT-singular value SVD dual-kernel decomposition to obtain rank R representation and dual-kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, ensuring that the tensor-decomposed and lightweight compressed Transformer model maintains inference accuracy on resource-constrained vehicle hardware. The steps include: By using the output of the attention mechanism as the input to the feedforward network of the Transformer model, the input-output relationship of the overall TransformerBlock can be obtained: In the formula, Here, is the attention weight coefficient, where The position of the key. The position of the query reflects and The degree of dependence between the two; Channel mapping weights; For the first The first token One component; definition Then, we obtained: The complete fourth-order tensor kernel form of the Transformer Block is obtained from the mathematical expression at the element level: For fourth-order tensor kernels According to standard row indexing rules ,common Rows; and column indexing rules ,common Okay; put it in Constructing a matrix under the modulus partitioning rule ; Perform SVD decomposition on matrix M: In the formula, , , , ; By using the low-rank approximation, we obtain: Perform TT-style dual-core decomposition and structural reconstruction: In the formula, , ; They are indexed by row respectively and column indexes Reconstructing the inverse mapping yields a dual-core system. and At the same time: According to Eckart–Young's theorem, for all decompositions satisfying rank-R, SVD gives... It is the globally optimal approximation in the Frobenius norm sense, and the best rank R approximation measured by the Frobenius norm, with a truncation error of: make Since the matrix transformation is isometric, Frobenius preserves this under reshape. Converting the above matrix optimality back to tensor form, we have: Therefore, TT-SVD provides a dual-core... and Similarly, using the Frobenius norm as the optimal tensor approximation, spatial attention and channel mapping between targets are respectively implemented in the perception scenario of autonomous vehicles.
2. The unmanned vehicle vision Transformer model compression method according to claim 1, characterized in that, The input sequence matrix of the Transformer model is ,satisfy In the formula, Represents the set of real numbers. To represent the number of tokens, For the Embedding dimension.
3. The unmanned vehicle vision Transformer model compression method according to claim 1, characterized in that, The process involves extracting features from the input sequence matrix of the Transformer model using an attention mechanism, and mathematically deriving a fourth-order tensor kernel. The expression is achieved through the fourth-order tensor kernel. The tensor mapping relationship from input features to output features is described, and the steps for constructing the tensorized representation of the attention mechanism are as follows: The computational expression of the attention mechanism is as follows: In the formula, , , The projection matrix; The projection dimensions of the query and key are used to calculate ; The projection dimension of the value is used to calculate the output; Define the input sequence matrix of the Transformer model ,in, For the number of tokens, For the Embedding dimension; Define the query matrix Key matrix Value matrix ,in, For the projection dimensions of Query and Key, For the projection dimension of Value; Define attention scoring matrix , where the component is ; Define attention matrix The component is , And satisfy for each They all ; The output of the attention mechanism is: The output component in the formula is denoted as ; The input features at all key positions are weighted and summed according to their corresponding attention weights to obtain the output feature values for each channel. Based on this, the calculation expression of the attention mechanism is fully expanded in component form to obtain: In the formula, To output feature channels, For the token channels in the input sequence. The Value mapping matrix maps the input features to the output feature space, achieving a linear transformation. Input features according to the above formula To output features The mapping relationship is represented as a fourth-order tensor kernel form. This makes the multi-dimensional interaction between the Query position, output channel, Key position, and input channel explicit. The entire mapping relationship is represented by strict tensor contraction, as shown in the following expression: In the formula, Corresponding output sequence and output channel; Corresponding input sequence and input channel; fourth-order tensor kernel Specifically by and They are jointly determined, reflecting the correlation between sequence positions and the linear mapping between channels.
4. The unmanned vehicle vision Transformer model compression method according to claim 3, characterized in that, The fourth-order tensor kernel With its Kronecker structure and linear homomorphic properties, the decoupling of spatial dependencies and feature channel mappings between different targets in autonomous vehicle perception scenarios includes the following steps: For the obtained fourth-order tensor kernel Reconstructed according to standard vectorization rules, the expression is as follows: Prove that T is a Kronecker Product in the form of the following two methods: Method 1: According to standard vectorization rules, the output will be... and input All components are stacked column by column: In the formula, ; ; turn up Make Established, at the same time The components satisfy the Kronecker product form; In the original component form, substitute and ,get: If we take the matrix and ,but: Because the elemental level has a commutative law Therefore, the two expressions above are equivalent, resulting in: In the formula, For Kronecker product; Method 2: Based on the mixed multiplication property of the Kronecker product, we obtain: but: Continuing to use the mixing rule of the Kronecker product, we get: Ultimately, we arrive at the same result: Given that the attention weight matrix and channel projection matrix are trainable, the fourth-order tensor kernel It has linear homomorphic properties, and the linear mapping can be decoupled into independent tensor product structures of spatial location dimension and channel mapping dimension, which respectively correspond to the spatial dependency relationship between different targets and the mapping relationship of feature channels in the autonomous vehicle perception scenario.
5. The unmanned vehicle vision Transformer model compression method according to claim 4, characterized in that, The Taylor expansion of the feedforward network of the Transformer model... The activation function performs local linearization, thereby equating the feedforward network to a continuous linear tensor mapping. The steps for quantizing the upper bound of the error using the Frobenius norm and achieving error controllability include: The feedforward network in the Transformer model is applied to the feature vector of each token, specifically through two layers of element-wise non-linear mapping, as shown in the following expression: In the formula, Output derived from the attention mechanism; , These are the dimension-up mapping and the bias vector, respectively. , These are the dimension reduction mapping and the bias vector, respectively. Element-wise nonlinearity is applied to the activation function; For each token - Feature Dimensions They all have the following component forms: because The activation function violates linear separability and hinders the direct representation of the tensor kernel; here, Taylor expansion is used for local linearization. Specifically, for At a certain point A first-order Taylor expansion yields the following expression: get: In the formula, It is a diagonal matrix; After local linearization of the activation function, the feedforward network is described by a continuous linear mapping: In the formula, For mapping bias terms; Since the bias term does not change the low-rank structure of the tensor kernel, the bias is uniformly absorbed into the output vector or ignored during kernel decomposition. Furthermore, the residual connections within the Transformer Block and the normalization effect of LayerNorm normalization normalize the bias, making its impact on tensor kernel decomposition negligible. Therefore, ignoring the bias term yields: In the formula, ; The linearization error described above is defined as follows: By using the upper bound of the Frobenius norm quantification error, we obtain: In the formula, Denotes the Frobenius norm of a matrix. It is the residual matrix, which directly quantifies the error of local linearization. It is the amplification factor of the FFN weight matrix of the second-layer feedforward network; For ReLU and GELU activation functions, if local linearization is performed by selecting the zeros, we have: All of them can quantify the upper bound of the error and achieve error control through the Frobenius norm, providing a basis for determining the latent space dimension in subsequent embedded deployment and engineering design.
6. The unmanned vehicle vision Transformer model compression method according to claim 5, characterized in that, The step of mapping the compressed TT-SVD dual core back to physical space according to the element source to construct a Transformer model suitable for embedded inference chips in autonomous vehicles includes: TT-SVD tensor decomposition and A low-rank latent space is introduced to enable feature transfer between the two kernels. Specifically, the former realizes modal compression and feature extraction in the latent space dimension, while the latter realizes latent feature recovery and channel reconstruction. In engineering design, the rank R is selected through an energy retention rate strategy, which uniquely determines the dimension of the latent space. Based on the given formula for the upper bound of the absolute error, minimum rank R constraint and dimensionality control are performed: We selected the TinyBERT model and the SST-2 task dataset to carry out the design and replacement of a lightweight structure based on TT-SVD dual-core tensor decomposition, and completed experimental verification.
7. The unmanned vehicle vision Transformer model compression method according to claim 6, characterized in that, The steps of selecting the TinyBERT model and the SST-2 task dataset, conducting lightweight structure design and replacement based on TT-SVD dual-core tensor decomposition, and completing experimental verification include: The TT-SVD dual-core tensor decomposition obtained and By mapping the element source back to the physical space, a two-layer lightweight Block structure is obtained. The rank R and the latent space dimension are uniquely determined by the "energy retention rate" and error upper bound control strategy. This replaces the Block structure in the original Transformer model. The nonlinear activation and residual connection of the original model remain unchanged to preserve the generalization ability of the model. A lightweight compression comparison experiment of the TinyBERT model was conducted on the SST-2 task dataset. The test device was a GPU RTX 3070, the batch size was 16, and the test conditions and test parameters were completely consistent to ensure the fairness of the experiment. The lightweight compressed Transformer model is exported in ONNX format via Open Neural Network Exchange, and the Huawei Ascend Tensor Compiler ATC tool is used to convert and optimize the model to OM format so that it can be adapted to the inference and computing architecture of the Huawei Ascend 310B chip. The converted .om model was loaded into the Ascend heterogeneous computing architecture CANN inference framework, and the Huawei Ascend 310B chip was deployed to complete the experimental verification.
8. The unmanned vehicle vision Transformer model compression method according to claim 7, characterized in that, The input / output and inference process of the Huawei Ascend 310B chip and the autonomous vehicle platform data includes: By embedding the Huawei Ascend 310B chip in the autonomous vehicle, the image sequence data acquired by the vehicle camera is preprocessed by calling the tokenizer tool through the Huawei Ascend hardware front-end application layer to obtain the input sequence matrix of the Transformer model. The input data, including the tag ID and attention mask, is encapsulated in the form of a tensor, and the data transfer and memory copy operations are completed through the Ascend Computing Library ACL. The .om model is scheduled and executed by the Huawei Ascend 310B chip. Each operator in the model is mapped to the corresponding operator kernel on the neural network processor (NPU) and runs on the chip core. The process sequentially performs embedding lookup, attention calculation, feedforward network calculation, and layer normalization. Finally, the output is generated through linear transformation. The entire process is accelerated end-to-end on the NPU.
9. A compression system for a vision Transformer model of an unmanned vehicle, characterized in that, include: The attention mechanism tensor representation module acquires image sequence perception data collected by the vehicle's onboard camera, performs word segmentation and tokenization operations on the image sequence perception data to obtain the input sequence matrix of the Transformer model, and extracts features from the input sequence matrix of the Transformer model through the attention mechanism, mathematically deriving a fourth-order tensor kernel. The expression is achieved through the fourth-order tensor kernel. Describe the tensor mapping relationship from input features to output features, and construct the tensor representation of the attention mechanism; The fourth-order tensor kernel relation decoupling module is used to decouple the fourth-order tensor kernel relation. It possesses the Kronecker structure and linear homomorphic properties, and is decoupled into the spatial dependency relationship and feature channel mapping relationship between different targets in the autonomous vehicle perception scenario; The local linearization module is used to perform Taylor expansion on the feedforward network of the Transformer model. The activation function is locally linearized, thereby transforming the feedforward network into a continuous linear tensor mapping. The upper bound of the error is quantized using the Frobenius norm, thus achieving error controllability. The TT-SVD dual-core decomposition module is used to decompose data according to the fourth-order tensor kernel. The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input and output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed by tensor column TT-singular value SVD dual kernel to obtain rank R representation and dual kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, so that the Transformer model after tensor decomposition and lightweight compression can still maintain inference accuracy on resource-constrained vehicle hardware. The TT-SVD dual-core mapping module is used to map the compressed TT-SVD dual core back to the physical space according to the element source, and construct a Transformer model suitable for embedded inference chips in autonomous vehicles. According to the fourth-order tensor kernel The expression model equates the overall structure of the Transformer Block to a fourth-order tensor kernel. The complete input-output relationship in the autonomous vehicle target detection task is represented by tensor quantization and decomposed using tensor column TT-singular value SVD dual-kernel decomposition to obtain rank R representation and dual-kernel proof. This enables target position association spatial attention and channel mapping in the autonomous vehicle target detection task, ensuring that the tensor-decomposed and lightweight compressed Transformer model maintains inference accuracy on resource-constrained vehicle hardware. The steps include: By using the output of the attention mechanism as the input to the feedforward network of the Transformer model, the input-output relationship of the overall TransformerBlock can be obtained: In the formula, Here, is the attention weight coefficient, where The position of the key. The position of the query reflects and The degree of dependence between the two; Channel mapping weights; For the first The first token One component; definition Then, we obtained: The complete fourth-order tensor kernel form of the Transformer Block is obtained from the mathematical expression at the element level: For fourth-order tensor kernels According to standard row indexing rules ,common Rows; and column indexing rules ,common Okay; put it in Constructing a matrix under the modulus partitioning rule ; Perform SVD decomposition on matrix M: In the formula, , , , ; By using the low-rank approximation, we obtain: Perform TT-style dual-core decomposition and structural reconstruction: In the formula, , ; They are indexed by row respectively and column indexes Reconstructing the inverse mapping yields a dual-core system. and At the same time: According to Eckart–Young's theorem, for all decompositions satisfying rank-R, SVD gives... It is the globally optimal approximation in the Frobenius norm sense, and the best rank R approximation measured by the Frobenius norm, with a truncation error of: make Since the matrix transformation is isometric, Frobenius preserves this under reshape. Converting the above matrix optimality back to tensor form, we have: Therefore, TT-SVD provides a dual-core... and Similarly, using the Frobenius norm as the optimal tensor approximation, spatial attention and channel mapping between targets are respectively implemented in the perception scenario of autonomous vehicles.