A large model inference acceleration method and system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By calculating the cosine of the angle between Transformer layers and the cache miss rate, hardware and feature redundancy are dynamically perceived. By utilizing singular value decomposition and low-dimensional subspace mapping, the problems of low efficiency and high latency in large model inference are solved, and efficient dynamic inference acceleration is achieved.

CN121920550BActive Publication Date: 2026-06-19ZIGUANG HENGYUE TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZIGUANG HENGYUE TECH CO LTD
Filing Date: 2026-03-27
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing technologies cannot dynamically optimize the computational complexity of large models based on real-time hardware status and data characteristics, resulting in low efficiency and high response latency for long text inference. In particular, when faced with soaring hardware cache miss rates or a large amount of redundancy in inter-layer features, static methods cannot adjust computational strategies in real time.

Method used

By calculating the cosine of the angle between Transformer layers and the cache miss rate, the hardware congestion level and model feature redundancy are dynamically perceived. The low-rank weight matrix is reconstructed using singular value decomposition, and word sequence merging is performed by combining low-dimensional subspace mapping to achieve dynamic inference acceleration.

Benefits of technology

It reduces inference latency, increases system throughput, breaks through the bottleneck of computational intensity and low energy efficiency, and achieves flexible adaptation to dynamic computing loads while ensuring accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121920550B_ABST

Patent Text Reader

Abstract

This application provides a method and system for accelerating large-scale model inference. In the field of artificial intelligence technology, this application obtains the output feature vectors of the current Transformer layer and the previous Transformer layer, as well as the cache miss rate of the cache memory, during the long text inference process of a large model. It calculates the cosine of the angle between the output feature vectors to obtain the inter-layer similarity. When the cache miss rate is greater than a preset congestion threshold and the inter-layer similarity is in a redundant range, a low-rank weight matrix is reconstructed. Each word in the input word sequence of the current Transformer layer is mapped to a low-dimensional subspace defined by the low-rank weight matrix through token merging to obtain the target word sequence. The target word sequence is then input into the current Transformer layer to accelerate the inference of the current Transformer layer, thus achieving efficient acceleration of large-scale model inference.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method and system for accelerating large-scale model inference. Background Technology

[0002] With the rapid development of artificial intelligence technology, large language models based on the Transformer architecture have shown enormous application potential in fields such as natural language processing, intelligent question answering, and long text analysis. Inference acceleration technology has become crucial for achieving efficient deployment from the cloud to the edge. However, with the continuous increase in the number of model parameters and the length of input sequences, the inference process faces the dual challenges of explosive growth in computational load and limited hardware resources, urgently requiring efficient acceleration methods to meet the requirements of real-time performance and energy efficiency.

[0003] Existing solutions typically employ static model compression strategies, such as weight quantization or network pruning, to reduce computational load by performing low-precision transformations of model parameters or eliminating redundant connections before deployment.

[0004] However, existing solutions fail to detect dynamic changes in the input data and the real-time load status of the underlying hardware during inference. This leads to severe memory wall bottlenecks and wasted computing resources when dealing with long texts or sudden high-concurrency requests, making it impossible to flexibly handle dynamically changing computational and memory access pressures while ensuring accuracy. In particular, when hardware cache miss rates spike or there is significant redundancy in inter-layer features, static methods cannot adjust computational strategies in real time, resulting in high inference latency and low throughput. Therefore, existing technologies suffer from the technical problem of failing to dynamically optimize computational complexity based on real-time hardware status and data characteristics during large model inference, leading to low efficiency and high response latency in long text inference. Summary of the Invention

[0005] The purpose of this application is to provide a method and system for accelerating large model inference, so as to solve the technical problem in the prior art that the computational complexity of the large model inference process cannot be dynamically optimized according to the real-time hardware status and data characteristics, resulting in low efficiency and high response latency of long text inference.

[0006] Firstly, this application provides a method for accelerating large-scale model inference, including:

[0007] Optionally, inter-layer similarity is obtained by calculating the cosine of the angle between the output feature vectors, including:

[0008] Perform a vector dot product operation between the output feature vector of the current Transformer layer and the output feature vector of the previous Transformer layer to obtain the vector dot product value;

[0009] The magnitude of the output feature vector of the current Transformer layer is multiplied by the magnitude of the output feature vector of the previous Transformer layer to obtain the magnitude product value.

[0010] Divide the vector dot product by the product of the magnitudes to obtain the cosine of the included angle, and use the cosine of the included angle as the inter-layer similarity.

[0011] Optionally, the method further includes:

[0012] Obtain the historical similarity between the current Transformer layer and the previous Transformer layer at multiple historical inference moments;

[0013] All historical similarities are sorted in ascending order to obtain a historical similarity sequence, and a preset number of historical similarities are uniformly extracted from the historical similarity sequence as the cluster center values of different similarity clusters;

[0014] Calculate the absolute value of the difference between each historical similarity and the values of different cluster centers, assign each historical similarity to the similarity cluster corresponding to the smallest absolute value, calculate the arithmetic mean of all historical similarities in each similarity cluster, and update the arithmetic mean to the cluster center value.

[0015] The cluster with the highest similarity value is identified as a highly redundant cluster, and the redundancy interval is determined based on the minimum and maximum historical similarities in the highly redundant cluster.

[0016] Optionally, a low-rank weight matrix can be reconstructed by performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer and retaining the vectors corresponding to the top k largest singular values, including:

[0017] Perform matrix decomposition on the pre-trained weight matrix of the current Transformer layer to obtain the left singular vector matrix, the singular value sequence, and the right singular vector matrix. The singular value sequence includes multiple singular values sorted in descending order of their numerical values.

[0018] The total energy value is obtained by calculating the sum of squares of each singular value in the singular value sequence. When the proportion of the sum of squares of the current k singular values in the total energy value first exceeds the preset energy threshold, a singular value diagonal matrix is constructed based on the first k singular values in the singular value sequence.

[0019] Extract the column vectors corresponding to the first k singular values from the left singular vector matrix to form the truncated left matrix, and extract the row vectors corresponding to the first k singular values from the right singular vector matrix to form the truncated right matrix.

[0020] Perform matrix multiplication on the truncated left matrix, the singular value diagonal matrix, and the truncated right matrix to obtain a low-rank weighted matrix.

[0021] Optionally, each word in the input word sequence of the current Transformer layer is mapped to a low-dimensional subspace defined by the low-rank weight matrix through token merging. The Euclidean distance of each word in the low-dimensional subspace is calculated, and adjacent word pairs with an Euclidean distance less than a preset distance threshold are merged to obtain the target word sequence, including:

[0022] Define the row vector space of the low-rank weight matrix as a low-dimensional subspace;

[0023] Calculate the product of the feature vector corresponding to each word in the input word sequence of the current Transformer layer and the low-rank weight matrix to obtain the projection vector of each word in the low-dimensional subspace;

[0024] Perform vector subtraction on the projection vectors corresponding to each pair of adjacent word pairs in the input word sequence to obtain the difference vector;

[0025] The magnitude of the difference vector is used as the Euclidean distance between adjacent word pairs in the low-dimensional subspace.

[0026] When the Euclidean distance is less than a preset distance threshold, the average value of the feature vectors corresponding to adjacent word pairs in the input word sequence is calculated to obtain the merged word.

[0027] The target word sequence is obtained by replacing corresponding adjacent word pairs in the input word sequence with merged word pairs.

[0028] Optionally, the target word sequence is input into the current Transformer layer, and matrix multiplication and self-attention calculation are performed using the low-rank weight matrix to obtain the target output feature vector, thereby accelerating the inference of the current Transformer layer, including:

[0029] The target word sequence is input into the self-attention computation unit of the current Transformer layer. The target word sequence is linearly transformed using a low-rank weight matrix to obtain the query matrix, key matrix and value matrix.

[0030] Perform matrix multiplication on the transpose of the query matrix and the key matrix to obtain the attention matrix, and then normalize the attention matrix to obtain the standard attention matrix.

[0031] Perform matrix multiplication on the standard attention matrix and the value matrix to obtain the context feature matrix;

[0032] Perform matrix multiplication between the context feature matrix and the preset output projection matrix to obtain the target output feature vector;

[0033] The target output feature vector is used as the input vector for the next Transformer layer to accelerate inference in the current Transformer layer.

[0034] Optionally, a linear transformation is performed on the target word sequence using a low-rank weight matrix to obtain the query matrix, key matrix, and value matrix, including:

[0035] Based on the column index order of the low-rank weight matrix, the low-rank weight matrix is divided into a first submatrix, a second submatrix, and a third submatrix with the same number of column vectors.

[0036] Matrix multiplication is performed on the target word sequence with the first submatrix, the second submatrix, and the third submatrix respectively to obtain the query matrix, the key matrix, and the value matrix.

[0037] Secondly, this application provides a large-model inference acceleration system, including:

[0038] The acquisition module is used to acquire the output feature vectors of the current Transformer layer and the previous Transformer layer, as well as the cache miss rate of the cache memory during the long text inference process of the large model;

[0039] The calculation module is used to obtain the inter-layer similarity by calculating the cosine value of the angle between the output feature vectors;

[0040] The decomposition module is used to reconstruct a low-rank weight matrix when the cache miss rate is greater than the preset congestion threshold and the inter-layer similarity is in the redundant interval. This is done by performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer and retaining the vectors corresponding to the top k largest singular values. The redundant interval is obtained by cluster analysis based on the historical similarity of the current Transformer layer, where k is a positive integer.

[0041] The calculation module is also used to map each word in the input word sequence of the current Transformer layer to a low-dimensional subspace defined by the low-rank weight matrix through token merging, calculate the Euclidean distance of each word in the low-dimensional subspace, and merge adjacent word pairs whose Euclidean distance is less than a preset distance threshold to obtain the target word sequence.

[0042] The computation module is also used to input the target word sequence into the current Transformer layer, and use the low-rank weight matrix to perform matrix multiplication and self-attention calculation to obtain the target output feature vector, so as to accelerate the inference of the current Transformer layer.

[0043] Thirdly, this application provides an electronic device, comprising:

[0044] Memory, used to store computer programs;

[0045] A processor is used to implement the steps of a large model inference acceleration method as described in the first aspect above when executing computer programs.

[0046] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the steps of the large model inference acceleration method described in the first aspect above.

[0047] This application provides a method for accelerating large model inference. By real-time monitoring of the model's feature output state and the underlying hardware's cache state, it achieves real-time awareness of dynamic changes in data features and hardware load pressure during the inference process. This solves the technical problems of existing technologies being unable to detect memory wall bottlenecks based on real-time hardware states and existing static compression strategies being unable to flexibly cope with dynamically changing computation and memory access pressures. It achieves a physical reduction in the length of the input sequence, reduces inference latency, and improves system throughput, overcoming the bottlenecks of computational intensity and low energy efficiency.

[0048] Furthermore, by performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer, a sequence of singular values arranged in descending order of numerical value is obtained, and the cumulative energy percentage of the singular values in this sequence is calculated. When the cumulative energy percentage first exceeds a preset energy threshold, the number of singular values k to be retained is dynamically determined, and then the corresponding left and right singular vectors and singular values are extracted to construct a truncated matrix. Finally, a low-rank weight matrix is obtained through matrix multiplication. This solves the technical problem that existing static compression methods cannot flexibly adapt to dynamic computational loads while ensuring accuracy. Attached Figure Description

[0049] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0050] Figure 1 A flowchart illustrating a method for accelerating large model inference provided in an embodiment of this application;

[0051] Figure 2 A flowchart illustrating a method for obtaining a low-rank weight matrix provided in an embodiment of this application;

[0052] Figure 3 A flowchart illustrating a method for obtaining a target element sequence provided in an embodiment of this application;

[0053] Figure 4 A schematic diagram of a large model inference acceleration system provided in an embodiment of this application;

[0054] Figure 5 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0055] To address the issue that existing solutions cannot detect changes in input data features and the real-time load status of underlying hardware during the inference process, resulting in memory wall bottlenecks and wasted computing resources when dealing with long texts or sudden high-concurrency requests, and the inability to dynamically optimize computational complexity based on real-time hardware status and data features, thus causing low efficiency and high response latency in long text inference.

[0056] This application obtains the cache miss rate of the cache memory and the output feature vectors of adjacent Transformer layers and calculates the inter-layer similarity to dynamically perceive the degree of hardware congestion and the redundancy of model features. Then, when there is hardware congestion and feature redundancy, it uses the redundant interval based on historical clustering analysis to trigger singular value decomposition to reconstruct the low-rank weight matrix, and combines low-dimensional subspace mapping to merge word sequences. Finally, it uses the low-rank weight matrix and the reduced target word sequence to perform calculations, thereby achieving dynamic inference acceleration through hardware and software collaboration while ensuring accuracy.

[0057] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0058] The core of this application is to provide a method for accelerating large-scale model inference, and a flowchart of one specific implementation is shown below. Figure 1 As shown, the method includes:

[0059] Step 101: During the long text inference process of the large model, obtain the output feature vectors of the current Transformer layer and the previous Transformer layer, as well as the cache miss rate of the cache memory.

[0060] In this step, the current Transformer layer refers to the specific network layer in the larger model that is performing the computational task; it is typically represented as... The preceding Transformer layer refers to the network layer that immediately precedes the current Transformer layer in the model structure sequence; it is usually represented by [symbol missing]. The output feature vector is a high-dimensional numerical vector representing the semantic features of the input data at that layer, generated after processing by the corresponding Transformer layer. These are denoted as... and Cache memory refers to high-speed storage units located inside the processor used to temporarily store instructions and data, including L1 cache and L2 cache. Cache miss rate refers to the proportion of times the processor accesses data but fails to retrieve the required data directly from the cache memory and instead needs to read it from main memory within a specific time window. This metric directly reflects the memory wall bottleneck and data movement pressure of the current computing node, and is denoted as . .

[0061] In this embodiment, the inference process of the large model is first monitored. When it is detected that the model is processing a long text input task, the hierarchical index of the currently executing operation is located and read from the model's computation graph in real time. Mark it as the current Transformer layer Simultaneously backtracking the index The corresponding layer is marked as the previous Transformer layer. Next, the real-time generated output feature vectors are captured from the output interfaces of these two levels respectively. and Meanwhile, by utilizing the performance counter interface provided by the underlying hardware, the access status statistics of the cache memory are periodically sampled, and the ratio of the number of cache misses to the total number of requests per unit time is calculated to obtain the real-time cache miss rate. .

[0062] For example, in a task of summarizing a scientific document of length A, when the reasoning reaches the first... When working with layers, directly read the feature vector of dimension D output by that layer as... And read the output vector from the previous layer cache as Simultaneous monitoring showed that the L2 cache miss rate at that moment was 0.45, i.e. .

[0063] Step 102: Calculate the cosine of the angle between the output feature vectors to obtain the inter-layer similarity.

[0064] In this step, the cosine of the angle refers to the cosine function value of the angle between two high-dimensional vectors in the vector space. It is used to measure the closeness of the two vectors in direction, and its value is usually between -1 and 1. Inter-layer similarity is a numerical index based on the cosine of the angle, which quantifies the smoothness of the evolution of the output features between the current Transformer layer and the previous Transformer layer. The closer the value is to 1, the smaller the feature change between the two layers and the higher the information redundancy.

[0065] Step 201: Perform a vector dot product operation on the output feature vector of the current Transformer layer and the output feature vector of the previous Transformer layer to obtain the vector dot product value.

[0066] In this embodiment, the vector calculation unit is invoked to read the output feature vector of the current Transformer layer. and the output feature vector of the previous Transformer layer Assume that both vectors have dimensions of 1. , respectively represented as and Using parallel multiply-accumulate instructions, the index of each dimension is calculated separately. The components at the location and The product of these products is then summed, and the formula is as follows: For example, in one dimension In the simplified scenario, assuming , Then calculate the vector dot product value. .

[0067] Step 202: Multiply the vector magnitude of the output feature vector of the current Transformer layer with the vector magnitude of the output feature vector of the previous Transformer layer to obtain the magnitude product value.

[0068] In this embodiment of the application, for the output feature vector Calculate all its components The sum of the squared values, and the square root of the sum, are obtained. vector magnitude Similarly, for the output feature vector... Calculate all its components The sum of the squares of the values and the square root of the result are given. vector magnitude Then, the calculated and Perform multiplication to obtain the product of the modulus and length. .

[0069] Step 203: Divide the vector dot product by the product of the magnitudes to obtain the cosine of the included angle, and use the cosine of the included angle as the inter-layer similarity.

[0070] In this step, in the embodiments of this application, the vector dot product value is read. As a numerator, the product of moduli is... As the denominator. Perform floating-point division to calculate. and The ratio of the two. According to the definition formula of cosine similarity, as shown in formula (1):

[0071] (1)

[0072] The obtained quotient is the cosine value of the included angle, and this value is directly assigned to the inter-layer similarity. This numerical standardization eliminates the influence of the absolute length of the vector, purely reflecting... and The degree of similarity in the feature space direction, for example, calculating inter-layer similarity. The results show that the output feature directions of the two layers are completely consistent, the inter-layer similarity is 1, and there is a high probability that they will fall into the redundant interval, thus triggering the acceleration mechanism.

[0073] Step 103: When the cache miss rate is greater than the preset congestion threshold and the inter-layer similarity is in the redundancy range, the low-rank weight matrix is reconstructed by performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer and retaining the vectors corresponding to the top k largest singular values. The redundancy range is obtained by cluster analysis based on the historical similarity of the current Transformer layer, where k is a positive integer.

[0074] In this step, the preset congestion threshold refers to a pre-defined critical value used to determine whether the hardware memory bandwidth is in a congested state. When the real-time monitored indicators exceed this value, the hardware resources are considered strained. The redundancy interval refers to a range of similarity values dynamically defined based on historical model data. Layer similarity falling within this range indicates that the current computational layer has high compressibility. The pre-trained weight matrix refers to the original parameter matrix stored in the current Transformer layer, which has been learned and determined by the large model during the training phase. It typically includes projected weights used to generate queries, keys, and values.

[0075] Singular values are non-negative real numbers that reflect the energy distribution of a matrix after singular value decomposition. A low-rank weighted matrix is an approximate matrix with a rank less than the original matrix, reconstructed by retaining some of the main singular values and their corresponding singular vectors. It reduces the number of parameters while preserving the main information of the original matrix.

[0076] Step 3011: Obtain the historical similarity between the current Transformer layer and the previous Transformer layer at multiple historical inference moments.

[0077] In this step, historical similarity refers to the inter-layer similarity data calculated and stored for the same layer position (i.e., the current Transformer layer) over several inference cycles prior to the current inference time. It is denoted as... .

[0078] In this embodiment, a fixed-capacity first-in-first-out queue or circular buffer is maintained to store the most recently accessed data. The inter-layer similarity is calculated at each inference time step, and the latest inter-layer similarity is stored in this queue while the oldest data is removed. When it is necessary to determine the redundant interval, all stored values are directly read from this queue to form a historical similarity set. For example, assuming the buffer size... The currently stored set of historical similarities is .

[0079] Step 3012: Sort all historical similarities in ascending order to obtain a historical similarity sequence, and extract a preset number of historical similarities from the historical similarity sequence as the cluster center values of different similarity clusters.

[0080] In this step, the historical similarity sequence refers to the ordered list generated by sorting the unordered historical similarity set according to numerical values from smallest to largest. The cluster center value refers to the value representing the core position of a particular cluster during the initialization phase or iteration of the clustering algorithm, denoted as . .

[0081] In this embodiment of the application, quicksort or mergesort algorithms are used to sort the set. Sort in ascending order to obtain the sequence Based on the preset number of clusters. Assuming These represent low, medium, and high redundancy states, respectively. Extracting from the index at equal intervals Each of the following values is used as... Initial cluster center values of similarity clusters For example, sorting the aforementioned set yields... ,like Then, extract the values at indices 0, 2, and 4, i.e. , , , as the initial center.

[0082] Step 3013: Calculate the absolute value of the difference between each historical similarity and the value of different cluster centers, assign each historical similarity to the similarity cluster corresponding to the smallest absolute value, calculate the arithmetic mean of all historical similarities in each similarity cluster, and update the arithmetic mean to the cluster center value.

[0083] In this step, a similarity cluster refers to several subsets into which historical similarities are divided based on their numerical closeness.

[0084] In this embodiment, iterative clustering logic is executed. For the set Each historical similarity Calculate its relationship with all initial cluster centers respectively. absolute value of the difference Compare these absolute values and find the cluster index corresponding to the minimum value. ,Will Classified as There are several similarity clusters. After all historical data has been allocated, each similarity cluster is iterated through. The sum of all historical similarity values within that cluster is accumulated and divided by the number of elements in that cluster to obtain a new average value. This new average value is then used to replace the original average value. This serves as the updated cluster center value. For example, for the value... , and The absolute values of the differences are respectively The minimum value is Therefore, it is classified as The corresponding cluster.

[0085] Step 3014: Determine the cluster with the highest cluster center value as a high-redundancy cluster, and determine the redundancy interval based on the minimum and maximum historical similarities in the high-redundancy cluster.

[0086] In this step, the high-redundancy cluster refers to the cluster with the largest cluster center value. It represents the set of states where the model's output features change least significantly and the information redundancy is the highest at this level.

[0087] In this embodiment, all updated cluster center values are compared, the maximum value is found, and the corresponding similarity cluster is marked as a high-redundancy cluster. All data elements in this high-redundancy cluster are traversed, and the minimum value is selected. and maximum value The closed interval formed by these two boundary values. Determined as a redundant interval For example, suppose that after iteration, the cluster corresponding to the largest cluster center includes... If so, then the cluster is a highly redundant cluster, and the determined redundancy interval is... .

[0088] like Figure 2 As shown, Figure 2 This is a flowchart illustrating a method for obtaining a low-rank weight matrix, as provided in an embodiment of this application.

[0089] Step 301: Perform matrix decomposition on the pre-trained weight matrix of the current Transformer layer to obtain the left singular vector matrix, the singular value sequence, and the right singular vector matrix. The singular value sequence includes multiple singular values sorted in descending order of their numerical values.

[0090] In this step, the left singular vector matrix refers to the matrix that includes the orthogonal basis of the eigenspace, usually denoted as . A singular value sequence is the set of elements on the diagonal of the diagonal matrix obtained by decomposition, denoted as . , the elements Arranged from largest to smallest, this reflects the energy or importance of the corresponding feature dimensions. A right singular vector matrix is a matrix that includes an orthogonal basis of the input space, usually denoted as... or its transpose .

[0091] In this embodiment, the pre-trained weight matrix pre-stored in the current Transformer layer is read. Its dimensions are Call the singular value decomposition function from the linear algebra library to perform... Perform a full decomposition or a truncated decomposition to obtain , and .in The singular values in the equation satisfy For example, suppose the pre-trained weight matrix is:

[0092]

[0093] Decomposition yields:

[0094]

[0095] Step 302: Calculate the sum of squares of each singular value in the singular value sequence to obtain the total energy value. When the proportion of the sum of squares of the current k singular values in the total energy value is greater than the preset energy threshold for the first time, construct a singular value diagonal matrix based on the first k singular values in the singular value sequence.

[0096] In this step, the total energy value refers to the sum of the squares of all singular values, representing the total information contained in the original matrix, denoted as . The preset energy threshold refers to the minimum information retention ratio set to ensure the accuracy of the compressed model, denoted as . A singular value diagonal matrix is one that contains only selected elements from the first few elements. Let be a diagonal matrix of singular values, denoted as . .

[0097] In this embodiment of the application, the complete singular value sequence is first traversed, and all... The sum of squares is obtained Then from Before starting to accumulate The square of each singular value yields the cumulative energy. And calculate the percentage When detected When the time comes, stop accumulating and reset the current index. Determined to be rank-reserved Then, create The zero matrix, will be the first The singular values are filled into the diagonal positions to construct the singular value diagonal matrix. .

[0098] For example, calculate the total energy. If a preset energy threshold is used... The proportion of the first singular value .Sure Construct singular valued diagonal matrices. .

[0099] Step 303: Extract column vectors corresponding to the first k singular values from the left singular vector matrix to form a truncated left matrix, and extract row vectors corresponding to the first k singular values from the right singular vector matrix to form a truncated right matrix.

[0100] In the embodiments of this application, based on the determined Value, for the left singular vector matrix Perform a slicing operation, keeping columns 1 through 2. The data in the column is generated by the dimension. truncated left matrix Meanwhile, for the right singular vector matrix... Perform a slicing operation, keeping only the first row to the last row. The data in rows, with the generation dimension as follows: truncated right matrix These two truncation matrices, together with the singular value diagonal matrix, constitute a low-rank approximation of the original weights.

[0101] For example, extract The first column is obtained ,extract The first line gets .

[0102] Step 304: Perform matrix multiplication on the truncated left matrix, the singular value diagonal matrix, and the truncated right matrix to obtain a low-rank weight matrix.

[0103] In this embodiment, a matrix multiplication sequence operation is performed. First, the truncated left matrix is calculated. singular value diagonal matrix The product of these two matrices yields the intermediate matrix. Next, calculate. With truncated right matrix The product of these factors yields the low-rank weight matrix. .Should The rank does not exceed Numerically, it approximates the original weight matrix. .

[0104] For example, calculate Although compared with the original matrix There are errors, but the rank is reduced while preserving the main energy characteristics.

[0105] Step 104: Map each word in the input word sequence of the current Transformer layer to a low-dimensional subspace defined by the low-rank weight matrix through token merging, calculate the Euclidean distance of each word in the low-dimensional subspace, and merge adjacent word pairs whose Euclidean distance is less than a preset distance threshold to obtain the target word sequence.

[0106] In this step, token merging refers to a data processing strategy that reduces sequence length by merging semantically similar tokens. A token is a vector representation of the input text sequence after word segmentation and embedding; it is the basic unit of model processing. A low-dimensional subspace is a vector space spanned by the column vectors of a low-rank weight matrix. A preset distance threshold is a critical distance value used to determine whether two tokens are semantically similar enough to be merged. Adjacent token pairs are two tokens that are sequentially adjacent in the input sequence. The target token sequence is the shortened token sequence that retains key semantic information after the merging operation.

[0107] like Figure 3 As shown, Figure 3 This is a flowchart illustrating a method for obtaining a target meta-sequence, provided in an embodiment of this application.

[0108] Step 401: Define the row vector space of the low-rank weight matrix as a low-dimensional subspace.

[0109] In this step, the row vector space refers to the vector space that can be generated by a linear combination of all the row vectors of a matrix, and its dimension is equal to the row rank of the matrix.

[0110] In this embodiment, the low-rank weight matrix is read directly. .Will The row vectors in the vector diagram are considered as basis vectors, and the linear space they span is labeled as a low-dimensional subspace. Any original high-dimensional word vector can be obtained by... Performing the operation essentially involves projecting the vector onto this... Vie middle.

[0111] For example, a low-rank weight matrix Its rank is 1, hence the defined low-dimensional subspace It is actually a 1-dimensional linear space.

[0112] Step 402: Calculate the product of the feature vector and the low-rank weight matrix for each word in the input word sequence of the current Transformer layer to obtain the projection vector of each word in the low-dimensional subspace.

[0113] In this step, the projection vector refers to the coordinate representation of the high-dimensional word feature vector in the low-dimensional subspace after a linear transformation, denoted as... .

[0114] In this embodiment of the application, the input lexical sequence of the current Transformer layer is obtained. Using the matrix multiplication unit, calculate each... With low-rank weight matrix The product of . The formula is as follows: .

[0115] For example, obtain the input word sequence Suppose it contains two feature vectors with a dimension of 3: , .calculate Projection: ,calculate Projection: .

[0116] Step 403: Perform vector subtraction on the projection vectors corresponding to each pair of adjacent word pairs in the input word sequence to obtain the difference vector.

[0117] In this embodiment of the application, the projection vector sequence is traversed. For indexes From 1 to Extract two adjacent projection vectors and Perform vector subtraction to calculate... This operation is performed in a low-dimensional space. For example, based on the aforementioned calculation results, the difference vector is calculated.

[0118] Step 404: Use the magnitude of the difference vector as the Euclidean distance between adjacent word pairs in the low-dimensional subspace.

[0119] In this embodiment of the application, for each difference vector Calculate its modulus The calculated Directly determined as the number The and the first The Euclidean distance between each word in the low-dimensional subspace. For example, calculate... Length of the module .

[0120] Step 405: When the Euclidean distance is less than the preset distance threshold, calculate the average value of the feature vectors corresponding to adjacent word pairs in the input word sequence to obtain the merged word.

[0121] In this step, the preset distance threshold refers to the critical value used to determine whether two word terms are sufficiently similar, denoted as . Merging lexical units refers to generating a new lexical vector by fusing information from two similar lexical units, denoted as . .

[0122] In this embodiment of the application, the Euclidean distance is used. and preset distance threshold Compare. If Then determine the two adjacent word elements ( The semantics are highly redundant and can be merged. At this point, we can backtrack to retrieve them from the original input sequence. High-dimensional feature vectors in and Calculate the arithmetic mean of the two. .

[0123] For example, let's set a preset distance threshold. Because the calculated distance is Decision to merge. Calculate the average of the original feature vectors: .

[0124] Step 406: Replace the corresponding adjacent word pairs in the input word sequence with merged word sequences to obtain the target word sequence.

[0125] In this step, the target word sequence refers to the new sequence after merging and compression, denoted as... Its length is less than the length of the original sequence. .

[0126] In this embodiment of the application, a sequence reconstruction operation is performed. When the sequence is determined... and Each word element is merged into Then, the original sequence is removed from the new sequence. and , and insert To avoid merging conflicts, a greedy strategy or an even or odd index grouping strategy is typically used for a one-time traversal. After processing, the generated sequence is the target word sequence. This sequence will then be fed into the Transformer layer for computation.

[0127] For example, the original sequence The length is 2, and after merging, we get The length is reduced to 1, which reduces the amount of subsequent computation. For example, the target word sequence ,Right now .

[0128] Step 105: Input the target word sequence into the current Transformer layer, and use the low-rank weight matrix to perform matrix multiplication and self-attention calculation to obtain the target output feature vector, so as to accelerate the inference of the current Transformer layer.

[0129] In this step, the target output feature vector refers to the feature data that has been transformed and incorporated contextual information, output by the current Transformer layer after completing all computational logic. It will serve as the input for the next layer.

[0130] Step 501: Input the target word sequence into the self-attention calculation unit of the current Transformer layer, and use the low-rank weight matrix to perform a linear transformation on the target word sequence to obtain the query matrix, key matrix and value matrix.

[0131] In this step, the self-attention computation unit refers to the core logic module in the Transformer architecture responsible for executing the self-attention mechanism, used to capture the dependencies between words at different positions within the input sequence. The query matrix is denoted as... The key matrix is denoted as And the value matrix is denoted as It refers to the intermediate feature matrix obtained after mapping the input sequence to three different semantic spaces, which respectively represent what information the current word queries, the index features when queried, and the actual content information carried.

[0132] Step 5011: Based on the column index order of the low-rank weight matrix, divide the low-rank weight matrix into a first submatrix, a second submatrix, and a third submatrix with the same number of column vectors.

[0133] In this step, column index order refers to the natural index of the matrix column vectors arranged from left to right, usually starting from 0 and incrementing. The first submatrix is denoted as... The second submatrix is denoted as And the third submatrix is denoted as This refers to the three independent weight blocks obtained by dividing the low-rank weight matrix in the column direction. They correspond to the linear transformation parameters required to generate query, key, and value vectors, respectively.

[0134] In this embodiment of the application, the low-rank weight matrix is read. According to column index order, The first third of the columns are extracted to form the first submatrix. Extract the next third of the columns to form the second submatrix. Extract the last third of the columns to form the third submatrix. This average partitioning method decomposes the uniform low-rank matrix into three functionally independent but identically structured projective submatrices. For example, based on the logic of the aforementioned embodiment and adapted to partitioning requirements, the low-rank weight matrix is:

[0135]

[0136] Divide the matrix into equal parts according to column index order, with each part consisting of one column: First submatrix That is, column 1: Second submatrix That is, column 2: ; Third submatrix That is, column 3: .

[0137] Step 5012: Perform matrix multiplication operations on the target word sequence with the first submatrix, the second submatrix, and the third submatrix respectively to obtain the query matrix, the key matrix, and the value matrix.

[0138] In this embodiment of the application, the target word sequence is obtained. Using the matrix multiplication unit, first calculate With the first submatrix The product of these two elements yields the query matrix. Next, calculate. With the second submatrix The product of these terms yields the key matrix. Final calculation With the third submatrix The product of these terms yields the value matrix. .

[0139] For example, obtain the target word sequence. Each of the above-mentioned segments obtained Multiplication, calculation , , .

[0140] Step 502: Perform matrix multiplication on the transpose of the query matrix and the key matrix to obtain the attention matrix, and normalize the attention matrix to obtain the standard attention matrix.

[0141] In this step, the attention matrix refers to the original score matrix obtained by calculating the correlation between the query matrix and the key matrix, denoted as . Its element values reflect the degree of attention between different lexical units. The standard attention matrix is the probability distribution matrix obtained by scaling and Softmax normalizing the original attention matrix, denoted as . The sum of the elements in each row is 1.

[0142] In this embodiment of the application, the key matrix is first... Performing the transpose operation yields Next, the query matrix is calculated. and The matrix product yields To prevent the dot product result from being too large, Divide each element in by the scaling factor , Let be the column dimension of the key matrix. Finally, the scaled matrix is normalized along its row dimensions using the Softmax function, as shown in formula (2):

[0143] (2)

[0144] Obtain the standard attention matrix This matrix precisely describes the weighted weight of each position in the sequence to the other positions.

[0145] For example, calculate Assuming a scaling factor of 1, the Softmax function, after normalization, yields... .

[0146] Step 503: Perform matrix multiplication on the standard attention matrix and the value matrix to obtain the context feature matrix.

[0147] In this step, the context feature matrix refers to the feature matrix that incorporates global context information, obtained by weighting and summing the value matrix according to the attention weights, and is denoted as... .

[0148] In this embodiment, the standard attention matrix is calculated using a matrix multiplication unit. AND-value matrix The product of these terms. Through product operations, the model aggregates information from related terms into the representation of the current term based on the attention distribution, thereby achieving information interaction and fusion. For example, calculating... ,at this time It is a 1×1 matrix.

[0149] Step 504: Perform matrix multiplication on the context feature matrix and the preset output projection matrix to obtain the target output feature vector.

[0150] In this step, the preset output projection matrix refers to the linear transformation matrix in the Transformer layer used to map the output of the attention mechanism back to the original feature dimension or the input dimension of the next layer, denoted as . The target output feature vector refers to the final output vector with rich semantic information generated after a complete self-attention computation process, denoted as... .

[0151] In this embodiment, the preset output projection matrix of the current Transformer layer is read. Calculate the context feature matrix. and The product of these steps yields the target output feature vector. This step completes the final linear transformation of the self-attention sublayer, ensuring that the dimensionality of the output data meets the requirements of the model architecture. For example, suppose... Mapping the 1D result back to 3D, then .

[0152] Step 505: Use the target output feature vector as the input vector of the next Transformer layer to accelerate the inference of the current Transformer layer.

[0153] In this embodiment of the application, the calculated target output feature vector is... Passed to the next adjacent Transformer layer via on-chip interconnect network or memory bus. The layer serves as the input data for this layer. Because a shortened target word sequence is used in the computation of this layer. and the low-rank weight matrix with reduced rank The amount of floating-point operations and memory accesses involved in the entire self-attention calculation process is reduced, thereby realizing the current layer's computational efficiency in physical time. The reasoning is accelerated.

[0154] This application's embodiments achieve real-time perception of dynamic changes in data features and hardware load pressure during inference by monitoring the model's feature output state and the underlying hardware's cache state in real time. This solves the technical problems of existing technologies that cannot perceive memory wall bottlenecks based on real-time hardware states and that existing static compression strategies cannot flexibly cope with dynamically changing computation and memory access pressures. It also achieves a physical reduction in the length of the input sequence, reducing inference latency and improving system throughput, thus overcoming the bottlenecks of computational intensity and low energy efficiency.

[0155] Figure 4 This is a schematic diagram illustrating a specific implementation of a large-model inference acceleration system provided in this application. (Refer to...) Figure 4 The system may include:

[0156] The acquisition module 21 is used to acquire the output feature vectors of the current Transformer layer and the previous Transformer layer, as well as the cache miss rate of the cache memory during the long text inference process of the large model;

[0157] Calculation module 22 is used to obtain inter-layer similarity by calculating the cosine value of the angle between the output feature vectors;

[0158] Decomposition module 23 is used to reconstruct a low-rank weight matrix by performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer and retaining the vectors corresponding to the first k largest singular values when the cache miss rate is greater than a preset congestion threshold and the inter-layer similarity is in the redundancy interval. The redundancy interval is obtained by clustering analysis based on the historical similarity of the current Transformer layer, and k is a positive integer.

[0159] The calculation module 22 is also used to map each word in the input word sequence of the current Transformer layer to a low-dimensional subspace defined according to the low-rank weight matrix by token merging, calculate the Euclidean distance of each word in the low-dimensional subspace, and merge adjacent word pairs whose Euclidean distance is less than a preset distance threshold to obtain the target word sequence.

[0160] The calculation module 22 is also used to input the target word sequence into the current Transformer layer, and use the low-rank weight matrix to perform matrix multiplication and self-attention calculation to obtain the target output feature vector, so as to accelerate the inference of the current Transformer layer.

[0161] This application provides a large model inference acceleration system to implement the aforementioned large model inference acceleration method. Therefore, the specific implementation of the large model inference acceleration system can be found in the embodiment section of the large model inference acceleration method above. The specific implementation can be referred to the description of the corresponding embodiments, which will not be repeated here.

[0162] Figure 5 A schematic diagram of the hardware structure of the electronic device provided in an embodiment of this application is shown.

[0163] This application also provides an electronic device, comprising: a memory for storing a computer program; and a processor for executing the computer program to implement the steps of any of the above-described methods for accelerating large model inference.

[0164] The electronic device may include a processor 510 and a memory 520 storing computer program instructions.

[0165] Specifically, the processor 510 may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.

[0166] Memory 520 may include mass storage for data or instructions. For example, and not limitingly, memory 520 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 520 may include removable or non-removable (or fixed) media. Where appropriate, memory 520 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 520 is non-volatile solid-state memory.

[0167] Memory may include read-only memory (ROM), random access memory (RAM), disk storage media devices, optical storage media devices, flash memory devices, and electrical, optical, or other physical / tangible memory storage devices. Therefore, typically, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the operations described with reference to the method according to the first aspect of this disclosure.

[0168] The processor 510 implements any of the large model inference acceleration methods described in the above embodiments by reading and executing computer program instructions stored in the memory 520.

[0169] In one example, the electronic device may also include a communication interface 530 and a bus 540. Wherein, such as Figure 5 As shown, the processor 510, memory 520, and communication interface 530 are connected through bus 540 and complete communication with each other.

[0170] The communication interface 530 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application.

[0171] Bus 540 includes hardware, software, or both, that couples components of an online data traffic metering device together. For example, and not limitingly, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 540 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, any suitable bus or interconnect is contemplated herein.

[0172] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of any of the above-described methods for accelerating large model inference.

[0173] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as USB flash drives, read-only memory, random access memory, portable hard drives, magnetic disks, or optical disks.

[0174] Embodiments of the present invention also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above embodiments of the large model inference acceleration method.

[0175] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0176] The foregoing has provided a detailed description of a method and system for accelerating large-scale model inference. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of this application.

Claims

1. A method for accelerating large-scale model inference, characterized in that, include: During the long text inference process of the large model, the output feature vectors of the current Transformer layer and the previous Transformer layer, as well as the cache miss rate of the cache memory are obtained. The inter-layer similarity is obtained by calculating the cosine of the angle between the output feature vectors. When the cache miss rate is greater than the preset congestion threshold and the inter-layer similarity is in the redundant interval, a low-rank weight matrix is reconstructed by performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer and retaining the vectors corresponding to the first k largest singular values. The redundant interval is obtained by cluster analysis based on the historical similarity of the current Transformer layer, and k is a positive integer. By token merging, each word in the input word sequence of the current Transformer layer is mapped to a low-dimensional subspace defined by the low-rank weight matrix. The Euclidean distance of each word in the low-dimensional subspace is calculated, and adjacent word pairs with Euclidean distances less than a preset distance threshold are merged to obtain the target word sequence. The target word sequence is input into the current Transformer layer, and the target output feature vector is obtained by matrix multiplication and self-attention calculation using the low-rank weight matrix, so as to accelerate the inference of the current Transformer layer. By using token merging, each word in the input word sequence of the current Transformer layer is mapped to a low-dimensional subspace defined according to the low-rank weight matrix. The Euclidean distance of each word in the low-dimensional subspace is calculated, and adjacent word pairs with an Euclidean distance less than a preset distance threshold are merged to obtain the target word sequence, including: The row vector space of the low-rank weight matrix is defined as a low-dimensional subspace; Calculate the product of the feature vector corresponding to each word in the input word sequence of the current Transformer layer and the low-rank weight matrix to obtain the projection vector of each word in the low-dimensional subspace; Perform vector subtraction on the projection vectors corresponding to each pair of adjacent word pairs in the input word sequence to obtain the difference vector; The magnitude of the difference vector is used as the Euclidean distance between the adjacent word pairs in the low-dimensional subspace. When the Euclidean distance is less than a preset distance threshold, the average value of the feature vectors corresponding to the adjacent word pairs in the input word sequence is calculated to obtain the merged word. The target word sequence is obtained by replacing the corresponding adjacent word pairs in the input word sequence with the merged word.

2. The method according to claim 1, characterized in that, The method further includes: Obtain the historical similarity between the current Transformer layer and the previous Transformer layer at multiple historical inference moments; All historical similarities are sorted in ascending order to obtain a historical similarity sequence, and a preset number of historical similarities are uniformly extracted from the historical similarity sequence as the cluster center values of different similarity clusters; Calculate the absolute value of the difference between each historical similarity and the value of different cluster centers, assign each historical similarity to the similarity cluster corresponding to the smallest absolute value, calculate the arithmetic mean of all historical similarities in each similarity cluster, and update the arithmetic mean to the cluster center value. The cluster with the highest similarity value is identified as a high-redundancy cluster, and the redundancy interval is determined based on the minimum and maximum historical similarities in the high-redundancy cluster.

3. The method according to claim 1, characterized in that, By performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer and retaining the vectors corresponding to the top k largest singular values, a low-rank weight matrix is reconstructed, including: Perform matrix decomposition on the pre-trained weight matrix of the current Transformer layer to obtain a left singular vector matrix, a sequence of singular values, and a right singular vector matrix. The sequence of singular values includes multiple singular values sorted in descending order of their numerical values. The total energy value is obtained by calculating the sum of squares of each singular value in the singular value sequence. When the proportion of the sum of squares of the current k singular values in the total energy value is greater than a preset energy threshold for the first time, a singular value diagonal matrix is constructed based on the first k singular values in the singular value sequence. Extract column vectors corresponding to the first k singular values from the left singular vector matrix to form a truncated left matrix, and extract row vectors corresponding to the first k singular values from the right singular vector matrix to form a truncated right matrix. Perform matrix multiplication on the truncated left matrix, the singular value diagonal matrix, and the truncated right matrix to obtain a low-rank weight matrix.

4. The method according to claim 1, characterized in that, The target word sequence is input into the current Transformer layer, and the target output feature vector is obtained by matrix multiplication and self-attention calculation using the low-rank weight matrix, thereby accelerating the inference of the current Transformer layer, including: The target word sequence is input into the self-attention calculation unit of the current Transformer layer, and the target word sequence is linearly transformed using the low-rank weight matrix to obtain the query matrix, key matrix and value matrix. Perform matrix multiplication on the query matrix and the transpose of the key matrix to obtain the attention matrix, and then normalize the attention matrix to obtain the standard attention matrix. Perform matrix multiplication between the standard attention matrix and the value matrix to obtain the context feature matrix; The context feature matrix is multiplied by a preset output projection matrix to obtain the target output feature vector; The target output feature vector is used as the input vector of the next Transformer layer to accelerate the inference of the current Transformer layer.

5. The method according to claim 4, characterized in that, The target word sequence is linearly transformed using the low-rank weight matrix to obtain a query matrix, a key matrix, and a value matrix, including: Based on the column index order of the low-rank weight matrix, the low-rank weight matrix is divided into a first submatrix, a second submatrix, and a third submatrix with the same number of column vectors. The target word sequence is multiplied by the first submatrix, the second submatrix, and the third submatrix to obtain the query matrix, the key matrix, and the value matrix.

6. The method according to claim 1, characterized in that, Inter-layer similarity is obtained by calculating the cosine of the angle between the output feature vectors, including: Perform a vector dot product operation between the output feature vector of the current Transformer layer and the output feature vector of the previous Transformer layer to obtain the vector dot product value; The magnitude of the output feature vector of the current Transformer layer is multiplied by the magnitude of the output feature vector of the previous Transformer layer to obtain the magnitude product value. Divide the vector dot product by the modulus product to obtain the cosine of the included angle, and use the cosine of the included angle as the inter-layer similarity.

7. A large-model inference acceleration system, characterized in that, include: The acquisition module is used to acquire the output feature vectors of the current Transformer layer and the previous Transformer layer, as well as the cache miss rate of the cache memory during the long text inference process of the large model; The calculation module is used to obtain the inter-layer similarity by calculating the cosine value of the angle between the output feature vectors; The decomposition module is used to reconstruct a low-rank weight matrix by performing singular value decomposition on the pre-trained weight matrix of the current Transformer layer and retaining the vectors corresponding to the first k largest singular values when the cache miss rate is greater than a preset congestion threshold and the inter-layer similarity is in the redundancy interval. The redundancy interval is obtained by clustering analysis based on the historical similarity of the current Transformer layer, and k is a positive integer. The calculation module is further configured to map each word in the input word sequence of the current Transformer layer to a low-dimensional subspace defined by the low-rank weight matrix through token merging, calculate the Euclidean distance of each word in the low-dimensional subspace, and merge adjacent word pairs whose Euclidean distance is less than a preset distance threshold to obtain the target word sequence. Specifically, the calculation module is configured to: define the row vector space of the low-rank weight matrix as a low-dimensional subspace; calculate the product of the feature vector corresponding to each word in the input word sequence of the current Transformer layer and the low-rank weight matrix to obtain the projection vector of each word in the low-dimensional subspace; perform vector subtraction on the projection vectors corresponding to each pair of adjacent word pairs in the input word sequence to obtain a difference vector; use the magnitude of the difference vector as the Euclidean distance of the adjacent word pairs in the low-dimensional subspace; when the Euclidean distance is less than the preset distance threshold, calculate the average value of the feature vectors corresponding to the adjacent word pairs in the input word sequence to obtain the merged word; and replace the corresponding adjacent word pairs in the input word sequence with the merged word to obtain the target word sequence. The computation module is also used to input the target word sequence into the current Transformer layer, and use the low-rank weight matrix to perform matrix multiplication and self-attention calculation to obtain the target output feature vector, so as to accelerate the inference of the current Transformer layer.

8. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of a large model inference acceleration method as described in any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, enables a method for accelerating large model inference as described in any one of claims 1 to 6.

Citation Information

Patent Citations

Large language model reasoning method and device, electronic equipment and storage medium
CN120633822A
Inference acceleration method and device of pre-training model and electronic equipment
CN121168674A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Large language model reasoning method and device, electronic equipment and storage medium

Inference acceleration method and device of pre-training model and electronic equipment