Method and apparatus for modifying architecture of large language model

WO2026127642A1PCT designated stage Publication Date: 2026-06-18LG ELECTRONICS INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
LG ELECTRONICS INC
Filing Date
2025-12-10
Publication Date
2026-06-18

Smart Images

  • Figure KR2025021296_18062026_PF_FP_ABST
    Figure KR2025021296_18062026_PF_FP_ABST
Patent Text Reader

Abstract

According to at least one embodiment, a computer-implemented method for modifying the architecture of a large language model (LLM) comprises a step of compressing an embedding layer of the LLM so as to reduce the size of a parameter space of the LLM, wherein the embedding layer has an embedding dimension of n, and the step of compressing the embedding layer includes a step of using a first intermediate mapping for mapping a token to an m-dimensional vector, m being less than n. The method further comprises a step of compressing a plurality of transformer layers of the LLM so as to further reduce the size of the parameter space of the LLM.
Need to check novelty before this filing date? Find Prior Art

Description

Method and apparatus for modifying the architecture of a large-scale language model

[0001] The present disclosure relates to a method for modifying the architecture of a large-scale language model.

[0002] Transformers are used in artificial intelligence (AI) models to capture long-range dependencies while processing sequential or structured data such as text, code, images, and audio. In the context of Transformer architecture, the term parameter space refers to all the learnable weights and biases of a model. These are the numerical values ​​that the model learns to perform tasks such as language understanding, translation, and text generation.

[0003] AI models utilizing large-scale transformer architectures can have parameter sizes that increase from an average of 7 billion to up to 70 billion. Such a large number of parameters provides generalized multi-expert large-scale language models (LLMs) that can offer better support and context-rich information to users.

[0004] However, if the number of these parameters is large, there are difficulties in edge deployment of these models because memory and computational resources are limited.

[0005] The object of the present disclosure may be to compress a transformer architecture used in modern deep learning models, such as LLM, by utilizing embedding layer compression, tensor decomposition, low-rank decomposition, and adaptive transformer block pruning.

[0006] The object of the present disclosure may be to compress the model architecture by decomposing the weight tensor and replacing the transformer layer (e.g., the least impact transformer layer) using a small adapter.

[0007] According to one or more aspects, this is achieved by moving the problem to a smaller parameter space, which prevents under-determination of the task and enables co-optimization that converges to a solution that solves the task. When considering a sufficiently specific task, under-determination of the task may occur if the parameter space is large.

[0008] The purpose of the present disclosure may be to improve the ratio between the size of the parameter space and the size of the work design space (or selection space).

[0009] According to at least one embodiment, a computer implementation method for modifying the architecture of a large-scale language model (LLM) comprises the step of compressing an embedding layer of the LLM to reduce the size of the parameter space of the LLM, wherein the embedding layer has an embedding dimension of n, and the step of compressing the embedding layer comprises the step of using a first intermediate mapping configured to map tokens to an m-dimensional vector, where m is smaller than n. The method further comprises the step of compressing a plurality of transformer layers of the LLM to further reduce the size of the parameter space of the LLM.

[0010] According to at least one embodiment, an artificial intelligence (AI) device is configured to modify the architecture of a large-scale language model (LLM). The AI ​​device comprises one or more transceivers; and one or more processors that compress an embedding layer of the LLM to reduce the size of the parameter space of the LLM, wherein the embedding layer utilizes a first intermediate mapping configured to map a token to an m-dimensional vector, where m is smaller than n, and compress a plurality of transformer layers of the LLM to further reduce the size of the parameter space of the LLM.

[0011] According to at least one embodiment, a non-transient storage medium stores instructions that cause at least one processor to perform operations at execution. The operations include: compressing an embedding layer of a large-scale language model (LLM) to reduce the parameter space size of the LLM, wherein the embedding layer has an embedding dimension of n, and the step of compressing the embedding layer includes utilizing a first intermediate mapping configured to map a token to an m-dimensional vector, where m is smaller than n; and compressing a plurality of transformer layers of the LLM to further reduce the parameter space size of the LLM.

[0012] According to an embodiment of the present disclosure, through a multi-stage approach, a pipeline can achieve an average compression rate of more than 10 times while maintaining the accuracy of the latest model.

[0013] According to an embodiment of the present disclosure, a much higher compression ratio can be achieved by retraining the model architecture to fit a specific subtask in a much smaller parameter space (e.g., maintaining an LLM to operate with fewer parameters to recover accuracy).

[0014] The attached drawings included to provide further understanding of this disclosure illustrate embodiments of this disclosure and help to explain various aspects of this disclosure together with the description.

[0015] FIG. 1 illustrates a block diagram of a large-scale language model (LLM) architecture according to at least one embodiment.

[0016] FIG. 2 illustrates a block diagram of an artificial intelligence (AI) server according to at least one embodiment.

[0017] FIG. 3a illustrates a reconstructed form of embedding learning according to at least one embodiment.

[0018] Figure 3b shows a simplified example of a linear function containing a two-dimensional matrix.

[0019] Figure 4 illustrates an example calculation of the compression ratio achieved through the reconstruction of Figure 3a.

[0020] Figure 5 shows the representation of applying tensor train decomposition to the matrix of the constructed embeddings.

[0021] Figure 6 shows an example calculation of a further improved compression ratio using tensor train decomposition.

[0022] Figure 7 illustrates an exemplary factorization of a tensor having the shape n_1× n_2×n_3.

[0023] Figure 8 illustrates an exemplary factorization of a tensor having the shape n_1× n_2.

[0024] Figure 9 shows an example of matrix-specific training utilization of three decomposed tensor cores.

[0025] Figure 10 shows an example of segment-by-segment learning utilization of three decomposed tensor cores.

[0026] FIG. 11a illustrates a block diagram of a transformer layer according to at least one embodiment.

[0027] FIG. 11b illustrates a block diagram of a low-rank adaptation according to at least one embodiment.

[0028] FIG. 12 illustrates a flowchart of a method 1200 for modifying the architecture of an LLM according to at least one embodiment.

[0029] Hereinafter, specific embodiments of the present invention will be described in more detail with reference to the drawings.

[0030] FIG. 1 illustrates a block diagram of a large-scale language model (LLM) architecture according to at least one embodiment.

[0031] LLM architectures are typically built around Transformers, which are neural network designs specialized in understanding and generating sequences such as text. Before text is input into the model, it is broken down into tokens such as whole words, subwords, characters, or word fragments. Each of these tokens is mapped to an integer ID.

[0032] The token ID is input into the embedding layer 102. In the embedding layer 102, the token ID is converted into a continuous vector (embedding).

[0033] The output of learning an embedding layer is essentially a lookup table (LUT) of dimension n × V. Here, V represents the size of the vocabulary under consideration, and n represents the dimension of the embedding layer. Each member (or word) of the vocabulary, i.e., each token, is represented as an n-dimensional vector. The size of the LUT also corresponds to the number of embedding parameters of the embedding layer 102.

[0034] Two main types of embeddings, token embeddings and position embeddings, can be added to each continuous vector. Token embeddings represent the identity or meaning of a token. Position embeddings provide the model with information about word order.

[0035] Referring again to Fig. 1, the continuous vectors are output to the transformer layer (or transformer block) 104.

[0036] Transformer layer 104 can be viewed as the core of the LLM. For simplification, a single transformer layer 104 is shown in Fig. 1. However, a typical LLM can have tens or hundreds of transformer layers stacked on top of each other.

[0037] Each transformer layer 104 includes a self-attention mechanism 106 and a feed-forward network (FFN) 110.

[0038] In relation to the self-attention mechanism 106, every token keeps an eye on (or "pays attention") to all other tokens in the sequence. The self-attention mechanism 106 calculates contextual relationships and determines which parts of the text are related to each other. This calculation uses a trainable Q (query), K (key), and V (value) matrix. The query (Q) considers what a specific token is looking for, the key (K) considers what the token provides, and the value (V) considers the information being conveyed. In the case of dense bidirectional attention, each token pays attention to all other tokens, providing a contextualized representation for each token.

[0039] Multihead Attention 107 is a core mechanism within Transformer models that allows the model to examine multiple parts of the input simultaneously in multiple ways. Multihead Attention 107 is an extension of Self Attention designed to enhance the model's ability to capture complex relationships.

[0040] Multi-head attention 107 can be viewed as including multiple self-attention layers in parallel. Instead of performing self-attention only once, the model performs it multiple times in parallel. Each "head" has its own learned Q, K, and V projection matrices.

[0041] "Add & Norm" 108 refers to two operations, residual addition and layer normalization, that are applied together after major lower layers such as multi-head attention and feedforward networks. "Add & Norm" 108 keeps Deep Transformers stable, learnable, and efficient.

[0042] In "Add & Norm" 108, the Transformer performs Add (residual connection) and Norm (layer normalization) (after lower layers such as the self-attention mechanism 106). Add refers to short-circuiting that preserves original information, helps gradients flow through the deep neural network, and prevents gradient vanishing. Norm rebalances and re-centers activations to ensure they remain numerically stable during training.

[0043] Therefore, the operation of "Add & Norm" 108 prevents information loss between layers, enables very deep models, helps the network learn modifications rather than the entire transformation, and improves learning stability.

[0044] In FFN 110, a set of multilayer perceptrons (MLPs) is applied independently to each token. This expands and contracts hidden dimensions, generating richer transformations.

[0045] FFN 110 transforms the hidden representation by processing each token independently. Unlike attention, which mixes information between tokens, FFN 110 applies the same neural network to all tokens.

[0046] Similar to "Add & Norm" 108, which is applied after Multi-Head Attention 107, "Add & Norm" 112 is applied after FFN 110. The operation of "Add & Norm" 112 is similar to the operation described earlier with reference to "Add & Norm" 108.

[0047] FIG. 2 illustrates a block diagram of an artificial intelligence (AI) server according to at least one embodiment.

[0048] FIG. 2 illustrates a block diagram of an AI server 20 according to at least one embodiment of the present disclosure. As shown in FIG. 2, the AI ​​server 20 is connected to an AI device 10.

[0049] The AI ​​server 20 may refer to a device that uses a machine learning algorithm to train an artificial neural network (ANN) (e.g., LLM of FIG. 1) or uses a trained artificial neural network. The AI ​​server 20 may include multiple servers that perform distributed processing and may be defined as a 5G network. The AI ​​server 20 may be included as part of the configuration of the AI ​​device 10 and may perform at least a part of the AI ​​processing together.

[0050] The AI ​​server 20 may include a communication interface 21, memory 23, a learning processor 24, a processor 26, etc.

[0051] The communication interface 21 can transmit and receive data with an external device such as an AI device 10.

[0052] Memory 23 may include a model storage device 23a. The model storage device 23a may store a model (or ANN 26b) trained through a learning processor 24.

[0053] The learning processor 24 can train ANN 26b using training data. The training model may be used by being installed on an AI server 20, or by being installed on an external device such as an AI device 10.

[0054] The learning model may be implemented in hardware, software, or a combination of hardware and software. If all or part of the learning model is implemented in software, one or more instructions constituting the learning model may be stored in memory 23.

[0055] Processor 26 can use a learning model to infer result values ​​for new input data and generate responses or control commands based on the inferred result values.

[0056] FIG. 3a illustrates a reconstructed form of embedding learning according to at least one embodiment. Such embedding learning can be performed in the embedding layer 102 of FIG. 1.

[0057] The embedding space is generally It is expressed as follows. Here, n represents the embedding dimension (e.g., 768, 1024, 4096, etc.). Each token is mapped to a vector in this n-dimensional space.

[0058] According to aspects of the present disclosure, embedding learning is reconfigured to reduce or compress the number of embedding parameters. Such compression may use function composition of well-selected maps.

[0059] According to at least one embodiment, an intermediate mapping is used to process a vector of smaller dimensions when mapping tokens. This smaller dimension, referred to herein as width m, is smaller than the embedding dimension n. According to various embodiments, width m is much smaller than the embedding dimension n. For example, according to at least one additional embodiment, m is an integer less than or equal to 10. As another example, m is an integer less than or equal to 3 (e.g., m is equal to 3, 2, or 1).

[0060] According to at least one embodiment, two maps An embedding map defined by the synthesis of is used. The first map It maps tokens to an m-dimensional vector. As mentioned earlier, the width m can be much smaller than the embedding dimension n.

[0061] Second map It extends the m-dimensional vector back to the embedding dimension n. According to at least one embodiment, the second map is a linear function and non-linear functions It is defined as a composition of for = 1, 2, … , k. (Here, the term "function" is understood to mean matrices and activation functions.) For example, the second map is It can be defined as.

[0062] non-linear function (or map) Is It can be defined as. Here Igo is is. Here, the value of m1 is equal to the width m. Also, x is It represents the vector, W represents the weight matrix, and b represents the bias value.

[0063] In the context of LLM, ReLU (Rectified Linear Unit) is a type of activation function used in neural networks and is defined as ReLU(z) = max(0, z).

[0064] Referring to Fig. 3a, the non-linear function silver It maps to. Likewise, a non-linear function Is Mapped to, and a non-linear function Is It is mapped to.

[0065] Last function is linear and weighted sum It corresponds to. Here Igo Igo am.

[0066] Figure 3b shows a simplified example of a linear function containing a two-dimensional matrix.

[0067] It is understood that various parameters can be fine-tuned and adjusted during the disclosed reconstruction process. These parameters include width m and intermediate dimension. , and K non-linear functions is included.

[0068] Figure 4 illustrates an example calculation of the compression ratio achieved through the reconstruction of Figure 3a.

[0069] According to at least one embodiment, in addition to utilizing the synthetic embeddings of the disclosed reconstruction, tensor train decomposition is utilized to further compress the embedding layer parameters. For example, tensor train decomposition may be applied to a larger (or largest) matrix of the embeddings. This larger matrix is ​​dimension It can be a matrix. Here, tensor train decomposition can be applied to further improve the compression ratio while maintaining excellent accuracy.

[0070] In the embedding layer, Tensor Train Decomposition decomposes a large matrix into smaller 3D tensors or Tensor Train (TT) core chains. At this time It becomes.

[0071] In this regard, the embedding index i is expressed in the form of a multi-index across multiple modes. The embedding vector dimension j can also be factored. The TT-rank controls compression, and a lower rank results in a higher compression ratio. Therefore, a larger dense matrix can be replaced with a smaller tensor sequence.

[0072] Figure 5 shows the representation of applying tensor train decomposition to the matrix of the constructed embeddings.

[0073] Figure 6 shows an example calculation of a compression ratio that further improves the compression ratio (e.g., compared to the compression ratio in Figure 4) by utilizing tensor train decomposition.

[0074] Now, the compression of a transformer layer (e.g., transformer layer 104 of FIG. 1) will be described with reference to various embodiments. According to at least one embodiment, the compression of a transformer layer includes applying tensor train decomposition (or tensor decomposition) and performing transformer layer pruning. The two processes are combined to reduce the number of parameters according to the presented task.

[0075] When applied to the Transformer layer, large weight matrices within the Transformer layer can be compressed using Tensor Train Decomposition. The method for decomposing a larger tensor T into smaller 3D tensor sequences is explained in more detail below.

[0076] Tensor-train decomposition using a tensor of the form is It can be expressed as. Here Is It is a 3D tensor core of the shape, and is called the TT-rank that controls the computation size.

[0077] Fig. 7 is An exemplary factorization of a tensor having the form is illustrated. FIG. 8 is Illustrate an exemplary factorization of a tensor having the form of .

[0078] To obtain a scalar result It is understood that... A common technique for decomposition is sequential singular value decomposition (SVD), but the compression of the transformer layer according to the embodiments disclosed herein generates a new tensor from scratch for retraining.

[0079] For example, the pipeline decomposes the weight matrix selected in the manner described above into three low-rank tensors using tensor-train decomposition. It generates. For similarity, the TT-rank is It can be configured to be. Matrix-wise training or segment-wise training can be utilized to retrain the decomposed tensor cores.

[0080] Figure 9 shows an example of matrix-to-matrix learning application for three decomposed tensor cores (decomposed tensor cores 902, 904, and 906). In this example, training epochs are executed after updating each weight matrix to ensure consistent changes after decomposition. Since the remaining weights of the model are fixed, the decomposed weights can copy the function of the original tensor as much as possible.

[0081] Figure 10 shows an example of using segment-by-segment learning for three decomposed tensor cores (decomposed tensor cores 1002, 1004, and 1006). Here, the term "segment" refers to a set of successive transformer layers. Segment-by-segment learning executes a training epoch after updating all weight matrices of transformer layer 1000, and then executes a retraining pipeline after decomposing the entire segment. Compared to matrix-by-matrix learning, segment-by-segment learning requires significantly fewer training epochs while maintaining accuracy. Therefore, segment-by-segment learning may be more desirable unless significant fluctuations or a decrease in accuracy due to the specialization loss of each layer are observed.

[0082] As previously mentioned regarding transformer layer compression, in addition to applying tensor train decomposition, transformer layer pruning can be performed to reduce the number of parameters according to the presented task. According to at least one implementation example, to better fit a large model to a dataset, one or more transformer layers deemed to have less influence are completely replaced with low-rank adaptations.

[0083] For example, each of one or more transformer layers is designated for replacement based on the transformer layer's sensitivity to its impact on LLM performance. Sensitivity may relate to the impact on the model's overall accuracy and performance if the transformer layer is replaced. A transformer layer may be designated for replacement if it is determined to have lower sensitivity than other transformer layers.

[0084] FIG. 11a shows a block diagram of transformer layer 1102, which is considered to have less impact. Transformer layer 1102 may be similar to transformer layer 104, which was previously described with reference to FIG. 1.

[0085] During transformer layer pruning, transformer layer 1102 is completely replaced by the low-rank adaptation 1104 of FIG. 11b. According to at least one embodiment, the low-rank adaptation 1104 takes the form of a gated multidimensional matrix (GLP). Alternatively, the low-rank adaptation may take the form of a pair of low-rank matrices. The same segment-wise learning may be performed to maintain the characteristics of transformer layer 1102. Since transformer layer 1102 is completely replaced by the low-rank adaptation 1104, matrix-wise learning is not considered.

[0086] FIG. 12 illustrates a flowchart of a method 1200 for modifying the architecture of an LLM according to at least one embodiment.

[0087] In block 1202, the embedding layer of the LLM is compressed to reduce the parameter space size of the LLM. The embedding layer has an embedding dimension of n. Compressing the embedding layer (e.g., embedding layer 102 in FIG. 1) involves utilizing a first intermediate mapping configured to map tokens to an m-dimensional vector (where m is smaller than n).

[0088] For example, as explained earlier with reference to Fig. 3a, an intermediate mapping is used to process smaller-dimensional vectors when mapping tokens. The first mapping It maps tokens to m-dimensional vectors.

[0089] According to another embodiment, m represents an integer less than or equal to 10.

[0090] According to another embodiment, m represents an integer less than or equal to 3.

[0091] According to another embodiment, the step of compressing the embedding layer further includes the step of using a second intermediate mapping, wherein the second intermediate mapping is configured to map an m-dimensional vector to an n-dimensional vector. The second intermediate mapping may be based on the composition of a linear function and a plurality of non-linear functions.

[0092] For example, as explained earlier with reference to Fig. 3a, the second map It extends the m-dimensional vector back to the embedding dimension n. According to at least one embodiment, the second map is a linear function and non-linear functions It is defined as a composition of (i=1,2,...,k). Therefore, the second map is It is defined as.

[0093] According to an additional embodiment, the step of compressing the embedding layer further includes the step of applying tensor train decomposition to one or more larger matrices of a linear function and a plurality of non-linear functions.

[0094] For example, as shown in Fig. 5, tensor train decomposition is applied to the matrix of the constructed embedding.

[0095] Referring again to Fig. 12, in block 1204, multiple transformer layers of the LLM are compressed to further reduce the size of the LLM's parameter space.

[0096] According to additional embodiments, compressing a plurality of transformer layers includes performing tensor train decomposition and performing transformer layer pruning.

[0097] For example, as previously described with reference to Figures 7, 8, 9, 10, 11a and 11b, compression of the transformer layer involves applying tensor train decomposition (or tensor decomposition) and performing transformer layer pruning.

[0098] According to an additional embodiment, performing tensor decomposition generates a new tensor for retraining based on matrix-wise training (e.g., see FIG. 9) or segment-wise training (e.g., see FIG. 10).

[0099] According to another embodiment, the step of performing transformer layer pruning includes replacing one of the plurality of transformer layers with a coarse-granularity adapter. The coarse-granularity adapter may be based on a gated MLP or a pair of low-rank nonlinear functions. Each of two or more of the plurality of transformer layers may be replaced with a respective coarse-granularity adapter.

[0100] For example, as previously explained with reference to FIGS. 11a and FIG. 11b, during transformer layer pruning, transformer layer 1102 of FIG. 11a is completely replaced by the low-rank adaptation 1104 of FIG. 11b.

[0101] According to another embodiment, each of two or more transformer layers is identified as a replacement target based on the sensitivity of the transformer layer to the impact on LLM performance evaluated by a common evaluation dataset. Examples of such datasets include, but are not limited to, Massive Multitask Language Understanding (MMLU), Abstraction and Reasoning Corpus for Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) evaluation, CommonSenseQA (CSQA), WinoGrande, etc.

[0102] The aspects and features described herein with reference to various embodiments relate to compressing an embedding layer and a transformer layer together as a compression methodology. For example, after the embedding layer is compressed so that the size of the parameter space is reduced, the transformer layer is retrained with consideration for a specific task to be processed. As previously described with reference to various embodiments, retraining may be performed through matrix-wise training or segment-wise training.

[0103] The embodiments disclosed herein enable flexible modification of the model architecture in both the embedding layer and the transformer layer by using various techniques and combining coarse-granularity and fine-granular components. This enhances control over the model compression type, allowing for flexible assignment of embedding size or the priority of attention modules depending on the specific task to be processed. Additionally, the disclosed embodiments focus on achieving higher compression ratios. Tensor decomposition provides the maximum compression ratio because it can assign lower ranks to higher compression ratios. Furthermore, as the contribution of the embedding layer increases as the transformer layer is compressed, compressing the embedding layer significantly improves the compression ratio. Additionally, as previously disclosed, one or more transformer layers are replaced considering the specific task to be processed. This fine-tuning is intended to ensure that the complexity of the task is accurately reflected in the number of layers used.

[0104] The embodiments described above are combinations of the components and features of the present invention in specific forms. Each component or feature should be considered optional unless otherwise explicitly stated. Each component or feature may be implemented without being combined with other components or features. Additionally, some components and / or features may be combined to implement embodiments of the present invention. The order of operations described in the embodiments of the present invention may be changed. Some components or features of one embodiment may be included in another embodiment, and such components or features may be replaced by related components or features of another embodiment. It is obvious that claims not explicitly cited in the appended claims may be combined to form embodiments or incorporated as new claims through post-filing amendments. It is obvious to those skilled in the art that the present invention may be implemented in various specific forms within the scope of the features of the present invention. Accordingly, the above detailed description should not be interpreted restrictively in any respect and should be considered exemplary. The scope of the present disclosure shall be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present disclosure are included within the scope of the present disclosure.

Claims

1. A computer implementation method for modifying the architecture of a large-scale language model (LLM), said computer implementation method comprises: A step of compressing the embedding layer of the LLM to reduce the parameter space size of the LLM, The above embedding layer has an embedding dimension of n, The step of compressing the above embedding layer includes the step of utilizing a first intermediate mapping configured to map tokens to an m-dimensional vector, and m is smaller than n; and A step comprising compressing a plurality of transformer layers of the LLM to further reduce the parameter space size of the LLM Computer implementation method.

2. In paragraph 1, the above m represents an integer less than or equal to 10 Computer implementation method.

3. In paragraph 1, the above m represents an integer less than or equal to 3 Computer implementation method.

4. In claim 1, the step of compressing the embedding layer further includes the step of utilizing a second intermediate mapping, wherein the second intermediate mapping is configured to map the m-dimensional vector to an n-dimensional vector. Computer implementation method.

5. In paragraph 4, the second intermediate mapping is based on the composition of a linear function and a plurality of non-linear functions. Computer implementation method.

6. In claim 5, the step of compressing the embedding layer further comprises the step of applying tensor train decomposition to one or more larger matrices of the linear function and the plurality of non-linear functions. Computer implementation method.

7. In paragraph 1, the step of compressing the plurality of transformer layers is: Step of performing tensor train decomposition; and including a step of performing transformer layer pruning Computer implementation method.

8. In claim 7, the step of performing the tensor train decomposition comprises the step of generating a new tensor for retraining based on matrix-wise training or segment-wise training. Computer implementation method.

9. In claim 7, the step of performing the transformer layer pruning comprises the step of replacing one of the plurality of transformer layers with a coarse-granularity adapter. Computer implementation method.

10. In claim 9, the coarse-granularity adapter is based on a gated multilayer perceptron (MLP) or a low-rank nonlinear function pair. Computer implementation method.

11. In claim 9, each of two or more of the plurality of transformer layers is replaced by each coarse-granularity adapter Computer implementation method.

12. In paragraph 11, each of the plurality of transformer layers is identified for replacement based on the sensitivity of the transformer layer to the impact on the performance of the LLM scored by a common evaluation dataset. Computer implementation method.

13. An artificial intelligence (AI) device configured to modify the architecture of a large-scale language model (LLM), One or more transceivers; and To reduce the size of the parameter space of the above LLM, the embedding layer of the above LLM is compressed, and the embedding layer utilizes a first intermediate mapping configured to map tokens to an m-dimensional vector, wherein the embedding dimension is n, and m is smaller than n. One or more processors comprising compressing a plurality of transformer layers of the LLM to further reduce the parameter space size of the LLM AI device.

14. In paragraph 13, the above m represents an integer less than or equal to 10 AI device.

15. In paragraph 13, the above m represents an integer less than or equal to 3 AI device.

16. In paragraph 13, the one or more processors utilize a second intermediate mapping to compress the embedding layer, and the second intermediate mapping is configured to map the m-dimensional vector to an n-dimensional vector. AI device.

17. In paragraph 13, the one or more processors are further configured to compress the plurality of transformer layers by: Step of performing tensor train decomposition; and Step to perform transformer layer pruning AI device.

18. In paragraph 17, the step of performing the tensor train decomposition comprises the step of generating a new tensor for retraining based on matrix-wise training or segment-wise training. AI device.

19. In claim 17, the step of performing the transformer layer pruning comprises the step of replacing one of the plurality of transformer layers with a coarse-granularity adapter. AI device.

20. A non-transient storage medium storing instructions that cause at least one processor to perform operations at execution, said operations A step of reducing the parameter space size of a large-scale language model (LLM) by compressing the embedding layer of the LLM, The above embedding layer has an embedding dimension of n, The step of compressing the above embedding layer includes the step of utilizing a first intermediate mapping configured to map tokens to an m-dimensional vector, and m is smaller than n; and A step comprising compressing a plurality of transformer layers of the LLM to further reduce the parameter space size of the LLM Non-transient storage media.