Language modeling with factored memories

CN122197518APending Publication Date: 2026-06-12RAKUTEN GROUP INC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
RAKUTEN GROUP INC
Filing Date
2025-11-13
Publication Date
2026-06-12

Smart Images

  • Figure CN122197518A_ABST
    Figure CN122197518A_ABST
Patent Text Reader

Abstract

Language modeling with factorized memory is performed by computing a topic membership score for each topic vector based on an input token embedding and a topic membership weight matrix, updating each topic vector based on the corresponding topic membership score, and merging the updated topic vectors to produce an output token embedding.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Cross-reference of related applications

[0002] This application claims priority to U.S. Provisional Patent Application No. 63 / 730,898, filed December 11, 2024; U.S. Non-Provisional Patent Application No. 19 / 260,365, filed July 4, 2025; and U.S. Non-Provisional Patent Application No. 19 / 260,361, filed July 4, 2025, the entire contents of which are incorporated herein by reference. Technical Field

[0003] This disclosure relates to language modeling with factorized memory. Background Technology

[0004] The information disclosed in the Background section is intended only to enhance the understanding of the general background of this disclosure and should not be construed as an admission or in any way implying that such information constitutes prior art known to those skilled in the art.

[0005] Transformer architectures in Large Language Models (LLMs) use a context window to consider the previous L tokens when generating the next token. To generate a sentence with L tokens, O(L...) time complexity is required. 2 ) calculations. Summary of the Invention

[0006] In at least some embodiments, language modeling with factorized memory is performed by an operational method comprising: calculating a topic membership score for each of a plurality of topic vectors based on an input token embedding and a topic membership weight matrix; updating each of at least some of the plurality of topic vectors based on the corresponding topic membership score; and merging the updated at least some of the plurality of topic vectors to produce an output token embedding.

[0007] In at least some embodiments, language modeling with factorized memory is performed by a device configured to: compute a topic membership score for each of a plurality of topic vectors based on an input token embedding and a topic membership weight matrix; update each of at least some of the plurality of topic vectors based on the corresponding topic membership score; and merge the updated at least some of the plurality of topic vectors to produce an output token embedding.

[0008] In at least some embodiments, language modeling with factorized memory is performed by a non-transitory computer-readable medium, the non-transitory computer-readable medium comprising instructions that, in response to execution by one or more processors, cause the following operations to be performed: calculating a topic membership score for each of a plurality of topic vectors based on an input token embedding and a topic membership weight matrix; updating each of at least some of the plurality of topic vectors based on the corresponding topic membership score; and merging the updated at least some of the plurality of topic vectors to produce an output token embedding. Attached Figure Description

[0009] Features, aspects, and advantages of embodiments of this disclosure will be described below with reference to the accompanying drawings, wherein similar reference numerals denote similar elements, and wherein:

[0010] Figure 1 This is a schematic diagram of a factorized memory block of a language model according to at least some embodiments of the present disclosure.

[0011] Figure 2 This is an operational flow utilizing a cyclic memory state according to at least some embodiments of the present disclosure.

[0012] Figure 3 This is an operational flow for updating topic vectors according to at least one embodiment of the present disclosure.

[0013] Figure 4 This is the operational flow for merging updated topic vectors according to at least some embodiments of this disclosure.

[0014] Figure 5 This is a schematic diagram of a language model with factorized memory according to at least some embodiments of the present disclosure.

[0015] Figure 6 This is an operational process for assembling and training a language model with factorized memory according to at least some embodiments of this disclosure.

[0016] Figure 7 This describes embodiments of a language modeling device with factorized memory according to at least some embodiments of the present disclosure. Detailed Implementation

[0017] The following disclosure provides numerous different embodiments or instances for implementing various features of the provided subject matter. Specific examples of components, values, operations, materials, arrangements, or the like are described below to simplify this disclosure. Of course, these are merely examples and are not intended to be limiting. Other components, values, operations, materials, arrangements, or the like may be considered. Furthermore, in various instances, references to numbers and / or letters may be repeated. This repetition is for simplicity and clarity and does not in itself indicate a relationship between the various embodiments and / or configurations discussed.

[0018] It should be understood that the systems and / or methods described herein can be implemented in various forms, including hardware, software, or a combination of both. The actual dedicated control hardware or software code used to implement these systems and / or methods should not limit their implementation. Therefore, no specific software code is referenced herein to describe the operation and behavior of the systems and / or methods. It should be understood that software and hardware can be designed to implement the systems and / or methods based on the descriptions herein.

[0019] Although specific combinations of features are cited in the claims and / or disclosed in the description, such specific combinations are not intended to limit the disclosure of the embodiments. In fact, many of these features can be combined in ways not expressly cited in the claims and / or disclosed in the description. Even if a dependent claim directly depends on only one claim, this disclosure may indicate that a dependent claim depends on other claims in the claim set.

[0020] Unless explicitly stated otherwise, elements, actions, or instructions used herein should not be construed as essential or necessary. Furthermore, as used herein, the article “a (a and an)” (in other words, not a noun mentioned in the plural) is intended to include one or more items and is interchangeable with “one or more”. Additionally, as used herein, the terms “has / have / having,” “include / including,” or similar terms are intended to be open-ended subjects. Furthermore, unless explicitly stated otherwise, the phrase “based on” is intended to mean “at least partially based on.” Moreover, phrases such as “at least one of [A] and [B],” “[A] and / or [B],” or “at least one of [A] or [B]” should be understood to include only A, only B, or both A and B.

[0021] In this disclosure, specific tasks can be performed using AI / ML (Artificial Intelligence / Machine Learning) models. An AI / ML model is a model generated using one or more AI techniques, one or more ML algorithms, or both, and produces output data based on input data. This output data is used to perform the task. Tasks performed using AI / ML models include tasks commonly referred to as intelligent tasks, such as classification, prediction, natural language processing, etc.

[0022] Although AI and ML have been explained separately, ML is a technology encompassed within AI. In ML, systems improve their performance over time by recognizing patterns and making inferences from training data, rather than being explicitly programmed for a specific task. Typically, the generation of an ML model involves data collection, model training, and model inference. Data collection involves gathering and preprocessing the data that will be used for training and inference. Model training involves developing and validating the model using the collected data. Model inference involves applying the trained model to new data to generate new output data and perform tasks.

[0023] Machine learning encompasses various types of learning methods, such as supervised learning, unsupervised learning, reinforcement learning, semi-supervised learning, self-supervised learning, transformational learning, transfer learning, meta-learning, and the like. These types of learning methods may be appropriately selected depending on the implementation. Unless otherwise specified, the application of types not mentioned in this description is not excluded. Furthermore, the structure of the ML model may vary depending on the implementation and learning method, and is not limited to the disclosed methods. In addition, ML includes deep learning, which uses models containing neural networks. Deep learning models may include, for example, deep neural networks (DNNs), convolutional neural networks (CNNs), etc.

[0024] It should be noted that the AI / ML model presented below is an example and is not limited to the AI / ML model described herein. It can be modified or changed by using different AI or ML algorithms. The configuration of the neural network is not limited to the configuration disclosed in this disclosure and can be modified.

[0025] Scaling L to a very large number is computationally infeasible. For example, a 1 GB email contains hundreds of millions of tokens, far exceeding the limits of any commercial API (typically 32k to 100k). Transformers do not learn during inference. If we want to adjust their behavior, we need to carry a cue shorter than L in each conversation round. The error rate increases with complexity. As task complexity increases, cue engineering and RAG become increasingly prone to errors.

[0026] The language model according to at least some embodiments of this disclosure utilizes a circular memory state, which includes encoded states from previous inputs on which the output is based. In at least some embodiments, the circular memory state has a fixed memory size.

[0027] In at least some embodiments, a language model that generates output based on a recurrent memory state representing previous input requires less memory than a language model that generates output based on a transformer architecture that directly considers previous input within a context window.

[0028] In at least some embodiments, the token embedding hyperspace is divided into M fixed topics. In at least some embodiments, the centroid of each topic is... Used as the anchor point for its partition, where This represents the memory vector for the m-th topic. In at least some embodiments, each memory vector uses hardware storage of the size of the memory vector.

[0029] In at least some embodiments, given an input embedding x t Calculate topic membership α across all topic centroids. t In at least some embodiments, the final output embeds y. t Using α t The weights are used to form a weighted average of the memory vector. in This represents a feature space with m dimensions.

[0030] In at least some embodiments, memory updates are also gated by topic membership. In at least some embodiments, only memory vectors corresponding to topics closely aligned with the input receive significant updates, while other topic-specific memories remain unaffected. Where η is the learning rate and τ is the extended temperature parameter.

[0031] In at least some embodiments, the negative terms in the original memory update rule can be simplified by utilizing the assumption that, in a well-trained embedding space, the token embeddings are uniformly distributed across their topic partitions. In at least some embodiments, the input embeddings of each transformer are RMS or layer-normalized.

[0032] In at least some embodiments, the updated expansion factor is defined as:

[0033] In at least some embodiments, by using P(x) t |h t Simplified to not depend on h t To achieve correlated parallel scan computation, resulting in:

[0034] or

[0035] And the model becomes:

[0036] Based on the above memory update equation... A stable norm can be accumulated through multiple steps. Tensors can be initialized to zero at the start of the sequence. This asymptotic norm accumulation can delay convergence. In at least some embodiments, normalization layers, particularly RMS normalization, are incorporated before the output of the memory layers. In at least some embodiments, RMS normalization helps stabilize the output scale across updates. In at least some embodiments, the expressiveness of the layers is expanded by incorporating input and output projections, which allows for dynamic control over the memory dimension and the number of multi-heads. In at least some embodiments, an output gating mechanism is also introduced, which empirically enhances model performance with minimal computational footprint. In at least some embodiments, the architecture here does not require one-dimensional convolutional (Conv1D) processing to maintain robust sequential inference, likely due to its inherent structure and topic-adaptive memory design.

[0037] In at least some embodiments, the foregoing is collectively referred to below as a set of equations for utilizing circular memory states. In at least some embodiments, this set of equations for utilizing circular memory states enables scaling to a large number of topic partitions m. In at least some embodiments, α t Used as routing probability, updates will be biased towards the most relevant partitions.

[0038] In at least some embodiments, the router weights are updated, such as θ as described below. t It can be regarded as memory h m The centroid of the specified space. In at least some embodiments, this removes the obstacle to parallel training optimization.

[0039] In at least some embodiments, the gating network will also (W i ⋅x t ) Deflected away from memory block h m , where h m Stay away from (W) i ⋅x t Furthermore, it has a small topic membership score, thus preventing overwriting of memories storing different topics, similar to context switching. In at least some embodiments, the topic membership α... t It is a probability distribution with a sum as high as 1.0, and therefore the weights θ t (W i ⋅x t ) and updated to the learning rate η.

[0040] In at least some embodiments, the network is "sparsely activated" such that θ can be discarded. t (W i ⋅x tThe computation is close to zero without significantly affecting the results. In at least some embodiments, this property enables the memory to scale to billions of topics m. In at least some embodiments, by performing direct adaptation, the dense next set of equations used to utilize the recurrent memory state can be transformed into a sparse variant, thereby significantly enhancing computational efficiency without significantly sacrificing model expressiveness, as will be described below.

[0041] Figure 1 This is a schematic diagram of a factorized memory block 100 of a language model according to at least some embodiments of the present disclosure. The factorized memory block 100 includes an input token embedding 101, a memory update function 102, topic vectors 104A, 104B, and 104M, a memory merging weight 106, a memory merging function 108, and an output token embedding 109. In at least some embodiments, the factorized memory block 100 is configured to selectively update portions of the recurrent memory state on which the output is based.

[0042] Input token embedding 101 is an example of input to factorized memory block 100. In at least some embodiments, input token embedding 101 is configured to represent the tokens of the natural language prompt as vectors in the feature space.

[0043] Memory update function 102 is an element of factorized memory block 100. In at least some embodiments, memory update function 102 is configured to update topic vectors, such as topic vectors 104A, 104B, and 104M, based on topic update weight values. In at least some embodiments, memory update function 102 is further configured to store the updated topic vectors in physical memory described below, such as... Figure 7 The memory 763.

[0044] Topic vectors, such as topic vectors 104A, 104B, and 104M, are elements of factorized memory block 100. In at least some embodiments, the topic vectors form a circular memory state. In at least some embodiments, each of topic vectors 104A, 104B, and 104M is updated by memory update function 102 based on a topic update weight value. In at least some embodiments, only some topic vectors are updated in response to each input token embedding. In at least some embodiments, the topic vectors are merged by memory merging function 108 to produce output token embedding 109.

[0045] Memory merging weight 106 is an element of factorized memory block 100. In at least some embodiments, memory merging weight 106 is configured to control the merging of topic vectors (e.g., topic vectors 104A, 104B, and 104M) based on the membership degree of the input token embeddings for each topic. In at least some embodiments, memory merging weight 106 is configured to bias the influence on output token embeddings 109 towards topics whose token embeddings have higher membership degrees. In at least some embodiments, memory merging weight 106 is calculated using topic merging rate and topic membership score.

[0046] Memory merging function 108 is an element of factorized memory block 100. In at least some embodiments, memory merging function 108 is configured to merge updated topic vectors, such as topic vectors 104A, 104B, and 104M, to produce output token embedding 109. In at least some embodiments, memory merging function 108 is configured to compute the output projection of the merged topic vectors. In at least some embodiments, memory merging function 108 is further configured to draw from physical memory (e.g., as described below) Figure 7 The updated topic vector is read from the memory 763.

[0047] Output token embedding 109 is an example of the output of factorized memory block 100. In at least some embodiments, output token embedding 109 is generated by merging updated topic vectors (e.g., topic vectors 104A, 104B, and 104M) using memory merging function 108. In at least some embodiments, output token embedding 109 is an output projection of the merged topic vectors.

[0048] Figure 2 This is an operational flow utilizing a cyclic memory state according to at least some embodiments of the present disclosure. In at least some embodiments, the operational flow provides a method utilizing a cyclic memory state. In at least some embodiments, the method is executed by a processor of a device described below, for example... Figure 7 The device 760 uses the processor 762.

[0049] In S220, the processor calculates the topic membership score. In at least some embodiments, the processor receives an input token embedding. In at least some embodiments, the processor normalizes the input token embedding before calculating the topic membership score. In at least some embodiments, the normalization is root mean square (RMS) normalization. In at least some embodiments, the processor applies Softmax to calculate the topic membership score. In at least some embodiments, the processor calculates the topic membership score α according to the following equation. t : Where Wα represents the topic membership weight matrix, x t Let represent the input token embedding at time step t, and m represent the number of topics. Let m represent a feature space. In at least some embodiments, the total topic membership score is 1. In at least some embodiments, the processor retrieves the topic membership weight matrix from the trained parameter values ​​of the language model. In at least some embodiments, the processor calculates the topic membership score α according to the following equation. t : Where τ represents the topic membership temperature value. In at least some embodiments, the processor retrieves the topic membership temperature value from configurable parameters or hyperparameters. In at least some embodiments, the processor calculates a topic membership score based on the input token embedding, the topic membership weight matrix, and the topic membership temperature value. In at least some embodiments, the processor calculates a topic membership score for each of a plurality of topic vectors based on the input token embedding and the topic membership weight matrix.

[0050] In at least some embodiments, the processor performs sparse updates. In at least some embodiments, the processor calculates topic membership scores such that only topics with the highest membership scores are updated. In at least some embodiments, α is calculated according to Equation 1 or Equation 2. t The processor then selects the first k relevant memory states according to the following equation: in Let represent the top-k function, where k represents the number of topics to be updated. In at least some embodiments, as a result of applying Equation 3, only the top k membership scores are saved, and the other membership scores are set to 0. In at least some embodiments, the processor then renormalizes the membership scores according to the following equation. in This represents a sparse update of membership scores. In at least some embodiments, the processor proceeds to utilize... Instead of α t At S224, the topic vector is updated and at S228, the updated topic vectors are merged to perform sparse update and merge.

[0051] In S224, the processor updates the topic vector based on the topic update weight value. In at least some embodiments, the processor computes the updated topic vector based on the input token embedding, the topic update weight, and the previous topic vector. In at least some embodiments, the processor computes only the updated topic vector or some topics. In at least some embodiments, the processor updates each of at least some of a plurality of topic vectors whose corresponding topic membership weight value is among a predetermined number of maximum topic membership weight values. In at least some embodiments, the processor updates the topic vector based on the topic membership score. Instead of α tThe processor computes updated topic vectors to perform sparse updates. In at least some embodiments, the processor updates each of at least some of a plurality of topic vectors whose corresponding topic membership weight values ​​are greater than a threshold membership weight value. In at least some embodiments, the processor retrieves data from physical memory (e.g., as described below) from physical memory. Figure 7 The processor retrieves the previous topic vector from memory 763. In at least some embodiments, the processor stores the updated topic vector in physical memory. In at least some embodiments, the processor updates each of at least some of the plurality of topic vectors based on the corresponding topic membership weight value. In at least some embodiments, the processor performs the actions described below. Figure 3 The operating procedure.

[0052] In S228, the processor merges the updated topic vectors. In at least some embodiments, the processor calculates the output projection of the merged topic vectors based on the topic merging weight, the updated topic vectors, and the output projection weight value. In at least some embodiments, the processor retrieves the updated topic vectors from physical memory. In at least some embodiments, the processor merges multiple updated topic vectors to produce an output token embedding. In at least some embodiments, the processor merges each of at least some of multiple topic vectors whose corresponding topic membership weight value is among a predetermined number of maximum topic membership weight values. In at least some embodiments, the processor merges topic membership scores... Instead of α t The processor merges the updated topic vectors to perform sparse merging. In at least some embodiments, the processor performs the actions described below. Figure 4 The operating procedure.

[0053] Figure 3 This is an operational flow for updating topic vectors according to at least one embodiment of the present disclosure. In at least some embodiments, the operational flow provides a method for updating topic vectors. In at least some embodiments, the method is executed by a processor of a device described below, for example... Figure 7 The device 760 uses the processor 762.

[0054] In S330, the processor calculates the topic update rate. In at least some embodiments, the processor uses a topic update rate weight value and an input token embedding to calculate the topic update rate. In at least some embodiments, the processor calculates the topic update rate η according to the following equation. t : in Let σ(∙) represent the topic update rate weight value, and let σ(∙) represent the sigmoid activation. In at least some embodiments, the processor retrieves the topic update rate weight value from the trained parameter values ​​of the language model. In at least some embodiments, the processor uses the topic update rate weight value and the input token embedding as input to determine the topic update rate.

[0055] In S332, the processor calculates the topic update weight. In at least some embodiments, the processor uses the topic update rate and the topic membership score to calculate the topic update weight. In at least some embodiments, the processor calculates the topic update weight θ according to the following equation. t : In at least some embodiments, the processor uses the topic update rate and topic membership score as inputs to determine the topic update weight.

[0056] In S334, the processor calculates the input projection. In at least some embodiments, the processor uses the input projection weight matrix and the input token embedding to calculate the input projection. In at least some embodiments, the processor calculates the input projection according to the following equation. : Among them W i This represents the input projection weight matrix. In at least some embodiments, the processor retrieves the input projection weight matrix from the trained parameter values ​​of the language model.

[0057] In S336, the processor computes the updated topic vector. In at least some embodiments, the processor computes the updated topic vector using the input projection, topic update weights, and the previous topic vector. In at least some embodiments, each of the plurality of topic vectors has a fixed length. In at least some embodiments, the processor computes the updated topic vector h according to the following equation. t : Where h t-1 This represents the previous topic vector. In at least some embodiments, the processor uses the input projection, topic update weights, and the previous topic vector as input to determine the updated topic vector.

[0058] In S338, the processor stores the updated topic vector in memory. In at least some embodiments, the processor stores each updated topic vector in one or more memory banks with a capacity equal to the updated topic vector. In at least some embodiments, the processor stores the updated topic vector in memory by overwriting the previous topic vector. In at least some embodiments, the processor stores the updated topic vector in memory by preserving the previous topic vector during the training of the language model.

[0059] Figure 4 This is a workflow for merging updated topic vectors according to at least some embodiments of the present disclosure. In at least some embodiments, the workflow provides a method for merging updated topic vectors. In at least some embodiments, the method is performed by a processor of a device described below, such as... Figure 7 The device 760 uses the processor 762.

[0060] In S440, the processor calculates the topic merging rate. In at least some embodiments, the processor uses a topic merging rate weight value and an input token embedding to calculate the topic merging rate. In at least some embodiments, the processor calculates the topic merging rate μ according to the following equation. t : in This represents the topic merging rate weight value. In at least some embodiments, the processor retrieves the topic merging rate weight value from the trained parameter values ​​of the language model.

[0061] In S444, the processor calculates the topic merging weight. In at least some embodiments, the processor uses the topic merging rate and topic membership score to calculate the topic merging weight. In at least some embodiments, the processor calculates the topic merging weight ϕ according to the following equation. t : In at least some embodiments, the processor uses topic merging weights to determine the contribution each topic vector should make to the merged representation.

[0062] In S448, the processor calculates the output projection of the merged topic vectors. In at least some embodiments, the processor uses topic merging weights, updated topic vectors, and output projection weight values ​​to calculate the output projection. In at least some embodiments, the processor calculates the output projection y according to the following equation: t : Where W0 represents the output projection weight matrix. In at least some embodiments, the processor retrieves the output projection weight values ​​from the trained parameter values ​​of the language model. In at least some embodiments, the processor utilizes the output projection as an output token embedding.

[0063] Figure 5 This is a schematic diagram of a language model with factorized memory according to at least some embodiments of the present disclosure. The language model 510 includes a token embedding layer 512, one or more decoding layers (e.g., decoder layer 514), and a language model head layer 518. In at least some embodiments, the language model 510 is configured to receive a natural language prompt 511 as input. In at least some embodiments, the language model 510 is configured to produce a natural language response 519 as output. Although the language model 510 is primarily designed for natural language, the natural language prompt 511 and the natural language response 519 are not strictly limited to natural language. The natural language prompt 511 and the natural language response 519 may contain non-linguistic text, such as code, mathematical algorithms, programming or markup languages, or any other non-linguistic elements that typically accompany natural language.

[0064] The token embedding layer 512 is a group of layers included in the language model 510. In at least some embodiments, the token embedding layer 512 is configured to parse the natural language prompt 511 into a token. In at least some embodiments, the token embedding layer 512 is configured to embed the token into a vector in the feature space. In at least some embodiments, the token embedding layer 512 is configured to encode the natural language prompt 511 into an input token embedding, for example... Figure 1 The input token embedding 101. In at least some embodiments, the token embedding layer 512 is generally compatible with the language model. In at least some embodiments, the token embedding layer 512 may be trained separately from the language model 510. In at least some embodiments, the token embedding layer 512 and the language model 510 are trained together as a whole.

[0065] A decoder layer, comprising decoder layer 514, is a group of layers included in language model 510. Decoder layer 514 includes factorized memory blocks 500 and feedforward blocks 516. In at least some embodiments, each decoder layer includes factorized memory blocks. In at least some embodiments, each decoder layer includes only factorized memory blocks. In at least some embodiments, some decoder layers optionally include feedforward blocks, fully connected blocks, etc., or any combination thereof with factorized memory blocks. In at least some embodiments, some decoder layers include attention blocks or multilayer perceptron (MLP) blocks without factorized memory blocks.

[0066] Factorized memory block 500 is a component of decoding layer 514. In at least some embodiments, factorized memory block 500 is configured to selectively update portions of the recurrent memory state upon which the output is based. In at least some embodiments, factorized memory block 500 includes memory update functionality and memory merging functionality. In at least some embodiments, factorized memory block 500 is configured to calculate topic membership scores, update topic vectors based on topic update weight values, and merge updated topic vectors to produce output token embeddings. In at least some embodiments, factorized memory block 500 as... Figure 1 Configure it as described in the text.

[0067] Feedforward block 516 is a component of decoding layer 514. In at least some embodiments, feedforward block 516 is an optional block within decoder layer 514. In at least some embodiments, feedforward block 516 is configured to perform additional processing on the output token embedding. In at least some embodiments, feedforward block 516 is configured to refine the output projection into the output token embedding.

[0068] The language model head layer 518 is a group of layers included in the language model 510. In at least some embodiments, the language model head layer 518 is configured to decode embedded token vectors into tokens. In at least some embodiments, the language model head layer 518 is configured to assemble tokens into a natural language response 519. In at least some embodiments, the language model head layer 518 is configured to decode output token embeddings into a natural language response 519. In at least some embodiments, the language model head layer 518 is generally compatible with the language model. In at least some embodiments, the language model head layer 518 and the language model 510 are trained together as a whole.

[0069] Figure 6 This is a workflow for assembling and training a language model with factorized memory according to at least some embodiments of the present disclosure. In at least some embodiments, the workflow provides a method for assembling and training a language model with factorized memory. In at least some embodiments, the method is executed by a processor of a device described below, such as... Figure 7 The device 760 uses the processor 762.

[0070] In S650, the processor uses factorized memory to construct the decoding layer. In at least some embodiments, the processor constructs the decoding layer, wherein at least some of these components include factorized memory blocks. In at least some embodiments, the processor constructs the decoding layer in a certain number, configuration, and pattern based on user input. In at least some embodiments, the processor includes one or more optional blocks in the decoding layer, such as feedforward blocks, fully connected blocks, etc. In at least some embodiments, the processor constructs a decoding layer in which at least some components include attention blocks or MLPs but do not include factorized memory blocks.

[0071] In S652, the processor assembles the token embedding layer, the decoding layer, and the language model head layer. In at least some embodiments, the processor assembles the language model by combining the decoding layer with the token embedding layer on the input side and the language model head layer on the output side. In at least some embodiments, the processor configures the output dimension of the token embedding layer to match the input dimension of the decoding layer. In at least some embodiments, the processor configures the input dimension of the language model head layer to match the output dimension of the decoding layer.

[0072] In S654, the processor selects a value for a configurable parameter. In at least some embodiments, the processor selects a value for a parameter that includes the total number of topic vectors (e.g., m in Equation 1), topic membership temperature (e.g., τ in Equation 2), and the number of updated topic vectors embedded per input (e.g., k in Equation 3). In at least some embodiments, the processor sets the value of the configurable parameter based on user input. In at least some embodiments, the processor selects the value of the configurable parameter based on training results.

[0073] In S656, the processor trains the language model. In at least some embodiments, the processor uses a training set of training samples to calculate a loss based on a loss function, and updates the trainable parameters of the language model based on the calculated loss. In at least some embodiments, the processor training includes the following parameters: topic membership weights, topic update rate weights, topic merging rate weights, input projection weights, output projection weights, token embedding layer parameters, language model head-level parameters, and any other trainable parameters of the language model. In at least some embodiments, when training the language model, the processor divides the token embedding hyperspace into multiple topic partitions. In at least some embodiments, when training the language model, the processor encodes the centroid of each topic partition into a topic vector, and the topic membership weights are based on the topic vectors.

[0074] In S658, the processor determines whether the accuracy and computational efficiency are acceptable. If the processor determines that the accuracy and computational efficiency are unacceptable, the operation returns to S654 to select different values ​​for the configurable parameters. If the processor determines that the accuracy and computational efficiency are acceptable, the operation ends.

[0075] Figure 7 This describes embodiments of a language modeling device 760 with factorized memory according to at least some embodiments of the present disclosure. For example... Figure 7 As shown, device 760 includes processor 762, memory 763, storage device 764, input component 765, output component 766, communication interface 767, and bus 768. As used herein, processor 762 refers to any type of computing circuitry that may include hardware and software elements. Processor 762 may be embodied as a multi-core processor, a single-core processor, or a combination of one or more multi-core processors and / or one or more single-core processors, a distributed processing system, or the like. Processor 762 may be a central processing unit (CPU), graphics processing unit (GPU), accelerated processing unit (APU), application-specific integrated circuit (ASIC), or another type of processing component.

[0076] Memory 763 includes non-transitory computer-readable media. Memory 763 includes random access memory (RAM), read-only memory (ROM), and / or another type of dynamic or static storage device (e.g., flash memory, magnetic memory, and / or optical memory) that stores information and / or instructions for use by processor 762. Memory 763 includes machine-readable instructions executable by processor 762. When executed by processor 762, these machine-readable instructions cause processor 762 to perform one or more method steps of the embodiments described above.

[0077] Storage device 764 stores information and / or software related to the operation and use of device 760. For example, storage device 764 may include hard disk (e.g., magnetic disk, optical disk, magneto-optical disk and / or solid-state drive), compressed optical disk (CD), digital versatile disk (DVD), floppy disk, cassette tape, magnetic tape and / or another type of non-transitory computer-readable media and corresponding drives.

[0078] Input component 765 is configured to receive information, such as user input. For example, input component 765 may include, but is not limited to, a touchscreen display, keyboard, keypad, mouse, buttons, switches, and / or microphone. Alternatively, input component 765 may include sensors for sensing information (such as a Global Positioning System (GPS), accelerometer, gyroscope, and / or actuator).

[0079] Output component 766 is configured to provide information output from device 760. For example, output component 766 may be, but is not limited to, a display, a speaker, an instruction device for an external device, and / or one or more light-emitting diodes (LEDs).

[0080] Communication interface 767 is an interface that provides a communication connection with other devices (such as external and internal devices). The connection via communication interface 767 can be a wired connection, a wireless connection, or a combination of wired and wireless connections, and can be a direct connection or an indirect connection via a communication network existing between device 760 and other devices. In other words, the standard of communication interface 767 is not limited.

[0081] Bus 768 serves as an interconnect between the processor 762, memory 763, storage device 764, input component 765, output component 766, and communication interface 767 of device 760. Bus 768 may include wired or wireless interconnects.

[0082] Figure 7 The number and arrangement of components shown are for illustrative purposes only. In reality, device 760 can be compared to... Figure 7 The components shown may include additional components, fewer components, different components, or components arranged differently. Alternatively, a set of components of device 760 (e.g., one or more components) may perform one or more functions described as being performed by another set of components of device 760. Furthermore, one or more method steps described in any of the embodiments may be performed using multiple devices 760 communicating with each other.

[0083] In at least some embodiments, language modeling with factorized memory is performed by an operational method comprising: calculating a topic membership score for each of a plurality of topic vectors based on an input token embedding and a topic membership weight matrix; updating each of at least some of the plurality of topic vectors based on the corresponding topic membership score; and merging the updated at least some of the plurality of topic vectors to produce an output token embedding.

[0084] In at least some embodiments, the update operation of the method includes calculating a topic update rate value for each of at least some of the plurality of topic vectors based on topic update rate weights and the input token embedding. In at least some embodiments, the update further includes calculating a topic update weight value for each of the at least some of the plurality of topic vectors based on the topic update rate value and the topic membership score. In at least some embodiments, the update further includes calculating an input projection based on an input projection weight matrix and the input token embedding. In at least some embodiments, the update further includes calculating an updated topic vector for each of the at least some of the plurality of topic vectors based on the topic update weight value, the input projection, and the previous topic vector. In at least some embodiments, the update further includes storing each updated topic vector in physical memory. In at least some embodiments, the update further includes retrieving the previous topic vector corresponding to each of the at least some of the plurality of topic vectors from the physical memory. In at least some embodiments, the merging includes calculating a topic merging rate value for each of the at least some of the plurality of topic vectors based on a topic merging rate weight and the input token embedding. In at least some embodiments, the merging further includes calculating a topic merging weight value for each topic among at least some of the plurality of topic vectors based on the topic merging rate value and the topic membership score. In at least some embodiments, the merging further includes calculating an output projection based on the output projection weight matrix of each topic among at least some of the plurality of topic vectors, the updated topic vector, and the topic merging weight value. In at least some embodiments, the method further includes: encoding natural language input into the input token embedding; and decoding the output token embedding into natural language output. In at least some embodiments, the calculation of each topic membership score is further based on a topic membership temperature value. In at least some embodiments, the calculation includes selecting a predetermined number of topic vectors with the highest topic membership score from the plurality of topic vectors, the predetermined number of topic vectors being at least some of the plurality of topic vectors.

[0085] In at least some embodiments, language modeling with factorized memory is performed by a device configured to: compute a topic membership score for each of a plurality of topic vectors based on an input token embedding and a topic membership weight matrix; update each of at least some of the plurality of topic vectors based on the corresponding topic membership score; and merge the updated at least some of the plurality of topic vectors to produce an output token embedding.

[0086] In at least some embodiments, the update operation performed by the device includes calculating a topic update rate value for each of at least some of the plurality of topic vectors based on topic update rate weights and the input token embedding. In at least some embodiments, the update further includes calculating a topic update weight value for each of the at least some of the plurality of topic vectors based on the topic update rate value and the topic membership score. In at least some embodiments, the update further includes calculating an input projection based on an input projection weight matrix and the input token embedding. In at least some embodiments, the update further includes calculating an updated topic vector for each of the at least some of the plurality of topic vectors based on the topic update weight value, the input projection, and the previous topic vector. In at least some embodiments, the update further includes storing each updated topic vector in physical memory. In at least some embodiments, the update further includes retrieving the previous topic vector corresponding to each of the at least some of the plurality of topic vectors from the physical memory. In at least some embodiments, the merging includes calculating a topic merging rate value for each of the at least some of the plurality of topic vectors based on a topic merging rate weight and the input token embedding. In at least some embodiments, the merging further includes calculating a topic merging weight value for each topic among at least some of the plurality of topic vectors based on the topic merging rate value and the topic membership score. In at least some embodiments, the merging further includes calculating an output projection based on the output projection weight matrix of each topic among at least some of the plurality of topic vectors, the updated topic vector, and the topic merging weight value. In at least some embodiments, the operation performed by the device further includes: encoding natural language input into the input token embedding; and decoding the output token embedding into natural language output. In at least some embodiments, the calculation of each topic membership score is further based on a topic membership temperature value. In at least some embodiments, the calculation includes selecting a predetermined number of topic vectors with the highest topic membership score from the plurality of topic vectors, the predetermined number of topic vectors being at least some of the plurality of topic vectors.

[0087] In at least some embodiments, language modeling with factorized memory is performed by a non-transitory computer-readable medium, the non-transitory computer-readable medium comprising instructions that, in response to execution by one or more processors, cause the following operations to be performed: calculating a topic membership score for each of a plurality of topic vectors based on an input token embedding and a topic membership weight matrix; updating each of at least some of the plurality of topic vectors based on the corresponding topic membership score; and merging the updated at least some of the plurality of topic vectors to produce an output token embedding.

[0088] In at least some embodiments, the update operation includes calculating a topic update rate value for each of the at least some of the plurality of topic vectors based on topic update rate weights and the input token embedding. In at least some embodiments, the update further includes calculating a topic update weight value for each of the at least some of the plurality of topic vectors based on the topic update rate value and the topic membership score. In at least some embodiments, the update further includes calculating an input projection based on an input projection weight matrix and the input token embedding. In at least some embodiments, the update further includes calculating an updated topic vector for each of the at least some of the plurality of topic vectors based on the topic update weight value, the input projection, and the previous topic vector. In at least some embodiments, the update further includes storing each updated topic vector in physical memory. In at least some embodiments, the update further includes retrieving the previous topic vector corresponding to each of the at least some of the plurality of topic vectors from the physical memory. In at least some embodiments, the merging includes calculating a topic merging rate value for each of the at least some of the plurality of topic vectors based on a topic merging rate weight and the input token embedding. In at least some embodiments, the merging further includes calculating a topic merging weight value for each topic among at least some of the plurality of topic vectors based on the topic merging rate value and the topic membership score. In at least some embodiments, the merging further includes calculating an output projection based on the output projection weight matrix of each topic among at least some of the plurality of topic vectors, the updated topic vector, and the topic merging weight value. In at least some embodiments, the operation further includes: encoding natural language input into the input token embedding; and decoding the output token embedding into natural language output. In at least some embodiments, the calculation of each topic membership score is further based on a topic membership temperature value. In at least some embodiments, the calculation includes selecting a predetermined number of topic vectors with the highest topic membership score from the plurality of topic vectors, the predetermined number of topic vectors being at least some of the plurality of topic vectors. In at least some embodiments, the operation further includes training a language model comprising multiple token embedding layers, at least one decoder layer comprising factorized memory blocks, and multiple language model head layers, wherein the factorized memory blocks comprise trainable parameters, including the topic membership weight matrix, the topic update rate, the topic merging rate, the input projection weight matrix, and the output projection weight matrix. In at least some embodiments, the operation further includes selecting a value for each of at least some configurable parameters comprising the total number of topic vectors, the number of updated topic vectors embedded per input, and the topic membership temperature.

Claims

1. A method comprising: The topic membership score of each topic vector in a plurality of topic vectors is calculated based on the input token embedding and topic membership weight matrix. Update each topic vector among at least some of the plurality of topic vectors based on the corresponding topic membership score; and The updated at least some of the multiple topic vectors are merged to produce an output token embedding.

2. The method according to claim 1, wherein The update includes calculating the topic update rate value for each of at least some of the plurality of topic vectors based on the topic update rate weight and the input token embedding.

3. The method according to claim 2, wherein The update further includes calculating a topic update weight value for each topic among at least some of the plurality of topic vectors based on the topic update rate value and the topic membership score.

4. The method according to claim 3, wherein The update further includes calculating the input projection based on the input projection weight matrix and the input token embedding.

5. The method according to claim 4, wherein The update further includes calculating an updated topic vector for each of at least some of the plurality of topic vectors based on the topic update weight value, the input projection, and the previous topic vector.

6. The method according to claim 5, wherein The update further includes storing each updated topic vector in physical memory.

7. The method of claim 6, wherein The merging process includes calculating a topic merging rate value for each of at least some of the plurality of topic vectors based on topic merging rate weights and the input token embedding.

8. The method according to claim 7, wherein The merging further includes calculating a topic merging weight value for each topic among at least some of the plurality of topic vectors based on the topic merging rate value and the topic membership score.

9. The method according to claim 8, wherein The merging further includes calculating the output projection based on the output projection weight matrix of each topic in at least some of the plurality of topic vectors, the updated topic vectors, and the topic merging weight values.

10. The method of claim 9, wherein A language model is trained, the language model comprising multiple token embedding layers, at least one decoder layer comprising factorized memory blocks, and multiple language model head layers, wherein the factorized memory blocks comprise trainable parameters, the trainable parameters comprising the topic membership weight matrix, the topic update rate, the topic merging rate, the input projection weight matrix, and the output projection weight matrix.

11. The method of claim 10, wherein Select a value for each of at least some of the configurable parameters, including the total number of topic vectors, the number of updated topic vectors embedded per input, and the topic membership temperature.

12. The method of claim 5, wherein The update further includes retrieving from the physical memory the previous topic vector corresponding to each of at least some of the plurality of topic vectors.

13. The method of claim 1, further comprising: The natural language input is encoded into the input token embedding.

14. The method of claim 13, further comprising: The output token is embedded and decoded into natural language output.

15. The method of claim 1, wherein The calculation of each topic membership score is further based on the topic membership temperature value.

16. The method of claim 1, wherein The calculation includes selecting a predetermined number of topic vectors with the highest topic membership score from the plurality of topic vectors, wherein the predetermined number of topic vectors is at least some of the plurality of topic vectors.

17. An apparatus configured to perform operations including: The topic membership score of each topic vector in a plurality of topic vectors is calculated based on the input token embedding and topic membership weight matrix. Update each topic vector among at least some of the plurality of topic vectors based on the corresponding topic membership score; and The updated at least some of the multiple topic vectors are merged to produce an output token embedding.

18. The device according to claim 17, wherein The update includes calculating the topic update rate value for each of at least some of the plurality of topic vectors based on the topic update rate weight and the input token embedding.

19. A non-transitory computer-readable medium comprising instructions that, in response to execution by one or more processors, cause execution of an operation including: The topic membership score of each topic vector in a plurality of topic vectors is calculated based on the input token embedding and topic membership weight matrix. Update each topic vector among at least some of the plurality of topic vectors based on the corresponding topic membership score; and The updated at least some of the multiple topic vectors are merged to produce an output token embedding.

20. The computer-readable medium of claim 19, wherein The update includes calculating the topic update rate value for each of at least some of the plurality of topic vectors based on the topic update rate weight and the input token embedding.