Efficient generative neural network serving using in-context caching
The in-context caching system efficiently routes requests to appropriate generative neural networks, enhancing LLM serving performance by improving throughput and reducing latency while maintaining response quality.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2025-12-23
- Publication Date
- 2026-07-02
AI Technical Summary
Serving large language models (LLMs) at scale is challenging due to high computational costs and latency, and existing caching methods lead to quality degradation in responses due to small differences in user requests.
An in-context caching system that selects appropriate demonstration examples based on similarity and utility, adaptively routes requests to generative neural networks with varying capabilities, and employs a cost-aware cache replay mechanism to balance response quality, latency, and system throughput.
Improves LLM serving throughput by 1.4-5.9x and reduces latency by 28-71% without degrading response quality, effectively managing resource demands and latency.
Smart Images

Figure US2025061256_02072026_PF_FP_ABST
Abstract
Description
[0001] EFFICIENT GENERATIVE NEURAL NETWORK SERVING USING INCONTEXT CACHING
[0002] CROSS REFERENCE TO RELATED APPLICATIONS
[0003] This application claims priority to U.S. Application No. 63 / 738,475, filed December 23, 2024. The disclosure of the foregoing application is hereby incorporated by reference in its entirety.
[0004] BACKGROUND
[0005] This specification relates to processing inputs using neural networks.
[0006] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to another layer in the network, e.g., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of weights.
[0007] SUMMARY
[0008] This specification describes a system implemented as computer programs on one or more computers in one or more locations that performs model serving for a set of generative neural networks, e.g., language model neural networks, e.g., large language model neural networks (LLM), recurrent neural networks, diffusion models, or any other appropriate type of generative neural networks. That is, the system uses the set of generative neural networks to respond to user requests.
[0009] In one aspect, a method includes maintaining cache data specifying a set of demonstration examples, wherein each demonstration example comprises a respective example request and a respective example response for the respective example request; receiving a new request; selecting, from the demonstration examples in the cache data and using the new request, a subset of demonstration examples for the new request; selecting, based on the subset of demonstration examples and the new request, a generative neural network from a set of a plurality of generative neural networks for processing the new request; processing the new request using the selected generative neural network to generate a new response to the new request; and providing the new response in response to the new request.In some implementations, selecting, from the demonstration examples in the cache data and using the new request, a subset of demonstration examples for the new request comprises: generating an embedding for the new request; and selecting a first subset of the demonstration examples based on a similarity between the embedding for the new request and respective embeddings of at least some of the demonstration examples.
[0010] In some implementations, selecting a first subset of the demonstration examples based on a similarity between the embedding for the new request and respective embeddings of at least some of the demonstration examples comprises: maintaining a respective centroid embedding for each of a plurality of clusters of the demonstration examples; selecting a cluster from the plurality clusters having a respective centroid embedding that is most similar to the embedding for the new request; and selecting the first subset of demonstration examples from the demonstration examples in the selected cluster.
[0011] In some implementations, selecting the first subset of demonstration examples from the demonstration examples in the selected cluster comprises: selecting a specified number of demonstration examples from the demonstration examples in the selected cluster having respective embeddings that are most similar to the embedding for the new request.
[0012] In some implementations, selecting, from the demonstration examples in the cache data and using the new request, a subset of demonstration examples for the new request further comprises: processing an input representing the new request and the demonstration examples in the first subset using a proxy neural network to generate a respective predicted score for each demonstration example that represents a utility of the demonstration example in responding to the new request; and selecting the subset of demonstration examples based on the predicted scores for the demonstration examples in the first subset.
[0013] In some implementations, the method further comprises: determining whether the new request matches any of the example requests in the demonstration examples in the cached data; and performing the receiving, selecting, selecting, and processing in response to determining that the new request does not match any of the example requests in the demonstration examples in the cached data.
[0014] In some implementations, the generative neural networks in the set of the plurality of generative neural networks each have a different computational cost.
[0015] In some implementations, selecting, based on the subset of demonstration examples and the new request, a generative neural network from a set of a plurality of generative neural networks for processing the new request comprises: processing an input representing the subset of demonstration examples and the new request using a request router model togenerate an action output that identifies one of the generative neural networks in the set of generative neural networks.
[0016] In some implementations, the method further comprises: receiving a feedback signal that indicates (i) a computational cost of processing the new request using the selected generative neural network and (ii) a quality7measure for the new response; combining the computational cost and the quality measure to generate a reward score; and updating the request router model using the reward score.
[0017] In some implementations, wherein the request router model is a contextual bandit model.
[0018] In some implementations, the request router model has been trained by performing an initial bootstrapping training phase followed by an online adaptation phase.
[0019] In some implementations, the method further comprises: updating the cache data using the new request and the new response.
[0020] In some implementations, updating the cache data using the new request and the new response comprises: adding data specifying a new demonstration example that comprises the new request and the new response to the cache data.
[0021] In some implementations, wherein updating the cache data using the new request and the new response comprises: generating, from the new request and the new response, a synthetic demonstration example that comprises a synthetic request and a synthetic response; and adding data specifying the synthetic demonstration example to the cache data.
[0022] In some implementations, the method further comprises: determining that first criteria are satisfied for updating the cache data; and in response, generating a respective distilled response for each of one or more of the example requests and, for each distilled response, adding data specifying a new demonstration example that includes the corresponding example request and the distilled response to the cache data.
[0023] In some implementations, the first criteria are satisfied when a system load on one or more of the generative neural networks is below a threshold value.
[0024] In some implementations, the method further comprises: determining that second criteria are satisfied for updating the cache data; and in response, generating a respective expanded response for each of one or more of the example requests and, for each expanded response, adding data specifying a new demonstration example that includes the corresponding example request and the expanded response to the cache data.
[0025] In some implementations, the second criteria are satisfied when a system load on one or more of the generative neural networks is below a threshold value.In some implementations, the cache data comprises, for a second subset of the demonstration examples, a respective internal model representation of the demonstration example.
[0026] In another aspect, a system comprises one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one more computers to perform the operations of the respective method of any of the above aspects.
[0027] In another aspect, one or more computer storage media store instructions that when executed by one or more computers cause the one more computers to perform the operations of the respective method of any of the above aspects.
[0028] Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages.
[0029] Large language models (LLMs) and other generative neural networks have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. That is, serving LLMs incurs significant computation costs, i.e., both in terms of latency in responding to user requests and compute resources required to generate the responses.
[0030] Moreover, although a large fraction of user requests to LLMs have semantically similar counterparts among previously submitted requests, i.e., are semantically similar to previously submitted requests, performing knowledge sharing among requests to mitigate these serving costs remains a challenge. For example, naively caching and reusing past responses leads to large uality degradation in outputs for the new requests, because even small differences in received requests can lead to large differences in expected responses for the requests.
[0031] This specification describes techniques that address these issues by making use of an in-context caching system that leverages historical requests as examples to guide response generation, enabling selective offloading of requests to more efficient LLMs. That is, the described techniques not only effectively select the most appropriate demonstration examples for any given received request, but also use the selected demonstration examples to effectively guide appropriate received requests to generative neural networks with lower computational costs, effectively decreasing serving costs.
[0032] Moreover, enabling this real-time knowledge transfer leads to intricate tradeoffs between response quality, latency, and system throughput at scale. That is. when a large number of requests need to be processed at any given time ('‘at scale’’), the describedtechniques effectively balance response quality, latency, and system throughput in order to achieve high serving performance.
[0033] For anew request, the described techniques identify similar, high-utility examples and efficiently prepend them to the input for better response qualify. Moreover, at scale, the described techniques adaptively route requests to LLMs of varying capabilities, accounting for response quality and serving loads. As another example, the described techniques can employ a cost-aware cache replay mechanism to improve example quality and coverage offline, maximizing cache utility’ and runtime efficiency.
[0034] As a particular example, evaluations on millions of realistic requests demonstrate that the described techniques improve LLM serving throughput by 1.4-5.9x and reduce latency by 28-71% without hurting response quality relative to an existing LLM serving system.
[0035] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
[0036] BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram of an example model serving system.
[0037] FIG. 2 is a flow diagram of an example process for generating a response to a new request.
[0038] FIG. 3 A is a flow diagram of an example process for selecting demonstration examples.
[0039] FIG. 3B shows an example of performing a two-stage selection process.
[0040] FIG. 4 is a flow diagram of an example process for selecting a generative neural network.
[0041] FIG. 5 shows an example of the request flow during operation of the system.
[0042] FIG. 6 shows an example of the performance of the described techniques.
[0043] FIG. 7 shows another example of the performance of the described techniques.
[0044] Like reference numbers and designations in the various drawings indicate like elements.
[0045] DETAILED DESCRIPTION FIG. 1 is a diagram of an example model serving system 100. The model system 100 is an example of a system implemented as computer programs on one or more computers inone or more locations, in which the systems, components, and techniques described below can be implemented.
[0046] The system 100 uses a set of generative neural networks 110A-N, e.g., language model neural networks, e.g., large language model neural networks (LLM), to respond to user requests.
[0047] In particular, the system 100 can efficiently perform serving, i.e., the process of processing new requests and returning responses to the new requests, using the set of generative neural networks 11 OA-N by making use of a cache.
[0048] More specifically, the system 100 maintains cache data 120 specifying a set of demonstration examples 122. Each demonstration example 122 includes a respective example request and a respective example response for the respective example request.
[0049] The system 100 then uses the demonstration examples 122 in the cache 120 to more efficiently generate responses 112 to user requests 102 while maintaining high response quality.
[0050] In particular, when the system 100 receives anew request 102, e.g.. one that does not match any of the example requests already in the cache 120, the system 110 selects, from the demonstration examples 122 in the cache data 120 and using the new request 102, a subset 124 of demonstration examples for the new request 102.
[0051] The system 100 then selects, based on the subset 124 of demonstration examples and the new request 102. a generative neural network from the set of generative neural networks 110A-N for processing the new request 102. In the example of FIG. 1 , the system has selected generative neural network 110A for processing the request 102.
[0052] Generally, the generative neural networks 110A-N each have a different computational cost. That is, processing a given input using different ones of the generative neural networks 110A-N results in different amounts of latency, different amounts of memory used, different amounts of processor cycles, and so on.
[0053] For example, the system 100 can be implemented on a user device, and one or more of the generative neural networks 110A-N may be deployed on the user device, while one or more others are deployed remotely, e.g., in the cloud. As a result, processing an input using a generative neural network deployed on the user device will result in less latency and less network bandwidth consumed than processing an input using a generative neural network deployed remotely. However, although processing using the remote neural network(s) results in higher latency, the remote neural networks can be larger neural networks as a result of the additional processing and memory resources available in the cloud. Thus, the remote neuralnetwork(s) can be expected to generate a higher quality output for at least some requests than the local generative neural network(s).
[0054] As another example, the system 100 can be deployed remote from the user device, e.g., in the cloud, and can have access to different neural networks 110A-N that have different computational costs.
[0055] As another example, the system 100 can be deployed on the user device and the different neural networks 110A-N can also be deployed on the user device but can have different computational costs.
[0056] As another example, two neural networks that are deployed on the same set of one or more devices can have different computational costs as a result of being deployed on different hardware. For example, one neural network can be deployed on a first number of hardware accelerators while the other neural network can be deployed on a smaller number of hardware accelerators or on general purpose hardware.
[0057] As another example, two neural networks that are deployed on the same set of one or more devices can have different computational costs as a result of one neural network having a higher capacity than the other neural network, i.e., one neural network having more parameters than the other, being able to handle more modalities of data, or both. For example, one neural network can have more layers, can include more attention heads within a given attention layer, can have a larger model dimension, and so on. That is, one neural network can have a different value for any of the aspects of the architecture of the generative neural networks described below that allows the neural network to have a higher capacity than the other generative neural network(s).
[0058] The system 100 processes the new request 102 using the selected generative neural network 110A-N to generate anew response 112 to the new request 102 and provides the new response 112 in response to the new request 102. In some implementations, the system 100 processes the new request and the selected subset of demonstration examples using the selected generative neural network to generate the new response 112. In some other implementations, the system 100 determines whether to include the selected subset of demonstration examples as part of the input to the generative neural network based on which generative neural network is selected. For example, the system 100 can include the demonstration examples as part of the input when a generative neural network with smaller capacity is selected, but can refrain from including the examples when a generative neural network with more capacity is selected.For example, the system 100 can receive requests from users of user devices and can provide the responses to the corresponding user devices in response to the requests. As another example, the system 100 can receive requests from other systems through an application programming interface (API) or a model communication protocol and provide the responses to the other systems.
[0059] The generative neural networks in the set can be any appropriate neural network that receives as input a sequence of tokens and processes the sequence of tokens to generate an output sequence of tokens. A ’token’ is data that represents a unit of data, e.g., a text symbol or data of another modality, e.g., a portion of an image, audio signal, or video signal. For example, a ‘token’ can be a one-hot vector or a dense embedding.
[0060] In some cases, the generative neural network is a language model neural network that processes tokens representing text symbols or a multi-modal language model neural network that can process tokens representing text symbols and tokens representing data of one or more other modalities, e.g., image, video, audio, and so on. As a particular example of this, the generative neural network can be an auto-regressive neural network that generates the tokens in the output sequence auto-regressively, i.e., one after another. One example of such a neural network is a decoder-only Transformer neural network. Examples of such neural networks include Gemini and Gemma.
[0061] The tasks performed by the system to generate responses can be any appropriate machine learning task. Some examples of tasks now follow.
[0062] For example, the machine learning task can be a text processing task.
[0063] A “text processing” task is any task that requires processing an input that includes a sequence of text, i.e., a sequence of text tokens, generating an output that includes a sequence of text tokens, or both.
[0064] The text tokens can be tokens selected from a vocabulary of text tokens that includes, e.g., one or more of characters, word pieces, words, punctuation marks, numerical symbols, or any other text sy mbols.
[0065] For example, the text processing task can be a text rewriting task that requires processing an input text sequence to generate an output text sequence that is a rewritten version of the input text sequence.
[0066] For example, one text rewriting task can be to generate an output text sequence that is a more formal version of the input text sequence but that conveys the same semantic meaning.As another example, one text rewriting task can be to generate an output text sequence that is a shorter version of the input text sequence but that conveys the same semantic meaning.
[0067] As another example, one text rewriting task can be to generate an output text sequence that is a more elaborate version of the input text sequence but that conveys the same semantic meaning.
[0068] As another example, one text rewriting task can be to generate an output text sequence that is a paraphrased version of the input text sequence, i.e., one that uses different words from the input text sequence but that conveys the same semantic meaning.
[0069] As another example, one text rewriting task can be to generate an output text sequence that is a proofread version of the input text sequence, i.e.. one that corrects grammar and spelling mistakes in the input text sequence.
[0070] As another example, the text processing task can be a task that requires generating an output text sequence that is a completion of an input text sequence.
[0071] As another example, the text processing tasks can include a task that requires generating an output text sequence that is an answer to or a response to a query posed by the input text sequence. For example, the inference system can be deployed as part of a ’‘chat bof ’ or dialog system that responds to queries posed by users.
[0072] As another example, the text processing task can be text classification tasks, e.g., tasks that require classifying an input sequence of text into one of multiple categories.
[0073] Examples of such tasks include entailment tasks, textual similarity tasks, sentiment tasks, grammaticality tasks, and so on.
[0074] As another example, the task can be a computer code generation task, where the input is a sequence of text describing the functionality’ of a piece of computer code, or a sequence of computer code to be modified or completed, or both and the output is a sequence of computed code that modifies the computer code, that has the functionality that is described by the sequence of text, or both.
[0075] As another example, the task can be a computer code understanding task, where the input is a sequence of computer code, and the output characterizes the sequence of computer code, e g., summarizes the function of the code, describes review comments on the code, and so on.
[0076] As yet another example, the task can be an image processing task, e.g., a task that requires processing an input sequence that includes one or more tokens representing an image, e.g., generated by processing the image using a pre-trained encoder neural network.Examples of such tasks include image captioning, e.g., where the input represents an image and the output is a natural language text caption for the image, visual question-answering, where the input includes a text question about an image and tokens representing the image and the output includes a natural language answer to the image, and so on.
[0077] In some cases, the task can be a multi-modal task that requires processing, generating, or both tokens of multiple different modalities, e.g., two or more of text, images, video, audio, or other sensor data.
[0078] A more detailed description of examples of generative neural network architectures and tasks that can be performed by the system 100 is provided below.
[0079] FIG. 2 is a flow diagram of an example process 200 for generating a response to a new request. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the model serving system 100 depicted in FIG. 1, appropriately programmed in accordance with this specification, can perform the process 200.
[0080] The system maintains cache data specifying a set of demonstration examples (step 202). Each demonstration example includes a respective example request and a respective example response for the respective example request.
[0081] In some cases, the cache data represents the demonstration examples in their original form, e.g., as plain text or as input tokens to the generative neural network. In some cases, for at least some of the demonstration examples, the cache data can instead or in addition include a respective internal model representation of the demonstration example. The respective internal model representation can include intermediate outputs generated by one of the generative neural networks in the set by processing the demonstration example. As one example, the internal model representation can be the keys and values (KVs) of the attention heads of one or more of the attention layers of the generative neural network.
[0082] The system receives a new request (step 204).
[0083] The system selects, from the demonstration examples in the cache data and using the new request, a subset of demonstration examples for the new request (step 206).
[0084] Selecting the subset of demonstration examples will be described in more detail below with reference to FIG 3.
[0085] The system selects, based on the subset of demonstration examples and the new request, a generative neural netw ork from a set of a plurality' of generative neural networks for processing the new request (208). As described above, the subset is generally a proper subset of, i.e., less than all of, the demonstration examples in the cache data.That is, the system selects the generative neural network based not only on the new request, but also based on the demonstration examples that the system has determined will be most useful in responding to the new request.
[0086] Selecting a generative neural network will be described in more detail below with reference to FIG. 4.
[0087] The system processes the new request using the selected generative neural network to generate a new response to the new request (step 210). In some cases, for any generative neural network in the set, the system also processes the selected subset of demonstration examples as part of the input to the selected generative neural network. In some other cases, the system only includes the subset of demonstration examples as part of the input when the selected generative neural network is one of a predetermined subset of generative neural networks from the set. For example, smaller models may benefit more from additional context while larger models may at times be harmed by including the additional context. In this example, the system can include the subset of demonstration examples as part of the input when the selected generative neural network is one of a predetermined subset of smallest generative neural networks from the set.
[0088] The system provides the new response in response to the new request (step 212). In some cases, the system can perform steps 204-210 only responsive to determining that the new request does not match any of the example requests in the demonstration examples in the cached data. That is, when a new request is received, the system can determine whether the new request matches any of the example requests in the demonstration examples in the cached data and only perform the receiving, selecting, selecting, and processing in response to determining that the new request does not match any of the example requests in the demonstration examples in the cached data. If the new request does match one of the requests in the demonstration examples, the system can provide the example response to the matching example request in response to the new request, without needing to perform the computationally expensive steps described above.
[0089] In some implementations, the cache data is static once the system begins serving the generative neural networks.
[0090] In some other implementations, the system continues updating the cache data as new responses are generated.
[0091] In these implementations, for at least some requests, the system can update the cache data using the new request and the new response.For example, the system can update the cache data by adding data specifying anew demonstration example that includes the new request and the new response to the cache data.
[0092] As another example, the system can generate, from the new request and the new response, a synthetic demonstration example that includes a synthetic request and a synthetic response and then add data specifying the synthetic demonstration example to the cache data. For example, the system can generate the synthetic demonstration example using example distillation, example expansion, or both. These will be described in more detail below. As another example, rather than storing the new request and new response directly, the system can generate a differentially private (DP) synthetic example and store the synthetic example in the cache data.
[0093] In some implementations, when certain criteria are satisfied, the system can optimize the demonstration examples already in the cache data. This is also referred to as implementing a cost-aware cache replay mechanism.
[0094] For example, the system can delete certain examples from the cache data, e.g., when the cache data reaches a maximum size. The system can remove, e g., the oldest examples in the cache, the least frequently selected examples in the cache, or any other appropriate criterion.
[0095] As a particular example of this, the system can employ an online cache management policy that evicts low-utility examples. To do this, the system can treat each example as an item with a weight (its cache size, such as plaintext length) and a value (representing the achievable efficiency gain by including the example in the cache). The objective is to maximize the total value, i.e., the cumulative efficiency gains from caching the selected examples. The solution yields a binary caching decision for each example: whether to retain it in the cache or evict it. For example, the efficiency gain of an example can be measured by the number of successful offloadings it enables, i.e., the number of requests for which the examples was selected and that are routed to a smaller neural network in the set of generative neural networks. To adapt to changing request patterns over time, the system can maintain a moving average of this gain, e g., by applying a decay factor of, e.g... 0.8, 0.9, or 0.95 every hour or other unit of time that passes to emphasize recent usage while gradually discounting stale patterns. This one-dimensional knapsack problem can be solved efficiently using a knapsack solver. The system can run the solver periodically in the background or whenever the memory limit is approached, ensuring that cache optimization does not interfere with online serving.As yet another example, the system can store KVs for only a subset of the examples in the cache. Storing KVs for a given example reduces the computational cost of processing an input that includes the example but increases the storage cost of storing the example in the KV cache. In this example, the system can, given a fixed memory budget, determine which examples to cache in KV cache format by periodically (e.g., at intervals, when the maximum memory size is reached, or so on) solving the knapsack problem. Here, each example is treated as an item with a weight (its KV cache size, i.e., a value that defines the amount of memory that the KVs for the example consume, e g., measured in total number of KVs or in amount of memory required to store the KVs) and a value (the latency savings achieved once repurposing relative to storing only the original form of the example). Here, repurposing refers to including the example in an input to a generative neural network. The objective is to maximize the total value, i.e., the cumulative latency savings from caching the selected examples. The output of this optimization problem is a binary' caching decision for each example: whether to cache it using KVs or not. The total latency savings of caching an example’s KV cache may be defined, for example, as
[0096] (compute_latency-IO_latency)xrepurposing_freq, where compute latency measures the compute latency savings from caching the example, lO latency measures the input / output latency savings from caching the example, so that (compute latency-IO latency) measures the total latency savings, and repurposing_freq measures how frequently the example is selected for inclusion in inputs. Both compute and IO latencies can be accurately estimated, as they are linear with respect to the input length and independent of the content. To account for temporal variations in repurposing patterns, the system can use a moving average of the repurposing frequency in place of the absolute repurposing frequency. As one example, this value can be updated by applying a decay factor of, e.g., 0.8, 0.9, or 0.95 every hour or other unit of time to the objective every hour or other unit of time that passes, ensuring that recent trends have greater influence while gradually discounting older usage patterns. Like the above, this one-dimensional knapsack problem can be solved efficiently using a knapsack solver. The system can run the solver periodically in the background or whenever the memory limit is approached, ensuring that cache optimization does not interfere with online serving.
[0097] As another example, the system can make use of example distillation, example expansion, or both.
[0098] When making use of example distillation, the system can determine that a set of criteria are satisfied for updating the cache data and, in response, generate a respectivedistilled response for each of one or more of the example requests and, for each distilled response, add data specifying a new demonstration example that includes the corresponding example request and the distilled response to the cache data. For example, the system can determine that these criteria are satisfied when a system load on one or more of the generative neural networks is below a threshold value. Generating a distilled response is described in more detail below.
[0099] When making use of example expansion, the system can determine that a set of criteria are satisfied for updating the cache data and, in response, generate a respective expanded response for each of one or more of the example requests and, for each expanded response, add data specifying a new demonstration example that includes the corresponding example request and the expanded response to the cache data. For example, the system can determine that these criteria are satisfied when a system load on one or more of the generative neural networks is below a threshold value. Generating an expanded response is described in more detail below.
[0100] Updating the cache is described in more detail below with reference to FIG. 5.
[0101] FIG. 3A is a flow diagram of an example process 300 for selecting a subset of demonstration examples for a new request. As described above, the subset is generally a proper subset of the demonstration examples in the cache data. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g.. the model serving system 100 depicted in FIG. 1, appropriately programmed in accordance with this specification, can perform the process 300.
[0102] The system generates an embedding for the new request (step 302). For example, the system can process the new request using an appropriate pre-trained embedding neural network, e.g., a text embedding neural network, an image embedding neural network, a multi-modal embedding neural network, and so on.
[0103] The system can also have pre-computed, using the embedding neural network or a different embedding neural network that has been trained jointly with the above embedding neural network, a respective embedding of each demonstration example in the cache.
[0104] The system selects a subset (also referred to as a “first subset”) of the demonstration examples based on a similarity between the embedding for the new7request and respective embeddings of at least some of the demonstration examples (step 304). The system can measure similarity using any appropriate similarity measure between two embeddings, e.g., cosine distance, L2 distance, and so on.For example, the system can compute a respective similarity measure between the embedding for the new request and the respective embeddings for each of the demonstration examples and then select a specified number of demonstration examples having respective embeddings that are most similar to the embedding for the new request.
[0105] However, when there are a large number of demonstration examples, computing similarities between the new request and all of the demonstration examples in the cache may not be feasible or may incur excessive latency.
[0106] Instead, the system can maintain a respective centroid embedding for each of a plurality of clusters of the demonstration examples. That is, the system can maintain, for each cluster, an embedding that is the centroid of the embeddings of the demonstration examples in the cluster. For example, the system can have clustered the demonstration examples using any appropriate technique applied to the embeddings of the demonstration examples, e.g., k-means clustering, spectral clustering, and so on.
[0107] The system can then select a cluster from the plurality clusters having a respective centroid embedding that is most similar to the embedding for the new request according to the similarity measure and then select the first subset of demonstration examples from the demonstration examples in the selected cluster. For example, the system can select a specified number of demonstration examples from the demonstration examples in the selected cluster having respective embeddings that are most similar to the embedding for the new request.
[0108] Thus, when there are K clusters that each have N / K members, the system can effectively select the subset of demonstration examples by computing only K + N / K similarities even though there are N total demonstration examples.
[0109] In some cases, the system uses this first subset as the final subset.
[0110] However, making the final selection solely based on the similarities may yield suboptimal results. For example, example relevance may have a relatively weak correlation with its actual helpfulness to generating a high-quality response. This is because relevancebased selection fails to consider model-specific capabilities and example quality, leading to biased utility estimation. For example, examples with poor response quality or those a smaller model already excels at offer little quality improvement and can even detriment generation, all while adding unnecessary overhead. Moreover, while relevance can enrich response details, overall response quality depends on a broader set of factors, such as accuracy, depth, and creativity, which extend beyond relevance.In these cases, the system makes use of a proxy neural network to further filter the subset of demonstration examples. The proxy neural network is a neural network that is configured to process an input representing (i) the new request and (ii) a demonstration example to generate a predicted score for the demonstration example that represents a utility of the demonstration example in responding to the new request. For example, the input representing the new request and the demonstration example can include the new request and the demonstration example or can include the respective embeddings of the new request and the demonstration example.
[0111] The proxy neural network can generally be any appropriate neural network that can map the input representing the new request and the demonstration example. As a particular example, the proxy neural network can be a computationally efficient self-attention or recurrent neural network.
[0112] The system can train the proxy neural network on training examples that each include (i) an input representing a request and the demonstration example and (ii) a target score that represents a quality measure assigned to a response generated by a generative neural network by processing the request. Thus, this ensures the model is aware of the example’s end-to-end quality in improving the response. The system can train the proxy model offline, outside the critical path of online serving, and deploy the model online to predict the pairwise helpfulness of each example relative to the new request.
[0113] In particular, in these cases, the system can process an input representing the new request and the demonstration examples in the first subset using the proxy neural network to generate a respective predicted score for each demonstration example that represents a utility of the demonstration example in responding to the new request (step 306).
[0114] The system can then select the subset of demonstration examples based on the predicted scores for the demonstration examples in the first subset (step 308). For example, the system can select a specified number of examples with the highest predicted scores as the final subset of demonstration examples or can select each example that has a score that exceeds a threshold.
[0115] FIG. 3B shows an example 350 of the two-stage demonstration example selection process. In particular, as shown in the example 350, the system selects the subset of demonstration examples by performing four steps.
[0116] In step 1, the system embeds the input (‘‘new-’') request.
[0117] In step 2. the system identifies similar examples, i.e., using the similarity as described above with reference to step 304.In step 3, the system calculates the utility- of each selected similar example using the proxy neural network. The "‘utility’” is referred to as the “predicted score” of the demonstration examples above with reference to FIG. 3A.
[0118] In step 4, the system selects examples using the utilities of the selected similar example. For example, the system can select a specified number of examples with the highest utilities as the final subset of demonstration examples.
[0119] Thus, as shown in the example 350, the system considers both the relevance of the demonstration examples and the predicted usefulness or helpfulness of the demonstration examples when included as part of in input as part of responding to the new request.
[0120] FIG. 4 is a flow diagram of an example process 400 for selecting a generative neural network for a new request. For convenience, the process 400 will be described as being performed by a system of one or more computers located in one or more locations. For example, a training system, e.g., the model serving system 100 depicted in FIG. 1, appropriately programmed in accordance with this specification, can perform the process 400.
[0121] The selects a subset of demonstration examples for the new request (step 402), e.g., as described above with reference to FIG. 3A and 3B.
[0122] The system processes an input representing the subset of demonstration examples and the new request using a request router model to generate an action output that identifies one of the generative neural networks in the set of generative neural networks (step 404).
[0123] For example, the input representing the new request and the demonstration examples can include the new request and the demonstration examples or can include the respective embeddings of the new- request and the demonstration examples.
[0124] For example, the action output can include a respective score (“logit”) for each of the generative neural networks and the system can select a generative neural network from the set using the respective scores, e.g., by selecting the highest-scoring generative neural network or by sampling a generative neural network in accordance with the scores. In some implementations, as described below, the system can apply a bias on these scores that is based on the current system load, so that smaller models are favored when system load is excessive.
[0125] Making use of a request router model that takes as input both the request and the selected examples allows the system to effectively route between the generative neural networks in the set. As a particular example, with well-chosen examples, small generative neural networks can generate high-quality responses, enabling them to offload requests fromlarger, more expensive counterparts. However, overly aggressive offloading to small generative neural networks risks degrading response quality, while a too conservative approach constrains efficiency improvement. An effective routing strategy must balance the achievable response quality across different models with the current serving load. Moreover, the router must efficiently adapt to evolving data distributions and model characteristics, such as emerging request trends and shifts in example utility.
[0126] Consequently, relying on fixed classifiers or heuristics, such as training a request classifier, becomes infeasible due to the non-stationarity and high volume of requests.
[0127] To account for this, the request router model can be a contextual bandit model. The contextual bandit based formulation of the router enables dynamic adaptation to changing request patterns. For example, during high traffic, the system can prioritize a smaller model to manage resources efficiently. However, when shifts in query distribution are detected, the system can favor the larger model to maintain accuracy.
[0128] In more detail, a contextual bandit is, at time t, provided a context ct G C as input, and a set of permissible actions At . ft then chooses an action at G At and receives feedback yt G Rk indicating action quality with respect to k > 1 metrics. In this instance, the context is a representation of the new request and the retrieved examples. The action space is a binary decision of which generative neural network to route the request to, e.g., whether to route the request to a smaller model or a larger model. The feedback is generated as the output of a composite reward function as the combination e.g., the weighted sum, of relative cost and quality between model responses when routing actions are made. In some implementations, the system can add an amplified penalty' term when pair-wise quality difference is large to penalize overly poor responses.
[0129] Generally, the request router model has been trained by performing an initial bootstrapping training phase in which the request router model is trained offline. Optionally, this can be followed by an online adaptation phase.
[0130] When the process 300 is being performed during the online adaptation phase, the system receives a feedback signal that indicates (i) a computational cost of processing the new request using the selected generative neural network and (ii) a quality measure for the new response (step 406). In some cases, the cost and quality^ can be relative measures of cost and quality, i.e., the cost or quality relative to processing the response using one or more other generative neural networks in the set. The quality’ of a given response can be determined based on the outputs of a reward model or based on user feedback.The computational cost can be computed on, e.g., the latency of generating the response, the amount of FLOPS required to generate the response, or any appropriate set of one or more computational cost measures.
[0131] For example, quality measure can be a binary quality measure, e.g., zero or one, or a continuous valued quality score.
[0132] The system combines the computational cost and the quality measure to generate a reward score (step 408). For example, the system can compute a weighted sum of the computational cost and the quality measure.
[0133] The system updates the request router model using the reward score (step 410). The system can perform this updating using any appropriate online adaptation technique.
[0134] Thus, the contextual bandit based formulation of the router enables dynamic adaptation to changing request patterns. During high traffic, the system prioritizes the smaller model to manage resources efficiently. However, when shifts in query distribution are detected, it favors the larger model to maintain accuracy.
[0135] In some implementations, to handle load fluctuations, the request router incorporates a load-aware biasing strategy. Specifically, the system can track the Exponential Moving Average (EMA) of the system serving load over time. When the EMA remains a target operational threshold (e.g., the service capacity of large models), the router prioritizes response quality. In contrast, when the EMA exceeds the operational threshold, the router triggers a feedback controller to compute a corrective bias. This bias can be calculated using the hyperbolic tangent (tanh) function or another appropriate function, applied to the positive load deviation (i.e., current load - threshold). The resulting bias adjusts the bandit’s output logits, reducing the selection scores of high-cost models and favoring more efficient, low er-cost alternatives to relieve system pressure. This design offers several advantages: the tanh function provides a smooth, saturating response, enhancing stability by preventing unbounded bias values, and the bias is only active during actual overload conditions.
[0136] Crucially, this lightweight control mechanism adjusts routing preferences without modifying or retraining the underlying request router, effectively decoupling overload management from the core routing logic. The persistent magnitude of this applied bias can be used as a signal for infrastructure auto-scaling.
[0137] FIG. 5 shows an example 500 of the operation of the system. As shown in the example 500, when a new request, e g., one of requests 1-n, is received, the system performs either step la or lb, depending on whether a matching example exists in the example cache.If a match exists, the system performs step lb and directly returns the response in the matching example to the requesting user as the response to the new request.
[0138] If a match does not exist, the system performs step la and leverages the cache to retrieve a set of examples (by performing the operations of an “example retriever”).
[0139] The system then uses a request router to determine which generative neural network (which “LLM”) to route the request to. Depending on which generative neural network is selected, the system performs either step 2a or step 2b. At step 2a, the system provides the request and the selected examples to the selected generative neural network to generate a response. At step 2b, the system provides only the request (and not the selected examples) to the selected generative neural network to generate a response. As described above, while in the example 500 only one of the models receives the selected examples, in other implementations, all of the models in the set would receive the selected examples.
[0140] The system then returns the response to the requesting user.
[0141] In some cases, the system can obtain user feedback, e.g., a binary like / dislike response or other feedback, indicating a quality of the generated response. The system can then use this feedback to perform step 4, in which the system updates the cache data.
[0142] An example of updating the cache data now follows.
[0143] In the example 500, the system optimizes the collective utility' of all examples with two strategies: (i) improving the quality of each request-response pair to refine its helpfulness when repurposed, which ensures the cache retains only high-quality examples, and (ii) maximizing overall example coverage to ensure examples complement one another, particularly when combined, so that new requests can find helpful examples.
[0144] To improve example quality, the system employs an example distillation replay¬ process, which opportunistically queries the model, e.g., a designated generative neural network from the set, offline to generate multiple responses for the same example request and retains only the highest-quality response in the cache, e.g., as measured by7the output of a reward model. This replay design is guided by several practical considerations: First, given the dynamic nature of workloads, off-peak hours provide an opportunity to replay and refine low-quality examples without introducing overhead during online serving. Second. LLMs inherently produce responses of varying quality due to the stochasticity' in generation, e g., word (token) sampling strategies in generating the next. This variability' allows the system to select the best response among multiple runs. Finally, practical serving deployments often generate multiple candidate responses for requests (e.g., in beam search or for user preferencecomparisons such as "Which response do you prefer?"). By leveraging these pre-existing candidates, example distillation introduces minimal additional efforts.
[0145] As each request often selects multiple in-context examples, ensuring cache coverage is important, especially when the request data distribution evolves rapidly, and fresh examples are scarce. It can also help in generating synthetic requests for better data governance. Similar to the aforementioned distillation replay, the system addresses this with an example expansion process to generate companion examples. Specifically, the system identifies examples with low semantic similarity to others in the cache and queries the model, e.g., a designated one of the generative neural networks in the set, to generate variations of the original request and responses during off-peak hours. This expansion maintains contextual relevance while introducing diversity. The expanded examples are then added to the cache, broadening its repository and improving the system’s ability to handle a wider array of incoming requests.
[0146] Example distillation and expansion are conducted opportunistically during offpeak hours, with their overhead amortized across many daily requests. Indeed, as examples are often accessed frequently per single model inference request, the relative cost is minimal, considering that a refined example reused hundreds of times incurs only around 1% amortized overhead.
[0147] In some cases, the system can further optimize this overhead by effectively determining which examples are prioritized for optimization and whether to optimize an example by distilling it for higher quality or expanding it for better coverage. In particular, the system can maximize the overall cache utility gains by optimizing the examples with larger potential gains in terms of offloading opportunities. Intuitively, when repurposing an example, the smaller the model to which the augmented request can be routed and the higher the response quality it achieves, the smaller the gain one can expect from further optimization. Therefore, the potential gain of optimizing an example e can be defined as G(e) = normalized_model_cost x(l — normalized j'esponse_quciliiy). where G(e) prioritizes examples that require using larger models, achieve lower quality, or are frequently selected (repurposed). Note that this multiplicative form represents the quality' improvement per unit efficiency cost. As examples are selected, their potential gains accumulate, and the system maintains a moving average of Gty) that decays over time to account for data drifts.
[0148] However, even after identifying examples with large G'(e). deciding whether to distill (improving its quality) or expand (creating a companion example to improve coverage)remains non-trivial. Note that whether to expand an example depends on its contribution (coverage) to other examples. Therefore, the system can introduce a hypothetical companion example e+ for each existing example, with its potential gain G(e+) defined as the weighted average gains of the top-k semantically closest neighboring examples to the existing example, i.e., G(e+) is the weighted average of Gs for the k most similar examples to the existing example. For example, the weight for each similar example can be , where z = Sz Etop-k sieG(i) ' sie. where sie denotes the similarity between example z and e. This design captures the fact that more similar examples can benefit more from coverage expansion, and is lightweight as the system already captures the potential gam of each example and cluster examples. As such, the system prioritizes examples with the highest potential gains G(e). If a hypothetical companion example (e+) is selected, the system performs expansion; otherw ise, it performs distillation to improve the existing example's quality.
[0149] FIG. 6 shows an example 600 of the performance of the described techniques. In particular, the example 600 shows the ‘‘win rate’’ of a smaller Gemma model relative to a larger Gemma model on generating responses across a variety of tasks. As can be seen from the example 600, incorporating the described techniques (“w7 IC”) to effectively select demonstration examples allows the smaller model to achieve significantly higher win rates than not using the described techniques (“w / o IC”).
[0150] FIG. 7 shows another example 700 of the performance of the described techniques. A can be seen from the example 700, the described techniques achieve a significantly better quality-efficiency tradeoff than an existing serving technique C'RouteLLM”) across multiple different task types.
[0151] As one example a machine learning model as described herein may comprise an autoregressive generative model (e.g., a Transformer, a recurrent neural network, etc.) that can auto-regressively generate an output sequence as the output sequence based on the input. A transformer neural network is a neural network comprising a stack of transformer blocks, each typically including an attention or self-attention neural network layer, generally followed by a feedforward neural network layer (where a self-attention neural network layer applies a self-attention operation, e.g. QKV self-attention, to elements of an embedding, to update each element of the embedding).
[0152] The model can, for example, comprise a vision-language model (VLM) that can generate tokenized representations of image or video data, e.g. in response to a text input or that can generate tokenized representations of text, e.g. in response to an image input; or amultimodal model that can that can generate tokens representing any of text, image or audio, e.g. in response to an input comprising any of text, image or audio; and so on.
[0153] As previously mentioned, the sequence processing neural network may be a language model or vision language model neural network. In general, a (vision) language model neural network can be a neural network that has been trained so that, given a text prompt that includes a sequence of tokens in a natural language, the neural network can generate the next token in the sequence. This process can be repeated to extend the text prompt one token at a time to generate a natural language output, i.e., to generate the natural language output auto-regressively token by token. At each time “time step,” the language model neural network processes the current sequence to generate a probability distribution over a vocabulary of tokens. The next token can then be selected using the probability distribution, e.g.. by sampling from the distribution using nucleus sampling or another sampling technique or by selecting the highest-probability token. The tokens in the vocabulary can include any of a variety of tokens, e.g., some combination of words, sub-words, characters, punctuation and other symbols, and numbers. In general, the language model neural network is trained on a corpus of text made up of tokens from the vocabulary (and optionally other tokens that can be mapped to a designated out-of-vocabulary token), to predict the next token in a sequence of tokens from the training data. The (vision) language model neural network can comprise an autoregressive Transformer neural network.
[0154] A (vision) language model neural network can be made to perform a particular task by providing a natural language description of the desired response as an input or “prompt” (input sequence). In some cases, the prompt may be a few-shot prompt where a few, e.g., 1 to 10, examples of a query and an example output are provided in the text prior to the actual query.
[0155] Instead or in addition, a (vision) language model neural network may be “fine-tuned” to perform a particular task, by obtaining a pre-trained language model neural network trained on a large corpus of examples as previously described and then further training part of all of the language model neural network on a relatively small number of examples particular to the type of task that is to be performed.
[0156] The (vision) language model neural network may be a large language model neural network, e.g., one that has greater than 1 billion, 10 billion or 100 billion trained parameters. The (vision) language model neural network may have been trained on greater than 10 billion. 100 billion or 1000 billion words or tokens representing words or other tokens.In implementations the model input and the model output (and the training sequences) each comprise a sequence of elements referred to herein as tokens. A "token" as used in this specification is a vector of numerical values having a specified dimensionality, i.e. the number of numerical values is constant across different tokens. Each token can comprise a respective predetermined or learned embedding (an ordered collection of numerical values having a predetermined dimensionality.
[0157] The model implementations and the generative neural network (or machine learning model) can have a sequence processing architecture, in which the model input comprises an input sequence of tokens and the model is configured to generate an output comprising an output sequence of tokens. The model implementations and the generative neural network (or machine learning model) can implement a generative neural network system configured to process the input sequence of tokens using a sequence processing neural network to generate the output sequence of tokens.
[0158] In some implementations the tokens can represent text, e.g., words, wordpieces or characters, in a natural or computer language. For example text may be received, e.g., as a series of encoded characters, e.g. UTF-8 encoded characters; such ‘’characters” can include Chinese and other similar characters, as well as logograms, syllabograms and the like. A text encoder, i.e. a tokenizer, can process a sequence of text to represent the text as a series of text tokens from a vocabulary- of text tokens, e.g. that each represent words, wordpieces or characters in a natural or computer language. The computer language may be any formal language used to communicate with a computer, e.g. a markup language, or a command or configuration language, or a data exchange language such as JSON, or a programming language. The tokenizer can, e.g., implement BPE (Byte Pair Encoding) or Wordpiece tokenization. Optionally the text can be obtained from audio data representing speech; the output tokens may be converted into audio data that represent speech corresponding to the text.
[0159] Also or instead the tokens may represent an image. For example a set (sequence) of input or output tokens can represent an image. Each image token may comprise a block encoding of values of the pixels in a different region of an image that maps a set of values of the pixels to a respective image token.
[0160] As used herein an image may be any still or moving image, i.e. the image may be part of a video, in 2D or 3D, and may be a monochrome, color or hyperspectral image, i.e. comprising monochrome or color pixels. As defined herein an “image” includes a point cloud e.g. from a LIDAR system, and a “pixel” includes a point of the point cloud. An image mayhave been captured by a camera or other image sensor from the real world; and objects in the image may comprise physical objects, represented by the image.
[0161] Also or instead the tokens may represent an audio waveform. For example a set (sequence) of input or output tokens can represent audio data representing a waveform e.g. instantaneous audio amplitude values or time-frequency audio data. Each image token may comprise a block encoding of the audio waveform in a different time segment of the audio that maps a set of values representing the audio waveform to a respective image token.
[0162] In a multimodal system audio data or an image may be flagged by a start-of-audio token or start-of-image token.
[0163] In some implementations the tokens represent text, pixels of an image, or an audio waveform and the generative neural network system is configured to generate the output sequence of tokens to perform a task represented by the input sequence of tokens. These tasks can also be referred to as "downstream" tasks.
[0164] In some implementations the task comprises an image or audio generation task. The input sequence of tokens can then characterize the image or audio to be generated, and the output sequence of tokens can compnse tokens defining an image or audio waveform characterized by the input sequence of tokens, e.g. text tokens.
[0165] In some implementations the task comprises an image or audio processing task. The input sequence of tokens can define an image or audio input, and the output sequence of tokens can comprise tokens defining text that describes the image or audio input. As some examples, the task can be a speech recognition task, an object or action detection task, a classification task, a captioning task, a question-answering task, or a character or word recognition task.
[0166] In some implementations the task comprises a multimodal processing task. One or both of the input sequence of tokens and the input sequence of tokens can comprise multimodal data. For example the input sequence of tokens can characterize both an image or audio input and a text input and the output sequence of tokens can comprise tokens defining a result of an image or audio processing task defined by the text, such as an open vocabulary classification or object detection task.
[0167] In general, multimodal data comprises a combination of two or more different types of data, e.g., two or more of audio data, image data, text data, or graph data. As one example the multimodal data may comprise audio-visual data, comprising a combination of pixels of an image or of video and audio data representing values of a digitized audio waveform. As another example the multimodal data may comprise a combination of i) text data representingtext in a natural language and ii) pixels of an image or of video or audio data representing values of an audio waveform.
[0168] Some examples of multimodal tasks include: open-vocabulary image classification (the output can classify the image input based on a text input comprising text descriptions of one or more classes in the image); open-vocabulary object detection (the output can detect one or more objects in the image input based on a text input comprising text descriptions of the one or more objects); image captioning (the output can comprise text that describes the image input); text-based image search (the output can identify from amongst multiple images in the image input one or more images that meet a text description of images to be retrieved, the text description being provided in a text input); image-based retrieval (the output can identify from amongst multiple images in the image input one or more images that match an further image in the image input), and so on. The multimodal processing task to be performed can be defined by text in the input sequence.
[0169] In some implementations the task comprises an agent control task in which the agent interacts with an environment to perform the task. The agent can be a mechanical agent such as a robot or (semi-)autonomous vehicle, interacting with a real-world environment to perform the task. The generative neural network system can be trained to control a simulated version of the agent in a simulated version of the environment and then afterwards used to control the real agent in the real-world environment. The input sequence of tokens can comprise tokens that represent an observation of the environment, e.g. an image captured by a camera or other imaging device from a real-world environment. The output sequence of tokens comprises tokens that define one or more actions to be performed by the agent in the environment in response to the observation.
[0170] Further examples of tasks are described later.
[0171] Merely as an example a sequence processing neural network model can be trained using a token-predicting objective or other, such as a softmax cross entropy loss (with teacher forcing) or an autoregressive negative log likelihood (NLL) loss. As an example such a loss could be -7, (1=1)AL log p(y_l |y_(<l),x_(<l) ) for a multimodal input comprising a sequence of text encoded as L tokens with the 1th text token y_l conditioned on preceding second modality inputs x_(<l), such as one or more images or videos, and conditioned on preceding text tokens y_(<l). As another example the model could be trained with a masking loss, e.g. a loss that requires the model to predict masked-out data such as masked out text or image tokens.There are many suitable training datasets available, depending on the task to be performed. Just as some examples these include, for text: WebLI (Web Language Image, Chen et al. arXiv:2305.18565vl). Some examples for images include: the Visual Genome dataset for Visual Question Answering (Krishna et al., arXiv: 1602.07332); Objects365 (Shao et al., “Objects365: A large-scale, high-quality dataset for object detection”, IEEE / CVF international conference on computer vision, pages 8430-8439); Open Images V4 (Kuznetsova et al., arXiv: 1811.00982); the SBU dataset (Ordonez et al. ”Im2Texl:
[0172] Describing Images Using 1 Million Captioned Photographs”, NeurlPS 2011): the Conceptual Captions datasets, e.g. VI (2M images) or V2 (10M images) (Sharma et al., “Conceptual Captions: A Cleaned, Hypemymed, Image Alt-text Dataset For Automatic Image Captioning”. ACL 2018); and Kinetics for video (Kay et al., arXiv: 1705.06950). An example for audio data is AudioSet (Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” ICASSP, IEEE, 2017, pp. 776-780). An example training dataset for agent (robot) control is described in Ebert et al., arXiv:2109.13396.
[0173] In general a training dataset for a particular task may comprise task-specific training examples that have been manually generated by a human being and / or task-specific training data may be generated automatically using existing tools. For example an OCR (Optical Character Recognition) task dataset may be generated by applying an OCR tool to a corpus of images; or an object detection task that requires generating object bounding-box coordinates may be generated by applying an existing object detection tool, such as a trained neural network, to a corpus of images; or a set of aligned image and text representations may be generated using ALIGN (Jia et al., arXiv:2102.05918); or instruction-annotated robot trajectories may be obtained as described in Brohan et al., arXiv:2212.06817, in either the real-world or in simulation.
[0174] As previously described, implementations of the described techniques can be used to obtain a (trained) generative neural network system.
[0175] In some implementations the generative neural network system, e.g. a language model or a visual language model, is stored on a user computing device, i.e. a device local to the user, such as a mobile device e.g. a mobile phone, or a smart speaker.
[0176] In some implementations the generative neural network system is implemented on a remote server in communication with a user computing device over a wired or wireless network communications link between the user computing device and the server.
[0177] The user computing device may be provided with an input mechanism, such as a text or voice interface, that enables user input from the user in a natural language. The usercomputing device may be provided with an output mechanism that provides a system output for the user in the or another natural language e.g. as speech or text; or in some other way, e.g. by displaying an image. The input and output mechanism may comprise, e.g., a keyboard, microphone, speaker, display, and / or camera.
[0178] As an example the input mechanism may comprise a system configured to input audio data characterizing a speech waveform of speech representing the input from the user in a natural language, and configured to convert the audio data into tokens representing the speech in the natural language, e.g. representing a transcription of the spoken input. The output mechanism may comprise a system configured to receive tokens representing the output for the user in the or another natural language and a system configured to convert the received tokens into audio data representing a waveform of speech representing the output to the user in the natural language, i.e. representing spoken words.
[0179] As a further example, the trained system can be deployed in an environment that enables a user to provide a request for the system, e.g. to process a multimodal input to generate a corresponding output sequence output. A user can provide the request, e.g.. by way of a user interface or through an application programming interface (API). The request can be transmitted from a user device, e.g., over a data communications network such as the internet, to one or more computers implementing the system, e.g., in a data center. The system can generate an output sequence and then transmit the output sequence to a user device over a data communications network.
[0180] Further example applications
[0181] A user computing device may be provided as an interface for the generative neural network system, with an input mechanism that enables user input from the user in a natural language and an output mechanism that provides a system output to the user in the natural language. The input and output mechanism may comprise, e.g., a keyboard and display. Also or instead the input and output mechanism may comprise a speech-based mechanism. For example the input mechanism may comprise a system configured to input audio data characterizing a speech waveform of speech representing the input from the user in the natural language and configured to convert the audio data into tokens representing the speech in the natural language, e.g. representing a transcription of the spoken input. The output mechanism may comprise a system configured to receive tokens representing the output to the user in the natural language and a system configured to convert the received tokens into audio data representing a waveform of speech representing the output to the user in the natural language, i.e. representing spoken words.In some implementations the input sequence comprises one or more natural language statements relating to an environment, in particular a real-world environment, and includes a natural language request relating to the environment. Similarly the output sequence may be a natural language reply or natural language output statement that also relates to the environment i.e. it provides information relating to the environment, in some implementations relating to or specifying actions to be taken in the environment.
[0182] The (trained) generative neural network system can be used for diagnosing a fault, or for correcting undesired behavior, in a mechanical or computing system operating in the real world environment. The input may comprise a description and / or image of one or more observations of the mechanical or computing system, e.g. of operation of the system, optionally obtained from one or more sensors sensing a condition or operation of the system. An image observation may be converted into a text description e.g. using an image captioning system or in other ways. The generated output sequence may comprise an image, audio, or text that identifies (described) a likely cause of the fault or undesired behavior. This may be used to repair the fault or correct the behavior. The reward model(s) can define relatively more useful types of output for repairing the fault or correcting the behavior, and other aspects of the response as previously described.
[0183] The (trained) generative neural network system can be used for controlling a mechanical agent such as a robot or vehicle. For example the input may comprise a description of a task to be performed, and the generated output sequence may comprise a list of sub-tasks to be performed by the mechanical agent (trained to perform such sub-tasks), in order to perform the task. The reward model(s) can define relatively more preferable or useful types of sub-task, task safety, efficiency, and so on.
[0184] As another example, the environment can be a computer security monitoring environment, e.g., the system can be deployed as part of a system that monitors the security of one or more computers. For example, the environment may be a computer network security monitoring environment, and the system can be deployed as part of a system that monitors the security of one or more computers on a computer network, e.g. a wireless network, a cellular network, a local area network and / or the internet. As another example, the environment may alternatively or additionally be a computer system security monitoring environment and the system can be deployed as part of a system that monitors the system for the presence of computer viruses and / or an unresolved software vulnerability, e.g. a zero-day exploit. A software vulnerability may be resolved by updating the software (e.g. patching) and / or removing (e.g. uninstalling) the software from the computer system. In these examples, thenatural language request can query' whether a computer security incident has been resolved (e.g., “has the incident been resolved?") and the input sequence may comprise relevant statements from system logs, i.e., that are potentially relevant to the event being queried. A computer security incident can be, e.g., a data breach, an unauthorized log-in or other access of a secured system, a detection of a computer virus or detection of a software vulnerability7. The incident can be “resolved" when the underlying incident is no longer a threat to the security of the computer system e.g., the computer virus has been removed, the access to the secured system has been removed, the data breach has been mitigated, or the software having the vulnerability has been updated or removed. The system can use the input sequence to generate a reply to the request that comprises a natural language statement indicating whether the incident has been resolved, optionally displaying evidence used to determine this.
[0185] The input sequence may include one or more of: code snippets from the software code, system logs, program logs, or other artifacts that should be left on the computer by running the program, or verification rules that represent requirements for the execution of the software program, or natural language statements describing the computer system on which the software executes. In general the input sequence may include relevant statements, i.e., statements that are potentially relevant to the event being queried.
[0186] In some implementations obtaining the input sequence may comprise obtaining, from the system logs, the data characterizing the computer network, or both, or from other data as described above, one or more observations of the computer network (which here includes computers on the network), and processing the one or more observations to generate a natural language representation of the one or more observations. The natural language request may relate to the computer security incident or to the secure operation of the computer netw ork. The method may include using the natural language representation of the one or more observations to provide one or more of the natural language statements describing the computer netw ork, and using the natural language reply or the natural language output statement to identify a security status of the computer network or a security flaw in the computer network.
[0187] As another example, the environment can be a software testing or evaluation environment, e.g., the system can be deployed as part of a system that tests software before deployment or that evaluates already-deployed software to identify bugs. In these examples, when the system tests software before deployment, the natural language request can ask whether the software will execute as intended, and the input sequence can include code snippets from the software code and, optionally, natural language statements describing thecomputer system on which the software will execute. The system can then use the input sequence to generate a reply that indicates whether the code will execute as intended, optionally displaying evidence used to determine this. When the system monitors the execution of code after deployment, the natural language request can ask whether a software program, or a portion of a software program, has executed as intended, and the input sequence can include one or more of: code snippets from the software code, system logs, program logs, or other artifacts that should be left on the computer by running the program, or verification rules that represent requirements for the execution of the software program, or natural language statements describing the computer system on which the software executes. The system can then use the input sequence to generate a reply that indicates whether the code has executed as intended, optionally displaying evidence used to determine this. As a particular example, the software program can be part of the boot up of a computer, and the system can generate a reply each time that the computer starts up to verify whether the computer will function correctly after start up.
[0188] As another example, the environment can be an educational environment, e.g., the system can be deployed as part of an education software program that assists a user in learning or practicing one or more corresponding skills. In these examples, the input sequence can include natural language statements describing or referencing a scenario or scene in a real-world or imagined environment, and the request can be a question about the scenario or scene.
[0189] As another example, the environment can be an information retrieval environment, e.g., the system can be deployed as part of a search engine or other software that allows a user to search for information in a corpus of documents, e.g., the Internet or another electronic document corpus. In these examples, the request can be any appropriate natural language question, and the reply can optionally include evidence such as include relevant statements from the corpus of documents, e.g. as identified by searching the corpus using conventional information retrieval techniques.
[0190] In some implementations, the language model neural network is a visual language model (VLM). In general, the VLM may process input sequences comprising tokens that each represent natural language or (a part of) an image or video to generate output tokens that each represent natural language or (a part of) an image or video. For example, the VLM may be configured to describe an image or video using natural language, e.g., to perform an image or video captioning task. As another example, the VLM may be configured to process input tokens representing an image and text tokens representing a query about the image or arequest to modifying the image, and to generate output tokens representing an answer to the query or representing a version of the image that has been modified in accordance with the request. The VLM may generate output tokens representing an image or video that is generated in response to input tokens providing a visual and / or audio and / or textual description of a desired image or video.
[0191] In some implementations, the “‘language” of the language model is not a natural language such (e.g. English), but may instead be a text-based encoding describing an entity or class of entities, e.g. a chemical or biological entity, such as a chemical structure or molecule. For example, the text-based encoding may be a sequence of tokens that defines a molecule or protein, e.g. a sequence specifying an arrangement of atoms or chemical functional groups in a molecule, or the amino acid residues of a protein. The language model may be referred to as a chemical and / or biological language model in such cases. The input for the language generation neural network may therefore be an input string defining a chemical (e.g. protein) structure and the output may be an output string defining a different chemical structure from the input string. The strings may be in the Simplified Molecular Input Line Entry System, SMILES, format, for example.
[0192] In another example of a computer language text generation task, a task-specific training example may comprise an image or video and a sequence of text in a computer language for performing a task in relation to the image or video, e.g. a data processing task that involves analyzing the content of the image or video to provide a result of the analysis or, e.g., a search to search for information relating to the content of the image or video. The computer language in the model output may comprise computer language for invoking a function or calling one or more external APIs. Merely as one example, such an output may be formatted as a JSON object. As previously, the sequence of text in the multimodal input may define the task to be performed and the second modality input may comprise, e.g. an image or video in relation to which the task is to be performed, e.g. a task that involves manipulation of particular ty pes of data that may benefit from access to an API such as mathematical data, date / time related data, scientific data, recent data that may post-date training of the model (that may be accessed by a search function or API), and so on. After training, when the model is used in inference, the model output may comprise text in the or another computer language for performing a task, e.g. as described above, in relation to an image or video in the second modality input. The method may then include using the text in the computer language to perform the task.In some implementations, the language model neural network may be used to interact with a human user of a digital assistant such as a smart speaker, smart display, or other device. For example, information defining a task can be obtained from the digital assistant, and the digital assistant can be used to instruct the user to perform the task. For example, this may comprise receiving, at the digital assistant, a request from the user for assistance and determining, in response to the request, a series of tasks for the user to perform, e.g. steps or sub-tasks of an overall task. Then for one or more tasks of the series of tasks, e.g. for each task, e.g. until a final task of the series the digital assistant can be used to output to the user an indication of the task, e.g. step or sub-task, to be performed. This may be done using natural language, e.g. on a display and / or using a speech synthesis subsystem of the digital assistant. Visual, e.g. video, and / or audio observations of the user performing the task may be captured, e.g. using the digital assistant. A system may then be used to determine whether the user has successfully achieved the task e.g. step or sub-task, i.e. from the answer as previously described. If there are further tasks to be completed the digital assistant may then, in response, progress to the next task (if any) of the series of tasks, e.g. by outputting an indication of the next task to be performed. In this way the user may be led step-by-step through a series of tasks to perform an overall task.
[0193] As an illustrative example, a user may be interacting with a digital assistant and ask for help performing an overall task consisting of multiple steps, e.g. cooking a pasta dish. While the user performs the task, the digital assistant receives audio and / or video inputs representative of the user's progress on the task, e.g. images or video or sound clips of the user cooking. The digital assistant uses a system as described above, in particular by providing it with the captured audio and / or video and a question that asks whether the user has completed a particular step, e.g. 'Has the user finished chopping the peppers?', to determine whether the user has successfully completed the step. If the answer confirms that the user has successfully completed the step then the digital assistant progresses to telling the user to perform the next step or, if at the end of the task, or if the overall task is a single-step task, then the digital assistant may indicate this to the user. The digital assistant may then stop receiving or processing audio and / or video inputs to ensure privacy and / or reduce power use.
[0194] In a further aspect there is provided a digital assistant device including a system as described above. The digital assistant can also include a user interface to enable a user to request assistance and to output information. In implementations this is a natural language user interface and may comprise a keyboard, voice input-output subsystem, and / or a display. The digital assistant can further include an assistance subsystem configured to determine, inresponse to the request, a series of tasks for the user to perform. In implementations this may comprise a generative (large) language model, in particular for dialog. The digital assistant can have an observation capture subsystem to capture visual and / or audio observations of the user performing a task; and an interface for the above-described language model neural network (which may be implemented locally or remotely). The digital assistant can also have an assistance control subsystem configured to assist the user. The assistance control subsystem can be configured to perform the steps described above, for one or more tasks e.g. of a series of tasks, e.g. until a final task of the series. More particularly, the assistance control subsystem and output to the user an indication of the task to be performed, capture, using the observation capture subsystem, visual or audio observations of the user performing the task, determine from the above-described answer whether the user has successfully achieved the task. In response, the digital assistant can progress to a next task of the series of tasks and / or control the digital assistant, e.g. to stop capturing observations.
[0195] The generative neural network system may comprise a multimodal machine learning system such as a visual language model (VLM). That is, implementations of the generative neural network system can perform a multimodal task in which the input and output sequence, collectively, comprise data of multiple different types. As used herein text can include numbers, punctuation, special symbols, and so on.
[0196] In some implementations, after training, a particular task that is to be performed by the generative neural network system can be described by part or all of a sequence of text in the input to the system. For example in an input that includes an image such a prompt might specify “Generate a caption”, “Generate a description”, “Answer the following question: [about the image or video]”, or “Detect a person”. Where the system is used for an agent control task a prompt may define “Take the knife out of the drawer”, or “Q: What action should the robot take to take the knife out of the drawer?”. Also or instead such a prompt may give one or more examples of a task to be performed. The generative neural network system can be trained on multiple natural and / or computer languages and the prompt may then specify a language to use.
[0197] A few further examples of some machine learning tasks that can be performed by a system trained as described herein follow. The tasks described below may be tasks that require spatial awareness or other context from the image or video. For example, a prompt may ask “What is the object in the top left comer?”.
[0198] In general for the tasks below the system can have been trained or fine-tuned on examples of the input and output for the task. For example the system can have been trainedusing still or moving images containing one or more objects or actions, and corresponding sequences of text or other data e.g. describing or classifying the images. However large, “foundation” models can, in general, perform some tasks zero-shot, i.e. without having been specifically trained on those tasks.
[0199] As one example the task may comprise an object or action detection task. For example the generated output sequence may comprise or represent text that describes or otherwise labels detected object(s) or action(s) in an input comprising an image or audio, and may include coordinates such as boun ding-box coordinates for the detected object(s) or action(s), e.g. "102090 100 cat 2030 100 100 dog”.
[0200] As another example the task may comprise a classification task, e.g. an object or action classification task. The generated output sequence may comprise data, e.g. text, that classifies the object(s) or action(s) in represented in the conditioning data, e.g. in an image or audio, into one of a plurality of classes, or that otherwise classify object(s) or action(s) represented in the conditioning data.
[0201] As another example the task may comprise a still or moving image describing task, e.g. a captioning task (which, as used here, includes an audio description task to explain what is happening in an image). The generated output sequence may comprise data, e.g. text, describing an image or video in the conditioning data. For example the generated output sequence may provide a caption or description or it may count objects in the image or video, or it may provide some other form of description.
[0202] As another example the task may comprise a still or moving image questionanswering task. The generated output sequence may comprise data, e.g. text, that answers a question about the input, e.g. an image or audio, where the question is also specified in the input, e.g. as sequence of text. This may be used, e.g., to answer questions about visual plots and charts or about sounds.
[0203] As another example the task may comprise a character or word recognition task, e.g. an OCR (optical character recognition) task. The input may comprise a still or moving image and the generated output sequence may comprise text that represents characters or words in the input, e.g. in a natural language.
[0204] As another example the task may comprise a still or moving image generation task. The generated output sequence may comprise image data defining values for pixels of a still or moving image, and the input, e.g. a sequence of text, may describe or characterize the image to be generated. Merely as an example, an image of a plot or chart may be generated to represent the input, e.g. comprising text.As another example the task may comprise a computer language text generation task. The conditioning data may comprise a natural language description of a task to be performed, and optionally an image (if the task is to be performed on or in relation to an image), and the generated output sequence may comprise text in a computer language to perform the task, e.g. a task of analyzing the content of the image to provide a result of the analysis or to search for information relating to the content of the image.
[0205] As a particular example the computer language in the generated output sequence may comprise computer language for invoking a function or calling one or more external APIs. Merely as one example, such an output sequence may comprise data formatted as a JSON object. As previously, the input may define the task to be performed and may also include an image in relation to which the task is to be performed. In general the task can involve manipulation of particular types of data that may benefit from access to an API such as mathematical data, date / time related data, scientific data, recent data that may post-date training of the system (that may be accessed by a search function or API), and so on; and the generated output sequence may comprise text in a computer language for performing the task. The method may then include using the text in the computer language to perform the task.
[0206] In general where the generated output sequence comprises text this may be converted to speech representing the text, and an audio (speech) output provided.
[0207] In some implementations the task comprises an agent control task in which the agent interacts with an environment to perform the agent control task. In these implementations the input can include an observation characterizing the environment. For example the input can include a sequence of text that defines the task to be performed by the agent and the image can represent an observation of the environment, e.g. captured by a camera or other imaging device from a real-world environment. The generated output sequence can comprise an action selection output, e.g. including text, that is used to select one or more actions to be performed by the agent in the environment in response to the observation. As an illustration the generated output sequence may define an action as text such as “A: 132 114 128 5 25 156”. that can be converted into a control signal for a mechanical agent, such as a robot, e.g. “AT=[0.1,-0.2,0] AR=[ 10Ao,25]Ao,-7Ao ]”. The action selection output may also or instead define one or more low-level skills, e.g. from a vocabulary of previously learnt skills. As before, the sequence of text in the input to the system may describe the task to be performed, e.g. "‘What action should the robot take to [perform task]”. Examples of systems for controlling an agent that may be fine tuned as described herein can include PaLM-E(Driess et al. arXiv:2303.03378), RT-1 (Brohan et al. arXiv:2212.06817), and RT-2 (Brohan et al. arXiv:2307.15818).
[0208] In some agent control implementations, the environment is a real-world environment and the agent is a mechanical agent interacting with the real-world environment, e.g., a robot or an autonomous or semi-autonomous land, air, or sea vehicle operating in or navigating through the environment, and the actions are actions taken by the mechanical agent in the real-world environment to perform the task. For example, the agent may be a robot or other mechanical agent interacting with the environment to accomplish a specific task, e g., to locate or manipulate an object of interest in the environment or to move an object of interest to a specified location in the environment or to navigate to a specified destination in the environment. In these implementations, the observations may include, e.g., one or more of: images, object position data, and sensor data to capture observations as the agent interacts with the environment. The actions may define control signals to control the robot or other mechanical agent, e.g., positions, torques, or other control signals for the parts of the mechanical agent, or higher-level control commands.
[0209] In some agent control implementations the agent may be a human agent and the environment may be a real-world environment. For example the agent can be a human user of a digital assistant such as a smart speaker, smart display, or some other device that is used to instruct the user to perform actions. The task may be any real-world task that the user wishes to perform. The observations may be obtained from an observation capture subsystem, e.g. a monitoring system such as a video camera or sound capture system, to capture visual observations of the user performing the task. The actions may comprise instructions in the form of, e g., text, image, video, or audio data such as speech, that guide the user in performing the task.
[0210] In any of the above examples, the task may be a '‘planning” task that requires the generative neural netw ork to generate one or more intermediate outputs prior to generating a final output that is used as the final output for the task. For example, the generative neural network may be caused to perform planning by virtue of the prompt that is provided as input to the neural network, e.g., which can include a natural language instruction or other type of instruction that instructs the neural netw ork to generate intermediate outputs, a few -shot example of planning “traj ectories” or both.
[0211] The described systems and techniques may be applied to a wide range of different types of input sequence and output sequence. In implementations of the described techniques the tokens may represent, characterize, or encode any type of information in a sequence e.g.stream of data. The term "represent" is used, below, generally to refer to any way in which a token can encode part of a sequence. The tokens may include marker tokens, such as a start of sequence token, an end of sequence token, and a separator token (indicating a separation or break between two distinct parts of a sequence). The tokens may, but need not be, drawn from a defined vocabulary of tokens.
[0212] Some of these implementations may be used for natural language tasks such as providing a natural language response to a natural language input, e.g. for question answering, or for text completion. In some implementations the input sequence may represent text in a natural language and the output sequence may represent text in the same natural language, e.g. a longer item of text. For example in some implementations the input sequence may represent text in a natural language and the output sequence may represent the same text with a missing portion of the text added or filled in. For example the output sequence may represent a predicted completion of text represented by the input sequence. Such an application may be used, e.g. to provide an auto-completion function e.g. for natural language-based search. In some implementations the input sequence may represent a text in a natural language e.g. posing a question or defining a topic, and the output sequence may represent a text in a natural language which is a response to the question or about the specified topic.
[0213] As another example the input sequence may represent a first item of text and the output sequence may represent a second, shorter item of text e.g. the second item of text may be a summary of a passage that is the first item of text. As another example the input sequence may represent a first item of text and the output sequence may represent an aspect of the first item of text e.g. it may represent an entailment task, a paraphrase task, a textual similarity task, a sentiment analysis task, a sentence completion task, a grammaticality task, and in general any natural language understanding task that operates on a sequence of text in some natural language e.g. to generate an output that classifies or predicts some property of the text. For example some implementations may be used to identify a natural language of the first item of text, or of spoken words where the input is audio (as described below).
[0214] Some implementations may be used to perform neural machine translation. Thus in some implementations the input tokens represent words, wordpieces, or characters in a first natural language and the output tokens represent words, wordpieces or characters in a second, different natural language. That is, the input sequence may represent input text in the first language and the output sequence may represent a translation of the input text into the second language.Some implementations may be used for automatic code generation. For example the input tokens may represent words, wordpieces or characters in a first natural language and the output tokens may represent instructions in a computer programming or markup language, or instructions for controlling an application program to perform a task e.g. build a data item such as an image or web page.
[0215] Some implementations may be used for speech recognition. In such applications the input sequence may represent spoken words and the output sequence may represent a conversion of the spoken words to a machine-written representation e.g. text. Then the input tokens may comprise tokens representing an audio data input including the spoken words e.g. characterizing a waveform of the audio in the time domain or in the time-frequency domain. The output tokens may represent words, wordpieces, characters, or graphemes of a machine-written, e.g. text, representation of the spoken input, that is representing a transcription of the spoken input.
[0216] Some implementations may be used for handwriting recognition. In such applications the input sequence may represent handwritten words, syllabograms or characters and the output sequence may represent a conversion of the input sequence to a machine-written representation e.g. text. Then the input tokens may comprise tokens representing portions of the handwriting and the output tokens may represent words, wordpieces, characters or graphemes of a machine-written, e.g. text, representation of the spoken input.
[0217] Some implementations may be used for text-to-speech conversion. In such applications the input sequence may represent text and the output sequence may represent a conversion of the text to spoken words. Then the input tokens may comprise tokens representing words or wordpieces or graphemes of the text and the output tokens may represent portions of audio data for generating speech corresponding to the text, e.g. tokens characterizing a portion of a waveform of the speech in the time domain or in the timefrequency domain, or phonemes.
[0218] Some implementations may be used for a genomics task, where the input sequence represents a fragment of a DNA sequence or other molecule sequence and the output sequence is either an embedding of the fragment for use in a downstream task, e.g., by making use of an unsupervised learning technique on a data set of DNA sequence fragments, or an output for the dow nstream task. Examples of downstream tasks include promoter site prediction, methylation analysis, predicting functional effects of non-coding variants, and so on.In some cases, the machine learning task is a combination of multiple individual machine learning tasks, i.e., the system is configured to perform multiple different individual machine learning tasks, e.g., two or more of the machine learning tasks mentioned above. For example, the system can be configured to perform multiple individual natural language understanding tasks, with the network input including an identifier for the individual natural language understanding task to be performed on the network input.
[0219] In some implementations the input sequence and the output sequence represent different modalities of input. For example the input sequence may represent text in a natural language and the output sequence may represent an image or video corresponding to the text; or vice-versa. In general the tokens may represent image or video features and a sequence of such tokens may represent an image or video. There are many ways to represent an image (or video) using tokens. As one example an image (or video) may be represented as a sequence of regions of interest (Rols) in the image, optionally including one or more tokens for global image features. For example an image may be encoded using a neural network to extract Rol features; optionally (but not essentially) a token may also include data, e.g. a position encoding, representing a position of the Rol in the image. As another example, the tokens may encode color or intensity values for pixels of an image. As another example, some image processing neural network systems e.g. autoregressive systems, naturally represent images as sequences of image features. As another example, a transformer-based sequence processing neural network system as previously described may be used to process images instead of or as well as text (e.g. if trained on images instead of or as well as text).
[0220] Thus in some implementations at least one of the input sequence and the output sequence is a sequence representing an image or video, and the tokens represent the image or video. For example the input sequence may be a sequence of text, the input tokens may represent words, wordpieces, or characters and the output sequence may comprise output tokens representing an image or video e.g. described by the text, or providing a visual answer to a question posed by the text, or providing a visualization of a topic of the text. In another example the input sequence may comprise a sequence of input tokens representing an image or video, and the output tokens may represent words or wordpieces, or characters representing text e.g. for a description or characterization of the image or video, or providing an answer to a question posed visually by the image or video, or providing information on a topic of a topic of the image or video.
[0221] In some other implementations both the input sequence and the output sequence may represent an image or video, and both the input tokens and the output tokens may represent arespective image or video. In such implementations the method / system may be configured to perform an image or video transformation. For example the input sequence and the output sequence may represent the same image or video in different styles e.g. one as an image the other as a sketch of the image; or different styles for the same item of clothing.
[0222] In some implementations the input sequence represents data to be compressed, e.g. image data, text data, audio data, or any other type of data; and the output sequence a compressed version of the data. The input and output tokens may each comprise any representation of the data to be compressed / compressed data e.g. symbols or embeddings generated / decoded by a respective neural network.
[0223] In some implementations the input sequence represents a sequence of actions to be performed by an agent e.g. a mechanical agent in a real-world environment implementing the actions to perform a mechanical task. The output sequence may comprise a modified sequence of actions e.g. one in which an operating parameter, such as a speed of motion or power consumption, has a limited value; or one in which a safety or other boundary' is less likely to be crossed. Then both the input tokens and the output tokens may represent the actions to be performed.
[0224] In some implementations the input sequence represents a sequence of health data and the output sequence may comprise a sequence of predicted treatment. Then the input tokens may represent any aspect of the health of a patient e.g. data from blood and other medical tests on the patient and / or EHR (Electronic Health Record) data; and the output tokens may represent diagnostic information e.g. relating to a disease status of the patient and / or relating to suggested treatments for the patient, and / or relating to a likelihood of an adverse health event for the patient.
[0225] As a particular example the sequence processing neural network can comprise a multimodal model neural network in which one or both of the model input (i.e. input sequence) and the model output (i.e. output sequence) comprise an image or audio. For example the multimodal machine learning model may be configured to process an input sequence comprising visual tokens representing pixels of a still or moving image (which here may include a point cloud image), and / or data representing an audio waveform e.g. values or features of the audio waveform such as audio tokens, and / or text tokens representing a sequence of text, to generate an output sequence e.g. comprising text tokens representing the still or moving image or audio waveform, and / or comprising a sequence of intensity value inputs for the pixels of an image or a sequence of values defining an audio waveform. A visual token may, e.g., represent multiple pixels in a region of the image, e.g. as features ofthe region. Such a multimodal model may perform any of the previously described tasks, e.g. using a multimodal input, or by providing a multimodal output, or by converting between different input and output modes (e.g. text / image / audio). For example it may generate text representing, describing (e.g. captioning), or otherwise characterizing an image or audio input, e.g. by answering a question related to the image or audio input, e.g. relating to a future e.g. physical prediction of a state of objects represented by the image or audio. As another example it may generate an image or audio represented, described, or otherwise characterized by a text input, or otherwise in response to the text input, e.g. representing an image or audio answer to a text question.
[0226] In this specification, the term "configured" is used in relation to computing systems and environments, as well as computer program components. A computing system or environment is considered "configured" to perform specific operations or actions when it possesses the necessary software, firmware, hardware, or a combination thereof, enabling it to carry out those operations or actions during operation. For instance, configuring a system might involve installing a software library with specific algorithms, updating firmware with new instructions for handling data, or adding a hardware component for enhanced processing capabilities. Similarly, one or more computer programs are "configured" to perform particular operations or actions when they contain instructions that, upon execution by a computing device or hardware, cause the device to perform those intended operations or actions.
[0227] The embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuitry7, software, firmware, computer hardware (encompassing the disclosed structures and their structural equivalents), or any combination thereof. The subject matter can be realized as one or more computer programs, essentially modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by or to control the operation of a computing device or hardware. The storage medium can be a storage device such as a hard drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination of these. Additionally or alternatively, the program instructions can be encoded on a transmitted signal, such as a machine-generated electrical, optical, or electromagnetic signal, designed to cany7information for transmission to a receiving device or system for execution by a computing device or hardware. Furthermore, implementations may leverage emerging technologies like quantum computing or neuromorphic computing for specific applications,and may be deployed in distributed or cloud-based environments where components reside on different machines or within a cloud infrastructure.
[0228] The term "computing device or hardware" refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs), and specialized processing hardware such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). In addition to hardware, a computing device or hardware may also include code that creates an execution environment for computer programs. This code can take the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of these elements. Embodiments may particularly benefit from utilizing the parallel processing capabilities of GPUs, in a General-Purpose computing on Graphics Processing Units (GPGPU) context, where code specifically designed for GPU execution, often called kernels or shaders, is employed. Similarly. TPUs excel at running optimized tensor operations crucial for many machine learning algorithms. By leveraging these accelerators and their specialized programming models, the system can achieve significant speedups and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in areas such as computer vision, natural language processing, and robotics.
[0229] A computer program, also referred to as software, an application, a module, a script, code, or simply a program, can be written in any programming language, including compiled or interpreted languages, and declarative or procedural languages. It can be deployed in various forms, such as a standalone program, a module, a component, a subroutine, or any other unit suitable for use within a computing environment. A program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e.g., scripts within a markup language document), residing in a dedicated file, or distributed across multiple coordinated files (e.g., files storing modules, subprograms, or code segments). A computer program can be executed on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected through a data communication network. The specific implementation of the computer programs may involve a combination of traditional programming languages and specialized languages or libraries designed for GPGPU programming or TPU utilization, depending on the chosen hardware platform and desired performance characteristics.In this specification, the term "engine" broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions. An engine is ty pically implemented as one or more software modules or components installed on one or more computers, which can be located at a single site or distributed across multiple locations. In some instances, one or more dedicated computers may be used for a particular engine, while in other cases, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of Al and machine learning could include data pre-processing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of engines will depend on the overall architecture and the distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors.
[0230] The processes and logic flows described in this specification can be executed by one or more programmable computers running one or more computer programs to perform functions by operating on input data and generating output. Additionally, graphics processing units (GPUs) and tensor processing units (TPUs) can be utilized to enable concurrent execution of aspects of these processes and logic flows, significantly accelerating performance. This approach offers significant advantages for computationally intensive tasks often found in Al and machine learning applications, such as matrix multiplications, convolutions, and other operations that exhibit a high degree of parallelism. By leveraging the parallel processing capabilities of GPUs and TPUs, significant speedups and efficiency gains compared to relying solely on CPUs can be achieved. Alternatively or in combination with programmable computers and specialized processors, these processes and logic flows can also be implemented using specialized processing hardware, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), for even greater performance or energy efficiency in specific use cases.
[0231] Computers capable of executing a computer program can be based on general-purpose microprocessors, special -purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. These accelerators often work in conjunction with CPUs, handling specialized computations while the CPU manages overall system operations and other tasks. Typically, a CPU receives instructions and data from read-only memory (ROM), randomaccess memory (RAM), or both. The elements of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data. The specific configuration of processing units and memory will depend on factors like the complexity of the Al model, the volume of data being processed, and the desired performance and latency requirements. Embodiments can be implemented on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high-performance computing capabilities. The system may include storage devices like hard drives, SSDs, or flash memory for persistent data storage.
[0232] Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and memory' devices. Examples include semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices; hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs. The specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or permanence.
[0233] To facilitate user interaction, embodiments of the subject matter described in this specification can be implemented on a computing device equipped with a display device, such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display, for presenting information to the user. Input can be provided by the user through various means, including a keyboard), touchscreens, voice commands, gesture recognition, or other input modalities depending on the specific device and application. Additional input methods can include acoustic, speech, or tactile input, while feedback to the user can take the form of visual, auditory', or tactile feedback. Furthermore, computers can interact with users by exchanging documents with a user's device or application. This can involve sending web content or data in response to requests or sending and receiving text messages or other forms of messages through mobile devices or messaging platforms. The selection of input and output modalities will depend on the specific application and the desired form of user interaction.
[0234] Machine learning models can be implemented and deployed using machine learning frameworks, such as TensorFlow or JAX. These frameworks offer comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models.
[0235] Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specificapplication and requirements. These may include a back-end component, such as a back-end server or cloud-based infrastructure; an optional middleware component, such as a middleware server or application programming interface (API), to facilitate communication and data exchange; and a front-end component, such as a client device with a user interface, a web browser, or an app, through which a user can interact with the implemented subject matter. For instance, the described functionality could be implemented solely on a client device (e.g., for on-device machine learning) or deployed as a combination of front-end and back-end components for more complex applications. These components, when present, can be interconnected using any form or medium of digital data communication, such as a communication network like a local area network (LAN) or a wide area network (WAN) including the Internet. The specific system architecture and choice of components will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience.
[0236] The computing system can include clients and servers that may be geographically separated and interact through a communication network. The specific type of network, such as a local area network (LAN), a wide area network (WAN), or the Internet, will depend on the reach and scale of the application. The client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. These protocols may include HTTP, TCP / IP, or other specialized protocols depending on the nature of the data being exchanged and the security requirements of the system. In certain embodiments, a server transmits data or instructions to a user's device, such as a computer, smartphone, or tablet, acting as a client. The client device can then process the received information, display results to the user, and potentially send data or feedback back to the server for further processing or storage. This allows for dynamic interactions between the user and the system, enabling a wide range of applications and functionalities.
[0237] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations andeven initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
[0238] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0239] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.
[0240] What is claimed is:
Claims
CLAIMS1. A method performed by one or more computers, the method comprising:maintaining cache data specifying a set of demonstration examples, wherein each demonstration example comprises a respective example request and a respective example response for the respective example request;receiving a new request;selecting, from the demonstration examples in the cache data and using the new request, a subset of demonstration examples for the new request;selecting, based on the subset of demonstration examples and the new request, a generative neural network from a set of a plurality of generative neural networks for processing the new request;processing the new request using the selected generative neural network to generate a new response to the new request; andproviding the new response in response to the new request.
2. The method of claim 1, wherein selecting, from the demonstration examples in the cache data and using the new request, a subset of demonstration examples for the new request comprises:generating an embedding for the new request; andselecting a first subset of the demonstration examples based on a similarity between the embedding for the new request and respective embeddings of at least some of the demonstration examples.
3. The method of claim 2, wherein selecting a first subset of the demonstration examples based on a similarity between the embedding for the new request and respective embeddings of at least some of the demonstration examples comprises:maintaining a respective centroid embedding for each of a plurality of clusters of the demonstration examples;selecting a cluster from the plurality clusters having a respective centroid embedding that is most similar to the embedding for the new request; andselecting the first subset of demonstration examples from the demonstration examples in the selected cluster.
4. The method of claim 3, wherein selecting the first subset of demonstration examples from the demonstration examples in the selected cluster comprises:selecting a specified number of demonstration examples from the demonstration examples in the selected cluster having respective embeddings that are most similar to the embedding for the new request.
5. The method of any one of claims 2-4, wherein selecting, from the demonstration examples in the cache data and using the new request, a subset of demonstration examples for the new request further comprises:processing an input representing the new request and the demonstration examples in the first subset using a proxy neural network to generate a respective predicted score for each demonstration example that represents a utility of the demonstration example in responding to the new request; andselecting the subset of demonstration examples based on the predicted scores for the demonstration examples in the first subset.
6. The method of any preceding claim, further comprising:determining whether the new request matches any of the example requests in the demonstration examples in the cached data; andperforming the receiving, selecting, selecting, and processing in response to determining that the new request does not match any of the example requests in the demonstration examples in the cached data.
7. The method of any preceding claim, wherein the generative neural networks in the set of the plurality of generative neural networks each have a different computational cost.
8. The method of any preceding claim, wherein selecting, based on the subset of demonstration examples and the new' request, a generative neural network from a set of a plurality of generative neural networks for processing the new request comprises:processing an input representing the subset of demonstration examples and the new request using a request router model to generate an action output that identifies one of the generative neural networks in the set of generative neural networks.
9. The method of claim 8, further comprising:receiving a feedback signal that indicates (i) a computational cost of processing the new request using the selected generative neural network and (ii) a quality measure for the new' response;combining the computational cost and the quality measure to generate a reward score;andupdating the request router model using the reward score.
10. The method of claim 8 or claim 9, wherein the request router model is a contextual bandit model.
11. The method of any one of claims 8-10, wherein the request router model has been trained by performing an initial bootstrapping training phase followed by an online adaptation phase.
12. The method of any preceding claim, further comprising:updating the cache data using the new request and the new response.
13. The method of claim 12, wherein updating the cache data using the new request and the new response comprises:adding data specifying a new demonstration example that comprises the new request and the new response to the cache data.
14. The method of claim 12, wherein updating the cache data using the new request and the new response comprises:generating, from the new request and the new response, a synthetic demonstration example that comprises a synthetic request and a synthetic response; andadding data specifying the synthetic demonstration example to the cache data.
15. The method of any preceding claim, further comprising:determining that first criteria are satisfied for updating the cache data; and in response, generating a respective distilled response for each of one or more of the example requests and, for each distilled response, adding data specify ing a new demonstration example that includes the corresponding example request and the distilled response to the cache data.
16. The method of claim 15, wherein the first criteria are satisfied when a system load on one or more of the generative neural networks is below a threshold value.
17. The method of any preceding claim, further comprising:determining that second criteria are satisfied for updating the cache data; and in response, generating a respective expanded response for each of one or more of theexample requests and, for each expanded response, adding data specifying a new demonstration example that includes the corresponding example request and the expanded response to the cache data.
18. The method of claim 17, wherein the second criteria are satisfied when a system load on one or more of the generative neural networks is below a threshold value.
19. The method of any preceding claim, wherein the cache data comprises, for a second subset of the demonstration examples, a respective internal model representation of the demonstration example.
20. The method of any preceding claim, when dependent on claim 14, wherein the synthetic demonstration example is a differentially private version of a new demonstration example that includes the new request and the new response.
21. The method of any preceding claim, further comprising:generating, for each one or more of the example requests, one or more additional responses, comprising determining, for each of the one or more example requests, whether to generate an expanded response or a distilled response for the example requests; and for each additional response, adding data specifying anew demonstration example that includes the corresponding example request and the additional response to the cache data.
22. The method of any preceding claim, when dependent on claim 9, wherein the feedback signal is based on a current system load of the set of one or more generative neural networks.
23. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one more computers to perform the operations of the respective method of any one of claims 1-22.
24. One or more computer storage media storing instructions that when executed by one or more computers cause the one more computers to perform the operations of the respective method of any one of claims 1 -22.