Multistage data extraction and compression for machine-learned agent system memory
The two-step process of memory extraction and consolidation in machine-learned agent systems optimizes memory management, reducing computational complexity and latency by using specialized models for each task, thereby improving efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- GDM HOLDING LLC
- Filing Date
- 2025-12-11
- Publication Date
- 2026-06-18
AI Technical Summary
Existing machine-learned agent systems face inefficiencies in managing and utilizing memory, particularly in processing and responding to queries, due to the complexity of integrating extracted memory data with existing representations, which can lead to increased computational burden and latency.
A two-step process involving memory extraction and consolidation is employed, where relevant information is first extracted from interactions and then integrated into an updated memory representation, allowing for specialized models to optimize each task, thereby reducing computational complexity and improving efficiency.
This approach enhances processing efficiency, reduces latency, and improves accuracy by allowing for optimized models to handle each task separately, leading to more efficient memory management and energy usage.
Smart Images

Figure US2025059260_18062026_PF_FP_ABST
Abstract
Description
MULTISTAGE DATA EXTRACTION AND COMPRESSION FOR MACHINE-LEARNED AGENTSYSTEM MEMORYPRIORITY
[0001] This application is based on and claims priority to U.S. Provisional Patent Application No. 63 / 730,725 (filed December 11 , 2024). U.S. Provisional Patent Application No. 63 / 730,725 is hereby incorporated by reference herein in its entirety.BACKGROUND
[0002] A computer can receive input(s). The computer can execute instructions to process the input(s) to generate output(s) using a parameterized model. The computer can obtain feedback on its performance in generating the outputs with the model. The computer can generate feedback by evaluating its performance. The computer can receive feedback from an external source. The computer can update parameters of the model based on the feedback to improve its performance In this manner, the computer can iteratively "learn" to generate the desired outputs. The resulting model is often referred to as a machine-learned model.SUMMARY
[0003] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.
[0004] In an aspect, the present disclosure provides a first example computer-implemented method. In some implementations, the first example computer-implemented method includes generating a first input data structure describing one or more interactions of a machine-learned agent system. In some implementations, the first example computer-implemented method includes generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure. In some implementations, the first example computer-implemented method includes providing extracted memory data from the one or more extracted memory objects for processing using one or more second machine-learned model inference operations to generate an updated memory representation, wherein the one or more second machine-learned model inference operations process a second input data structure comprising the extracted memory data and prior memory data describing one or more prior interactions of the machine-learned agent system to generate the updated memory representation based on the second input data structure. In some implementations, the example computer- implemented method includes generating a third input data structure based on the updated memory representation and a query. In some implementations, the first example computer-implemented methodincludes generating, based on one or more third machine-learned model inference operations, a response to the query based on the third input data structure.
[0005] In some implementations, the first example computer-implemented method includes transmitting the extracted memory data to a machine-learned model system that executes the one or more second machine-learned model inference operations. In some implementations, the first example computer- implemented method includes receiving, from the machine-learned model system, output data describing the updated memory representation.
[0006] In an aspect, the present disclosure provides a second example computer-implemented method. In some implementations, the second example computer-implemented method includes receiving one or more extracted memory objects describing one or more interactions of a machine-learned agent system. In some implementations, the second example computer-implemented method includes generating a second input data structure comprising extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine-learned agent system. In some implementations, the second example computer-implemented method includes generating, based on one or more second machine-learned model inference operations, an updated memory representation based on the second input data structure. In some implementations, the second example computer-implemented method includes storing the updated memory representation.
[0007] In an aspect, the present disclosure provides a third example computer-implemented method. In some implementations, the third example computer-implemented method includes generating a first input data structure describing one or more interactions of a machine-learned agent system. In some implementations, the third example computer-implemented method includes generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure. In some implementations, the third example computer-implemented method includes generating a second input data structure comprising extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine- learned agent system. In some implementations, the third example computer-implemented method includes generating, based on one or more second machine-learned model inference operations, an updated memory representation based on the second input data structure. In some implementations, the third example computer-implemented method includes storing the updated memory representation.
[0008] In some implementations, the second example computer-implemented method or the third example computer-implemented method includes generating a third input data structure based on the updated memory representation and a query. In some implementations, the second example computer- implemented method or the third example computer-implemented method includes generating, based on one or more third machine-learned model inference operations, a response to the query based on the third input data structure.
[0009] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes providing, as input to a machine-learned model, the third input data structure to cause the machine-learned model to generate an output conditioned on the third input data structure.
[0010] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes providing, as the input to the machine-learned model, the updated memory representation in the third input data structure to cause the machine-learned model to generate the output conditioned on the updated memory representation.
[0011] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the machine-learned model attends over the third input data structure to generate one or more attention values representing one or more associations between one or more portions of a representation of the third input data structure and one or more portions of a context window comprising the representation of the third input data structure. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the machine-learned model generates one or more predicted values based on the one or more attention values. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the machine-learned model generates the output based on the one or more predicted values.
[0012] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more third machine-learned model inference operations are implemented using a multimodal machine- learned model.
[0013] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more third machine-learned model inference operations are implemented using a machine-learned model different from a machine-learned model used for the one or more first machine-learned model inference operations or the one or more second machine-learned model inference operations.
[0014] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more third machine-learned model inference operations are implemented using a multimodal machine- learned model configured to generate speech outputs based on image and text inputs, wherein at least one text input of the text inputs includes the updated memory representation.
[0015] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes executing a query over one or more extracted memory objects. In some implementations, the first example computer- implemented method, the second example computer-implemented method, or the third example computer- implemented method includes updating one or more values associated with the one or more extracted memory objects based on the query.
[0016] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, updating the one or more values includes deleting the one or more extracted memory objects.
[0017] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the query returns the one or more extracted memory objects based on metadata stored in association with the one or more extracted memory objects.
[0018] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the query returns the one or more extracted memory objects based on a session identifier associated with the one or more extracted memory objects.
[0019] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more values are updated based on an instruction received by the machine-learned agent system.
[0020] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the machine-learned agent system streams query data comprising images and audio from an input device to the multimodal machine-learned model to generate output data conditioned on the query data, wherein the output data is generated conditioned on the updated memory representation.
[0021] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, at least one of the following conditions is satisfied: the updated memory representation represents an additional memory object as compared to the prior memory representation; the updated memory representation represents an updated version of a memory object represented in the prior memory representation; or the updated memory representation represents, in a single memory object, memory data represented in a plurality of memory objects in the prior memory representation.
[0022] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the extracted memory data includes memory values from the one or more extracted memory objects. In someimplementations of the first example computer-implemented method, the second example computer- implemented method, or the third example computer-implemented method, the prior memory data includes memory values from one or more prior memory objects that describe the one or more prior interactions of the machine-learned agent system.
[0023] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the second input data structure includes a plurality of portions representing the extracted memory data and the prior memory data, wherein each respective portion of the plurality of portions corresponds to each respective memory object of a plurality of memory objects, the plurality of memory objects comprising the one or more extracted memory objects and the one or more prior memory objects.
[0024] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the extracted memory data includes memory values from the one or more extracted memory objects. In some implementations of the first example computer-implemented method, the second example computer- implemented method, or the third example computer-implemented method, the prior memory data includes a compressed memory representation generated, using one or more third machine-learned model inference operations, based on one or more prior memory objects that describe the one or more prior interactions of the machine-learned agent system.
[0025] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes replacing the prior memory representation with the updated memory representation.
[0026] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, replacing the prior memory representation with the updated memory representation includes storing the updated memory representation as a current version of one or more stored memory representations.
[0027] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, replacing the prior memory representation with the updated memory representation includes overwriting the prior memory representation with the updated memory representation.
[0028] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes storing the one or more extracted memory objects in association with a session identifier corresponding to the one or more interactions of the machine-learned agent system.
[0029] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one ormore first machine-learned model inference operations comprise an attention operation performed over the first input data structure.
[0030] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, generating the first input data structure includes obtaining interaction record data describing one or more interactions of a machine-learned agent system. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, generating the first input data structure includes parsing the interaction record data to remove one or more portions of the interaction record data.
[0031] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more portions of the interaction record data contain control sequences for the machine-learned agent system.
[0032] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the control sequences comprise a tool call configured to cause the machine-learned agent system to invoke a tool specified by the tool call.
[0033] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, generating the second input data structure includes retrieving, from a memory datastore, the prior memory data
[0034] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, generating the second input data structure includes retrieving, from a memory datastore, a plurality of prior memory objects describing the one or more prior interactions of the machine-learned agent system. In some implementations of the first example computer-implemented method, the second example computer- implemented method, or the third example computer-implemented method, generating the second input data structure includes composing the prior memory data based on the plurality of prior memory objects.
[0035] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes providing the first input data structure as input to a first machine-learned model to generate an output. In some implementations, the first example computer-implemented method, the second example computer- implemented method, or the third example computer-implemented method includes generating the one or more extracted memory objects based on the output.
[0036] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes: providingthe first input data structure as input to a machine-learned model to generate a first output; and generating the one or more extracted memory objects based on the first output.
[0037] In some implementations, the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method includes providing the second input data structure as input to the machine-learned model to generate a second output
[0038] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more first machine-learned model inference operations are configured for memory extraction. In some implementations of the first example computer-implemented method, the second example computer- implemented method, or the third example computer-implemented method, the one or more second machine-learned model inference operations are configured for memory consolidation.
[0039] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more first machine-learned model inference operations are configured for memory extraction by constructing the first input data structure to cause the machine-learned model to generate the first output to contain extracted memory data. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more second machine-learned model inference operations are configured for memory consolidation by constructing the second input data structure to cause the machine-learned model to generate the second output to contain consolidated memory data.
[0040] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more first machine-learned model inference operations comprise the execution of a first machine-learned model to generate a first output. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, wherein the first machine-learned model is different from the second machine-learned model.
[0041] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the first machine-learned model is configured for memory extraction. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the second machine-learned model is configured for memory consolidation.
[0042] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the first machine-learned model is trained for memory extraction. In some implementations of the first examplecomputer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the second machine-learned model is trained for memory consolidation.
[0043] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the first machine-learned model includes a first adapted variant of a base machine-learned model. In some implementations of the first example computer-implemented method, the second example computer- implemented method, or the third example computer-implemented method, the second machine-learned model includes a second adapted variant of the base machine-learned model.
[0044] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the first input data structure includes textual data.
[0045] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the first input data structure includes audio recording data, and wherein the first machine-learned model inference operations comprise processing the audio recording data.
[0046] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the first input data structure includes video recording data, and wherein the first machine-learned model inference operations comprise processing the video recording data.
[0047] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the first input data structure includes image data, and wherein the first machine-learned model inference operations comprise processing the image data.
[0048] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more first machine-learned model inference operations are executed using cached attention values computed during the one or more interactions with the machine-learned agent system.
[0049] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, one or more of the one or more first machine-learned model inference operations are executed in sequence with one or more of the one or more interactions with the machine-learned agent system.
[0050] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the one or more of the one or more first machine-learned model inference operations comprise generation of amemory tool call that instructs the machine-learned agent system to invoke a memory tool for storing extracted memory data associated with the one or more interactions.
[0051] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, generating the second input data structure includes querying a memory datastore for a set of stored memory objects, wherein the set of stored memory objects correspond to a selection of memory objects from an interaction memory datastore based on a memory filter.
[0052] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the memory filter includes a recency-based filter. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the memory filter includes a priority-based filter. In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the memory filter includes a relevance-based filter.
[0053] In some implementations of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, the memory filter operates to select the set of stored memory objects based on a respective value of a metadata attribute stored in association with each respective memory object of the set of stored memory objects.
[0054] In an aspect, the present disclosure provides an example one or more non-transitory computer- readable media storing instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising an implementation of the first example computer- implemented method, the second example computer-implemented method, or the third example computer- implemented method, or any combination thereof.
[0055] In an aspect, the present disclosure provides an example computing system. In some implementations, the example computing system includes one or more processors. In some implementations, the example computing system includes one or more non-transitory computer-readable media storing instructions that are executable by the one or more processors to cause the computing system to perform operations, the operations comprising an implementation of the first example computer- implemented method, the second example computer-implemented method, or the third example computer- implemented method, or any combination thereof.
[0056] In an aspect, the present disclosure provides an example computer program product comprising instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising an implementation of the first example computer-implemented method, the second example computer-implemented method, or the third example computer-implemented method, or any combination thereof.
[0057] Other example aspects of the present disclosure are directed to other systems, methods, apparatuses, tangible non-transitory computer-readable media, and devices for performing functions described herein. These and other features, aspects, and advantages of various implementations will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate implementations of the present disclosure and, together with the description, help explain the related principles.BRIEF DESCRIPTION OF THE DRAWINGS
[0058] Figure 1 is a block diagram of an example machine-learned agent system according to example implementations of aspects of the present disclosure.
[0059] Figure 2 is a block diagram of an example implementation of a memory extractor according to example implementations of aspects of the present disclosure.
[0060] Figure 3 is a block diagram of an example implementation of a memory consolidator according to example implementations of aspects of the present disclosure.
[0061] Figure 4 is a block diagram of an example implementation of a memory consolidator according to example implementations of aspects of the present disclosure.
[0062] Figure 5 is a communication diagram illustrating an example sequence of memory cycles according to example aspects of the present disclosure.
[0063] Figure 6 is a block diagram of an example implementation in which latent values may be cached for use in both interaction outputs and memory extraction according to example implementations of aspects of the present disclosure.
[0064] Figure 7 is a block diagram of an example implementation in which latent values may be cached for use in both interaction outputs and memory extraction according to example implementations of aspects of the present disclosure.
[0065] Figure 8 is a flow chart diagram illustrating an example method for training a machine-learned model according to example implementations of aspects of the present disclosure.
[0066] Figure 9 is a block diagram of an example processing flow for using machine-learned model(s) to process input(s) to generate output(s) according to example implementations of aspects of the present disclosure.
[0067] Figure 10 is a block diagram of an example sequence processing model according to example implementations of aspects of the present disclosure.
[0068] Figure 11 is a block diagram of an example technique for populating an example input sequence for processing by a sequence processing model according to example implementations of aspects of the present disclosure.
[0069] Figure 12 is a block diagram of an example model development platform according to example implementations of aspects of the present disclosure.
[0070] Figure 13 is a block diagram of an example training workflow for training a machine-learned model according to example implementations of aspects of the present disclosure.
[0071] Figure 14 is a block diagram of an inference system for operating one or more machine-learned model(s) to perform inference according to example implementations of aspects of the present disclosure.
[0072] Figure 15 is a block diagram of an example networked computing system according to example implementations of aspects of the present disclosure.
[0073] Figure 16 is a block diagram of an example computing device according to example implementations of aspects of the present disclosure.
[0074] Figure 17 is a block diagram of an example computing device according to example implementations of aspects of the present disclosure.
[0075] Figure 18 is a flow chart diagram illustrating an example method for implementing a machine- learned agent system according to example implementations of aspects of the present disclosure.
[0076] Figure 19 is a flow chart diagram illustrating an example method for implementing a machine- learned agent system according to example implementations of aspects of the present disclosure.
[0077] Figure 20 is a flow chart diagram illustrating an example method for implementing a machine- learned agent system according to example implementations of aspects of the present disclosure.DETAILED DESCRIPTION
[0078] Example implementations of the present disclosure provide a system and method for efficiently managing and utilizing memory in machine-learned agent systems. For example, the machine-learned agent system can leverage user-specific contextual information, received from the user organically over the course of one or more interactive sessions, to condition the outputs of a machine-learned model to align with the preferences of and / or prior instructions from the user. In some example implementations, a two- step process may be employed to extract memory information from the interaction sessions and consolidate the memory information into a compact representation for use in future interactions. First, a machine-learned model may extract relevant information from input data representing agent interactions. Second, a machine-learned model may consolidate this extracted information with existing memory, resulting in an updated memory representation. This updated representation may then be used to respond to subsequent queries.
[0079] In some example implementations, this two-step approach may offer several benefits. The separation of data extraction and memory consolidation may allow for more efficient processing and improved accuracy. For example, the extraction step may focus on identifying and isolating relevantinformation, while the consolidation step may focus on integrating this information effectively into the existing memory representation The two-step approach may also facilitate the use of specialized machine- learned models optimized for each respective task, potentially leading to improved performance. The two- step approach may also allow for efficient computation by subdividing a complex inference task into two less complex inference tasks, thereby facilitating the processing of each task with fewer computational resources, or facilitating a lower-latency processing of each task with fixed resources. Further, the two-step approach may provide for improved memory data indexing by emitting intermediate outputs from the extraction step prior to consolidation so that the intermediate outputs may be indexed and stored separately from the ultimately consolidated memory representation.
[0080] Specifically, for instance, outputting structured memory objects as an intermediate stage can facilitate fine-grained querying of memories not enabled by storage of only a higher-level representation thereof. For instance, a user may query a machine-learned agent system to recall information or update information associated with a particular day, interaction session, or conversation. A concise, consolidated memory representation may lack sufficient indexing metadata to facilitate tracing all memory information associated with the requested interval or event. However, the query can be served by looking to a more detailed datastore of memory objects that record, in addition to extracted memory values, metadata that characterize the extracted memory values. Such metadata may be useful for returning, updating, or deleting memory data associated with a particular query.
[0081] An example technical benefit of the two-step process is that factorizing the extraction and consolidation enables the size (e.g., data size, sequence length in tokens, etc.) of the consolidated memory representation to be adjusted or adapted for a given execution environment. For example, a given execution environment may have limited resources for executing machine-learned models, so that smaller sequence lengths are desired to reduce a memory footprint for performing forward passes through the model. In an example, a latency bound may be applied to minimize delays in obtaining responses, requiring smaller memory representations that are associated with lower memory footprints, fewer FLOPs, etc. An energy usage parameter may be specified to balance resource consumption for execution with a quality of recall - based on the parameter, a larger or smaller memory representation may be used to represent stored memories in a context window of a machine-learned model. Further memory consolidation size can be adjusted to provide network latency adaptation. For instance, low network latency can mean that the model itself using more tokens taking longer to process is tolerable as total latency will be within bounds, whereas if high network latency then it may be desirable for model responses to be as fast as possible to compensate. In general, multiple different memory representations may be generated for multiple different implementation contexts. The multiple different memory representations can correspond to multiple different tradeoffs of recall and data size or token length.
[0082] Further, in some examples, outputting structured memory objects as an intermediate stage can facilitate improved conflict resolution. For example, instead of resolving conflicts during the initial extraction phase, conflict resolution can be deferred until the consolidation step, allowing for context-dependent or query-dependent resolution strategies. For example, a memory datastore may store conflicting memory objects and allow the consolidation operation to generate a memory representation that resolves the conflicts based on, for instance, metadata stored in conjunction with each memory object. For example, metadata can indicate temporal features, environmental features, or other contextualizing features that can be used to resolve conflicts. Advantageously, a consolidation stage can implement a machine-learned model that executes attention operations over multiple memory objects together with contextualizing metadata to generate a memory representation that resolves conflicting signals.
[0083] Extracted memory data can operate as a form of memory compression In some example implementations, this extracted data may represent a compressed form of the original input data. For example, indefinitely maintaining a full record of all interactions with a machine-learned agent system may be computationally costly. A compressed representation of the interactions may facilitate recall of desired information while avoiding unnecessary storage of undesired information. For instance, compression may be achieved through the extraction of salient information during the extraction phase. Extracted memory data may then provide a compressed representation of one or more interactions with a machine-learned agent system.
[0084] Additionally, a consolidated memory representation may provide a form of memory compression. In some example implementations, the consolidated memory representation may also be a compressed representation of stored extracted memory objects. In an example, extracted memory objects may store a variety of data characterizing an interaction. While such data may be useful for indexing, querying, or maintaining a robust memory datastore, all such data may not be equally salient for every query. Accordingly, a consolidated memory representation generated based on stored extracted memory objects can represent data in the extracted memory objects in a format that discards data less relevant to a query. In this manner, for instance, a consolidated memory representation may provide a form of memory compression.
[0085] Various aspects of the technology described herein can provide other technical effects and benefits. For instance, an example technical effect of example implementations of the present disclosure is improved efficiency in processing and reduced latency in responding to subsequent queries. In some example implementations, this may be achieved by reducing the complexity of the overall memory management process. For example, in some example implementations, explicitly outputting extracted memories as intermediate representations may reduce the computational burden. The extraction model may then only be tasked with identifying portions and attributes of input data that are to be remembered. Downstream models may then only be tasked with consolidating these explicitly identified memories, ratherthan requiring a single model that is powerful enough to implicitly extract insights from raw input data to extract and directly output a consolidated representation thereof. This separation of tasks may lead to more efficient use of model parameters and improved overall performance. More efficient use of model parameters can include increasing a performance of a given model without a commensurate increase in model size or computational cost. More efficient use of model parameters can include achieving a desired performance with a smaller model. Smaller models executed in sequence may further provide a benefit of smaller maximum memory footprint on a host device that is executing the model.
[0086] In this manner, for instance, an example technical effect of example implementations of the present disclosure is increased energy efficiency in performing operations using machine-learned models, thereby improving the functioning of computers implementing such models. For instance, example implementations can provide for more energy-efficient runtime execution or inference (e.g., by the use of pre-processed memory data). In some scenarios, increased energy efficiency can provide for less energy to be used to perform a given task (e.g., less energy expended to maintain the model in memory, less energy expended to perform calculations within the model, etc.). In some scenarios, increased energy efficiency can provide for more task(s) to be completed for a given energy budget (e.g., a larger quantity of tasks, more complex tasks, the same task but with more accuracy or precision, etc.). In this manner, for instance, the improved energy efficiency of example implementations of the present disclosure can reduce an amount of pollution or other waste associated with implementing machine-learned models and systems, thereby advancing the field of machine-learning and artificial intelligence as a whole. The amount of pollution can be reduced in toto (e.g., an absolute magnitude thereof) or on a normalized basis (e.g., energy per task, per model size, etc.). For example, an amount of CO2 released (e.g., by a power source) in association with training and execution of machine-learned models can be reduced by implementing more energy-efficient training or inference operations. An amount of heat pollution in an environment (e.g., by the processors / storage locations) can be reduced by implementing more energy-efficient training or inference operations.
[0087] Various example implementations are described herein with respect to the accompanying Figures.
[0088] Figure 1 is a block diagram of an example machine-learned agent for extracting, consolidating, and using memory data in service of assisting users to perform tasks. An example machine-learned agent system 100 can obtain an interaction record 102 that contains data describing interactions between an external environment (e.g., a user) and machine-learned agent system 100. Machine-learned agent system 100 can include a memory extractor 104 that operates to extract memories from interaction record 102. For example, memory extractor 104 can provide an input data structure 106 containing interaction data 108 to machine-learned model system(s) 110. Machine-learned model systems 110 can include one or more machine-learned models that execute one or more inference operations based on input data structure(s)106 to produce output data structure 112. Output data structure 112 can include data for extracted memory object(s) 114. For example, extracted memory object(s) 114 can include at least a portion of interaction data 108 or memory data derived therefrom, as well as associated metadata describing features of interaction data 108.
[0089] Extracted memory object(s) 114 can be stored in interaction memory datastore 116. Interaction memory datastore 116 can operate to store an indexed data structure of extracted memory object(s) 114, supporting queries over memory objects based on their memory content or metadata. Machine-learned agent system 100 can include a memory manager 118 to maintain, curate, query, retrieve, or perform other operations over memory objects in interaction memory datastore 116.
[0090] Machine-learned agent system 100 can include a memory consolidator 120 that operates to consolidate information extracted from one or multiple interactions between an external environment (e.g., a user) and machine-learned agent system 100. For example, memory consolidator 120 can receive data from extracted memory object(s) 114 as well as retrieved memory data 122 from prior interactions and use machine-learned model system(s) 110 to generate a consolidated memory representation. For example, memory consolidator 120 can provide an input data structure 124 containing extracted memory data 126 from extracted memory object(s) 114. Input data structure 124 can further contain prior memory data 128 based on retrieved memory data 122 from prior interactions. Machine-learned model system(s) 110 can include one or more machine-learned models that execute one or more inference operations based on input data structure(s) 124 to produce an output data structure 130. Memory consolidator 120 can use output data structure 130 to generate one or more memory representations 132 that represent a current state of a memory of the agent system with respect to its interactions with the environment (e.g., a user). Memory representation(s) 132 can be stored in interaction memory datastore 116.
[0091] Memory representation (s) 132 can be leveraged by machine-learned agent system 100 for subsequent interaction with the external environment (e.g., to assist in addressing a given task). For example, machine-learned agent system 100 can receive, from an input interface 134, a query 136. Machine-learned agent system 100 can generate an input data structure 138 that contains memory representation(s) 132 (or data derived therefrom) and query data 140. Machine-learned model system(s) 110 can include one or more machine-learned models that execute one or more inference operations based on input data structure(s) 138 to produce an output data structure(s) 142. Output data structure(s) 142 can include response data generated conditioned on memory representation(s) 132 and query data 140. Machine-learned agent system 100 can output response 144 based on output data structure(s) 142. Response 144 can be a response to a user based on query 136 (e.g., an answer to a question posed in query 136), a response to an inquiry of a user based on query 136 (e.g., a recommendation for an action based on query 136), or an intermediate response generated by machine-learned agent system 100 to perform in service of query 136 (e.g., a function call generated by a machine-learned model to execute inservice of a task indicated by query 136). Machine-learned agent system 100 can output response 144 via one or more output interfaces 146.
[0092] As used herein, a "user” can refer to a number of different entities including, as some examples, an account (e.g., a “user account" associated with a software or a service), a sub-account of an account, a person or individual, a corporation or corporate user, a legal entity or other defined entity, an administrator, a system manager, a computer-implemented user (e.g., an agent system, a debug user or testing user, etc.), and / or other suitable users. A “user” associated with an interaction can be a profile associated with an interaction submitted via an entity user. For instance, a “user” can be an instance of a profile contained in a superset of profiles, wherein the superset can be associated with an individual or an entity or organization.
[0093] A user can be associated with a key or other credential that can authenticate inputs and outputs received from and output to the user. A user can be specified using a key or credential used to accompany or sign calls to an application programming interface.
[0094] Machine-learned agent system 100 can be or include processing logic, software, firmware, or hardware configured to automate one or more operations of or interactions with a computing device or system. For instance, machine-learned agent system 100 can be hosted on a local device or a cloud server to control operations of the host device or other devices. Machine-learned agent system 100 can receive inputs and initiate actions or tasks based on the inputs.
[0095] Machine-learned agent system 100 can be or include an artificial intelligence (“Al”) agent. Machine-learned agent system 100 can control machine-learned models and Al-enabled systems to help users solve tasks. For instance, machine-learned agent system 100 can employ one or more machine- learned models to generate outputs responsive to queries from users. As one example, an agent system can operate on a computing system configured to receive an input from a user device and provide an output responsive to the input to the user device. The agent system can be or can implement a multi-modal agent (e.g., a multi-modal artificial intelligence agent). For instance, a multi-modal agent can process inputs from one or more data modalities. In some implementations, the agent system can be implemented as a “situated agent” in which the agent system shares one or more perceptual inputs with a human user. For example, the situated agent can receive and process various data inputs, including video, audio, and / or textual data which are also observable by the human user. The agent system can process these inputs to generate responses that are contextually-relevant for the user’s physical or digital environment, for example enabling the agent system to generate dialogue or other responses or outputs which assist the user in understanding and / or navigating the environment.
[0096] Machine-learned agent system 100 can execute operations that include both learned operators (e.g., operators that execute predictions using machine-learned models having learned values, or operators that execute code generated based on predictions using machine-learned models having learned values) and non-learned operators. In some implementations, machine-learned agent system 100 can execute oneor multiple learned operators to perform tasks and make decisions within a non-learned framework of rules and software infrastructure. For instance, an agent framework can include routing layers, i / o callbacks, or other subroutines that route and trigger processing of incoming information with various learned operators. For some processing stages, learned operators can apply predetermined recipes or templates for processing the information (e.g., using a series of prompt templates to ingest new information or format output information). For some processing stages, learned operators can use the outputs of a first prediction to control the inputs for downstream operations, such as by predicting parameters for application programming interface calls, invoking code interpreter environments for executing generated code, etc.
[0097] Machine-learned agent system 100 and any components thereof can engage one or more machine-learned models to generate inferences for performing various tasks. Machine-learned agent system 100 and any components thereof can interact with machine-learned model system(s) 110 to obtain inferences from one or multiple models. As described herein, reference to machine-learned agent system 100 and any components thereof using a machine-learned model can include machine-learned agent system 100 and any components thereof interacting with machine-learned model system(s) 110. As described herein, reference to machine-learned agent system 100 and any components thereof using a machine-learned model can include machine-learned agent system 100 and any components thereof interacting with machine-learned models other than machine-learned model system(s) 110.
[0098] Machine-learned agent system 100 can operate responsive to queries obtained from one or more input interfaces. Machine-learned agent system 100 can execute to assist a user with a task by processing a query based on a user input received from a user. For instance, machine-learned agent system 100 can reduce a complexity of inputs to perform various tasks on a computing device or system. For example, machine-learned agent system 100 can use machine-learned models to recognize tasks to perform and execute operations to achieve the tasks based on inputs in view of accumulated context from prior inputs. In this manner, for instance, each input can be augmented by the model's learned skillset as well as the available context from the interaction memory, so that even minimal inputs can be effective to initiate execution of complex tasks.
[0099] Machine-learned agent system 100 can operate in a user-specific manner. For example, memory data can be stored in a user-specific manner. For instance, a user can be associated with one or more accounts and / or an account can be associated with one or more users, and memory data can be specific to an account or specific to a user. For instance, an account may be associated with one or more individuals that have access to (e.g., manage) the account, and user-specific memory data may be associated with the individuals themselves and / or the account. As one particular example, one user can be a corporate entity associated with a corporate account, and one or more employees of the corporate entity can each be individual users that have access to memory data linked to the account associated with the corporate entity. Each of the individual users can further be linked to memory data associated with their respectiveuser profiles or accounts. In this manner, for instance, multiple overlapping memories can be stored and accessed as desired. In some additional examples, a user or account can be associated with one or more user sub-accounts or profiles. For example, an account may have a first profile associated with personal use and a second profile associated with business use. Memory data may be associated with each profile. In some implementations, these example aspects may be combined in a variety of combinations. For example, a user may be associated with a first user-specific memory associated with a first profile and a second user-specific memory associated with a second profile, each of which is inaccessible by the other profile, and a third user-specific memory that is associated with the user (e.g. the user's overall account) directly such that it is accessible to both the first profile and the second profile.
[0100] Machine-learned agent system 100 can perform memory extraction and compression on prior interactions to obtain memory representations for conditioning future interactions. This process may enable the system to efficiently manage and utilize memory resources by selectively storing only the most relevant information from past interactions, thereby reducing storage requirements and improving query response times. The system may further leverage the compressed representation of past interactions to improve the efficiency of subsequent queries by reducing the computational burden associated with processing large amounts of data. This may be particularly advantageous in resource-constrained environments or when dealing with a large volume of historical interaction data. For example, for multimodal interaction mechanisms (e.g., image and audio based virtual agents that consume image and audio streams), historical interaction data can consume significant storage volumes.
[0101] To extract useful memory information from prior interactions, machine-learned agent system 100 can obtain interaction record(s) 102 from one or more data sources. Interaction record 102 can be generated by machine-learned agent system 100 itself or received from another source.
[0102] Interaction record 102 can be or include a data structure containing data describing one or more inputs received by machine-learned agent system 100. Interaction record 102 can include data describing outputs generated by machine-learned agent system 100 in response to the one or more inputs. For instance, interaction record 102 can form a dialog history between machine-learned agent system 100 and its environment. The environment can include one or more users. The environment can include one or more tools, data sources, or other systems with which machine-learned agent system 100 interacts. The environment can be a physical environment or a virtual or digital environment.
[0103] Interaction record 102 can include one or more interactive "turns” in a multi-turn interaction. Each turn may include data describing an input from the environment and a corresponding output from machine- learned agent system 100. The input data may include user queries, commands, or other forms of input. The output data may include responses generated by machine-learned agent system 100, actions taken by the system, or other forms of output.
[0104] Interaction record 102 can include no interactive “turns.” For example, interaction record 102 can include a one-way interaction between machine-learned agent system 100 and the external environment. For instance, interaction record 102 can include observations of the environment by machine-learned agent system 100 that do not elicit corresponding interactions with the environment.
[0105] Interaction record 102 can include one or more data modalities. These data modalities may include text, images, audio, video, sensor data, or combinations thereof. Text data may be stored as strings in a database or as tokens in a vector database. Image data may be stored as arrays of pixel values or as feature vectors derived from a convolutional neural network. Audio data may be stored as waveforms or as spectrograms, potentially leveraging techniques such as mel-frequency cepstral coefficients (MFCCs) for efficient storage and retrieval. Video data may be stored as sequences of images or as compressed video streams. Sensor data, such as accelerometer or gyroscope readings, may be stored as time series in a database. In some implementations, a structured data format, such as JSON or Protocol Buffers, may be used to represent the interaction record, allowing for flexible storage and querying of diverse data modalities.
[0106] Interaction record 102 can contain data captured by one or more sensors of a device operating machine-learned agent system 100. For example, such sensor data may include data from accelerometers, gyroscopes, cameras, microphones, GPS modules, or other sensors. The sensor data may itself capture the interaction (e.g., an audio recording of speech interactions; an image recording of an environment; etc). The sensor data may be incorporated into the interaction record to provide context for the interaction(s).
[0107] Interaction record 102 can contain data captured by one or more sensors of other devices that are associated with a user of machine-learned agent system 100. For example, a user account may be associated with a mobile phone, a tablet, and a laptop. Data from each device may be included in interaction record 102, potentially tagged with device identifiers to distinguish the source of the data. This may allow the system to personalize responses and tailor memory management strategies based on the specific device and context of use. The system may also use this information to consolidate data from multiple devices, providing a unified view of user interactions across all associated devices, while still maintaining the ability to query and retrieve data specific to a particular device. This may be particularly useful for applications requiring context-aware responses or for managing user preferences across multiple devices.
[0108] Interaction record 102 can include a record of content or other data ingested by both machine- learned agent system 100 and a user (e.g., as a “situated agent"). These shared observations can include various different data modalities, such as video, audio, and / or textual data which are also observable by a user. The video, audio, or text can be observable in a physical environment (e.g., image captures or audio recordings of a physical environment) or a digital environment (e.g., renderings or screen captures of digital content rendered to a user as well as to machine-learned agent system 100).
[0109] For example, interaction record 102 can include on-screen context data from a user device, such as a screenshot or screen recording. On-screen context data can include image or textual content representing the visual elements displayed on the user's screen at the time of the interaction. This onscreen context data can be captured using various techniques, such as screen capture APIs, screen recording APIs, or other methods capable of capturing the visual output of a user device. The on-screen context data can be encoded in various formats, such as JPEG, PNG, or video formats, and can be included as part of the interaction record 102. On-screen context data can be combined with other data modalities within interaction record 102, such as user instructions, audio recordings, or sensor data. Onscreen context data can also be used to improve the performance of machine-learned agent system 100 in assisting user tasks by conditioning agent responses on the specific visual context of the interaction. Example on-screen captures can be obtained depicting a variety of software applications, such as word processing applications, image editing applications, browser applications, operating systems and file management systems, gaming applications, media applications, etc. Machine-learned agent system 100 can ingest such content to generate outputs to control the depicted applications (e.g., generating application programming interface calls to directly provide inputs to the application via application programming interfaces of the application; generating application programming interface calls to simulate user inputs to the application; etc.).
[0110] Interaction record 102 can include audio data representing audio being rendered by a user device or audio being received at a microphone of the user device. This audio data can be stored in various formats. Metadata associated with the audio data can include timestamps indicating the start and end times of the audio segment, a source identifier specifying whether the audio was captured by the device's microphone or rendered by the device’s speakers, or other contextual information such as the application or process that generated or captured the audio. For instance, if the audio is part of a newscast or other content being played on the device, the metadata could include the name of the content and a summary of the specific segment being played. If the audio is captured from the device's microphone, the metadata could include environmental information or other contextualizing descriptors (e.g., "morning walk"). Audio information can then be used by machine-learned models to better understand user intentions, provide more relevant responses, or enhance the user’s overall experience. For example, if the user is listening to a newscast and asks a question related to a specific topic mentioned in the newscast, the system can use the audio data to identify the relevant segment of the newscast and provide a more accurate and contextually appropriate response. Similarly, if the user is in an outdoor environment and asks about the surrounding wildlife, the system can analyze the audio data to identify bird calls or other animal sounds and provide information about the species present.
[0111] An example multimodal interaction record 102 follows:
[0112] {
[0113] “interaction Record” : [
[0114] {
[0115] “role” : “user”,
[0116] “parts” : {
[0117] “text” : "",
[0118] “audio” : base64 encoded audio recording of request to transcribe recipe and generate shopping list,
[0119] “imageFrames” : [
[0120] {
[0121] “image” : base64 encoded image depicting handwritten recipe,
[0122] “mime” : “image / jpeg”
[0123] }
[0124] ]
[0125] },
[0126] “time” : T1
[0127] },
[0128] {
[0129] “role”: “agent”,
[0130] “parts” : {
[0131] “text” : String of transcribed recipe and shopping list,
[0132] “audio” : base64 encoded audio recording of agent’s confirmation of completion of task,
[0133] “imageFrames” : []
[0134] },
[0135] “time” : T2
[0136] }
[0137] ]
[0138] }
[0139] Other data modalities may be used. Interaction record 102 may be single-modality (e.g., text- only, image-only, audio-only, etc.).
[0140] Interaction record 102 can include discrete data messages or can include data streams or portions thereof (e.g., sliding windows of data from a stream). For instance, interaction record 102 can include streamed audiovisual content (e.g., from a situated agent).
[0141] Interaction record 102 can include internal commands that are not output to a user. For instance, interaction record 102 can include messages generated by machine-learned agent system 100 for internaluse by the system to initiate or control operations, such as tool or function calls, internal reasoning or deliberation regarding a query, debugging logs, or other internal messages.
[0142] For example, to continue the above example, in response to the provided recipe, the user could ask for an instructional video to help the user learn to make the recipe. To service this request, machine- learned agent system 100 can generate one or more messages that invoke external search tools to search an internet database of available media resources. The invocation of a tool and the response from the tool can be captured in interaction record 102. For example, an example extension of the above interaction record object follows:
[0143] {
[0144] “interaction Record”: [
[0145]
[0146] {
[0147] “role”: “user”,
[0148] “parts” : {
[0149] “text” : ““,
[0150] “audio” : base64 encoded audio recording of request to find instructional video for making recipe
[0151] “imageFrames” : []
[0152] },
[0153] “time” : T3
[0154] },
[0155] {
[0156] “role”: “agent”,
[0157] “parts” : {
[0158] “functionCall” : {
[0159] “name" : “videoSearch”,
[0160] “args” : [
[0161] “q uery” : String summarizing key terms from recipe
[0162] ]
[0163] },
[0164] “time” : T4
[0165] },
[0166] {
[0167] “role”: “agent”,
[0168] “parts” : {
[0169] “functionResponse” : {
[0170] "name” : “videoSearch”,
[0171] "response” : !
[0172] “url” : URL pointing to video
[0173] ]
[0174] },
[0175] "time” : T5
[0176] },
[0177] {
[0178] "role”: "agent”,
[0179] "parts” : {
[0180] “text” : hyperlink to URL,
[0181] “audio” : base64 encoded audio recording of response confirming completion
[0182] },
[0183] “time” : T6
[0184] }
[0185] ]
[0186] }
[0187] Other types of internal interactions can be included with various different tools.
[0188] Memory extractor 104 can be or include a processing logic, software, firmware, or hardware configured to ingest, process, consolidate, and manipulate interaction data to extract memory information. Memory extractor 104 can ingest all or part of interaction record 102 to extract memory information. This process may involve the use of one or more machine-learned models to identify and extract relevant information from the interaction record, such as key events, entities, relationships, or other salient features. The extracted information may then be structured into one or more memory objects, each containing the extracted data along with associated metadata, such as timestamps, source identifiers, or contextual information.
[0189] Memory extractor 104 can use one or more machine-learned models to extract memory information. For example, memory extractor 104 can use one or more machine-learned models to generate structured data objects that represent salient information from interaction record 102 for storage as discrete memory objects in interaction memory datastore 116. In general, memory extractor 104 can cause a machine-learned model to generate a structured memory object by providing, as an input to the machine- learned model, an input data structure configured to generate outputs that contain information extracted from interaction record 102 in a format that can be parsed, stored, and indexed in interaction memory datastore 116. An example input data structure is input data structure 106.
[0190] Input data structure 106 can be or include a data structure configured to provide information from interaction record 102 to machine-learned model system(s) 110 in a format that causes one or more machine-learned models of machine-learned model system(s) 110 to generate outputs that can be used to extract useful memory objects from interaction record 102. Input data structure 106 can contain explicit or implicit conditioning values that are selected to cause one or more machine-learned models that process the conditioning values to generate outputs in alignment with a desired structure or schema for memory objects extracted from interaction record 102.
[0191] Input data structure 106 can include instructions for instructing a machine-learned model that has been trained to follow instructions. For example, input data structure 106 can include a prompt for a machine-learned model that specifies parameters of the extraction task, including a format of the inputs, a format of the outputs, a type of memory information to be extracted, a size of an extracted memory value, a schema or structure for memory objects to be extracted (e.g., a name for the memory object, a list of associated components, etc.), or other information to guide the extraction workflow.
[0192] In general, input data structure 106 can be or include a data object configured for input to machine-learned model system(s) 110. Input data structure 106 can include a structure defined based on an application programming interface of machine-learned model system(s) 110.
[0193] Input data structure 106 can contain a single data modality or multiple data modalities. Various example data modalities for input data structure 106 are described herein with respect to the description of inputs to example machine-learned model 1 . Input data structure 106 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, etc.
[0194] Interaction data 108 can be or include one or more data representations of interaction record 102. Interaction data 108 can be the same as interaction record 102. Interaction data 108 can be a subset of or superset of interaction record 102.
[0195] Interaction data 108 can be a filtered or processed version of interaction record 102. For example, memory extractor 104 can obtain interaction record 102, filter interaction record 102 to generate interaction data 108, and provide interaction data 108 to machine-learned model system(s) 110. In an example, memory extractor 104 can filter interaction record 102 to omit formal structures, formatting data, internal markers, or other features of interaction record 102 that are not designated as within the scope of the memory for machine-learned agent system 100. For instance, tool or function calls emitted by machine- learned agent system 100 within one or more interactions of interaction record 102 may be filtered out as potentially not being strong memory signals. Similarly, extraneous metadata associated with individual interactions may be suppressed (e.g., internal debugging information).
[0196] Interaction data 108 can include the same or different modalities of data as interaction record 102. For example, interaction record 102 can include image data and audio data (e.g., video data), and interaction data 108 can include image data and audio data (e.g., video data). In some implementations,interaction record 102 can include image data and audio data (e.g. , video data), and interaction data 108 can include text data (e.g., captions of images; transcriptions of audio).
[0197] For instance, to continue the above recipe example, interaction data 108 can include the full interaction record 102. Interaction data 108 can include the interaction record with the internal function calls filtered out. Interaction data can include, in place of multimodal messages, textual summaries thereof. For instance, an example of interaction data 108 follows:
[0198] (T1) User: “Please transcribe this recipe and generate a shopping list.” <image id = “1234” caption-'recipe for custard">
[0199] (T2) Agent: “Sure! Here you go: “ + String of transcribed recipe and shopping list
[0200] In an example, interaction data 108 can include textual data generated based on a multimodal interaction record 102. Using textual data to represent multimodal interaction records can provide a compact representation of complex interactions.
[0201] Machine-learned model system(s) 110 can be or include processing logic, software, firmware, or hardware configured to host and execute machine-learned models to obtain predictions. For example, machine-learned model system(s) 110 can include software platforms configured to manage the storage and execution of various different machine-learned models, various different adapters or other profiles for one or more machine-learned models, etc. Machine-learned model system(s) 110 can be implemented locally to machine-learned agent system 100 (e.g., on a same device or system) or remotely from machine- learned agent system 100 (e.g., on a different device or system). Machine-learned model system(s) 110 can load machine-learned parameters from storage into memory devices (e.g., memory of one or more hardware accelerator devices), transform inputs based on an architecture of the loaded machine-learned model, maintain a cache of intermediate states (e.g., latent or otherwise) for the machine-learned model during execution, and return outputs generated by the executed model. Machine-learned model system(s) 110 can execute, for instance, one or more machine-learned model(s) 1 . Example machine-learned model types and configurations that can be used to process input data structure 106 are described herein with respect to machine-learned model 1. Example aspects of machine-learned model system(s) 110 are described herein with respect to model host 31 .
[0202] Machine-learned model system(s) 110 can facilitate interactions between components of machine-learned agent system 100 and one or more machine-learned models. Machine-learned model system(s) 110 can directly execute one or more machine-learned models or can provide API access to other systems (on-device or on external devices) that execute one or more machine-learned models using inputs provided via the API.
[0203] Machine-learned model system(s) 110 can provide access to generalist models. For instance, machine-learned model systems 110 can provide access to foundational models that are configured to perform inference for a wide variety of tasks. Machine-learned model systems 110 can provide access tomachine-learned sequence processing models, such as large language models or “LLMs’’ or small language models or "SLMs,” vision-language models of "VLMs,” vision models (e.g , convolutional neural nets), audio models, etc.
[0204] Machine-learned model systems 110 can provide access to a variety of specialized models that implement various functionality of components of machine-learned agent system 100. The models can be designed and trained (e.g., fine-tuned) to perform specific tasks such as parsing and analyzing input data, extracting information to remember, predicting a classification or type of information, extracting memories relevant to a query, etc.
[0205] Machine-learned model systems 110 can include or invoke various types of hardware and software components specifically designed to execute machine learning algorithms and models. Examples of devices or systems that can be used to implement machine-learned model systems 110 include dedicated machine learning engines equipped with one or more high-performance GPUs or other hardware accelerators for accelerated computing. These environments can be hosted locally on an edge device or hosted on a server to offload computational tasks.
[0206] Machine-learned agent system 100 can invoke a machine-learned model to process interaction record 102 to generate interaction data 108. For example, a multimodal machine-learned model can ingest multimodal interaction data to generate interaction data in a modality compatible with input data structure 106. In an example, a multimodal machine-learned model can ingest multimodal interaction data to caption imagery or video, transcribe or summarize audio, etc.
[0207] Memory extractor 104 can provide input data structure 106 as input to one or more machine- learned models implemented by machine-learned model system(s) 110. Machine-learned model system 110 can execute one or more machine-learned model inference operations to generate outputs based on the provided inputs. One or more machine-learned models can process input data structure 106 to generate one or more outputs. The one or more outputs can be output back to machine-learned agent system 100 in output data structure 112.
[0208] For example, machine-learned model system(s) 110 can implement a generative language model to process input data structure 106 to generate language content (e.g., a string of text) containing extracted memory information. The generation of the extracted memory information can be conditioned on input data structure 106. Example generation and conditioning mechanisms are described in further detail below with respect to machine-learned model 1.
[0209] Output data structure 112 can include complete or partial outputs from machine-learned model system(s) 110. For instance, output data structure 112 can include a portion of a data stream as machine- learned model system 110 streams generated content back to machine-learned agent system 100. Output data structure 112 can include a complete response generated by machine-learned model system(s) 110 based on input data structure 106.
[0210] In general, output data structure 112 can be or include a data object output by machine-learned model system(s) 110. Output data structure 112 can include a structure defined based on an application programming interface of machine-learned model system(s) 110.
[0211] Output data structure 112 can contain a single data modality or multiple data modalities. Various example data modalities for output data structure 112 are described herein with respect to the description of outputs from example machine-learned model 1. Output data structure 112 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, etc.
[0212] Output data structure 112 can include content generated by one or more machine-learned models executed by machine-learned model system(s) 110. Output data structure 112 can include metadata associated with the generated content, such as confidence values, scores (e.g., based on an evaluation policy), logits, etc.
[0213] Memory extractor 104 can engage in multiple rounds of communication with machine-learned model system(s) 110 to process all or part of interaction record 102. For instance, extraction can proceed in parts by using one or more models to execute one or more inference operations that generate a first output. The first output can be an initial breakdown or analysis of interaction data 108. A second processing round can include using one or more models (the same or different model(s)) to process the output of the first part (or a new input) to perform a second portion of the task (e.g., the generation of the extracted memory information).
[0214] Memory extractor 104 can generate extracted memory object(s) 114 based on data from output data structure 112. For example, output data structure 112 can include a string representation of a data structure recording memory information and attributes thereof. Memory extractor 104 can parse output data structure 112 to store an extracted memory object in a storage mechanism (e.g., a storage data structure, such as a database).
[0215] Extracted memory object(s) 114 can include a discrete data structure that represents one or more pieces of extracted memory information. Extracted memory object(s) 114 can include a single piece of extracted memory information or multiple pieces of extracted memory information. Extracted memory object(s) 114 can include one or more metadata fields along with the data extracted from interaction record 102. For example, metadata can include time stamps, user identifiers, device or system identifiers, contextual metadata, etc.
[0216] Extracted memory object(s) 114 can include information excerpted from interaction record 102 or interaction data 108 (e.g., data copied). Extracted memory object(s) 114 can include inferred information derived or generated from input data structure 106, output data structure 112, or interaction record 102 (e.g., based on an inference or prediction operation).
[0217] Interaction memory datastore 116 can be a data structure configured for the structured storage of data. Interaction memory datastore 116 can be or include a relational or non-relational database, a datatable, a document, a file system, or other structured data representation. Examples of devices or systems that can be used to implement interaction memory datastore 116 include traditional relational databases, NoSQL databases, in-memory data stores, and distributed file systems. Additionally, cloud storage solutions provided by network-hosted platforms can also be used. Interaction memory datastore 116 can be implemented locally to machine-learned agent system 100 (e.g., on a same device or system) or remotely from machine-learned agent system 100 (e.g., on a different device or system).
[0218] Interaction memory datastore 116 can store data of various different modalities. Interaction memory datastore 116 can store text data. Interaction memory datastore 116 can store image data. Interaction memory datastore 116 can store audio data. Interaction memory datastore 116 can store combined audio and image data (e.g., video data). Interaction memory datastore 116 may store arbitrary data types. Various example data modalities that can be stored in interaction memory datastore 116 are described herein with respect to the description of inputs to or outputs from example machine-learned model 1.
[0219] Interaction memory datastore 116 can store memory data in a native modality in which the interaction was performed. For example, an agent system that implements a machine-learned model that can perform inference natively on input audio data to generate responses (e.g., rather than requiring a transcription to text first) can store, in interaction memory datastore 116, audio recordings of interactions in addition to or in lieu of transcribed text. These audio recordings can capture rich contextual information that may be lost in the transcription, such as cadence, tone, inflection, background noise, etc. of a user’s utterance. Similarly, an agent system that implements a machine-learned model that can perform inference natively on input image data to generate responses (e.g., rather than requiring a transcription to text first) can store, in interaction memory datastore 116, image data of interactions in addition to or in lieu of captioning text. The image data (e.g., video frames, still image captures) can capture rich contextual information that may be lost in captioning, such as environmental information, lighting, colors, mood, facial expressions, body posture, etc. In general, an agent system that implements a machine-learned model to perform inference natively on multiple modalities of input data to generate responses can store, in interaction memory datastore 116, multiple of such modalities of memories.
[0220] Interaction memory datastore 116 can include a vector-based data recall structure. For instance, a vector database can store embedded representations of memory values to facilitate similarity searches based on an embedding of a query value. Based on identified similarity matches, a corresponding data record in interaction memory datastore 116 for a given matched vector in the vector database can be retrieved and served in response to the query.
[0221] Interaction memory datastore 116 can include a hierarchical storage structure. For example, a low-precision storage layer can provide for rapid querying with low precision. For some queries, low precision may be sufficient. For other queries, the results of the rapid low-precision query can be used todirect or guide slower searches through more detailed storage layers (e.g., higher-dimensional vector embeddings, higher-detail textual, audio, or image entries).
[0222] Interaction memory datastore 116 can be multimodal. For example, one or more queries can include data of a first modality and one or more queries can include data of a second modality. Interaction memory datastore 116 can store memory information in both the first and the second modality. For example, interaction memory datastore 116 can store text data and image data, audio data and image data, text data and audio data, or other combinations of data modalities.
[0223] Figure 2 is a block diagram of an example implementation of memory extractor 104 and example memory objects generated thereby. Memory extractor 104 can include instructions 200 in input data structure 106 that specify desired contents of generated memory information received in output data structure 112. Memory extractor 104 can parse output data structure 112 according to the desired schema to populate fields of a structured memory object 202-1 . Example fields in memory object 202-1 include a memory value 204-1 (e.g., storing information to be remembered) and memory metadata 206-1 (e.g., storing information that contextualizes the information to be remembered, such as a timestamp, session id, or other context).
[0224] In an example, instructions 200 can include a textual prompt that provides explicit natural language instructions that condition the generations of a machine-learned model implemented by machine- learned model system(s) 110.
[0225] For instance, memory extractor 104 may leverage instructions 200 to distinguish between system configuration memories and personal interaction memories. These instructions may refer to explicit labels in extracted memory data. For example, each extracted memory object 114 may include a designated field specifying its memory type ("system configuration" or "personal interaction"). Alternatively, instructions 200 may include an instruction indicating the intended type of memory extraction to be performed (e.g., "extract system configuration memories" or " extract personal interaction memories").
[0226] In some example implementations, the instructions may take the form of prompts or conditioning values provided to machine-learned model system(s) 110. These prompts may explicitly instruct the models to process and extract different memory types separately. For instance, the prompt may instruct the model to: "Extract only memories tagged with ‘personal interaction1and disregard all other memory types." Alternatively, the prompt may guide the model to differentiate based on content. For example, the prompt may state: "Extract memories related to user preferences, interactions, and conversational history into ‘personal interaction' memory objects. Extract memories related to system settings, configurations, and internal states into ‘system configuration’ memory objects."
[0227] The machine-learned models within machine-learned model system(s) 110 may operate to recognize and respond to these instructions. The models may use these instructions to filter, weight, or otherwise selectively process different information types during the extraction process. For example, themodel may, based on internal latent states activated based on the provided instructions, attend over input data structure 106 in a particular manner to cause the generation of outputs that extract memory information according to a designated type. A model may employ different extraction strategies for different memory types, such as generating more concise memory values for system configuration memories and more detailed memory values for personal interaction memories.
[0228] Output data structure 112 generated by machine-learned model system(s) 110 may reflect the separation of memory types. For example, output data structure 112 can contain generated memory object data that correspond to a respective schema specified in instructions 200.
[0229] An example textual prompt for instructions 200 follows:
[0230] “You are a specialist tasked with extracting crucial information from conversations between a user and an Al assistant. This extracted information will be used to personalize future interactions and enhance the user experience.
[0231] Your objective is to identify and categorize two types of information:
[0232] 1. **User Preferences for Al Behavior:** These are instructions or desires expressed by the user regarding how the Al assistant should interact with them. Examples include preferred communication style, level of formality, and any specific naming conventions.
[0233] 2. **Memory of Shared Experiences:** These are summaries of significant events, topics, and details from the conversation that are relevant to the user’s ongoing interaction with the Al. This includes explicit requests from the user to remember something, as well as implicit cues about important information.
[0234] # Process:
[0235] 1. ** Identify User Preferences:** Analyze the conversation and extract any explicit or implicit instructions from the user about how the Al should behave. Focus solely on directives related to the Al’s functionality, not on general user preferences.
[0236] 2. **Summarize Shared Experiences:** Create concise, diary-like entries for each significant interaction, event, or topic discussed. Prioritize information the user explicitly asked the Al to remember. Also, include details that are implicitly important based on the conversation flow, such as plans, discussions about other people, and recurring topics. Exclude information already captured as a user preference. Include objective details about individuals involved (like name, location, or profession) and subjective details about the user (like their attitudes or goals related to the discussed topics).
[0237] # Output Format:
[0238] Provide two distinct lists:
[0239] 1. **User Preferences:** A timestamped list of identified user preferences for Al behavior. If none are present, indicate “None.”
[0240] 2. **Shared Experiences:** A timestamped list of concise summaries of significant interactions.
[0241] Use a consistent timestamp format and maintain a third-person perspective throughout. Avoid special characters in the output. The length and detail of the summaries should reflect the conversation’s complexity while remaining as concise as possible.”
[0242] Memory object 202-1 can include memory value 204-1 . Memory value 204-1 can include one or more pieces of information that a machine-learned model executed by machine-learned model system(s) should remember or associate with a query. Memory value 204-1 can include one or multiple pieces of information extracted from interaction record 102. Memory value 204-1 can include one or multiple data modalities.
[0243] Memory object 202-1 can include memory metadata 206-1 . For instance, a time stamp can be parsed from the returned output and stored in a corresponding field of the memory object. A session identifier can be input into memory object 202-1 based on a session identifier associated with interaction record 102 or based on a session identifier generated by memory extractor 104 (e.g., for tracing memory sources). A user account credential or designation can be input into memory object 202-1 based on a designation associated with interaction record 102 or based on a user designation input into memory extractor 104. This can provide for granular recall and disambiguation of multiple user accounts from a single session or from multiple sessions.
[0244] Memory extractor 104 can populate one or more fields of memory object 202-1 based on output(s) generated by one or more models. Memory extractor 104 can populate one or more fields based on data obtained without using a model. For instance, memory extractor 104 can populate a field with a raw portion of interaction data 108 or other portions of interaction record 102
[0245] With reference again to Figure 1 , machine-learned agent system 100 can include memory manager 118 to maintain the memory data generated and extracted by memory extractor 104 within interaction memory data store 116. For instance, of particular advantage, memory objects stored in interaction memory data store 116 can be indexed and queried based on the structured data format of the memory objects. Memory manager 118 can execute various actions or operations to store, retrieve, delete, or otherwise manage memory data. For instance, requests to delete information associated with a particular session can be implemented by causing memory manager 118 to query interaction memory data store 116 for memory objects from a particular session and delete any such memory objects.
[0246] For instance, because the memory of the machine-learned agent system can be formed by interactions with the user, the user can also use the same interaction mechanism to limit, erase, or otherwise modify the memory of the machine-learned agent system. For instance, the machine-learned agent system can store memory objects that contain a timestamp value. The user can instruct the machine- learned agent system to forget information after three months. The machine-learned agent system can implement this instruction by configuring the interaction memory datastore to delete memory objects on a rolling basis based on the specified horizon. Similarly, a user can utter a command such as, "don'tremember this conversation,” and the machine-learned agent system can purge any memory data related to the conversation upon conclusion of the session. Memory objects associated with the session can be stored with a session identifier to facilitate grouping of memory data associated with the session.
[0247] In general, memory manager 118 can control a schedule for memory extraction, memory consolidation, and memory updated (e.g., organizing, deleting, etc.). Memory manager 118 can operate based on one or more triggers. Example triggers may be based on explicit user commands, implicit system inferences, or scheduled events. Explicit user commands may include, but are not limited to, verbal or textual instructions such as “remember this,” “forget that,” “remember [information],” “forget [information],” “add to memory,” “delete from memory,” “save this,” “clear memory,” or other similar directives. These commands may be accompanied by specific data to be remembered or forgotten. Scheduled events may trigger memory updates at predetermined intervals, such as daily, weekly, or monthly These events may involve reviewing and updating existing memory objects, deleting expired or irrelevant information, or consolidating related memory objects. The system may also employ a combination of these triggers, using explicit commands to prioritize certain information while relying on implicit inferences and scheduled events for maintaining a comprehensive and up-to-date memory datastore. Further, triggers may be based on events external to the user interaction, such as receiving data from a connected device or an external system. For example, a sensor reading exceeding a threshold may trigger the system to store the sensor data as a memory object. The system may also be configured to receive external commands or instructions to add, update, or delete memory objects. The system may prioritize certain triggers over others based on configurable parameters or system-learned heuristics. For example, explicit user commands may have higher priority than implicit inferences.
[0248] Similarly, specific requests to recall particular information can be implemented by causing memory manager 118 to query interaction memory data store 116 and return, to machine-learned agent system 100, one or more memory objects containing relevant content. Such queries may be implemented using a structured or indexed data query language such as SQL. Other query techniques may also be used, such as similarity searches over vector representations of memory objects or fields thereof. Storing robust memory objects in a structured manner can facilitate a powerful memory manager 118. In this manner, for instance, the two-stage memory extraction and consolidation technique of the present disclosure can provide for powerful recall and memory control performance while using compact memory representations at inference time.
[0249] Memory manager 118 can be or include processing logic, software, firmware, or hardware configured to govern the storage and recall of memory data describing interactions with machine-learned agent system 100. Memory manager 118 can execute operations that include both learned operators (e.g., operators that execute predictions using machine-learned models having learned values, or operators thatexecute code generated based on predictions using machine-learned models having learned values) and non-learned operators.
[0250] Memory manager 118 can execute one or multiple learned operators to perform tasks and make decisions within a non-learned framework of rules and software infrastructure. For instance, memory manager 118 may execute learned operators that apply predetermined recipes or templates for processing incoming information (e.g . , using a series of prompt templates to ingest new information or format output information). For some processing stages, learned operators can use the outputs of a first prediction to control the inputs for downstream operations, such as by predicting parameters for application programming interface calls, invoking code interpreter environments for executing generated code, etc.
[0251] Memory manager 118 can control the storage of new memories. For example, memory manager 118 can perform one or more operations to ingest new information to extract information to remember. For example, memory manager 118 can use one or more inferences generated using one or more machine- learned models to process one or multiple inputs (e.g., a single new input, a dialog between a user and agent system 100) to predict new information to remember. Predicting new information to remember can include generating a summary of information to remember. Predicting new information to remember can include generating a portion or excerpt of content to remember. For example, memory manager 118 can invoke a machine-learned model input (e.g., after one or more inputs are received) that causes a machine- learned model to predict whether there is anything in the query that is to be remembered.
[0252] Memory manager 118 can control the updating of stored memories. For example, memory manager 118 can purge or update stale information based on new information. Memory manager 118 can query interaction memory datastore 116 to retrieve relevant entries for new information and update the relevant entries based on the new information. The new information can replace the information stored in the relevant entries (e.g., an updated date for an event) or can instruct deletion of the information stored in the relevant entries (e.g., an instruction to forget a particular conversation).
[0253] Memory manager 118 can control the recall of stored memories. For example, memory manager 118 can query interaction memory datastore 116 to retrieve information for processing a query. For example, memory manager 118 can initiate similarity search operations to retrieve a number of relevant stored memories for a particular query. Memory manager 118 can use one or more machine-learned models to generate a query (e.g., generate SQL string) for querying interaction memory datastore 116 to retrieve memories for a particular query.
[0254] To generate a compact representation of agent memory for use at inference time (e.g., to condition one or more machine-learned model inferences), machine-learned agent system 100 can use memory consolidator 120 to process extracted memory data along with any existing prior memory data to generate consolidated memory data.
[0255] Memory consolidator 120 can be or include a processing logic, software, firmware, or hardware configured to ingest, process, consolidate, and manipulate memory data (e.g., memory values, memory objects, memory representations) to generate consolidated memory data. In general, memory consolidator 120 can operate to consolidate newly extracted memory objects with existing memory data 122 that can be retrieved from interaction memory datastore 116.
[0256] Retrieved memory data 122 can include previously generated memory objects or memory information derived therefrom (e.g., memory values from one or more fields of a memory object). Retrieved memory data 122 can include a previously generated compact memory representation generated based on a prior memory consolidation cycle.
[0257] Memory consolidator 120 can use one or more machine-learned model inference operations to consolidate memory information into a compact representation thereof. Memory consolidator 120 can use one or more machine-learned model inference operations with conditioning inputs to consolidate memory information.
[0258] Memory consolidator 120 can engage machine-learned model system(s) 110 to execute one or more machine-learned model inference operations to generate a consolidated memory representation. For example, memory consolidator 120 can provide input data structure 124 to machine-learned model system(s) 110 as input to one or more machine-learned models.
[0259] Input data structure 124 can be or include a data structure configured to provide extracted memory data 126 and prior memory data 128 to machine-learned model system(s) 110 in a format that causes one or more machine-learned models of machine-learned model system(s) 110 to generate outputs that can be used to consolidate extracted memory information into a compact representation. Input data structure 124 can contain explicit or implicit conditioning values that are selected to cause one or more machine-learned models that process the conditioning values to generate outputs in alignment with a desired structure or schema for a consolidated memory representation.
[0260] Input data structure 124 can include instructions for instructing a machine-learned model that has been trained to follow instructions. For example, input data structure 124 can include a prompt for a machine-learned model that specifies parameters of the consolidation task, including a format of the inputs, a format of the outputs, a type of memory information to be consolidated, a desired size or length of the consolidated memory representation, a schema or structure for the consolidated memory representation, or other information to guide the consolidation workflow.
[0261] In general, input data structure 124 can be or include a data object configured for input to machine-learned model system(s) 110. Input data structure 124 can include a structure defined based on an application programming interface of machine-learned model system(s) 110. Input data structure 124 can contain a single data modality or multiple data modalities. Various example data modalities for input data structure 124 are described herein with respect to the description of inputs to example machine-learned model 1 . Input data structure 124 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, vector embeddings, etc.
[0262] Extracted memory data 126 can be or include one or more data representations of data from extracted memory object(s) 114. Extracted memory data 126 can be the same as extracted memory object(s) 114. Extracted memory data 126 can be a subset of or superset of extracted memory object(s) 114. Extracted memory data 126 can be a filtered or processed version of extracted memory object(s) 114. For example, memory consolidator 120 can obtain extracted memory object(s) 114, filter extracted memory object(s) 114 to generate extracted memory data 126, and provide extracted memory data 126 to machine- learned model system(s) 110. In an example, memory consolidator 120 can filter extracted memory object(s) 114 to omit metadata fields, or other features of extracted memory object(s) 114 that are not designated as within the scope of the consolidated memory representation for machine-learned agent system 100. Extracted memory data 126 can include the same or different modalities of data as extracted memory object(s) 114. For example, extracted memory object(s) 114 can include multiple data modalities (e.g., text and image data), and extracted memory data 126 can include a single data modality (e.g., text data).
[0263] Prior memory data 128 can be or include one or more data representations of memory object(s) or prior compact memory representation(s) of prior interactions. Prior memory data 128 can be the same as memory representation(s) 132. Prior memory data 128 can be a subset of or superset of memory representation(s) 132. Prior memory data 128 can be a filtered or processed version of memory representation(s) 132. For example, memory consolidator 120 can obtain memory representation(s) 132, filter memory representation(s) 132 to generate prior memory data 128, and provide prior memory data 128 to machine-learned model system(s) 110. Prior memory data 128 can include the same or different modalities of data as memory representation(s) 132.
[0264] Input data structure 124 can include both extracted memory data 126 and prior memory data 128, allowing the machine-learned model(s) to consider both recent and historical information when generating the consolidated memory representation. The specific format and content of input data structure 124 will depend on the specific machine-learned model(s) used for consolidation and the desired characteristics of the consolidated memory representation. For example, input data structure 124 may be structured as a JSON object, a Protocol Buffer message, or another suitable data structure.
[0265] Output data structure 130 may be a data structure generated by machine-learned model system(s) 110 based on input data structure 124. Output data structure 130 may contain one or more data representations of a consolidated memory representation. Output data structure 130 may be generated by machine-learned model system(s) 110 by executing one or more machine-learned model inference operations.
[0266] Output data structure 130 may include metadata associated with the generated consolidated memory representation, such as confidence values, scores (e.g., based on an evaluation policy), logits, or other relevant metadata. Output data structure 130 may include a single data modality or multiple data modalities. For example, output data structure 130 may include textual data, numerical data, vector embeddings, or other suitable data representations. Output data structure 130 may be a complete or partial output from machine-learned model system(s) 110. For instance, output data structure 130 may include a portion of a data stream as machine-learned model system(s) 110 streams generated content back to memory consolidator 120. Output data structure 130 may include a complete response generated by machine-learned model system(s) 110 based on input data structure 124.
[0267] In general, output data structure 130 may be or include a data object output by machine-learned model system(s) 110. Output data structure 130 may include a structure defined based on an application programming interface of machine-learned model system(s) 110. Output data structure 130 may contain a single data modality or multiple data modalities. Output data structure 130 may be or include any one or more of text or other symbolic data, numerical data, image data, audio data, compressed or encoded data, vector embeddings, or other suitable data representations. The format of output data structure 130 may be determined by the specific machine-learned model(s) used for consolidation and the desired characteristics of the consolidated memory representation. For example, output data structure 130 may be structured as a JSON object, a Protocol Buffer message, a vector embedding, or another suitable data structure.
[0268] Output data structure 130 may be used by memory consolidator 120 to generate one or more memory representations 132. The content of output data structure 130 may include a more compact or compressed representation of the extracted memory data 126 and prior memory data 128. This process may involve parsing and interpreting the data within output data structure 130. For example, if output data structure 130 contains a textual representation of the consolidated memory, memory consolidator 120 may parse this text to extract relevant information. Alternatively, if output data structure 130 contains a vector embedding, memory consolidator 120 may directly utilize this embedding as the compact memory representation 132. In other implementations, memory consolidator 120 may perform further processing on the data in output data structure 130, such as dimensionality reduction or quantization, to generate an even more compact representation.
[0269] Figure 3 is a block diagram of an example implementation of memory consolidator 120. Memory consolidator 120 can perform consolidation over a set of new and prior memory objects. An example process includes retrieving relevant prior memory data 122 (e.g., data from memory object(s) 202-1 , . . . , 202-M) from interaction memory datastore 116. Retrieved prior memory data 122 may then be combined with newly extracted memory data 126 (e.g., obtained from extracted memory object(s) 114) to form input data structure 124. Input data structure 124 may then be provided to machine-learned model system(s) 110. Machine-learned model system(s) 110 may execute one or more machine-learned models to processinput data structure 124, generating output data structure 130. This output data structure 130 may include data for a memory representation object (e.g., memory representation 132).
[0270] Input data structure 124 can include instructions 300 for generating memory representation object 302. For example, instructions 300 can include a prompt for conditioning the generation of outputs by one or more machine-learned models of machine-learned model system(s) 110.
[0271] For instance, memory consolidator 120 may leverage instructions 300 to distinguish between system configuration memories and personal interaction memories. These instructions may refer to explicit labels in extracted memory data 126. For example, each extracted memory object 114 may include a designated field specifying its memory type ("system configuration" or "personal interaction"). Alternatively, input data structure 124 may include an instruction indicating the intended type of memory consolidation to be performed (e.g., "consolidate system configuration memories" or "consolidate personal interaction memories"). This metadata may be provided by memory extractor 104 or another component of machine- learned agent system 100.
[0272] In some example implementations, the instructions may take the form of prompts or conditioning values provided to machine-learned model system(s) 110. These prompts may explicitly instruct the models to process and consolidate different memory types separately. For instance, the prompt may instruct the model to: "Consolidate only memories tagged with 'personal interaction' and disregard all other memory types." Alternatively, the prompt may guide the model to differentiate based on content. For example, the prompt may state: "Consolidate memories related to user preferences, interactions, and conversational history into a ‘personal interaction' memory representation. Consolidate memories related to system settings, configurations, and internal states into a separate ‘system configuration' memory representation."
[0273] The machine-learned models within machine-learned model system(s) 110 may operate to recognize and respond to these instructions. The models may use these instructions to filter, weight, or otherwise selectively process different memory types during the consolidation process. For example, the model may, based on internal latent states activated based on the provided instructions, attend over input data structure 124 in a particular manner to cause the generation of outputs that consolidate memory information according to a designated type. A model may employ different consolidation strategies for different memory types, such as generating more concise representations for system configuration memories and more detailed representations for personal interaction memories.
[0274] Output data structure 130 generated by machine-learned model system(s) 110 may reflect the separation of memory types. Memory consolidator 120 may obtain separate portions of a memory representations for system configuration memories and personal interaction memories. These separate portions may be stored separately or together in interaction memory datastore 116, potentially using different indexing schemes or storage locations to facilitate efficient retrieval based on memory type. The specific format and structure of the separate memory representations may be determined by theinstructions provided in input data structure 124 and the capabilities of the machine-learned models used for consolidation.
[0275] An example textual prompt for instructions 300 follows:
[0276] “# Task: Consolidating Extracted Information for Personalized Al Interactions
[0277] You are a specialist tasked with consolidating previously extracted information from user-AI conversations. This consolidated information will be used to personalize future interactions and enhance the user experience. You will receive two inputs: ( 1 ) previously extracted information from prior conversations and (2) newly extracted information from the most recent conversation. Your objective is to integrate this new information with the existing data, updating and organizing it effectively.
[0278] The information is categorized into two types:
[0279] 1. **User Preferences for Al Behavior:'”* These are the user’s instructions or desires regarding how the Al assistant should interact with them (e.g., communication style, level of formality, naming conventions).
[0280] 2. ** Memory of Shared Experiences:** These are summaries of significant events, topics, and details from conversations relevant to the user’s ongoing interaction with the Al. They include explicit requests to remember something and implicitly important information.
[0281] # Consolidation Process:
[0282] **User Preferences for Al Behavior:**
[0283] * Eliminate duplicate entries from the newly extracted information.
[0284] * Resolve conflicts between new and existing preferences by keeping the most recent entry(based on the timestamp).
[0285] * The final output should be a chronologically ordered list of distinct user preferences, each with its initial timestamp.
[0286] **Memory of Shared Experiences:**
[0287] * Condense all newly extracted memories into a *singie, concise, diary-like summary*, using the earliest timestamp from the new extractions. Retain all essential details about events, topics, individuals (objective details like name, location, profession), and the user’s subjective perspectives (attitudes, goals) while removing redundancy.
[0288] * Any conflicting information within the newly extracted memories should be ignored. Do not attempt to resolve these conflicts ordraw conclusions from them.
[0289] * Append the consolidated new memory to the existing list of shared experiences.
[0290] # Output Format:
[0291] Provide two distinct, chronologically ordered lists:
[0292] 1. ** User Preferences:** A timestamped list of user preferences for Al behavior. If none exist, output "None. "
[0293] 2. **Shared Experiences:** A timestamped list of concise summaries of significant interactions.
[0294] # Important Considerations:
[0295] * For user preferences, only add new preferences or update existing ones based on recency.
[0296] * For shared experiences, always produce a *single* consolidated, diary-like entry for the new interaction and append it to the existing history. Focus on conciseness and prioritize information the user explicitly asked to be remembered, as well as implicitly important details.
[0297] * Maintain a consistent third-person perspective and timestamp format throughout. Avoid special characters in the output. The length and detail of the summaries should reflect the conversation’s complexity while remaining concise.”
[0298] Memory representation object 302 may be a data structure representing a consolidated summary of user interactions with the machine-learned agent system. It may be generated by memory consolidator 120. This object may include a compact representation of information extracted from multiple memory objects.
[0299] In an example, memory representation object 302 can include textual data. For instance, memory representation object 302 can be or include a string representation of a current snapshot of an agent memory with respect to a history of interactions. Compression may be achieved through various techniques, such as summarization, abstraction, or dimensionality reduction. The object may contain textual data summarizing key events, user preferences, and conversational context. This textual data may be generated by a machine-learned model, potentially a large language model, trained to generate concise and informative summaries The model may be conditioned on the input data structure 124, which may include both newly extracted memory data and prior memory data. The model may be further conditioned by instructions 300, which may specify the desired format and content of the summary. The resulting textual summary may discard less relevant details, thereby achieving compression.
[0300] The length and detail of the summary may be dynamically adjusted based on the complexity of the input data and the specified constraints. For example, the summary may prioritize information explicitly requested by the user to be remembered. The summary may also include implicitly important information, such as recurring topics or plans. The object may also include numerical data, such as timestamps or confidence scores associated with the summarized information. These numerical values may provide additional context or metadata for the summary.
[0301] In some instances, extracted memory objects 114 may contain extracted memory data 126 that may include potentially conflicting information from the extraction phase. Instead of resolving these conflicts during the extraction phase, machine-learned agent system 100 may defer conflict resolution to the consolidation step. This deferral may be achieved by directly providing the extracted memory objects 114, or a representation thereof, as input to the consolidation process. Memory consolidator 120 may then employ machine-learned model system(s) 110 to process this input, including the potentially conflictinginformation within the extracted memory objects 114. The machine-learned model(s) may generate outputs that resolve, identify, or otherwise navigate conflicting signals in full view of the context of the conflicting signals, such as by using timestamps, user designations, or other metadata associated with the extracted memory objects 114 to determine a resolution. For example, the model may employ attention mechanisms to weigh the importance of different memory objects and their associated metadata when generating the consolidated memory representation 132. The model may prioritize certain types of information or to resolve conflicts based on predefined rules or heuristics (e.g., as specified in instructions). The output of this consolidation process may be a consolidated memory representation 132 that resolves the conflicts and provides a coherent and compact representation of the agent’s memory.
[0302] This approach of deferring conflict resolution to the consolidation step may offer several advantages in some situations. First, it may simplify the extraction process by allowing memory extractor 104 to focus solely on identifying and extracting relevant information without the added complexity of conflict resolution. For instance, conflict resolution may require more contextual information. By omitting such context (e.g., prior memories), the extraction process may be more efficiently computed. For instance, a more focused task provided to a machine-learned model (e.g., focusing only on extraction) may result in higher performance outcomes, outputs that are easier to validate or evaluate due to factorization, etc. This separation of concerns may lead to improved efficiency and accuracy in the extraction phase. Second, it may enable the use of more sophisticated and context-aware conflict resolution strategies. By deferring conflict resolution to the consolidation step, the system may leverage the full context of the extracted memory data 126 and any prior memory data 128 to make informed decisions about how to resolve conflicts. This may result in a more accurate and robust memory representation 132. Third, this approach may facilitate the use of specialized machine-learned models or machine-learned model instructions (e.g., prompts) for each task. Memory extractor 104 may employ models or prompts optimized for information extraction, while memory consolidator 120 may employ models or prompts optimized for conflict resolution and memory consolidation. This specialization may lead to improved performance and efficiency. Fourth, deferring conflict resolution may reduce the computational burden on the extraction phase, potentially allowing extraction on edge devices, which may correspond to a source of interaction data. This can reduce a transmission of the full raw interaction record, decreasing network transmissions. Further, this can potentially increase a granularity of data control by separating which stages have access to the full interaction record.
[0303] In some implementations, conflict resolution can be deferred to runtime, such as when a memory representation is being used to condition a generation for a future interaction. For example, consolidation can be performed with instructions that cause a model to generate a memory representation that preserves all conflicting memory information without collapse into a resolved representation. This can further defer the conflict resolution so that, at inference time, when a full set of updated context is available (e.g., prioravailable context plus newly available context), a model may evaluate the most likely output in view of all input signals.
[0304] Memory consolidator 120 may retrieve a subset (e.g., a top-K set) of stored memory objects from interaction memory datastore 116 based on a ranking system that may incorporate recency, relevance, priority, or other metrics, either individually or in combination. The ranking may be determined by one or more machine-learned models within machine-learned model system(s) 110. These models may be trained to assign scores reflecting the importance of each memory object based on the specified criteria. The ranking can be based on metadata values stored in the respective memory objects.
[0305] Recency may be determined by a timestamp associated with each memory object (e.g., memory metadata 206-1). More recent memory objects may receive higher scores. This recency-based ranking may be weighted more heavily for certain types of memory, such as user preferences for Al behavior, where the most recent preference may be prioritized.
[0306] Relevance may be determined by comparing the content of each memory object to a current query or a moving average of recent queries (e.g., a centroid of a cluster of recent queries). This comparison may involve techniques such as keyword matching, semantic similarity analysis, or other machine-learned methods. Memory objects with higher semantic similarity to the query may receive higher relevance scores. Relevance scores may be weighted differently depending on the type of memory and the nature of the query. For example, a query regarding a specific event may prioritize relevance over recency.
[0307] Priority may be assigned based on metadata associated with memory objects. For instance, memory objects explicitly marked as high priority by the user or system may receive higher priority scores. Priority may also be inferred from the content of the memory object (e.g., by memory manager 118 using one or more machine-learned models of machine-learned model system(s) 110). For example, a memory object detailing a critical system event may be assigned a higher priority than a memory object summarizing a casual conversation.
[0308] Combinations of recency, relevance, and priority may be used to generate a comprehensive ranking. For example, a weighted average of the individual scores may be computed, with the weights adjusted based on the specific application and query context.
[0309] Figure 4 is a block diagram of an example implementation of memory consolidator 120. Memory consolidator 120 can perform consolidation over a set of new memory objects incrementally over a prior consolidated memory representation. For example, memory representation object 302 can be retrieved by memory consolidator 120 and processed using instructions 400 to generate an updated memory representation object 402. Updated consolidated memory representation object 402 may then be stored in interaction memory datastore 116.
[0310] This incremental approach can be more efficient than the full consolidation approach, particularly when dealing with a large volume of memory data, as it avoids the computational cost of re-consolidating the entire memory representation with each new set of interactions.
[0311] The input data structure 124 may include instructions 400 (e.g., prompts) that specifically guide the incremental update process, specifying the method of integration (e.g., additive, subtractive, or replacement), the weighting of new versus old information, or other parameters to control the update process.
[0312] An example textual prompt for instructions 400 follows:
[0313] “# Task: Consolidating Extracted Information for Personalized Al Interactions
[0314] You are a specialist tasked with consolidating previously extracted information from user-AI conversations. This consolidated information will be used to personalize future interactions and enhance the user experience. You will receive two inputs: (1) the existing consolidated information and (2) newly extracted information from the most recent conversation. Your objective is to integrate this new information with the existing data, updating and organizing it effectively.
[0315] The information is categorized into two types:
[0316] 1. **User Preferences for Al Behavior:'”* These are the user’s instructions or desires regarding how the Al assistant should interact with them (e.g., communication style, level of formality, naming conventions).
[0317] 2. ** Memory of Shared Experiences:** These are summaries of significant events, topics, and details from conversations relevant to the user’s ongoing interaction with the Al. They include explicit requests to remember something and implicitly important information.
[0318] # Consolidation Process:
[0319] **User Preferences for Al Behavior:**
[0320] * Eliminate duplicate entries from the newly extracted information.
[0321] * Resolve conflicts between new and existing preferences by keeping the most recent entry(based on the timestamp).
[0322] * The final output should be a chronologically ordered list of distinct user preferences, each with its initial timestamp.
[0323] **Memory of Shared Experiences:**
[0324] * Condense all newly extracted memories into a *singie, concise, diary-like summary*, using the earliest timestamp from the new extractions. Retain all essential details about events, topics, individuals (objective details like name, location, profession), and the user’s subjective perspectives (attitudes, goals) while removing redundancy.
[0325] * Any conflicting information within the newly extracted memories should be ignored. Do not attempt to resolve these conflicts ordraw conclusions from them.
[0326] * Append the consolidated new memory to the existing list of shared experiences.
[0327] # Output Format:
[0328] Provide two distinct, chronologically ordered lists:
[0329] 1. ** User Preferences:** A timestamped list of user preferences for Al behavior. If none exist, output "None. "
[0330] 2. ** Shared Experiences:** A timestamped list of concise summaries of significant interactions.
[0331] # Important Considerations:
[0332] * For user preferences, only add new preferences or update existing ones based on recency.
[0333] * For shared experiences, always produce a *single* consolidated, diary-like entry for the new interaction and append it to the existing history. Focus on conciseness and prioritize information the user explicitly asked to be remembered, as well as implicitly important details.
[0334] * Maintain a consistent third-person perspective and timestamp format throughout. Avoid special characters in the output. The length and detail of the summaries should reflect the conversation’s complexity while remaining concise.”
[0335] Memory representation 402 may be an updated version of memory representation 302.
[0336] With reference again to Figure 1 , memory representation(s) 132 (e.g., memory representation object(s) 302, 402) can be leveraged by machine-learned agent system 100 for subsequent interaction with the external environment (e.g., to assist in addressing a given task). For example, machine-learned agent system 100 can receive, from an input interface 134, a query 136. Machine-learned agent system 100 can generate an input data structure 138 that contains memory representation(s) 132 (or data derived therefrom) and query data 140. Machine-learned model system(s) 110 can include one or more machine- learned models that execute one or more inference operations based on input data structure(s) 138 to produce an output data structure(s) 142. Output data structure(s) 142 can include response data generated conditioned on memory representation(s) 132 and query data 140. Machine-learned agent system 100 can output response 144 based on output data structure(s) 142. Response 144 can be a response to a user based on query 136 (e.g., an answer to a question posed in query 136), a response to an inquiry of a user based on query 136 (e.g., a recommendation for an action based on query 136), or an intermediate response generated by machine-learned agent system 100 to perform in service of query 136 (e.g., a function call generated by a machine-learned model to execute in service of a task indicated by query 136). Machine-learned agent system 100 can output response 144 via one or more output interfaces 146.
[0337] Input interface(s) 134 can include various devices or systems that receive input signals or other data from users or other devices or systems. Examples of such devices or systems can include microphones and cameras that capture voice and visual inputs, respectively, keyboards and touchscreens that allow for textual and touch-based inputs, sensors, transducers, or other digital or analog signal sources, including network adapters, wireless receivers, etc.
[0338] Input interface(s) 134 can process a variety of inputs provided by users or automated systems. For instance, users can provide voice commands, typed messages, or selections made via touchscreen interfaces. For instance, users can provide input through spoken commands detected via an analog transducer, and a system can convert the detected signals into a digital format that the machine-learned agent system can process. External devices and systems can communicate via input interface(s) 134 to transmit messages or other data objects or signals to machine-learned agent system 100. Input interface(s) 134 can be implemented using an application programming interface (API) exposed over a network or within an execution environment. For example, machine-learned agent system 100 can expose an application programming interface accessible by one or more other applications (e.g., a front end application hosting a user interface, a back-end application, such as an operating system, or other software application) to engage machine-learned agent system 100 to perform tasks.
[0339] Query 136 can be or include any type of data provided to machine-learned agent system 100 for processing. Query 136 can be structured or unstructured and may come from diverse sources such as text entries, voice commands, stored files, sensor outputs, downloaded networked content, etc. For example, a user might type a request into a text interface, speak a command into a voice-activated device, or a connected device might automatically send data based on certain triggers.
[0340] Query 136 can include instructions to perform a task. Various example tasks are described herein with respect to the description of example machine-learned model 1 .
[0341] Query 136 can include a declaration of information. The information can be declared in a recorded statement from a user or data recorded from other sources. Machine-learned agent system 100 can ingest query 136 to determine whether and how to remember the provided information.
[0342] Query 136 can contain a single data modality or multiple data modalities. Various example data modalities for query 136 are described herein with respect to the description of inputs to example machine- learned model 1 . Query 136 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, etc.
[0343] Query 136 can be ingested by machine-learned agent system 100 using an input ingestion system. An input ingestion system can process and analyze input data received from users or other systems. An input ingestion system can parse the raw query data, which may be in structured or unstructured forms, such as text entries, voice commands, or other digital formats. An input ingestion system can employ natural language processing (NLP) techniques to interpret and transform textual data into a structured format that can be further utilized within the system. For instance, NLP techniques such as tokenization, entity recognition, and syntactic parsing can be used to parse the input data into chunks. Chunks can encapsulate semantically associated portions of query 136. An input ingestion system can use machine-learned model system(s) 110 to ingest query 136.
[0344] Input data structure 138 may be a data structure configured to provide memory representation (s) 132 and query data 140 to machine-learned model system(s) 110 in a format that may cause one or more machine-learned models of machine-learned model system(s) 110 to produce output data structure(s) 142. Input data structure 138 may contain explicit or implicit conditioning values selected to cause one or more machine-learned models processing such values to generate outputs aligned with a desired structure or schema for response data. Input data structure 138 may include instructions for instructing a machine- learned model trained to follow instructions. Input data structure 138 may include a prompt for a machine- learned model specifying parameters of the inference task, including a format of the inputs, a format of the outputs, a type of response data to be generated, a desired size or length of the response data, a schema or structure for the response data, or other information to guide the inference workflow. Input data structure 138 may be, or may include, a data object configured for input to machine-learned model system(s) 110 and may include a structure defined based on an application programming interface of machine-learned model system(s) 110. Input data structure 138 may contain a single data modality or multiple data modalities. Input data structure 138 may be, or may include, any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, vector embeddings, or other suitable data representations. Memory representation(s) 132, or data derived therefrom, may be included within input data structure 138. Query data 140 may be included within input data structure 138. Input data structure 138 may further include metadata associated with memory representation(s) 132 and / or query data 140, such as timestamps, user identifiers, confidence values, scores, or other relevant metadata. The format of input data structure 138 may be determined by the specific machine-learned model(s) used for inference and the desired characteristics of the output data structure(s) 142. For example, input data structure 138 may be structured as a JSON object, a Protocol Buffer message, or another suitable data structure.
[0345] Query data 140 may be a data structure containing information derived from query 136, formatted for use by machine-learned model system(s) 110. Query 136 may be received from input interface 134 and may represent a user request or inquiry. Query data 140 may be generated by machine-learned agent system 100 and may include, but is not limited to, one or more of the following: a textual representation of query 136; a structured representation of query 136; metadata associated with query 136, such as a timestamp, user designation, or device identifier; and / or a combination of these data modalities. The generation of query data 140 from query 136 may involve preprocessing steps, such as natural language processing (NLP) techniques, to extract key information and transform the query into a format suitable for use by the machine-learned models. Query data 140 may be formatted to align with the input requirements of the machine-learned models within machine-learned model system(s) 110, potentially including specific formatting, encoding, or other transformations to ensure compatibility. The specific content and format of query data 140 may vary depending on the specific implementation and the requirements of the machine- learned models used for processing the query. In some examples, query data 140 may be a directrepresentation of query 136, while in other examples, query data 140 may be a significantly transformed representation of query 136.
[0346] Output data structure 142 may be a data structure generated by machine-learned model system(s) 110 based on input data structure 138. Output data structure 142 may contain one or more data representations of a final or intermediate response to query data 140. Output data structure 142 may be generated by machine-learned model system(s) 110 by executing one or more machine-learned model inference operations.
[0347] Output data structure 142 may be a complete or partial output from machine-learned model system(s) 110. For instance, output data structure 142 may include a portion of a data stream as machine- learned model system(s) 110 streams generated content back to machine-learned agent 100. Output data structure 142 may include a complete response generated by machine-learned model system(s) 110 based on input data structure 138.
[0348] In general, output data structure 142 may be or include a data object output by machine-learned model system(s) 110. Output data structure 142 may include a structure defined based on an application programming interface of machine-learned model system(s) 110. Output data structure 142 may contain a single data modality or multiple data modalities. Output data structure 142 may be or include any one or more of text or other symbolic data, numerical data, image data, audio data, compressed or encoded data, vector embeddings, or other suitable data representations. The format of output data structure 142 may be determined by the specific machine-learned model(s) used for responding to user queries at runtime and the desired characteristics of the interaction experience.
[0349] Output data structure(s) 142 may include response data generated conditioned on memory representation(s) 132 and query data 140. In some example implementations, output data structure(s) 142 may represent a final response to a user query 136, such as a message to render for a user (audio message, textual message, etc.). This final response may be generated by one or more machine-learned models within machine-learned model system(s) 110 based on the input data structure 138, which includes memory representation(s) 132 and query data 140. The format of this final response may vary depending on the specific application and may include text, images, audio, or other data modalities. The final response 144 may be output via one or more output interfaces 146.
[0350] Output data structure(s) 142 may represent an intermediate response to a query 136, such as a tool call or function call. This intermediate response may be generated by one or more machine-learned models within machine-learned model system(s) 110 to facilitate the processing of the user query. For example, the intermediate response may involve invoking an external tool or service to obtain additional information or perform a specific task. The intermediate response may be structured as a data object containing information such as the name of the tool or service to be invoked, the arguments to be passed to the tool or service, and any other relevant parameters. This intermediate response may be processed bymachine-learned agent system 100 to initiate the tool call and obtain the results. The results of the tool call may then be incorporated into subsequent processing to generate a final response to the user. In some example implementations, the intermediate response may itself be included in interaction record 102 to maintain a complete record of the agent’s interactions. The intermediate response may also be used to generate subsequent intermediate responses or final responses. Output data structure(s) 142 may include multiple intermediate responses and / or a final response. The specific content and structure of the intermediate and final responses may vary depending on the nature of the user query and the capabilities of the machine-learned agent system 100.
[0351] In general, machine-learned agent system 100 can engage machine-learned model system(s) 110 over multiple interaction cycles to process query 136 before generating response 144. Machine- learned agent system 100 can invoke one or more machine-learned models over multiple turns or iterations (e.g., N iterations) prior to preparing and outputting response 144. For example, machine-learned agent system 100 can implement a chain-of-thought self-deliberation technique to generate content describing reasoning about query 136 in view of memory representation(s) 132. This generated content can be used to condition future generations of additional content in service of a final response 144. For example, a first input data structure can contain an instruction to generate a thorough analysis of query 136 in view of memory representation(s) 132. Output data structure 142 can contain the generated analysis. A second input data structure can contain the generated analysis along with an instruction to generate a final response based on the analysis. A second output data structure can contain the final response. Machine- learned agent system 100 can generate response 144 based on the second output data structure, which was based on the first output data structure. The raw outputs may not be surfaced to the user. For example, only a response 144 might be output to output interface(s) 146. The intermediate outputs generated during the N iterations may be logged for analysis but may not be used in an output.
[0352] Response 144 can include data from output data structure 142. Response 144 can be a final or complete response to query 136, or response 144 can be a partial response to query 136 that effectuates a step in a multi-step response (e.g., performing a subtask in a multi-part task).
[0353] Response 144 can include data for output to a user interface for rendering for a user. Response 144 can include text, image, audio, or other data that an output interface can render for a user.
[0354] Response 144 can include data for communication to a user (e.g., to a device associated with an account, to another agent, to an external system). Response 144 can include instructions to control a receiving system. For instance, response 144 can include an application programming interface call generated based on output data structure 142. For instance, one or more parameters of the application programming interface call can be generated by a machine-learned model and provided to machine- learned agent system 100 in output data structure 142. The application programming interface call can be parsed and packaged into response 144. Response 144 can be output via an output interface to areceiving system (e.g., over a network, over a system bus, via a software queue or operating system communication pathway) to initiate an action of the receiving system according to the application programming interface call.
[0355] Output interfaces 146 can include various mechanisms and devices that enable the machine- learned agent system to communicate with users or other systems. These interfaces may consist of graphical user interfaces (GUIs), audio output devices, network connectivity devices, API libraries, and other components that can send data or render data relevant to the agent actions executed by the system. For example, GUIs can display responses to questions, results of a task, notifications about upcoming tasks, or alerts regarding any required user inputs. Audio outputs can provide auditory responses (e.g., spoken language responses) or alerts.
[0356] Output interface(s) 146 can include network interfaces that enable machine-learned agent system 100 to send data or commands to other systems. For instance, the output interfaces can facilitate the execution of API calls to external services.
[0357] In an example implementation, an input interface 134 and an output interface 146 can be associated with the same software application or device. For example, a virtual assistant or agent application can provide an interface for interacting with a virtual assistant. The interactions can be implemented in one or multiple interaction modalities, such as a textual chat, a voice conversation, a situated agent (e.g., implemented using a wearable device). Inputs (e.g., query 136) can represent one part of a dialog corresponding to a user and responses 144 can represent the other part of the dialog corresponding to the agent’s responses.
[0358] Figure 5 is a communication diagram illustrating an example sequence of operations and communication cycles according to example aspects of the present disclosure. Specifically, Figure 5 illustrates decoupling of the timing of memory extraction and consolidation. One or more interactions 500 can occur at a first time. During these interactions, machine-learned agent system 100 can interact with input interface(s) 134, output interface(s) 146, interaction memory datastore 116, and machine-learned model system(s) 110 as described herein. Subsequently, memory extraction cycle(s) 502 (e.g., as described above with respect to the operations of memory extractor 104) can execute to extract memory information from the interactions. This alternating pattern can repeat one or more times.
[0359] After some time, memory consolidation cycle(s) 504 (e.g., as described above with respect to operations of memory consolidator 120) can execute to consolidate the memories extracted during the one or more memory extraction cycle(s) 502. Memory consolidation cycle(s) 504 can execute based on an accumulation of one or more new memory objects over time. Memory consolidation cycle(s) 504 can execute based on a termination of a session.
[0360] Memory consolidation cycle(s) 504 can execute when maintaining a full interaction record becomes impractical. For instance, while a full interaction record may be maintained in memory, it may befeasible to input the entire record to a machine-learned model to condition future generations. However, over time the data footprint for such an interaction record could become larger than an available memory or storage allocation. Memory consolidation cycle(s) 504 could execute to consolidate extracted memories into a compact memory representation 506 that can be passed forward for use to condition future interactions.
[0361] For instance, interaction(s) 508 may not have access to the full interaction record of all of interactions 500. However, memory representation(s) 506 may be used to condition machine-learned model inference operations with salient extracted memories with a much more compact data footprint (e.g., in storage, in memory, in a model context window, etc.). In this manner, for instance, machine-learned agent system 100 can generate responses in interaction(s) 508 in view of the most relevant memories from interaction(s) 500.
[0362] During or after interaction(s) 508, memory extraction cycle(s) 502 can execute to extract any new memories from interaction(s) 508. This alternating pattern can repeat one or more times.
[0363] After some time, memory consolidation cycle(s) 504 can execute to consolidate the memories extracted during the one or more memory extraction cycle(s) 502 for interaction (s) 508. Memory consolidation cycle(s) 504 can generate updated memory representation(s) 510 for use in future interactions.
[0364] In some implementations, memory extraction is coupled with memory consolidation. For instance, in some implementations, memory extraction and memory consolidation are executed in consecutive sequence. For example, all memories from a preceding session can be extracted and then the extracted memories can be consolidated with prior memory data.
[0365] In some implementations, memory extraction can be implemented in an online fashion in parallel with or in sequence with interactions. Memory consolidation can occur in an offline fashion. For example, memory extraction can scale with a current size of an interaction record since last extraction is performed. For instance, in some implementations, extraction can be performed only over interaction record portions over which extraction has not been performed. In this manner, for instance, executing extraction multiple times over a session can reduce a scope over which extraction is performed, thereby reducing a size of a model input to a model performing extraction. This can allow for such models to be executed with less resource utilization (e.g., less memory, fewer FLOPs, etc.). For example, for models implementing quadratic attention, computational cost can be quadratic with input sequence size.
[0366] In some implementations, memory consolidation scales with a size of interaction memory datastore (or a size of a top-K set of objects retrieved therefrom). In some implementations, memory consolidation benefits from a more powerful model (e.g., larger context window, larger number of parameters to encode skills for reasoning across diverse data, etc.) that may be more computationallyexpensive or slower to execute. In some implementations, memory consolidation can be executed offline when immediate low-latency responsiveness is not needed.
[0367] In some implementations, memory extraction can be performed using a same model as used to generate agent responses to user messages or other queries. This can unlock efficiencies of performing memory extraction in an online manner while latent values for the model are still cached for the generation of the agent responses.
[0368] For example, an attention-based model may compute attention value across an input sequence to generate outputs expected to follow from the input sequence (e.g ., see machine-learned model 4 below). These attention values may be carried forward and re-used for future generations that include the same input sequence. To the extent that memory extraction and agent response generation share the input sequence, the attention values may be cached and used for both the generation of agent responses as well as the generation of extracted memory information.
[0369] Figure 6 is a block diagram of an example implementation in which latent values may be cached for use in both interaction outputs and memory extraction. A current state of an interaction record 602 can include one or more messages capturing interactions between a user and the agent system (e.g., a user message 604-1 , a responsive agent message 604-2, a user message 604-3). Machine-learned model system(s) 110 can input the current state of the interaction record 602 as input to a machine-learned model 606 that computes latent values for the current state of the interaction record 608. Based on the latent values for the current state of the interaction record 608, machine-learned model 606 can generate an interaction output 610. The interaction output 610 can be used by machine-learned agent system 100 to generate an agent message 604-4a responsive to the user message. Because it may be of interest to extract memories from the same portion of the current state of the interaction record, the same latent values for the current state of the interaction record 608 may be used by machine-learned model 606 to compute a memory extraction output 612. Machine-learned agent system 100 can use memory extraction output 612 to generate extracted memory 604-4b. For instance, instructions for extracting memories from the current state of the interaction record can be appended to a sequence of the current state of the interaction record so that the added instructions do not disturb the left-to-right language modeling attention used to compute the attention values, enabling their re-use for computing memory extraction output 612.
[0370] In some implementations, interaction output 610 can be computed in parallel with memory extraction output 612 (e.g., along a batch dimension of machine-learned model 606). This can leverage batch dimension efficiencies to avoid additional forward passes through machine-learned model 606 over the same context (e.g., the current state of the interaction record) while generating memory extraction output 612.
[0371] In some implementations, when computed in sequence with interaction outputs, memory cycles (e.g., extraction or consolidation) can be computed while awaiting completion of a user input to obscure a latency or system utilization impacted by performance of the cycle(s).
[0372] Figure 7 is a block diagram of an example implementation in which latent values may be cached for use in both interaction outputs and memory extraction. Here, the current state of the interaction record includes the agent message 604-4a and a memory extraction message 704-4b (e.g., that records the extraction of extracted memory 604-4b) at time T4. Memory extraction message 704-4b can be hidden or not rendered via a user interface. Memory extraction message 704-4b can be included in the current state of the interaction record to indicate memory extraction checkpoints. Memory extraction message 704-4b can be included in the current state of the interaction record to indicate to a machine-learned model what information has already been extracted, so as to avoid duplicate extraction. For example, at T5 a user can provide user message 704-5. Machine-learned model system(s) 110 can execute machine-learned model 606 to generate latent values for the current state of the interaction record 708. Based on the new latent values, machine-learned model 606 can generate interaction output 710 and memory extraction output 712, which machine-learned agent system 100 can use respectively to generate agent message 704-6a and 704-6b.
[0373] In an example, memory extraction messages can operate tool calls for causing machine-learned agent system 100 to active memory extractor to store a memory. For instance, machine-learned model 606 can be configured (e.g., trained or prompted) to, responsive to a triggering condition, process a query asking what information from the interaction record should be remembered. For instance, the query can be injected into a position at the end of a current state of the interaction record (e.g., on a batch dimension). Machine-learned model 606 can generate a response to this memory query and respond with a tool call that invokes memory extractor to store an extracted memory object.
[0374] Here is a toy example transcript to illustrate the operation:
[0375] User: Recommend me a movie to watch.
[0376] Agent: Sure, happy to help on that. What’s the genre you like? Are you looking for something thought-provoking or something to just relax and enjoy?
[0377] User: I like thought-provoking ones, please recommend me some recent ones.
[0378] "'tool_code
[0379] user_memory.remember(user_fact=‘The user likes thought-provoking movies’])
[0380] user_memory. recall])
[0381]
[0382] # User facts
[0383] ‘The user likes thought-provoking movies’
[0384] Agent: Some recent films that might fit the bill include "Everything Everywhere All at Once, " "Parasite, " "The Father, " and "The Power of the Dog. " Each of these films tackle complex topics in a unique and thought-provoking way
[0385] User: Great, do you have some book recommendations?
[0386] "'tool_code
[0387] user_memory.recall()
[0388]
[0389] # User facts
[0390] ‘The user likes thought-provoking movies’
[0391] Agent: If you like stories that are thought-provoking, I recommend reading "The Sense of anEnding" by Julian Barnes. It explores the topic of memory and how our understanding of the past is often limited and unreliable. Let me know if you would like to try something different though?
[0392] User: That looks great, let me take a look.
[0393] As illustrated in the above example, memory calls can be generated by a machine-learned model based on preceding context and, based on the performed memory operations, agent responses can be generated, all in the same sequence continuation.
[0394] This alignment between extraction and agentic interactions can introduce efficiencies that can improve operation of memory systems on edge devices. For instance, by using the same model for extraction and agentic interaction, and further by leveraging a common cache of generated attention values, lower-resourced computational systems can perform both agentic interaction and memory extraction. In this manner, for instance, access to and processing of raw interaction records can be performed locally, or primarily locally, with a decreased amount of transmission of full context interaction data to other systems for in depth computations.
[0395] Figure 8 depicts a flowchart of a method 800 for training one or more machine-learned models according to aspects of the present disclosure. For instance, an example machine-learned model can include any one or more models of machine-learned model system 110 used for any operation of any component of machine-learned agent system 100.
[0396] One or more portion(s) of example method 800 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 800 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 800 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. Figure 8 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted,rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 8 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 800 can be performed additionally, or alternatively, by other systems.
[0397] At 802, example method 800 can include obtaining a training instance. A set of training data can include a plurality of training instances divided between multiple datasets (e.g., a training dataset, a validation dataset, or testing dataset). A training instance can be labeled or unlabeled. Although referred to in example method 800 as a “training” instance, it is to be understood that runtime inferences can form training instances when a model is trained using an evaluation of the model's performance on that runtime instance (e.g., online training / learning). Example data types for the training instance and various tasks associated therewith are described throughout the present disclosure.
[0398] At 804, example method 800 can include processing, using one or more machine-learned models, the training instance to generate an output. The output can be directly obtained from the one or more machine-learned models or can be a downstream result of a chain of processing operations that includes an output of the one or more machine-learned models.
[0399] At 806, example method 800 can include receiving an evaluation signal associated with the output. The evaluation signal can be obtained using a loss function. Various determinations of loss can be used, such as mean squared error, likelihood loss, cross entropy loss, hinge loss, contrastive loss, or various other loss functions. The evaluation signal can be computed using known ground-truth labels (e.g., supervised learning), predicted or estimated labels (e.g , semi- or self-supervised learning), or without labels (e.g., unsupervised learning). The evaluation signal can be a reward (e.g., for reinforcement learning). The reward can be computed using a machine-learned reward model configured to generate rewards based on output(s) received. The reward can be computed using feedback data describing human feedback on the output(s).
[0400] At 808, example method 800 can include updating the machine-learned model using the evaluation signal. For example, values for parameters of the machine-learned model(s) can be learned, in some embodiments, using various training or learning techniques, such as, for example, backwards propagation. For example, the evaluation signal can be backpropagated from the output (or another source of the evaluation signal) through the machine-learned model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the evaluation signal with respect to the parameter value(s)). For example, system(s) containing one or more machine-learned models can be trained in an end-to-end manner. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. Example method 800 can include implementing anumber of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
[0401] In some implementations, example method 800 can be implemented for training a machine- learned model from an initialized state to a fully trained state (e.g., when the model exhibits a desired performance profile, such as based on accuracy, precision, recall, etc.).
[0402] In some implementations, example method 800 can be implemented for particular stages of a training procedure. For instance, in some implementations, example method 800 can be implemented for pre-training a machine-learned model. Pre-training can include, for instance, large-scale training over potentially noisy data to achieve a broad base of performance levels across a variety of tasks / data types.
[0403] In some implementations, example method 800 can be implemented for fine-tuning a machine- learned model Fine-tuning can include, for instance, smaller-scale training on higher-quality (e.g., labeled, curated, etc.) data. Fine-tuning can affect all or a portion of the parameters of a machine-learned model.For example, various portions of the machine-learned model can be “frozen” for certain training stages. For example, parameters associated with an embedding space can be “frozen” during fine-tuning (e.g., to retain information learned from a broader domain(s) than present in the fine-tuning dataset(s)). In some implementations, example method 800 uses adapter modules. Adapters can be small trainable layers that are inserted between pre-existing layers of a pre-trained model. During the fine-tuning process, the original parameters of the pre-trained model are typically frozen, and only the parameters of the adapters are updated.
[0404] In some implementations, example method 800 can be implemented to execute parameterefficient fine-tuning methods, such as Low Rank Adaptation (LoRA). LoRA can refine pre-trained models with minimal adjustments to the original parameters. This can be achieved by introducing trainable low-rank matrices that modify the behavior of the pre-trained weights without directly altering them. In some implementations, during fine-tuning, only these auxiliary matrices are updated, which significantly reduces the number of parameters that are trained.
[0405] An example fine-tuning approach includes reinforcement learning. Reinforcement learning can be based on user feedback on model performance during use.
[0406] Figure 9 is a block diagram of an example processing flow for using machine-learned model(s) 1 to process input(s) 2 to generate output(s) 3.
[0407] Machine-learned model(s) 1 can be or include one or multiple machine-learned models or model components. Example machine-learned models can include neural networks (e.g., deep neural networks). Example machine-learned models can include non-linear models or linear models. Example machine- learned models can use other architectures in lieu of or in addition to neural networks. Example machine- learned models can include decision tree based models, support vector machines, hidden Markov models, Bayesian networks, linear regression models, k-means clustering models, etc.
[0408] Machine-learned model(s) 1 can be or include, or otherwise be representative of any one or more of the machine-learned models described above with respect to the preceding figures. For example, machine-learned model(s) 1 can be or include, or otherwise be representative of any one or more models of machine-learned model system 110 used for any operation of any component of machine-learned agent system 100. Although various features, variations, and implementations described below are described with respect to machine-learned model(s) 1 , it is to be understood that such features, variations, and implementations are to be understood as described with respect to each of any one or more models of machine-learned model system 110 used for any operation of any component of machine-learned agent system 100, or any other machine-learned component described herein.
[0409] Example neural networks can include feed-forward neural networks, recurrent neural networks (RNNs), including long short-term memory (LSTM) based recurrent neural networks, convolutional neural networks (CNNs), diffusion models, generative-adversarial networks, or other forms of neural networks. Example neural networks can be deep neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention or cross-attention. For example, some example machine-learned models can include multi-headed self-attention models, multi-query self-attention models, or other attention mechanisms.
[0410] Machine-learned model(s) 1 can include a single or multiple instances of the same model configured to operate on data from input(s) 2. Machine-learned model(s) 1 can include multiple different models or multiple different model portions configured to operate on data from input(s) 2.
[0411] Machine-learned model(s) 1 can include an ensemble of different models that can cooperatively interact to process data from input(s) 2. For example, a model ensemble can include multiple models that have different attributes (e.g., different architectures, trained with different recipes, etc.). The ensemble can output an overall output based on the individual outputs of the constituent models. In this manner, for instance, the diverse constituent models can work together to provide system-level robustness by effectively aggregating over individual strengths and weaknesses of any given model. The respective individual outputs can be combined in a weighted combination, using a voting or routing mechanism, or a learned output layer (e.g., one or more feedforward or fully-connected layers).
[0412] Machine-learned model(s) 1 can employ a mixture-of-experts structure. See, e.g., Zhou et al., Mixture-of-Experts with Expert Choice Routing, ARXIV:2202.09368V2 (Oct. 14, 2022). For example, different portions of a model can learn (explicitly or implicitly) different expertise areas, with pathways through the model being selected by a learned routing mechanism that engages the appropriate expert for a given input (e.g., a given portion of an input, such as on a per-token basis). For example, a feedforward network can be sparsely activated for a given portion of an input based on an output of a routing mechanism that processes the portion of the input. In this manner, for instance, the group of activated weights can form an “expert” that is selected by the router. On each forward pass, only a subset of the total model weights maybe engaged, thereby decreasing a quantity of operations performed for processing a given input compared to a densely activated model. In this manner, for instance, the expressive and interpretive power of a high- parameter-count model can be achieved with more compute-efficient forward passes.
[0413] Input(s) 2 can generally include or otherwise represent various types of data. Input(s) 2 can include one type or many different types of data. Output(s) 3 can be data of the same type(s) or of different types of data as compared to input(s) 2. Output(s) 3 can include one type or many different types of data.
[0414] Example data types for input(s) 2 or output(s) 3 include natural language text data, software code data (e.g., source code, object code, machine code, or any other form of computer-readable instructions or programming languages), machine code data (e.g., binary code, assembly code, or other forms of machine-readable instructions that can be executed directly by a computer’s central processing unit), assembly code data (e.g., low-level programming languages that use symbolic representations of machine code instructions to program a processing unit), genetic data or other chemical or biochemical data, image data, audio data, audiovisual data, haptic data, biometric data, medical data, financial data, statistical data, geographical data, astronomical data, historical data, sensor data generally (e.g., digital or analog values, such as voltage or other absolute or relative level measurement values from a real or artificial input, such as from an audio sensor, light sensor, displacement sensor, etc.), and the like. Data can be raw or processed and can be in any format or schema.
[0415] In multimodal inputs 2 or outputs 3, example combinations of data types include image data and audio data, image data and natural language data, natural language data and software code data, image data and biometric data, sensor data and medical data, etc. It is to be understood that any combination of data types in an input 2 or an output 3 can be present.
[0416] An example input 2 can include one or multiple data types, such as the example data types noted above. An example output 3 can include one or multiple data types, such as the example data types noted above. The data type(s) of input 2 can be the same as or different from the data type(s) of output 3. It is to be understood that the example data types noted above are provided for illustrative purposes only. Data types contemplated within the scope of the present disclosure are not limited to those examples noted above.
[0417] Figure 10 is a block diagram of an example implementation of an example machine-learned model configured to process sequences of information. For instance, an example implementation of machine-learned model(s) 1 can include machine-learned sequence processing model(s) 4. An example system can pass input(s) 2 to sequence processing model(s) 4. Sequence processing model(s) 4 can include one or more machine-learned components. Sequence processing model(s) 4 can process the data from input(s) 2 to obtain an input sequence 5. Input sequence 5 can include one or more input elements 5- 1 , 5-2, . . . , 5-M, etc. obtained from input(s) 2. Sequence processing model 4 can process input sequence 5 using prediction layer(s) 6 to generate an output sequence 7. Output sequence 7 can include one ormore output elements 7-1 , 7-2 , . . . . 7-N, etc. generated based on input sequence 5. The system can generate output(s) 3 based on output sequence 7.
[0418] Sequence processing model(s) 4 can include one or multiple machine-learned model components configured to ingest, generate, or otherwise reason over sequences of information. For example, some example sequence processing models are referred to as language models and can leverage language-based understandings across one or multiple modalities of input information. Sequence processing model(s) 4 can include relatively large models (e.g., more parameters, computationally expensive, etc.), which may be referred to as "Large Language Models” or LLMs. Sequence processing model(s) 4 can include relatively small models (e.g., fewer parameters, computationally lightweight, etc.), which may be referred to as “Small Language Models” or SLMs. Example language models include, for instance, models described in Gemma: Open Models Based on Gemini Research and Technology, GOOGLE, https: / / arxiv.org / abs / 2403.08295; Gemma 2: Improving Open Language Models at a Practical Size, GOOGLE, https: / / arxiv.org / abs / 2408.00118; PaLI: A Jointly-Scaled Multilingual Language-Image Model, GOOGLE, https: / / arxiv.org / abs / 2209.06794.
[0419] Sequence processing model(s) 4 can process one or multiple types of data simultaneously. Variations of language models that can perform joint vision and language tasks may be referred to as “Vision-Language Models,” or VLMs. Example VLMs include models described in PaliGemma: A versatile 3B VLM for transfer, GOOGLE, https: / / arxiv.org / abs / 2407.07726; PaliGemma 2: A Family of Versatile VLMs for Transfer, GOOGLE, https: / / arxiv.org / abs / 2412.03555; Flamingo: a Visual Language Model for Few-Shot Learning, GOOGLE, https: / / arxiv.org / abs / 2204.14198.
[0420] Sequence processing model(s) 4 can be multimodal. Example multimodal sequence processing models include, for instance, models described in Gemini: A Family of Highly Capable Multimodal Models, GOOGLE, https: / / arxiv.org / abs / 2312.11805; Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, GOOGLE, https: / / arxiv.org / abs / 2403.05530.
[0421] Other example sequence processing models can operate to generate outputs or receive inputs in specific domains, such as image domains, see, e.g., Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ARXIV:2010.11929v2 (Jun. 3, 2021), audio domains, see, e.g., Agostinelli et al., MusicLM: Generating Music From Text, ARXIV:2301 .11325v1 (Jan. 26, 2023), biochemical domains, see, e.g., Jumper et al., Highly accurate protein structure prediction with AlphaFold, 596 Nature 583 (Aug. 26, 2021 ), by way of example.
[0422] In general, sequence processing model(s) 4 can obtain input sequence 5 using data from input(s) 2. For instance, input sequence 5 can include a representation of data from input(s) 2 in a format understood by sequence processing model(s) 4. One or more machine-learned components of sequence processing model(s) 4 can ingest the data from input(s) 2, parse the data into pieces compatible with theprocessing architectures of sequence processing model(s) 4 (e.g., via “tokenization”), and project the pieces into an input space associated with prediction layer(s) 6 (e.g., via “embedding”).
[0423] Sequence processing model(s) 4 can ingest the data from input(s) 2 and parse the data into a sequence of elements to obtain input sequence 5. For example, a portion of input data from input(s) 2 can be broken down into pieces that collectively represent the content of the portion of the input data. The pieces can provide the elements of the sequence.
[0424] Elements 5-1 , 5-2, . . . , 5-M can represent, in some cases, building blocks for capturing or expressing meaningful information in a particular data domain. For instance, the elements can describe “atomic units” across one or more domains. For example, for textual input source(s), the elements can correspond to groups of one or more words or sub-word components, such as sets of one or more characters.
[0425] For example, elements 5-1 , 5-2, . . . , 5-M can represent tokens obtained using a tokenizer. For instance, a tokenizer can process a given portion of an input source and output a series of tokens (e.g., corresponding to input elements 5-1 , 5-2, . . . . 5-M) that represent the portion of the input source. Various approaches to tokenization can be used. For instance, textual input source(s) can be tokenized using a byte-pair encoding (BPE) technique. See, e.g., Kudo et al., SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, PROCEEDINGS OF THE 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (System Demonstrations), pages 66-71 (October 31 -November 4, 2018), https: / / aclanthology.org / D18-2012.pdf. Image-based input source(s) can be tokenized by extracting and serializing patches from an image.
[0426] In general, arbitrary data types can be serialized and processed into input sequence 5. It is to be understood that element(s) 5-1 , 5-2, . . . . 5-M depicted in Figure 10 can be the tokens or can be the embedded representations thereof.
[0427] Prediction layer(s) 6 can predict one or more output elements 7-1 , 7-2, . . . , 7-N based on the input elements. Prediction layer(s) 6 can include one or more machine-learned model architectures, such as one or more layers of learned parameters that manipulate and transform the input(s) to extract higher- order meaning from, and relationships between, input element(s) 5-1 , 5-2, . . . . 5-M. In this manner, for instance, example prediction layer(s) 6 can predict new output element(s) in view of the context provided by input sequence 5.
[0428] Prediction layer(s) 6 can evaluate associations between portions of input sequence 5 and a particular output element. These associations can inform a prediction of the likelihood that a particular output follows the input context. For example, consider the textual snippet, “The carpenter’s toolbox was small and heavy. It was full of .” Example prediction layer(s) 6 can identify that “It” refers back to “toolbox” by determining a relationship between the respective embeddings. Example prediction layer(s) 6 can also link “It” to the attributes of the toolbox, such as “small” and “heavy.” Based on these associations,prediction layer(s) 6 can, for instance, assign a higher probability to the word “nails’’ than to the word “sawdust.”
[0429] A transformer is an example architecture that can be used in prediction layer(s) 6. See, e.g., Vaswani et al., Attention Is All You Need, ARXIV: 1706.03762V7 (Aug. 2, 2023). A transformer is an example of a machine-learned model architecture that uses an attention mechanism to compute associations between items within a context window. The context window can include a sequence that contains input sequence 5 and potentially one or more output element(s) 7-1, 7-2, . . . , 7-N. A transformer block can include one or more attention layer(s) and one or more post-attention layer(s) (e.g., feedforward layer(s), such as a multi-layer perceptron).
[0430] Prediction layer(s) 6 can include other machine-learned model architectures in addition to or in lieu of transformer-based architectures. For example, recurrent neural networks (RNNs) and long shortterm memory (LSTM) models can also be used, as well as convolutional neural networks (CNNs). In general, prediction layer(s) 6 can leverage various kinds of artificial neural networks that can understand or generate sequences of information.
[0431] Output sequence 7 can include or otherwise represent the same or different data types as input sequence 5. For instance, input sequence 5 can represent textual data, and output sequence 7 can represent textual data. Input sequence 5 can represent image, audio, or audiovisual data, and output sequence 7 can represent textual data (e.g., describing the image, audio, or audiovisual data). It is to be understood that prediction layer(s) 6, and any other interstitial model components of sequence processing model(s) 4, can be configured to receive a variety of data types in input sequence(s) 5 and output a variety of data types in output sequence(s) 7.
[0432] Output sequence 7 can have various relationships to input sequence 5. Output sequence 7 can be a continuation of input sequence 5. Output sequence 7 can be complementary to input sequence 5. Output sequence 7 can translate, transform, augment, or otherwise modify input sequence 5. Output sequence 7 can answer, evaluate, confirm, or otherwise respond to input sequence 5. Output sequence 7 can implement (or describe instructions for implementing) an instruction provided via input sequence 5.
[0433] Output sequence 7 can be generated autoreg ressively. For instance, for some applications, an output of one or more prediction layer(s) 6 can be passed through one or more output layers (e.g., softmax layer) to obtain a probability distribution over an output vocabulary (e.g., a textual or symbolic vocabulary) conditioned on a set of input elements in a context window. In this manner, for instance, output sequence 7 can be autoregressively generated by sampling a likely next output element, adding that element to the context window, and re-generating the probability distribution based on the updated context window, and sampling a likely next output element, and so forth.
[0434] Output sequence 7 can also be generated non-autoregressively. For instance, multiple output elements of output sequence 7 can be predicted together without explicit sequential conditioning on each other. See, e.g., Saharia et al., Non-Autoregressive Machine Translation with Latent Alignments, ARXIV:2004.07437 v3 (Nov. 16, 2020).
[0435] Output sequence 7 can include one or multiple portions or elements. In an example content generation configuration, output sequence 7 can include multiple elements corresponding to multiple portions of a generated output sequence (e.g., a textual sentence, values of a discretized waveform, computer code, etc.). In an example classification configuration, output sequence 7 can include a single element associated with a classification output. For instance, an output “vocabulary” can include a set of classes into which an input sequence is to be classified. For instance, a vision transformer block can pass latent state information to a multilayer perceptron that outputs a likely class value associated with an input image.
[0436] Figure 11 is a block diagram of an example technique for populating an example input sequence 8. Input sequence 8 can include various functional elements that form part of the model infrastructure, such as an element 8-0 obtained from a task indicator 9 that signals to any model(s) that process input sequence 8 that a particular task is being performed (e.g., to help adapt a performance of the model(s) to that particular task). Input sequence 8 can include various data elements from different data modalities. For instance, an input modality 10-1 can include one modality of data. A data-to-sequence model 11-1 can process data from input modality 10-1 to project the data into a format compatible with input sequence 8 (e.g., one or more vectors dimensioned according to the dimensions of input sequence 8) to obtain elements 8-1 , 8-2, 8-3. Another input modality 10-2 can include a different modality of data. A data-to- sequence model 11-2 can project data from input modality 10-2 into a format compatible with input sequence 8 to obtain elements 8-4, 8-5, 8-6. Another input modality 10-3 can include yet another different modality of data. A data-to-sequence model 11-3 can project data from input modality 10-3 into a format compatible with input sequence 8 to obtain elements 8-7, 8-8, 8-9.
[0437] Input sequence 8 can be the same as or different from input sequence 5. Input sequence 8 can be a multimodal input sequence that contains elements that represent data from different modalities using a common dimensional representation. For instance, an embedding space can have P dimensions. Input sequence 8 can be configured to contain a plurality of elements that have P dimensions. In this manner, for instance, example implementations can facilitate information extraction and reasoning across diverse data modalities by projecting data into elements in the same embedding space for comparison, combination, or other computations therebetween.
[0438] For example, elements 8-0, . . . . 8-9 can indicate particular locations within a multidimensional embedding space. Some elements can map to a set of discrete locations in the embedding space. For instance, elements that correspond to discrete members of a predetermined vocabulary of tokens can mapto discrete locations in the embedding space that are associated with those tokens. Other elements can be continuously distributed across the embedding space. For instance, some data types can be broken down into continuously defined portions (e.g., image patches) that can be described using continuously distributed locations within the embedding space.
[0439] In some implementations, the expressive power of the embedding space may not be limited to meanings associated with any particular set of tokens or other building blocks. For example, a continuous embedding space can encode a spectrum of high-order information. An individual piece of information (e.g., a token) can map to a particular point in that space: for instance, a token for the word “dog” can be projected to an embedded value that points to a particular location in the embedding space associated with canine-related information. Similarly, an image patch of an image of a dog on grass can also be projected into the embedding space. In some implementations, the projection of the image of the dog can be similar to the projection of the word “dog” while also having similarity to a projection of the word “grass,” while potentially being different from both. In some implementations, the projection of the image patch may not exactly align with any single projection of a single word. In some implementations, the projection of the image patch can align with a combination of the projections of the words “dog” and “grass.” In this manner, for instance, a high-order embedding space can encode information that can be independent of data modalities in which the information is expressed.
[0440] Task indicator 9 can include a model or model component configured to identify a task being performed and inject, into input sequence 8, an input value represented by element 8-0 that signals which task is being performed. For instance, the input value can be provided as a data type associated with an input modality and projected along with that input modality (e.g., the input value can be a textual task label that is embedded along with other textual data in the input; the input value can be a pixel-based representation of a task that is embedded along with other image data in the input; etc.). The input value can be provided as a data type that differs from or is at least independent from other input(s). For instance, the input value represented by element 8-0 can be learned within a continuous embedding space.
[0441] Input modalities 10-1 , 10-2, and 10-3 can be associated with various different data types (e.g., as described above with respect to input(s) 2 and output(s) 3).
[0442] Data-to-sequence models 11-1 , 11-2, and 11-3 can be the same or different from each other. Data-to-sequence models 11-1 , 11-2, and 11-3 can be adapted to each respective input modality 10-1 , 10- 2, and 10-3. For example, a textual data-to-sequence model can subdivide a portion of input text and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-1, 8-2, 8-3, etc.). An image data-to-sequence model can subdivide an input image and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-4, 8-5, 8-6, etc.). An arbitrary data type data-to-sequence model can subdivide an input of that arbitrary data type and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-7, 8-8, 8-9, etc.).
[0443] Data-to-sequence models 11-1 , 11-2, and 11-3 can form part of machine-learned sequence processing model(s) 4. Data-to-sequence models 11-1 , 11-2, and 11-3 can be jointly trained with or trained independently from machine-learned sequence processing model(s) 4. Data-to-sequence models 11-1, 11- 2, and 11-3 can be trained end-to-end with machine-learned sequence processing model(s) 4.
[0444] Figure 12 is a block diagram of an example model development platform 12 that can facilitate creation, adaptation, and refinement of example machine-learned models (e.g., machine-learned model(s) 1 , sequence processing model(s) 4, etc.). Model development platform 12 can provide a number of different toolkits that developer systems can employ in the development of new or adapted machine- learned models.
[0445] Model development platform 12 can provide one or more model libraries 13 containing building blocks for new models. Model libraries 13 can include one or more pre-trained foundational models 13-1 , which can provide a backbone of processing power across various tasks. Model libraries 13 can include one or more pre-trained expert models 13-2, which can be focused on performance in particular domains of expertise. Model libraries 13 can include various model primitives 13-3, which can provide low-level architectures or components (optionally pre-trained), which can be assembled in various arrangements as desired. Model primitives 13-3 can include a library of pre-trained adapters or LoRA modules that can adapt a baseline foundational model to align its outputs with a desired performance profile, augment model capabilities (e.g., to adapt to a different input modality, etc.), and the like.
[0446] Model development platform 12 can receive selections of various model components 14. Model development platform 12 can pass selected model components 14 to a workbench 15 that combines selected model components 14 into a development model 16.
[0447] Workbench 15 can facilitate further refinement and adaptation of development model 16 by leveraging a number of different toolkits integrated with model development platform 12. For example, workbench 15 can facilitate alignment of the development model 16 with a desired performance profile on various tasks using a model alignment toolkit 17.
[0448] Model alignment toolkit 17 can provide a number of tools for causing development model 16 to generate outputs aligned with desired behavioral characteristics. Alignment can include increasing the accuracy, precision, recall, etc. of model outputs. Alignment can include enforcing output styles, schema, or other preferential characteristics of model outputs. Alignment can be general or domain-specific. For instance, a pre-trained foundational model 13-1 can begin with an initial level of performance across multiple domains. Alignment of the pre-trained foundational model 13-1 can include improving a performance in a particular domain of information or tasks (e.g., even at the expense of performance in another domain of information or tasks).
[0449] Model alignment toolkit 17 can integrate one or more dataset(s) 17-1 for aligning development model 16. Curated dataset(s) 17-1 can include labeled or unlabeled training data. Dataset(s) 17-1 can beobtained from public domain datasets. Dataset(s) 17-1 can be obtained from private datasets associated with one or more developer system(s) for the alignment of bespoke machine-learned model(s) customized for private use-cases.
[0450] Pre-training pipelines 17-2 can include a machine-learned model training workflow configured to update development model 16 over large-scale, potentially noisy datasets. For example, pre-training can leverage unsupervised learning techniques (e.g., de-noising, etc.) to process large numbers of training instances to update model parameters from an initialized state and achieve a desired baseline performance. Pre-training pipelines 17-2 can leverage unlabeled datasets in dataset(s) 17-1 to perform pretraining. Workbench 15 can implement a pre-training pipeline 17-2 to pre-train development model 16.
[0451] Fine-tuning pipelines 17-3 can include a machine-learned model training workflow configured to refine the model parameters of development model 16 with higher-quality data. Fine-tuning pipelines 17-3 can update development model 16 by conducting supervised training with labeled dataset(s) in dataset(s) 17-1. Fine-tuning pipelines 17-3 can update development model 16 by conducting reinforcement learning using reward signals from user feedback signals. Workbench 15 can implement a fine-tuning pipeline 17-3 to fine-tune development model 16.
[0452] Prompt libraries 17-4 can include sets of inputs configured to induce behavior aligned with desired performance criteria. Prompt libraries 17-4 can include few-shot prompts (e.g., inputs providing examples of desired model outputs for prepending to a desired runtime query), chain-of-thought prompts (e.g., inputs providing step-by-step reasoning within the exemplars to facilitate thorough reasoning by the model), and the like.
[0453] Example prompts can be retrieved from an available repository of prompt libraries 17-4. Example prompts can be contributed by one or more developer systems using workbench 15.
[0454] In some implementations, pre-trained or fine-tuned models can achieve satisfactory performance without exemplars in the inputs. For instance, zero-shot prompts can include inputs that lack exemplars. Zero-shot prompts can be within a domain within a training dataset or outside of the training domain(s).
[0455] Prompt libraries 17-4 can include one or more prompt engineering tools. Prompt engineering tools can provide workflows for retrieving or learning optimized prompt values. Prompt engineering tools can facilitate directly learning prompt values (e.g., input element values) based on one or more training iterations. Workbench 15 can implement prompt engineering tools in development model 16.
[0456] Prompt libraries 17-4 can include pipelines for prompt generation. For example, inputs can be generated using development model 16 itself or other machine-learned models. In this manner, for instance, a first model can process information about a task and output an input for a second model to process in order to perform a step of the task. The second model can be the same as or different from the first model. Workbench 15 can implement prompt generation pipelines in development model 16.
[0457] Prompt libraries 17-4 can include pipelines for context injection. For instance, a performance of development model 16 on a particular task can improve if provided with additional context for performing the task. Prompt libraries 17-4 can include software components configured to identify desired context, retrieve the context from an external source (e.g., a database, a sensor, etc.), and add the context to the input prompt. Workbench 15 can implement context injection pipelines in development model 16.
[0458] Although various training examples described herein with respect to model development platform 12 refer to "pre-training” and "fine-tuning,” it is to be understood that model alignment toolkit 17 can generally support a wide variety of training techniques adapted for training a wide variety of machine- learned models. Example training techniques can correspond to the example training method 800 described above.
[0459] Model development platform 12 can include a model plugin toolkit 18. Model plugin toolkit 18 can include a variety of tools configured for augmenting the functionality of a machine-learned model by integrating the machine-learned model with other systems, devices, and software components. For instance, a machine-learned model can use tools to increase performance quality where appropriate. For instance, deterministic tasks can be offloaded to dedicated tools in lieu of probabilistically performing the task with an increased risk of error. For instance, instead of autoregressively predicting the solution to a system of equations, a machine-learned model can recognize a tool to call for obtaining the solution and pass the system of equations to the appropriate tool. The tool can be a traditional system of equations solver that can operate deterministically to resolve the system of equations. The output of the tool can be returned in response to the original query. In this manner, tool use can allow some example models to focus on the strengths of machine-learned models— e.g., understanding an intent in an unstructured request for a task— while augmenting the performance of the model by offloading certain tasks to a more focused tool for rote application of deterministic algorithms to a well-defined problem.
[0460] Model plugin toolkit 18 can include validation tools 18-1. Validation tools 18-1 can include tools that can parse and confirm output(s) of a machine-learned model. Validation tools 18-1 can include engineered heuristics that establish certain thresholds applied to model outputs. For example, validation tools 18-1 can ground the outputs of machine-learned models to structured data sources (e.g., to mitigate “hallucinations").
[0461] Model plugin toolkit 18 can include tooling packages 18-2 for implementing one or more tools that can include scripts or other executable code that can be executed alongside development model 16. Tooling packages 18-2 can include one or more inputs configured to cause machine-learned model(s) to implement the tools (e.g., few-shot prompts that induce a model to output tool calls in the proper syntax, etc.). Tooling packages 18-2 can include, for instance, fine-tuning training data for training a model to use a tool.
[0462] Model plugin toolkit 18 can include interfaces for calling external application programming interfaces (APIs) 18-3. For instance, in addition to or in lieu of implementing tool calls or tool code directly with development model 16, development model 16 can be aligned to output instructions that initiate API calls to send or obtain data via external systems.
[0463] Model plugin toolkit 18 can integrate with prompt libraries 17-4 to build a catalog of available tools for use with development model 16. For instance, a model can receive, in an input, a catalog of available tools, and the model can generate an output that selects a tool from the available tools and initiates a tool call for using the tool.
[0464] Model development platform 12 can include a computational optimization toolkit 19 for optimizing a computational performance of development model 16. For instance, tools for model compression 19-1 can allow development model 16 to be reduced in size while maintaining a desired level of performance. For instance, model compression 19-1 can include quantization workflows, weight pruning and sparsification techniques, etc. Tools for hardware acceleration 19-2 can facilitate the configuration of the model storage and execution formats to operate optimally on different hardware resources. For instance, hardware acceleration 19-2 can include tools for optimally sharding models for distributed processing over multiple processing units for increased bandwidth, lower unified memory requirements, etc. Tools for distillation 19-3 can provide for the training of lighter-weight models based on the knowledge encoded in development model 16. For instance, development model 16 can be a highly performant, large machine- learned model optimized using model development platform 12. To obtain a lightweight model for running in resource-constrained environments, a smaller model can be a "student model” that learns to imitate development model 16 as a "teacher model." In this manner, for instance, the investment in learning the parameters and configurations of development model 16 can be efficiently transferred to a smaller model for more efficient inference.
[0465] Workbench 15 can implement one, multiple, or none of the toolkits implemented in model development platform 12. Workbench 15 can output an output model 20 based on development model 16. Output model 20 can be a deployment version of development model 16. Output model 20 can be a development or training checkpoint of development model 16. Output model 20 can be a distilled, compressed, or otherwise optimized version of development model 16.
[0466] Figure 13 is a block diagram of an example training flow for training a machine-learned development model 16. One or more portion(s) of the example training flow can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of the example training flow can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of the example training flow can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. Figure 13 depicts elementsperformed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 13 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of the example training flow can be performed additionally, or alternatively, by other systems.
[0467] Initially, development model 16 can persist in an initial state as an initialized model 21 . Development model 16 can be initialized with weight values. Initial weight values can be random or based on an initialization schema. Initial weight values can be based on prior pre-training for the same or for a different model.
[0468] Initialized model 21 can undergo pre-training in a pre-training stage 22. Pre-training stage 22 can be implemented using one or more pre-training pipelines 17-2 over data from dataset(s) 17-1 . Pre-training can be omitted, for example, if initialized model 21 is already pre-trained (e.g., development model 16 contains, is, or is based on a pre-trained foundational model or an expert model).
[0469] Pre-trained model 23 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Pre-trained model 23 can be the initial state if development model 16 was already pre-trained. Pre-trained model 23 can undergo fine-tuning in a fine- tuning stage 24. Fine-tuning stage 24 can be implemented using one or more fine-tuning pipelines 17-3 over data from dataset(s) 17-1 . Fine-tuning can be omitted, for example, if a pre-trained model has satisfactory performance, if the model was already fine-tuned, or if other tuning approaches are preferred.
[0470] Fine-tuned model 25 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Fine-tuned model 25 can be the initial state if development model 16 was already fine-tuned. Fine-tuned model 25 can undergo refinement with user feedback 26. For instance, refinement with user feedback 26 can include reinforcement learning, optionally based on human feedback from human users of fine-tuned model 25. As reinforcement learning can be a form of fine-tuning, it is to be understood that fine-tuning stage 24 can subsume the stage for refining with user feedback 26. Refinement with user feedback 26 can produce a refined model 27. Refined model 27 can be output to downstream system(s) 28 for deployment or further development.
[0471] In some implementations, computational optimization operations can be applied before, during, or after each stage. For instance, initialized model 21 can undergo computational optimization 29-1 (e.g., using computational optimization toolkit 19) before pre-training stage 22. Pre-trained model 23 can undergo computational optimization 29-2 (e.g., using computational optimization toolkit 19) before fine-tuning stage 24. Fine-tuned model 25 can undergo computational optimization 29-3 (e.g., using computational optimization toolkit 19) before refinement with user feedback 26. Refined model 27 can undergocomputational optimization 29-4 (e.g., using computational optimization toolkit 19) before output to downstream system(s) 28. Computational optimization(s) 29-1 , . . . , 29-4 can all be the same, all be different, or include at least some different optimization techniques.
[0472] Figure 14 is a block diagram of an inference system for operating one or more machine-learned model(s) 1 to perform inference (e.g., for training, for deployment, etc.). A model host 31 can receive machine-learned model(s) 1 . Model host 31 can host one or more model instance(s) 31-1 , which can be one or multiple instances of one or multiple models. Model host 31 can host model instance(s) 31-1 using available compute resources 31-2 associated with model host 31 .
[0473] Model host 31 can perform inference on behalf of one or more client(s) 32. Client(s) 32 can transmit an input request 33 to model host 31 . Using input request 33, model host 31 can obtain input(s) 2 for input to machine-learned model(s) 1 . Machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3. Using output(s) 3, model host 31 can return an output payload 34 for responding to input request 33 from client(s) 32. Output payload 34 can include or be based on output(s) 3.
[0474] Model host 31 can leverage various other resources and tools to augment the inference task. For instance, model host 31 can communicate with tool interfaces 35 to facilitate tool use by model instance(s) 31-1 . Tool interfaces 35 can include local or remote APIs. Tool interfaces 35 can include integrated scripts or other software functionality. Model host 31 can engage online learning interface(s) 36 to facilitate ongoing improvements to machine-learned model(s) 1 . For instance, online learning interface(s) 36 can be used within reinforcement learning loops to retrieve user feedback on inferences served by model host 31 . Model host 31 can access runtime data source(s) 37 for augmenting input(s) 2 with additional contextual information. For instance, runtime data source(s) 37 can include a knowledge graph 37-1 that facilitates structured information retrieval for information associated with input request(s) 33 (e.g., a search engine service). Runtime data source(s) 37 can include public or private, external or local database(s) 37-2 that can store information associated with input request(s) 33 for augmenting input(s) 2. Runtime data source(s) 37 can include account data 37-3 which can be retrieved in association with a user account corresponding to a client 32 for customizing the behavior of model host 31 accordingly.
[0475] Model host 31 can be implemented by one or multiple computing devices or systems. Client(s) 2 can be implemented by one or multiple computing devices or systems, which can include computing devices or systems shared with model host 31 .
[0476] For example, model host 31 can operate on a server system that provides a machine-learning service to client device(s) that operate client(s) 32 (e.g., over a local or wide-area network). Client device(s) can be end-user devices used by individuals. Client device(s) can be server systems that operate client(s) 32 to provide various functionality as a service to downstream end-user devices.
[0477] In some implementations, model host 31 can operate on the same device or system as client(s) 32. Model host 31 can be a machine-learning service that runs on-device to provide machine-learningfunctionality to one or multiple applications operating on a client device, which can include an application implementing client(s) 32 Model host 31 can be a part of the same application as client(s) 32. For instance, model host 31 can be a subroutine or method implemented by one part of an application, and client(s) 32 can be another subroutine or method that engages model host 31 to perform inference functions within the application. It is to be understood that model host 31 and client(s) 32 can have various different configurations.
[0478] Model instance(s) 31-1 can include one or more machine-learned models that are available for performing inference. Model instance(s) 31-1 can include weights or other model components that are stored in persistent storage, temporarily cached, or loaded into high-speed memory. Model instance(s) 31- 1 can include multiple instance(s) of the same model (e.g., for parallel execution of more requests on the same model). Model instance(s) 31-1 can include instance(s) of different model(s). Model instance(s) 31-1 can include cached intermediate states of active or inactive model(s) used to accelerate inference of those models. For instance, an inference session with a particular model may generate significant amounts of computational results that can be re-used for future inference runs (e.g., using a KV cache for transformerbased models). These computational results can be saved in association with that inference session so that session can be executed more efficiently when resumed.
[0479] Compute resource(s) 31-2 can include one or more processors (central processing units, graphical processing units, tensor processing units, machine-learning accelerators, etc.) connected to one or more memory devices. Compute resource(s) 31-2 can include a dynamic pool of available resources shared with other processes. Compute resource(s) 31-2 can include memory devices large enough to fit an entire model instance in a single memory instance. Compute resource(s) 31-2 can also shard model instance(s) across multiple memory devices (e.g., using data parallelization or tensor parallelization, etc.). This can be done to increase parallelization or to execute a large model using multiple memory devices which individually might not be able to fit the entire model into memory.
[0480] Input request 33 can include data for input(s) 2. Model host 31 can process input request 33 to obtain input(s) 2. Input(s) 2 can be obtained directly from input request 33 or can be retrieved using input request 33. Input request 33 can be submitted to model host 31 via an API.
[0481] Model host 31 can perform inference over batches of input requests 33 in parallel. For instance, a model instance 31-1 can be configured with an input structure that has a batch dimension. Separate input(s) 2 can be distributed across the batch dimension (e.g., rows of an array). The separate input(s) 2 can include completely different contexts. The separate input(s) 2 can be multiple inference steps of the same task. The separate input(s) 2 can be staggered in an input structure, such that any given inference cycle can be operating on different portions of the respective input(s) 2. In this manner, for instance, model host 31 can perform inference on the batch in parallel, such that output(s) 3 can also contain the batch dimension and return the inference results for the batched input(s) 2 in parallel. In this manner, forinstance, batches of input request(s) 33 can be processed in parallel for higher throughput of output payload(s) 34.
[0482] Output payload 34 can include or be based on output(s) 3 from machine-learned model(s) 1 . Model host 31 can process output(s) 3 to obtain output payload 34. This can include chaining multiple rounds of inference (e.g., iteratively, recursively, across the same model(s) or different model(s)) to arrive at a final output for a task to be returned in output payload 34. Output payload 34 can be transmitted to client(s) 32 via an API.
[0483] Online learning interface(s) 36 can facilitate reinforcement learning of machine-learned model(s) 1 . Online learning interface(s) 36 can facilitate reinforcement learning with human feedback (RLHF). Online learning interface(s) 36 can facilitate federated learning of machine-learned model(s) 1 .
[0484] Model host 31 can access a library of pre-trained adapters or LoRA modules that can adapt a baseline model to align its outputs with a desired performance profile, augment model capabilities (e.g., to adapt to a different input modality, etc.), and the like. For instance, model host 31 can receive an input request to load a customized model, and model host 31 can retrieve one or more components to adapt a baseline model to the custom profile. Model host 31 can determine that a particular functionality is needed for a particular task (e.g., based on an output of a model that preprocesses an input) and retrieve a pretrained component accordingly.
[0485] Model host 31 can execute machine-learned model(s) 1 to perform inference for various tasks using various types of data. For example, various different input(s) 2 and output(s) 3 can be used for various different tasks. In some implementations, input(s) 2 can be or otherwise represent image data. Machine-learned model(s) 1 can process the image data to generate an output. As an example, machine- learned model(s) 1 can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an image segmentation output. As another example, machine-learned model(s) 1 can process the image data to generate an image classification output. As another example, machine-learned model(s) 1 can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an encoded image data output (e.g., an encoded and / or compressed representation of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an upscaled image data output. As another example, machine-learned model(s) 1 can process the image data to generate a prediction output.
[0486] In some implementations, the task is a computer vision task. In some cases, input(s) 2 includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score correspondingto a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.
[0487] In some implementations, input(s) 2 can be or otherwise represent natural language data. Machine-learned model(s) 1 can process the natural language data to generate an output. As an example, machine-learned model(s) 1 can process the natural language data to generate a language encoding output. As another example, machine-learned model(s) 1 can process the natural language data to generate a latent text embedding output. As another example, machine-learned model(s) 1 can process the natural language data to generate a translation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a classification output. As another example, machine- learned model(s) 1 can process the natural language data to generate a textual segmentation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a semantic intent output. As another example, machine-learned model(s) 1 can process the natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, machine-learned model(s) 1 can process the natural language data to generate a prediction output (e.g., one or more predicted next portions of natural language content).
[0488] In some implementations, input(s) 2 can be or otherwise represent speech data (e.g., data describing spoken natural language, such as audio data, textual data, etc.). Machine-learned model(s) 1 can process the speech data to generate an output. As an example, machine-learned model(s) 1 can process the speech data to generate a speech recognition output. As another example, machine-learned model(s) 1 can process the speech data to generate a speech translation output. As another example, machine-learned model(s) 1 can process the speech data to generate a latent embedding output. As another example, machine-learned model(s) 1 can process the speech data to generate an encoded speech output (e.g., an encoded and / or compressed representation of the speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate an upscaled speech output(e.g. , speech data that is higher quality than the input speech data, etc.). As another example, machine- learned model(s) 1 can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate a prediction output.
[0489] In some implementations, input(s) 2 can be or otherwise represent latent encoding data (e.g., a latent space representation of an input, etc.). Machine-learned model(s) 1 can process the latent encoding data to generate an output. As an example, machine-learned model(s) 1 can process the latent encoding data to generate a recognition output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a reconstruction output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a search output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a reclustering output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a prediction output.
[0490] In some implementations, input(s) 2 can be or otherwise represent statistical data. Statistical data can be, represent, or otherwise include data computed and / or calculated from some other data source. Machine-learned model(s) 1 can process the statistical data to generate an output. As an example, machine-learned model(s) 1 can process the statistical data to generate a recognition output. As another example, machine-learned model(s) 1 can process the statistical data to generate a prediction output. As another example, machine-learned model(s) 1 can process the statistical data to generate a classification output. As another example, machine-learned model(s) 1 can process the statistical data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the statistical data to generate a visualization output. As another example, machine-learned model(s) 1 can process the statistical data to generate a diagnostic output.
[0491] In some implementations, input(s) 2 can be or otherwise represent sensor data. Machine-learned model(s) 1 can process the sensor data to generate an output. As an example, machine-learned model(s) 1 can process the sensor data to generate a recognition output. As another example, machine-learned model(s) 1 can process the sensor data to generate a prediction output. As another example, machine- learned model(s) 1 can process the sensor data to generate a classification output. As another example, machine-learned model(s) 1 can process the sensor data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the sensor data to generate a visualization output. As another example, machine-learned model(s) 1 can process the sensor data to generate a diagnostic output. As another example, machine-learned model(s) 1 can process the sensor data to generate a detection output.
[0492] In some implementations, machine-learned model(s) 1 can be configured to perform a task that includes encoding input data for reliable and / or efficient transmission or storage (and / or corresponding decoding). For example, the task may be an audio compression task. The input may include audio dataand the output may include compressed audio data. In another example, the input includes visual data (e.g. one or more images or videos), the output includes compressed visual data, and the task is a visual data compression task. In another example, the task may include generating an embedding for input data (e.g. input audio or visual data). In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output may include a text output which is mapped to the spoken utterance. In some cases, the task includes encrypting or decrypting input data. In some cases, the task includes a microprocessor performance task, such as branch prediction or memory address translation.
[0493] In some implementations, the task is a generative task, and machine-learned model(s) 1 can be configured to output content generated in view of input(s) 2. For instance, input(s) 2 can be or otherwise represent data of one or more modalities that encodes context for generating additional content.
[0494] In some implementations, the task can be a text completion task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent textual data and to generate output(s) 3 that represent additional textual data that completes a textual sequence that includes input(s) 2. For instance, machine-learned model(s) 1 can be configured to generate output(s) 3 to complete a sentence, paragraph, or portion of text that follows from a portion of text represented by input(s) 2.
[0495] In some implementations, the task can be an instruction-following task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent instructions to perform a function and to generate output(s) 3 that advance a goal of satisfying the instruction function (e.g., at least a step of a multi-step procedure to perform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward accomplishing the requested functionality. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of performing a function. Multiple steps can be performed, with a final output being obtained that is responsive to the initial instructions.
[0496] In some implementations, the task can be a question answering task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent a question to answer and to generate output(s) 3 that advance a goal of returning an answer to the question (e.g., at least a step of a multi-step procedure toperform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and machine- learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward answering the question. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of obtaining an answer to the question (e.g., querying a database, performing a computation, executing a script, etc.). Multiple steps can be performed, with a final output being obtained that is responsive to the question.
[0497] In some implementations, the task can be an image generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of image content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent image data that depicts imagery related to the context. For instance, machine-learned model(s) 1 can be configured to generate pixel data of an image. Values for channel(s) associated with the pixels in the pixel data can be selected based on the context (e.g., based on a probability determined based on the context).
[0498] In some implementations, the task can be an audio generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of audio content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent audio data related to the context. For instance, machine- learned model(s) 1 can be configured to generate waveform data in the form of an image (e.g., a spectrogram). Values for channel(s) associated with pixels of the image can be selected based on the context. Machine-learned model(s) 1 can be configured to generate waveform data in the form of a sequence of discrete samples of a continuous waveform. Values of the sequence can be selected based on the context (e.g., based on a probability determined based on the context).
[0499] In some implementations, the task can be a data generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of data (e.g., data from various data domains, such as sensor data, image data, multimodal data, statistical data, etc.). The desired data can be, for instance, synthetic data for training other machine-learned models. The context can include arbitrary data type(s). Machine-learned model(s) 1 can be configured to generateoutput(s) 3 that represent data that aligns with the desired data. For instance, machine-learned model(s) 1 can be configured to generate data values for populating a dataset. Values for the data object(s) can be selected based on the context (e.g., based on a probability determined based on the context).
[0500] Figure 15 is a block diagram of an example networked computing system that can perform aspects of example implementations of the present disclosure. The system can include a number of computing devices and systems that are communicatively coupled over a network 49. An example computing device 50 is described to provide an example of a computing device that can perform any aspect of the present disclosure (e.g., implementing model host 31 , client(s) 32, or both). An example server computing system 60 is described as an example of a server computing system that can perform any aspect of the present disclosure (e.g., implementing model host 31 , client(s) 32, or both). Computing device 50 and server computing system(s) 60 can cooperatively interact (e.g , over network 49) to perform any aspect of the present disclosure (e.g., implementing model host 31 , client(s) 32, or both). Model development platform system 70 is an example system that can host or serve model development platform(s) 12 for development of machine-learned models. Third-party system(s) 80 are example system(s) with which any of computing device 50, server computing system(s) 60, or model development platform system(s) 70 can interact in the performance of various aspects of the present disclosure (e.g., engaging third-party tools, accessing third-party databases or other resources, etc.).
[0501] Network 49 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over network 49 can be carried via any type of wired or wireless connection, using a wide variety of communication protocols (e.g., TCP / IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), or protection schemes (e.g., VPN, secure HTTP, SSL). Network 49 can also be implemented via a system bus. For instance, one or more devices or systems of Figure 15 can be co-located with, contained by, or otherwise integrated into one or more other devices or systems.
[0502] Computing device 50 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, a server computing device, a virtual machine operating on a host device, or any other type of computing device. Computing device 50 can be a client computing device. Computing device 50 can be an end-user computing device. Computing device 50 can be a computing device of a service provided that provides a service to an end user (who may use another computing device to interact with computing device 50).
[0503] Computing device 50 can include one or more processors 51 and a memory 52. Processor(s) 51 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 52 can include one or more non-transitory computer-readable storage media, such asHBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 52 can store data 53 and instructions 54 which can be executed by processor(s) 51 to cause computing device 50 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein.
[0504] Computing device 50 can also include one or more input components that receive user input. For example, a user input component can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, camera, LIDAR, a physical keyboard or other buttons, or other means by which a user can provide user input.
[0505] Computing device 50 can store or include one or more machine-learned models 55. Machine- learned models 55 can include one or more machine-learned model(s) 1 , such as a sequence processing model 4. Machine-learned models 55 can include one or multiple model instance(s) 31-1. Machine-learned model(s) 55 can be received from server computing system(s) 60, model development platform system 70, third party system(s) 80 (e.g., an application distribution platform), or developed locally on computing device 50. Machine-learned model(s) 55 can be loaded into memory 52 and used or otherwise implemented by processor(s) 51 . Computing device 50 can implement multiple parallel instances of machine-learned model(s) 55.
[0506] Server computing system(s) 60 can include one or more processors 61 and a memory 62. Processor(s) 61 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 62 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 62 can store data 63 and instructions 64 which can be executed by processor(s) 61 to cause server computing system(s) 60 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein.
[0507] In some implementations, server computing system 60 includes or is otherwise implemented by one or multiple server computing devices. In instances in which server computing system 60 includes multiple server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.
[0508] Server computing system 60 can store or otherwise include one or more machine-learned models 65. Machine-learned model(s) 65 can be the same as or different from machine-learned model(s) 55. Machine-learned models 65 can include one or more machine-learned model(s) 1 , such as a sequenceprocessing model 4. Machine-learned models 65 can include one or multiple model instance(s) 31-1 . Machine-learned model(s) 65 can be received from computing device 50, model development platform system 70, third party system(s) 80, or developed locally on server computing system(s) 60. Machine- learned model(s) 65 can be loaded into memory 62 and used or otherwise implemented by processor(s) 61 . Server computing system(s) 60 can implement multiple parallel instances of machine-learned model(s) 65.
[0509] In an example configuration, machine-learned models 65 can be included in or otherwise stored and implemented by server computing system 60 to establish a client-server relationship with computing device 50 for serving model inferences. For instance, server computing system(s) 60 can implement model host 31 on behalf of client(s) 32 on computing device 50. For instance, machine-learned models 65 can be implemented by server computing system 60 as a portion of a web service (e.g., remote machine-learned model hosting service, such as an online interface for performing machine-learned model operations over a network on server computing system(s) 60). For instance, server computing system(s) 60 can communicate with computing device 50 over a local intranet or internet connection. For instance, computing device 50 can be a workstation or endpoint in communication with server computing system(s) 60, with implementation of machine-learned models 65 being managed by server computing system(s) 60 to remotely perform inference (e.g., for runtime or training operations), with output(s) returned (e.g., cast, streamed, etc.) to computing device 50. Machine-learned models 65 can work cooperatively or interoperatively with machine-learned models 55 on computing device 50 to perform various tasks.
[0510] Model development platform system(s) 70 can include one or more processors 71 and a memory 72. Processor(s) 71 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 72 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 72 can store data 73 and instructions 74 which can be executed by processor(s) 71 to cause model development platform system(s) 70 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein. Example operations include the functionality described herein with respect to model development platform 12. This and other functionality can be implemented by developer tool(s) 75.
[0511] Third-party system(s) 80 can include one or more processors 81 and a memory 82. Processor(s) 81 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 82 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinationsthereof. Memory 82 can store data 83 and instructions 84 which can be executed by processor(s) 81 to cause third-party system(s) 80 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein. Example operations include the functionality described herein with respect to tools and other external resources called when training or performing inference with machine-learned model(s) 1 , 4, 16, 20, 55, 65, etc. (e.g., third-party resource(s) 85).
[0512] Figure 15 illustrates one example arrangement of computing systems that can be used to implement the present disclosure. Other computing system configurations can be used as well. For example, in some implementations, one or both of computing system 50 or server computing system(s) 60 can implement all or a portion of the operations of model development platform system 70. For example, computing system 50 or server computing system(s) 60 can implement developer tool(s) 75 (or extensions thereof) to develop, update / train, or refine machine-learned models 1 , 4, 16, 20, 55, 65, etc. using one or more techniques described herein with respect to model alignment toolkit 17. In this manner, for instance, computing system 50 or server computing system(s) 60 can develop, update / train, or refine machine- learned models based on local datasets (e.g., for model personalization / customization, as permitted by user data preference selections).
[0513] Figure 16 is a block diagram of an example computing device 98 that performs according to example embodiments of the present disclosure. Computing device 98 can be a user computing device or a server computing device (e.g., computing device 50, server computing system(s) 60, etc.). Computing device 98 can implement model host 31 . For instance, computing device 98 can include a number of applications (e.g., applications 1 through N). Each application can contain its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. As illustrated in Figure 16, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.
[0514] Figure 17 is a block diagram of an example computing device 99 that performs according to example embodiments of the present disclosure. Computing device 99 can be the same as or different from computing device 98. Computing device 99 can be a user computing device or a server computing device (e.g., computing device 50, server computing system(s) 60, etc.). Computing device 98 can implement model host 31 . For instance, computing device 99 can include a number of applications (e.g., applications 1 through N). Each application can be in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, avirtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
[0515] The central intelligence layer can include a number of machine-learned models. For example, as illustrated in Figure 17, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of computing device 99.
[0516] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for computing device 99 As illustrated in Figure 17, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).
[0517] Figure 18 depicts a flowchart of a method 1800 for implementing a machine-learned agent system (e.g., 100) according to aspects of the present disclosure.
[0518] One or more portion(s) of example method 1800 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 1800 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 1800 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models.
[0519] Figure 18 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 18 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting One or more portions of example method 1800 can be performed additionally, or alternatively, by other systems.
[0520] At 1802, example method 1800 includes generating a first input data structure describing one or more interactions of a machine-learned agent system. For example, the first input data structure 106 can include interaction data 108 that describes interaction record 102, which can contain data recording one or more interactions 102 of machine-learned agent system 100 with a current environment.
[0521] At 1804, example method 1800 includes generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure. In some implementations, generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure can include executing one or more first machine-learned model inference operations to generate one or more extracted memory objects based on the first input data structure. For example, executing one or more first machine- learned model inference operations can include engaging (e.g., via an application programming interface) machine-learned model system 110 to execute machine-learned model(s) 1 (e.g., sequence processing model 4) to generate one or more memory objects 114 based on interaction data 108. The one or more machine-learned model inference operations can be conditioned on data describing the interactions. The one or more first machine-learned model inference operations can be conditioned on instructions describing an analysis to be performed on interaction data 108. The one or more first machine-learned model inference operations can include generated, using a machine-learned model, an output. The output can be or be included in output data structure 112.
[0522] The output can be used to generate one or more memory objects 114. Memory extractor 104 can store the one or more memory objects 114 in interaction memory datastore 116. In some implementations, example method 1800 includes storing the one or more extracted memory objects in association with a session identifier corresponding to the one or more interactions of the machine-learned agent system.
[0523] At 1806, example method 1800 includes providing extracted memory data of the one or more extracted memory objects for processing using one or more second machine-learned model inference operations to generate an updated memory representation. For example, based on the provided extracted memory data, memory consolidator 120 can generate the second input data structure 124, which can include extracted memory data 126 describing one or more extracted memory objects 114, and can include prior memory data 128 describing one or more prior interactions 102 of machine-learned agent system 100. In some implementations, example method 1800 includes generating a second input data structure including extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine-learned agent system.
[0524] At 1808, example method 1800 includes generating a third input data structure based on the updated memory representation and a query. For example, machine-learned agent system 100 can receive, from input interface 134, a query 136. Machine-learned agent system 100 can generate an input data structure 138 that contains memory representation(s) 132 (or data derived therefrom) and query data 140.
[0525] At 1810, example method 1800 includes generating, based on one or more third machine- learned model inference operations, a response to the query based on the third input data structure. For example, machine-learned model system(s) 110 can include one or more machine-learned models (e.g.,machine-learned model(s) 1) that execute one or more inference operations based on input data structure(s) 138 to produce an output data structure(s) 142.
[0526] In some implementations, in alternative or in addition to 1808 and 1810, example method 1800 includes storing the updated memory representation. For example, memory consolidator 120 can store the updated memory representation in interaction datastore 116.
[0527] In example method 1800, the execution of the machine-learned models performing the inference operations can be performed locally to a device or system executing example method 1800 or remote from a device or system executing example method 1800. For example, the providing of extracted memory data can include providing the extracted memory data from one component executing on a first device to another component executing on the same device. For example, the providing of extracted memory data can include providing the extracted memory data from one component executing on a first device to another component executing on a different second device.
[0528] For instance, in some implementations, example method 1800 includes transmitting the extracted memory data to a machine-learned model system that executes the one or more second machine-learned model inference operations. For example, a computing device 50 can execute one or more portions of example method 1800 and interact with server computing system(s) 60 to cause or control execution of machine-learned models 65. In some implementations, example method 1800 includes receiving, from the machine-learned model system, output data describing the updated memory representation.
[0529] In some implementations, example method 1800 includes generating, based on one or more second machine-learned model inference operations, an updated memory representation based on the second input data structure. In some implementations, generating, based on one or more second machine- learned model inference operations, an updated memory representation based on the second input data structure can include executing one or more second machine-learned model inference operations to generate an updated memory representation based on the second input data structure. For example, executing one or more second machine-learned model inference operations can include engaging (e.g., via an application programming interface) machine-learned model system 110 to execute machine-learned model(s) 1 (e.g., sequence processing model 4) to generate output data structure 130 based on extracted memory data 126 and prior memory data 128. Memory consolidator 120 can generate an updated memory representation (e.g., memory representation(s) 132, memory representation object 302, 402, etc.).
[0530] Figure 19 depicts a flowchart of a method 1900 for implementing a machine-learned agent system (e.g., 100) according to aspects of the present disclosure.
[0531] One or more portion(s) of example method 1900 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 1900 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of examplemethod 1900 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models.
[0532] Figure 19 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 19 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 1900 can be performed additionally, or alternatively, by other systems.
[0533] At 1902, example method 1900 includes receiving one or more extracted memory objects describing one or more interactions of a machine-learned agent system. For example, one or more extracted memory objects can be received by memory consolidator 120 from memory extractor 104 or interaction memory datastore 116. One or more extracted memory objects can be received from the same or different device that is executing one or more aspects of example method 1900.
[0534] At 1904, example method 1900 includes generating a second input data structure comprising extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine-learned agent system. For example, based on the provided extracted memory data, memory consolidator 120 can generate the second input data structure 124, which can include extracted memory data 126 describing one or more extracted memory objects 114, and can include prior memory data 128 describing one or more prior interactions 102 of machine-learned agent system 100.
[0535] At 1906, example method 1900 includes generating, based on one or more second machine- learned model inference operations, an updated memory representation based on the second input data structure. In some implementations, generating, based on one or more second machine-learned model inference operations, an updated memory representation based on the second input data structure can include executing one or more second machine-learned model inference operations to generate an updated memory representation based on the second input data structure. For example, executing one or more second machine-learned model inference operations can include engaging (e.g. , via an application programming interface) machine-learned model system 110 to execute machine-learned model(s) 1 (e.g , sequence processing model 4) to generate output data structure 130 based on extracted memory data 126 and prior memory data 128. Memory consolidator 120 can generate an updated memory representation (e.g., memory representation(s) 132, memory representation object 302, 402, etc.).
[0536] In some implementations, a device or system executing aspects of example method 1900 can be the same or different from a device or system that generates extracted memory data or that executes a user interface for facilitating interactions between a user and machine-learned agent system 100. Forexample, a server computing system 60 can execute one or more operations of example method 1900 to provide a memory consolidation operation service (e.g., a software service accessible via an application programming interface). For example, local applications executing on a first device can provide, via an application programming interface, extracted memory data and optionally prior memory data for consolidation. In this manner, for instance, models used for performing inference operations for generating consolidated memory representations can be executed on different hardware than models used for performing inference operations for generating extracted memory data. It is to be understood, however, that a consolidation software component can be executed on the same system or device as a device or system that generates extracted memory data or that executes a user interface for facilitating interactions between a user and machine-learned agent system 100.
[0537] Figure 20 depicts a flowchart of a method 2000 for implementing a machine-learned agent system (e.g., 100) according to aspects of the present disclosure.
[0538] One or more portion(s) of example method 2000 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 2000 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 2000 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models.
[0539] Figure 20 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 20 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 2000 can be performed additionally, or alternatively, by other systems.
[0540] At 2002, example method 2000 includes generating a first input data structure describing one or more interactions of a machine-learned agent system. For example, the first input data structure 106 can include interaction data 108 that describes interaction record 102, which can contain data recording one or more interactions 102 of machine-learned agent system 100 with a current environment.
[0541] At 2004, example method 2000 includes generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure. In some implementations, generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure can include executing one or more first machine-learned model inference operations to generate one or more extracted memory objects based on the first input data structure. For example, executing one or more first machine-learned model inference operations can include engaging (e.g., via an application programming interface) machine-learned model system 110 to execute machine-learned model(s) 1 (e.g., sequence processing model 4) to generate one or more memory objects 114 based on interaction data 108. The one or more machine-learned model inference operations can be conditioned on data describing the interactions. The one or more first machine-learned model inference operations can be conditioned on instructions describing an analysis to be performed on interaction data 108. The one or more first machine-learned model inference operations can include generated, using a machine-learned model, an output. The output can be or be included in output data structure 112.
[0542] At 2006, example method 2000 includes generating a second input data structure comprising extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine-learned agent system. For example, based on the provided extracted memory data, memory consolidator 120 can generate the second input data structure 124, which can include extracted memory data 126 describing one or more extracted memory objects 114, and can include prior memory data 128 describing one or more prior interactions 102 of machine-learned agent system 100. In some implementations, example method 1800 includes generating a second input data structure including extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine-learned agent system.
[0543] At 2008, example method 2000 includes generating, based on one or more second machine- learned model inference operations, an updated memory representation based on the second input data structure. In some implementations, generating, based on one or more second machine-learned model inference operations, an updated memory representation based on the second input data structure can include executing one or more second machine-learned model inference operations to generate an updated memory representation based on the second input data structure. For example, executing one or more second machine-learned model inference operations can include engaging (e.g., via an application programming interface) machine-learned model system 110 to execute machine-learned model(s) 1 (e.g., sequence processing model 4) to generate output data structure 130 based on extracted memory data 126 and prior memory data 128. Memory consolidator 120 can generate an updated memory representation (e.g., memory representation(s) 132, memory representation object 302, 402, etc.).
[0544] In some implementations, example method 1900 or 2000 includes generating a third input data structure based on the updated memory representation and a query. For example, machine-learned agent system 100 can receive, from input interface 134, a query 136. Machine-learned agent system 100 can generate an input data structure 138 that contains memory representation(s) 132 (or data derived therefrom) and query data 140.
[0545] In some implementations, example method 1900 or 2000 includes generating, based on one or more third machine-learned model inference operations, a response to the query based on the third inputdata structure. In some implementations, generating, based on one or more third machine-learned model inference operations, a response to the query based on the third input data structure includes executing one or more third machine-learned model inference operations to generate a response to the query based on the third input data structure. For example, machine-learned model system(s) 110 can include one or more machine-learned models (e.g., machine-learned model(s) 1) that execute one or more inference operations based on input data structure(s) 138 to produce an output data structure(s) 142.
[0546] In some implementations, example method 1800, 1900, or 2000 includes storing the updated memory representation. For example, memory consolidator 120 can store the updated memory representation in interaction datastore 116. The updated memory representation can be used to condition future responses of machine-learned agent system 100.
[0547] In some implementations, example method 1900 or 2000 includes storing the one or more extracted memory objects (e.g., in interaction memory datastore 116). In some implementations, example method 1900 or 2000 includes storing the one or more extracted memory objects in association with a session identifier corresponding to the one or more interactions of the machine-learned agent system.
[0548] In some implementations, example method 1800, 1900, or 2000 includes (e.g., for executing the one or more third machine-learned model inference operations to generate the response to the query based on the third input data structure) providing, as input to a machine-learned model, the third input data structure to cause the machine-learned model to generate an output conditioned on the third input data structure. In some implementations, example method 1800, 1900, or 2000 includes (e.g., for executing the one or more third machine-learned model inference operations to generate the response to the query based on the third input data structure) providing, as the input to the machine-learned model, the third input data structure to cause the machine-learned model to generate the output conditioned on the updated memory representation. For example, the machine-learned model executed by machine-learned model system 110 to generate output data structure 142 can include an attention mechanism that conditions outputs based on an input sequence in a context window. Input data structure 138 can be tokenized into a sequence representation by a tokenizer of a machine-learned sequence processing model (e.g., sequence processing model 4) an input into a context window for generated outputs.
[0549] In some implementations of example method 1800, 1900, or 2000, the machine-learned model attends over the third input data structure to generate one or more attention values representing one or more associations between one or more portions of a representation of the third input data structure and one or more portions of a context window including the representation of the third input data structure. In some implementations of example method 1800, 1900, or 2000, the machine-learned model generates one or more predicted values based on the one or more attention values. In some implementations of example method 1800, 1900, or 2000, the machine-learned model generates the output based on the one or more predicted values.
[0550] In some implementations of example method 1800, 1900, or 2000, the one or more third machine-learned model inference operations are implemented using a multimodal machine-learned model. For example, a multimodal machine-learned model can generate outputs for engaging in multimodal interactions using various input interface(s) 134 and output interface(s) 146.
[0551] In some implementations of example method 1800, 1900, or 2000, the one or more third machine-learned model inference operations are implemented using a machine-learned model different from a machine-learned model used for the one or more first machine-learned model inference operations or the one or more second machine-learned model inference operations. For example, a multimodal machine-learned model can be used for ingesting inputs and generating outputs for multimodal agentic interactions with a user. For example, a situated agent that operates to ingest sensor data (e.g., image streams, audio streams, etc.) can use a multimodal machine-learned model (e.g., machine-learned model(s) 1) to generate inputs and perform inference operations for understanding inputs and generating outputs. A different machine-learned model (e.g., a text-based machine-learned model) can execute to perform one or more inference operations for memory extraction or consolidation.
[0552] In some implementations of example method 1800, 1900, or 2000, the one or more third machine-learned model inference operations are implemented using a multimodal machine-learned model configured to generate speech outputs based on image and text inputs, wherein at least one text input of the text inputs includes the updated memory representation. For example, a text-based memory representation can be used to condition generations for responses by machine-learned agent system 100 to user inputs. The responses can be generated natively in various different modalities. In an example, a multimodal machine-learned model can process text-based inputs and generate native audio outputs. In an example, a multimodal machine-learned model can process text-based inputs and generate textual outputs that can be rendered into audible speech using a text-to-speech system.
[0553] In some implementations, example method 1800, 1900, or 2000 includes executing a query over one or more extracted memory objects. For example, memory manager 118 can query interaction memory datastore 116 to retrieve extracted memory objects for performing one or more operations. In some implementations, example method 1800, 1900, or 2000 includes updating one or more values associated with the one or more extracted memory objects based on the query. In some implementations of example method 1800, 1900, or 2000, updating the one or more values includes deleting the one or more extracted memory objects.
[0554] In some implementations of example method 1800, 1900, or 2000, the query returns the one or more extracted memory objects based on metadata stored in association with the one or more extracted memory objects. In some implementations of example method 1800, 1900, or 2000, the query returns the one or more extracted memory objects based on a session identifier associated with the one or more extracted memory objects. In some implementations of example method 1800, 1900, or 2000, the one ormore values are updated based on an instruction received by the machine-learned agent system (e.g., an instruction to forget a particular statement, conversation, or other portion of information).
[0555] In some implementations of example method 1800, 1900, or 2000, the machine-learned agent system streams query data including images and audio from an input device to the multimodal machine- learned model to generate output data conditioned on the query data, wherein the output data is generated conditioned on the updated memory representation. In an example, the input device can be associated with a situated agent system (e.g., on a mobile device, a wearable device, etc.) that operates to perceive a user or a user's environment along with inputs from a user to generate responses thereto (e.g., to perform tasks).
[0556] In some implementations of example method 1800, 1900, or 2000, the updated memory representation represents an additional memory object as compared to the prior memory representation. In some implementations of example method 1800, 1900, or 2000, the updated memory representation represents an updated version of a memory object represented in the prior memory representation. In some implementations of example method 1800, 1900, or 2000, the updated memory representation represents, in a single memory object, memory data represented in a plurality of memory objects in the prior memory representation.
[0557] In some implementations of example method 1800, 1900, or 2000, the extracted memory data includes memory values from the one or more extracted memory objects. In some implementations of example method 1800, 1900, or 2000, the prior memory data includes memory values from one or more prior memory objects that describe the one or more prior interactions of the machine-learned agent system.
[0558] In some implementations of example method 1800, 1900, or 2000, the second input data structure includes a plurality of portions representing the extracted memory data and the prior memory data, wherein each respective portion of the plurality of portions corresponds to each respective memory object of a plurality of memory objects, the plurality of memory objects including the one or more extracted memory objects and the one or more prior memory objects. For example, as described with respect to Figure 3, a memory consolidation cycle can be performed in view of prior memory objects and newly extracted memory objects. For example, a plurality of portions correspond to a plurality of substrings of a string-based memory representation. The plurality of substrings can be appended or concatenated to generate the string-based memory representation. The plurality of substrings can be memory values obtained from a plurality of memory objects. Data other than strings can be used or included. For instance, images, audio recordings, or other values can be used for representation of the plurality of memory objects. In some implementations of example method 1800, 1900, or 2000, the extracted memory data includes memory values from the one or more extracted memory objects.
[0559] In some implementations of example method 1800, 1900, or 2000, the prior memory data includes a compressed memory representation generated, using one or more third machine-learned modelinference operations, based on one or more prior memory objects that describe the one or more prior interactions of the machine-learned agent system. For example, as described with respect to Figure 4, a memory consolidation cycle can be performed in view of a prior consolidation representation of prior memory objects and also in view of newly extracted memory objects.
[0560] In some implementations, example method 1800, 1900, or 2000 includes replacing the prior memory representation with the updated memory representation. In some implementations of example method 1800, 1900, or 2000, replacing the prior memory representation with the updated memory representation includes storing the updated memory representation as a current version of one or more stored memory representations. In some implementations of example method 1800, 1900, or 2000, replacing the prior memory representation with the updated memory representation includes overwriting the prior memory representation with the updated memory representation.
[0561] In some implementations of example method 1800, 1900, or 2000, the one or more first machine- learned model inference operations include an attention operation performed over the first input data structure. An attention operation can include self-attention executed over the first input data structure to compute values indicating associations between portions of the first input data structure. These associations can be used to generate updated representations of the portions to encode the context of the portions. These contextualized representations can be provided to an output mechanism (e.g., a feedforward layer) to predict one or more values that indicate a likely next portion in a sequence of portions. In this manner, for instance, the arrangement and content of the first input data structure can activate different internal latent states so as to cause the generation of outputs that extract memory information according to a particular format or technique.
[0562] In some implementations of example method 1800, 1900, or 2000, the one or more second machine-learned model inference operations include an attention operation performed over the second input data structure. The attention operation can include self-attention executed over the second input data structure to compute values indicating associations between portions of the second input data structure. These associations can be used to generate updated representations of the portions to encode the context of the portions. These contextualized representations can be provided to an output mechanism (e.g., a feedforward layer) to predict one or more values that indicate a likely next portion in a sequence of portions In this manner, for instance, the arrangement and content of the second input data structure can activate different internal latent states so as to cause the generation of outputs that consolidate memory information according to a particular format or technique.
[0563] In some implementations of example method 1800, 1900, or 2000, the one or more third machine-learned model inference operations include an attention operation performed over the third input data structure based on the updated memory representation. The attention operation can include selfattention executed over the third input data structure to compute values indicating associations betweenportions of the third input data structure. These associations can be used to generate updated representations of the portions to encode the context of the portions These contextualized representations can be provided to an output mechanism (e.g., a feedforward layer) to predict one or more values that indicate a likely next portion in a sequence of portions. In this manner, for instance, the arrangement and content of the third input data structure can activate different internal latent states so as to cause the generation of outputs for agentic interactions based on consolidated memory information according to a particular format or technique.
[0564] In some implementations of example method 1800, 1900, or 2000, generating the first input data structure includes obtaining interaction record data describing one or more interactions of a machine- learned agent system. In some implementations of example method 1800, 1900, or 2000, generating the first input data structure includes parsing the interaction record data to remove one or more portions of the interaction record data. In some implementations of example method 1800, 1900, or 2000, the one or more portions of the interaction record data contain control sequences for the machine-learned agent system. In some implementations of example method 1800, 1900, or 2000, the control sequences include a tool call configured to cause the machine-learned agent system to invoke a tool specified by the tool call.
[0565] In some implementations of example method 1800, 1900, or 2000, generating the second input data structure includes retrieving, from a memory datastore, the prior memory data. In some implementations of example method 1800, 1900, or 2000, generating the second input data structure includes retrieving, from a memory datastore, a plurality of prior memory objects describing the one or more prior interactions of the machine-learned agent system. In some implementations of example method 1800, 1900, or 2000, generating the second input data structure includes composing the prior memory data based on the plurality of prior memory objects.
[0566] In some implementations, example method 1800, 1900, or 2000 includes (e.g., for executing the one or more first machine-learned model inference operations) providing the first input data structure as input to a first machine-learned model to generate an output. In some implementations of example method 1800, 1900, or 2000, executing the one or more first machine-learned model inference operations includes generating the one or more extracted memory objects based on the output.
[0567] The machine-learned model(s) used for extraction and consolidation can be the same. In some implementations, example method 1800, 1900, or 2000 includes (e.g., for executing the one or more first machine-learned model inference operations) providing the first input data structure as input to a machine- learned model to generate a first output and generating the one or more extracted memory objects based on the first output. In some implementations, example method 1800, 1900, or 2000 includes (e.g., for executing the one or more second machine-learned model inference operations) providing the second input data structure as input to the machine-learned model to generate a second output and generating the updated memory representation based on the second output.
[0568] The machine-learned model(s) used for extraction and consolidation can be different. In some implementations, example method 1800, 1900, or 2000 includes (e.g., for executing the one or more first machine-learned model inference operations) providing the first input data structure as input to a first machine-learned model to generate a first output and generating the one or more extracted memory objects based on the first output. In some implementations, example method 1800, 1900, or 2000 includes (e.g., for executing the one or more second machine-learned model inference operations) providing the second input data structure as input to a second machine-learned model to generate a second output and generating the updated memory representation based on the second output; wherein the first machine- learned model is different from the second machine-learned model.
[0569] Model selection for extraction and consolidation can vary based on the complexity and context window requirements of each task. For extraction, the input data, which may include multimodal data such as text, images, and audio, may be processed to identify and isolate relevant information. Depending on the frequency of execution, extraction may involve long or short context windows. If the relevant information is spread across a large context window (e.g., a long conversation), a model with a larger context window may be used. The model’s architecture may also be selected based on an input data modality (e.g., modality of data in interaction record 102).
[0570] For extraction, as an example, more complex reasoning may be involved to assemble and collect related memories. A larger model may be used to obtain enhanced capabilities for recognizing patterns in inputs (e.g., latent patterns encoded in embedded representations of the inputs). A size of the context window may depend on the desired level of detail in the consolidated memory representation and the number of memory objects being consolidated.
[0571] In some implementations of example method 1800, 1900, or 2000, the one or more first machine- learned model inference operations are configured for memory extraction. In some implementations of example method 1800, 1900, or 2000, the one or more second machine-learned model inference operations are configured for memory consolidation.
[0572] In some implementations of example method 1800, 1900, or 2000, the one or more first machine- learned model inference operations are configured for memory extraction by constructing the first input data structure to cause the machine-learned model to generate the first output to contain extracted memory data. In some implementations of example method 1800, 1900, or 2000, the one or more second machine- learned model inference operations are configured for memory consolidation by constructing the second input data structure to cause the machine-learned model to generate the second output to contain consolidated memory data.
[0573] In some implementations of example method 1800, 1900, or 2000, the first machine-learned model is configured for memory extraction. In some implementations of example method 1800, 1900, or 2000, the second machine-learned model is configured for memory consolidation.
[0574] In some implementations of example method 1800, 1900, or 2000, the first machine-learned model is trained for memory extraction. For example, a machine-learned model can be fine-tuned to generate extracted memory data based on a recall or precision metric.
[0575] In some implementations of example method 1800, 1900, or 2000, the second machine-learned model is trained for memory consolidation. For example, a machine-learned model can be fine-tuned to generate consolidated memory representations based on a recall metric. For example, a machine-learned model can be fine-tuned to generate consolidated memory representations based on a data-weighted recall metric that optimizes for compact representations that provide high recall.
[0576] In some implementations of example method 1800, 1900, or 2000, the first machine-learned model includes a first adapted variant of a base machine-learned model. In some implementations of example method 1800, 1900, or 2000, the second machine-learned model includes a second adapted variant of the base machine-learned model.
[0577] In some implementations of example method 1800, 1900, or 2000, the first input data structure includes textual data.
[0578] In some implementations of example method 1800, 1900, or 2000, the first input data structure includes audio recording data, and wherein the first machine-learned model inference operations include processing the audio recording data.
[0579] In some implementations of example method 1800, 1900, or 2000, the first input data structure includes video recording data, and wherein the first machine-learned model inference operations include processing the video recording data
[0580] In some implementations of example method 1800, 1900, or 2000, the first input data structure includes image data, and wherein the first machine-learned model inference operations include processing the image data.
[0581] In some implementations of example method 1800, 1900, or 2000, the one or more first machine- learned model inference operations are executed using cached attention values computed during the one or more interactions with the machine-learned agent system. For example, example implementations of using latent values 608 and 708 are described with respect to Figure 6 and Figure 7, respectively.
[0582] In some implementations of example method 1800, 1900, or 2000, one or more of the one or more first machine-learned model inference operations are executed in sequence with one or more of the one or more interactions with the machine-learned agent system.
[0583] In some implementations of example method 1800, 1900, or 2000, the one or more of the one or more first machine-learned model inference operations include generation of a memory tool call that instructs the machine-learned agent system to invoke a memory tool for storing extracted memory data associated with the one or more interactions.
[0584] In some implementations of example method 1800, 1900, or 2000, generating the second input data structure includes querying a memory datastore for a set of stored memory objects, wherein the set of stored memory objects correspond to a selection of memory objects from an interaction memory datastore based on a memory filter.
[0585] In some implementations of example method 1800, 1900, or 2000, the memory filter includes a recency-based filter. In some implementations of example method 1800, 1900, or 2000, the memory filter includes a priority-based filter. In some implementations of example method 1800, 1900, or 2000, the memory filter includes a relevance-based filter.
[0586] In some implementations of example method 1800, 1900, or 2000, the memory filter operates to select the set of stored memory objects based on a respective value of a metadata attribute stored in association with each respective memory object of the set of stored memory objects.
[0587] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.
[0588] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.
[0589] Aspects of the disclosure have been described in terms of illustrative embodiments thereof. Any and all features in the following claims can be combined or rearranged in any way possible, including combinations of claims not explicitly enumerated in combination together, as the example claim dependencies listed herein should not be read as limiting the scope of possible combinations of features disclosed herein. Accordingly, the scope of the present disclosure is by way of example rather than by way of limitation, and the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Moreover, terms are described herein using lists of example elements joined by conjunctions such as“and,” “or," “but,” etc. It should be understood that such conjunctions are provided for explanatory purposes only. Clauses and other sequences of items joined by a particular conjunction such as “or,” for example, can refer to “and / or,” “at least one of, “any combination of example elements listed therein, etc. Terms such as “based on” should be understood as “based at least in part on.”
[0590] The term “can” should be understood as referring to a possibility of a feature in various implementations and not as prescribing an ability that is necessarily present in every implementation. For example, the phrase “X can perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perform Y, and not as indicating that in every instance X must always be able to perform Y. It should be understood that, in various implementations, X might be unable to perform Y and remain within the scope of the present disclosure.
[0591] The term “may” should be understood as referring to a possibility of a feature in various implementations and not as prescribing an ability that is necessarily present in every implementation. For example, the phrase “X may perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perform Y, and not as indicating that in every instance X must always be able to perform Y. It should be understood that, in various implementations, X might be unable to perform Y and remain within the scope of the present disclosure.
Claims
1. WHAT IS CLAIMED IS:1 . A computer-implemented method comprising: generating a first input data structure describing one or more interactions of a machine-learned agent system; generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure; providing extracted memory data of the one or more extracted memory objects for processing using one or more second machine-learned model inference operations to generate an updated memory representation, wherein the one or more second machine-learned model inference operations process a second input data structure comprising the extracted memory data and prior memory data describing one or more prior interactions of the machine-learned agent system to generate the updated memory representation based on the second input data structure; generating a third input data structure based on the updated memory representation and a query; and generating, based on one or more third machine-learned model inference operations, a response to the query based on the third input data structure.
2. The computer-implemented method of claim 1 , comprising: transmitting the extracted memory data to a machine-learned model system that executes the one or more second machine-learned model inference operations; and receiving, from the machine-learned model system, output data describing the updated memory representation.
3. A computer-implemented method comprising: receiving one or more extracted memory objects describing one or more interactions of a machine- learned agent system; generating a second input data structure comprising extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine- learned agent system;generating, based on one or more second machine-learned model inference operations, an updated memory representation based on the second input data structure; and storing the updated memory representation.
4. A computer-implemented method comprising: generating a first input data structure describing one or more interactions of a machine-learned agent system; generating, based on one or more first machine-learned model inference operations, one or more extracted memory objects based on the first input data structure; generating a second input data structure comprising extracted memory data from the one or more extracted memory objects and prior memory data describing one or more prior interactions of the machine- learned agent system; generating, based on one or more second machine-learned model inference operations, an updated memory representation based on the second input data structure; and storing the updated memory representation.
5. The computer-implemented method of claim 3 or 4, comprising: generating a third input data structure based on the updated memory representation and a query; and generating, based on one or more third machine-learned model inference operations, a response to the query based on the third input data structure.
6. The computer-implemented method of any of the preceding claims, comprising: providing, as input to a machine-learned model, the third input data structure to cause the machine-learned model to generate an output conditioned on the third input data structure.
7. The computer-implemented method of any of the preceding claims, comprising:providing, as the input to the machine-learned model, the updated memory representation in the third input data structure to cause the machine-learned model to generate the output conditioned on the updated memory representation.
8. The computer-implemented method of any of claim 6 or claim 7, wherein: the machine-learned model attends over the third input data structure to generate one or more attention values representing one or more associations between one or more portions of a representation of the third input data structure and one or more portions of a context window comprising the representation of the third input data structure; the machine-learned model generates one or more predicted values based on the one or more attention values; and the machine-learned model generates the output based on the one or more predicted values.
9. The computer-implemented method of any of the preceding claims, wherein the one or more third machine-learned model inference operations are implemented using a multimodal machine-learned model.
10. The computer-implemented method of any of the preceding claims, wherein the one or more third machine-learned model inference operations are implemented using a machine-learned model different from a machine-learned model used for the one or more first machine-learned model inference operations or the one or more second machine-learned model inference operations.11 . The computer-implemented method of any of the preceding claims, wherein the one or more third machine-learned model inference operations are implemented using a multimodal machine-learned model configured to generate speech outputs based on image and text inputs, wherein at least one text input of the text inputs comprises the updated memory representation.
12. The computer-implemented method of any of the preceding claims, comprising: executing a query over one or more extracted memory objects;updating one or more values associated with the one or more extracted memory objects based on the query.
13. The computer-implemented method of claim 12, wherein updating the one or more values comprises: deleting the one or more extracted memory objects.
14. The computer-implemented method of any of claims 12 or 13, wherein the query returns the one or more extracted memory objects based on metadata stored in association with the one or more extracted memory objects.
15. The computer-implemented method of any of claims 12, 13, or 14, wherein the query returns the one or more extracted memory objects based on a session identifier associated with the one or more extracted memory objects.
16. The computer-implemented method of any of claims 12, 13, 14, or 15, wherein the one or more values are updated based on an instruction received by the machine-learned agent system.
17. The computer-implemented method of any of the preceding claims, wherein: the machine-learned agent system streams query data comprising images and audio from an input device to the multimodal machine-learned model to generate output data conditioned on the query data, wherein the output data is generated conditioned on the updated memory representation.
18. The computer-implemented method of any of the preceding claims, wherein the prior memory data is obtained from a prior memory representation, and wherein at least one of the following conditions is satisfied: the updated memory representation represents an additional memory object as compared to the prior memory representation; the updated memory representation represents an updated version of a memory object represented in the prior memory representation; orthe updated memory representation represents, in a single memory object, memory data represented in a plurality of memory objects in the prior memory representation.
19. The computer-implemented method of any of the preceding claims, wherein: the extracted memory data comprises memory values from the one or more extracted memory objects; and the prior memory data comprises memory values from one or more prior memory objects that describe the one or more prior interactions of the machine-learned agent system.
20. The computer-implemented method of any of the preceding claims, wherein the second input data structure comprises a plurality of portions representing the extracted memory data and the prior memory data, wherein each respective portion of the plurality of portions corresponds to each respective memory object of a plurality of memory objects, the plurality of memory objects comprising the one or more extracted memory objects and the one or more prior memory objects.21 . The computer-implemented method of any of the preceding claims, wherein: the extracted memory data comprises memory values from the one or more extracted memory objects; and the prior memory data comprises a compressed memory representation generated, using one or more third machine-learned model inference operations, based on one or more prior memory objects that describe the one or more prior interactions of the machine-learned agent system.
22. The computer-implemented method of any of the preceding claims, comprising: replacing the prior memory representation with the updated memory representation23. The computer-implemented method of claim 22, wherein replacing the prior memory representation with the updated memory representation comprises: storing the updated memory representation as a current version of one or more stored memory representations.
24. The computer-implemented method of claim 22, wherein replacing the prior memory representation with the updated memory representation comprises: overwriting the prior memory representation with the updated memory representation.
25. The computer-implemented method of any of the preceding claims, comprising: storing the one or more extracted memory objects in association with a session identifier corresponding to the one or more interactions of the machine-learned agent system.
26. The computer-implemented method of any of the preceding claims, wherein the one or more first machine-learned model inference operations comprise an attention operation performed over the first input data structure.
27. The computer-implemented method of any of the preceding claims, wherein generating the first input data structure comprises: obtaining interaction record data describing one or more interactions of a machine-learned agent system; and parsing the interaction record data to remove one or more portions of the interaction record data28. The computer-implemented method of claim 27, wherein the one or more portions of the interaction record data contain control sequences for the machine-learned agent system.
29. The computer-implemented method of claim 27, wherein the control sequences comprise a tool call configured to cause the machine-learned agent system to invoke a tool specified by the tool call.
30. The computer-implemented method of any of the preceding claims, wherein generating the second input data structure comprises: retrieving, from a memory datastore, the prior memory data.31 . The computer-implemented method of any of the preceding claims, wherein generating the second input data structure comprises: retrieving, from a memory datastore, a plurality of prior memory objects describing the one or more prior interactions of the machine-learned agent system; and composing the prior memory data based on the plurality of prior memory objects.
32. The computer-implemented method of any of the preceding claims, comprising: providing the first input data structure as input to a first machine-learned model to generate an output; and generating the one or more extracted memory objects based on the output33. The computer-implemented method of any of the preceding claims, comprising: providing the first input data structure as input to a machine-learned model to generate a first output; and generating the one or more extracted memory objects based on the first output.
34. The computer-implemented method of any of the preceding claims, comprising: providing the second input data structure as input to the machine-learned model to generate a second output; and generating the updated memory representation based on the second output.
35. The computer-implemented method of any of the preceding claims, wherein: the one or more first machine-learned model inference operations are configured for memory extraction; and the one or more second machine-learned model inference operations are configured for memory consolidation.
36. The computer-implemented method of claim 35, wherein: the one or more first machine-learned model inference operations are configured for memory extraction by constructing the first input data structure to cause the machine-learned model to generate the first output to contain extracted memory data; and the one or more second machine-learned model inference operations are configured for memory consolidation by constructing the second input data structure to cause the machine-learned model to generate the second output to contain consolidated memory data.
37. The computer-implemented method of any of the preceding claims, wherein: the one or more first machine-learned model inference operations comprise the execution of a first machine-learned model to generate a first output; and the one or more second machine-learned model inference operations comprise the execution of a second machine-learned model to generate a second output; wherein the first machine-learned model is different from the second machine-learned model.
38. The computer-implemented method of claim 37, wherein: the first machine-learned model is configured for memory extraction; and the second machine-learned model is configured for memory consolidation.
39. The computer-implemented method of claim 37, wherein: the first machine-learned model is trained for memory extraction; and the second machine-learned model is trained for memory consolidation.
40. The computer-implemented method of claim 37, wherein: the first machine-learned model comprises a first adapted variant of a base machine-learned model; and the second machine-learned model comprises a second adapted variant of the base machine- learned model.41 . The computer-implemented method of any of the preceding claims, wherein the first input data structure comprises textual data.
42. The computer-implemented method of any of the preceding claims, wherein the first input data structure comprises audio recording data, and wherein the first machine-learned model inference operations comprise processing the audio recording data.
43. The computer-implemented method of any of the preceding claims, wherein the first input data structure comprises video recording data, and wherein the first machine-learned model inference operations comprise processing the video recording data.
44. The computer-implemented method of any of the preceding claims, wherein the first input data structure comprises image data, and wherein the first machine-learned model inference operations comprise processing the image data.
45. The computer-implemented method of any of the preceding claims, wherein the one or more first machine-learned model inference operations are executed using cached attention values computed during the one or more interactions with the machine-learned agent system.
46. The computer-implemented method of claim 45, wherein one or more of the one or more first machine-learned model inference operations are executed in sequence with one or more of the one or more interactions with the machine-learned agent system.
47. The computer-implemented method of claim 46, wherein the one or more of the one or more first machine-learned model inference operations comprise generation of a memory tool call that instructs the machine-learned agent system to invoke a memory tool for storing extracted memory data associated with the one or more interactions.
48. The computer-implemented method of any of the preceding claims, wherein generating the second input data structure comprises: querying a memory datastore for a set of stored memory objects, wherein the set of stored memory objects correspond to a selection of memory objects from an interaction memory datastore based on a memory filter.
49. The computer-implemented method of claim 48, wherein the memory filter comprises at least one of the following: a recency-based filter; a priority-based filter; or a relevance-based filter.
50. The computer-implemented method of claim 48 or 49, wherein the memory filter operates to select the set of stored memory objects based on a respective value of a metadata attribute stored in association with each respective memory object of the set of stored memory objects.51 . One or more non-transitory computer-readable media storing instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising the method of any of the preceding claims.
52. A computing system, comprising: one or more processors; and one or more non-transitory computer-readable media storing instructions that are executable by the one or more processors to cause the computing system to perform operations, the operations comprising the method of any of the preceding claims.
53. A computer program product comprising instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising the method of any of the preceding claims.