Dynamic interaction memory datastore for machine-learned agent systems
The machine-learned agent system addresses the lack of memory in existing systems by dynamically updating a user-specific datastore, reducing repetitive interactions and improving task execution efficiency through contextual awareness and controlled data access.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- GDM HOLDING LLC
- Filing Date
- 2025-12-11
- Publication Date
- 2026-06-18
AI Technical Summary
Existing agent systems lack a memory capability, leading to repetitive user interactions and inefficient task execution due to the lack of contextual information, and often require access to external data sources without user control over data sharing.
A machine-learned agent system that dynamically updates a user-specific interaction memory datastore with multimodal data, allowing it to condition responses based on past interactions and user preferences, reducing the need for repetitive inputs and improving asynchronous task execution.
The system reduces computational load and latency by leveraging pre-processed memory data, enabling efficient task execution and user-controlled data access, enhancing responsiveness and computational efficiency.
Smart Images

Figure US2025059270_18062026_PF_FP_ABST
Abstract
Description
DYNAMIC INTERACTION MEMORY DATASTORE FOR MACHINE-LEARNED AGENT SYSTEMSPRIORITY
[0001] This application is based on and claims priority to U S Provisional Patent Application No. 63 / 730,845 (filed December 11 , 2024). U.S. Provisional Patent Application No. 63 / 730,845 is hereby incorporated by reference herein in its entirety.BACKGROUND
[0002] A computer can receive input(s). The computer can execute instructions to process the input(s) to generate output(s) using a parameterized model. The computer can obtain feedback on its performance in generating the outputs with the model. The computer can generate feedback by evaluating its performance. The computer can receive feedback from an external source. The computer can update parameters of the model based on the feedback to improve its performance. In this manner, the computer can iteratively “learn" to generate the desired outputs. The resulting model is often referred to as a machine-learned model.SUMMARY
[0003] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.
[0004] In an aspect, the present disclosure provides an example computer-implemented method. In some implementations, the example computer-implemented method includes receiving, by a machine- learned agent system, a query associated with a user. In some implementations, the example computer- implemented method includes accessing, by the machine-learned agent system, memory data from an interaction memory datastore associated with the user, wherein the interaction memory datastore includes one or more memory objects that were generated based on one or more prior interactions between the machine-learned agent system and the user. In some implementations, the example computer- implemented method includes inputting, by the machine-learned agent system and to a machine-learned sequence processing model, an input data structure based on the query and the memory data. In some implementations, the example computer-implemented method includes generating, by the machine-learned agent system and based on processing the input data structure using the machine-learned sequence processing model, an output. In some implementations, the example computer-implemented method includes outputting, by the machine-learned agent system, and based on the output, a response to the query.
[0005] In some implementations of the example method, the interaction memory datastore includes multimodal data. In some implementations of the example method, the memory data is based on a first memory object associated with a first data modality and a second memory object associated with a second data modality.
[0006] In some implementations of the example method, the first data modality comprises audio data. In some implementations of the example method, the second data modality comprises image data. In some implementations of the example method, the first data modality comprises text data. In some implementations of the example method, the second data modality comprises image data. In some implementations of the example method, the first data modality comprises text data. In some implementations of the example method, the second data modality comprises audio data.
[0007] In some implementations, the example method includes retrieving, by the machine-learned agent system, and from the interaction memory datastore, the one or more memory values based on a relevance of the one or more memory values to the query.
[0008] In some implementations, the example method includes after receiving an input during an interactive session, processing, by the machine-learned agent system, the input using a machine-learned model to generate one or more values. In some implementations, the example method includes, based on the generated one or more values indicating that at least a portion of the input is to be stored, extracting, by the machine-learned agent system, the portion. In some implementations, the example method includes storing, by the machine-learned agent system, the portion as a memory value in a memory object in the interaction memory datastore
[0009] In some implementations, the example method includes storing, by the machine-learned agent system, metadata associated with the memory value in the memory object.
[0010] In some implementations of the example method, the portion includes at least one of: text data, image data, or audio data.
[0011] In some implementations of the example method, the portion corresponds to an explicit instruction to remember information.
[0012] In some implementations, the example method includes receiving, by the machine-learned agent system, input data indicating an instruction to forget specified information. In some implementations, the example method includes deleting, by the machine-learned agent system, the specified information from the interaction memory datastore.
[0013] In some implementations, the example method includes receiving, by the machine-learned agent system, input data indicating an instruction to forget specified information after a specified interval. In some implementations, the example method includes queuing, by the machine-learned agent system, the specified information for deletion from the interaction memory datastore after the specified interval.
[0014] In some implementations of the example method, the memory data includes an ordered combination of the one or more memory values, the ordered combination ordered based on corresponding timestamps of the plurality of memory objects.
[0015] In some implementations, the example method includes filtering the plurality of memory objects to obtain the one or more memory values. In some implementations of the example method, the filtering includes computing one or more relevance measures for the plurality of memory objects based on the query. In some implementations of the example method, the filtering includes returning a subset of the plurality of memory objects based on the relevance measure.
[0016] In some implementations of the example method, the one or more relevance measures comprise a respective score generated based on distance between a query embedding and a respective memory value embedding. In some implementations of the example method, the one or more relevance measures comprise a respective sequence output generated by processing the query and one or more respective memory values using a second machine-learned sequence processing model that is optionally the same as or different from the machine-learned sequence processing model.
[0017] In some implementations of the example method, the one or more memory values each comprise a respective string, and wherein the memory data includes a string comprising the one or more respective strings from the one or more memory values.
[0018] In some implementations, the example method includes inputting at least a portion of the output to the machine-learned sequence processing model. In some implementations, the example method includes generating, based on processing the output using the machine-learned sequence processing model, the response to the query. In some implementations, the output includes a plurality of predictions for analytical content that includes an analysis of the query with respect to the context data.
[0019] In some implementations of the example method, the plurality of predictions for analytical content are not exposed on a user interface associated with the query.
[0020] In some implementations, the example method includes consolidating, in the interaction memory datastore, a subset of the plurality of memory objects into a single memory object based on an alignment between the subset.
[0021] In some implementations, the example method includes receiving input data associated with a timestamp. In some implementations, the example method includes creating a current memory object. In some implementations, the current memory object includes the timestamp. In some implementations, the current memory object includes a current memory value based on the input data. In some implementations, the example method includes matching the current memory object to a prior memory object. In some implementations, the example method includes replacing, in the interaction memory datastore, the prior memory object with the current memory object.
[0022] In some implementations, the example method includes receiving input data associated with a timestamp. In some implementations, the example method includes matching at least a portion of the input data to a prior memory object. In some implementations, the example method includes updating, in the interaction memory datastore, the prior memory object based on the portion of the input data.
[0023] In some implementations of the example method, the response includes instructions configured to control an application programming interface to cause a computing system to perform an action.
[0024] In some implementations of the example method, the query indicates a configuration parameter that controls an operation of the computing system, and wherein the instructions are configured to control the application programming interface to cause the computing system to assign a value for the configuration parameter.
[0025] In some implementations of the example method, the configuration parameter is associated with data access by the computing system. In some implementations of the example method, the configuration parameter is associated with data retention by the computing system. In some implementations of the example method, the configuration parameter is associated with data communication by the computing system.
[0026] In an aspect, the present disclosure provides an example one or more non-transitory computer- readable media storing instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising any implementation of the example method.
[0027] In an aspect, the present disclosure provides an example computing system. In some implementations, the example computing system includes one or more processors. In some implementations, the example computing system includes one or more non-transitory computer-readable media storing instructions that are executable by the one or more processors to cause the computing system to perform operations, the operations comprising any implementation of the example method.
[0028] In an aspect, the present disclosure provides an computer program product comprising instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising any implementation of the example method.
[0029] Other example aspects of the present disclosure are directed to other systems, methods, apparatuses, tangible non-transitory computer-readable media, and devices for performing functions described herein. These and other features, aspects, and advantages of various implementations will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate implementations of the present disclosure and, together with the description, help explain the related principles.BRIEF DESCRIPTION OF THE DRAWINGS
[0030] Figure 1 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0031] Figure 2 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0032] Figure 3 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0033] Figure 4 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0034] Figure 5 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0035] Figure 6 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0036] Figure 7 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0037] Figure 8 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0038] Figure 9 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0039] Figure 10 is a block diagram that illustrates aspects of a machine-learned agent system according to example implementations of aspects of the present disclosure.
[0040] Figure 11 is a flow chart diagram illustrating an example method for training a machine-learned model according to example implementations of aspects of the present disclosure.
[0041] Figure 12 is a block diagram of an example processing flow for using machine-learned model(s) to process input(s) to generate output(s) according to example implementations of aspects of the present disclosure.
[0042] Figure 13 is a block diagram of an example sequence processing model according to example implementations of aspects of the present disclosure.
[0043] Figure 14 is a block diagram of an example technique for populating an example input sequence for processing by a sequence processing model according to example implementations of aspects of the present disclosure.
[0044] Figure 15 is a block diagram of an example model development platform according to example implementations of aspects of the present disclosure.
[0045] Figure 16 is a block diagram of an example training workflow for training a machine-learned model according to example implementations of aspects of the present disclosure.
[0046] Figure 17 is a block diagram of an inference system for operating one or more machine-learned model(s) to perform inference according to example implementations of aspects of the present disclosure.
[0047] Figure 18 is a block diagram of an example networked computing system according to example implementations of aspects of the present disclosure.
[0048] Figure 19 is a block diagram of an example computing device according to example implementations of aspects of the present disclosure.
[0049] Figure 20 is a block diagram of an example computing device according to example implementations of aspects of the present disclosure.
[0050] Figure 21 is a flow chart diagram illustrating an example method for using a machine-learned agent system according to example implementations of aspects of the present disclosure.DETAILED DESCRIPTION
[0051] Example implementations of the present disclosure provide a machine-learned agent system that can intelligently generate, curate, and update a user-specific memory or datastore of contextual information that guides the responses of the agent to inputs. For example, the machine-learned agent system can leverage user-specific contextual information, received from the user organically over the course of one or more interactive sessions, to condition the outputs of a machine-learned model to align with the preferences of and / or prior instructions from the user. By dynamically updating the user-specific memory over the course of interactions with the user, the user-specific interaction memory can provide a dynamic, adaptive representation of the user’s interests, instructions, and goals that can flexibly conform to the user's variable needs over time.
[0052] More particularly, a user can interact with a system (e.g., an agent-based system) over a number of sessions. The session can be interactive sessions during which the machine-learned agent system ingests inputs from the user and generates outputs that are returned to the user or that are implemented to control one or more computing devices on behalf of the user. During or after each session, the machine- learned agent system can refresh the interaction memory by reviewing recent interactions and identifying new pieces of relevant information (e.g., "memory pieces”) that should be “remembered” to better assist the user in the future. In future interactions, inferences generated by the model can be conditioned on context extracted from the interaction memory.
[0053] An interaction memory datastore can contain structured representations of memory data. For instance, the machine-learned agent system can process the content of an interaction with the user and selectively generate memory objects that encapsulate salient features of the interaction. For instance, a memory object can be a data container that stores one or more memory values (e.g., strings, images, audio, etc.) along with relevant metadata. The stored memory objects can represent pieces of information that can collectively form a working memory of past interactions with the user that can inform futurepredictions made by the agent system on behalf of the user. During future inference operations, memory values can be inserted into the input of a machine-learned model to condition its outputs based on the memories retained of the past interactions.
[0054] The machine-learned agent system can add or update the interaction memory datastore based on explicit or implicit triggers. An example explicit trigger may occur when a user instructs the machine- learned agent system to remember a particular piece of information, such as the user’s favorite architect. This piece of information can be stored (e.g., as a string) as a context value in a memory object of the context datastore. Additional metadata may be stored in the memory object, such as a timestamp, the type of trigger used for storing the memory, a summary of the interaction during which the memory value was received, or other attributes.
[0055] Implicit triggers may occur when the machine-learned agent system processes data descriptive of the interaction with the user and generates an inference that indicates that a particular piece of information should be “remembered” to better assist the user in the future. For example, a user may request suggestions for a pizza recipe. The machine-learned agent system can receive the request and generate an output list of recipes. In response to reviewing the list of suggestions, the user may state, “Please show me recipes without dairy; I eat a vegan diet.” The machine-learned agent system can process the user's response and generate an updated output list for responding to the request. Additionally, the machine- learned agent system can process the interaction and identify that the user’s expressed preference for a vegan diet may be highly relevant to future tasks performed on behalf of the user. Responsive to this determination, the machine-learned agent system can add, to the interaction memory datastore, a memory object that stores a memory value indicating that the user eats a vegan diet. This self-reflection over the interaction history can occur online (e.g., in parallel with or in between generating responses during the interaction) or offline (e.g., after the interaction during an otherwise inactive session).
[0056] In future requests and / or instructions, the user may naturally tend to omit details previously provided to the machine-learned agent system (e.g., details regarding the vegan diet, details instructed to be remembered) because, in the user’s experience, the user has already provided that information. Advantageously, when servicing future requests that omit the specification of such information, the machine-learned agent system can refer back to the stored context and generate responses that align with the previously-specified information.
[0057] Because the memory of the machine-learned agent system can be formed by interactions with the user, the user can also use the same interaction mechanism to limit, erase, or otherwise modify the memory of the machine-learned agent system. For instance, the machine-learned agent system can store memory objects that contain a timestamp value. The user can instruct the machine-learned agent system to forget information after three months. The machine-learned agent system can implement this instruction by configuring the interaction memory datastore to delete memory objects on a rolling basis based on thespecified horizon. Similarly, a user can utter a command such as, “don’t remember this conversation,” and the machine-learned agent system can purge any memory data related to the conversation upon conclusion of the session. Memory objects associated with the session can be stored with a session identifier to facilitate grouping of memory data associated with the session.
[0058] Advantageously, the memory of the machine-learned agent system can be multimodal. For instance, interactions between an agent system and users can leverage various modalities of data for inputs and outputs. Agent systems can process image data (e.g., streaming from a camera or read from a file) that describes a user’s environment. Agent systems can process text data input by a user or output by a transcription system. Agent systems can process audio data recorded by a microphone of a user device (e.g., that captures utterances from a user). The original data modalities of each interaction can encode contextual signals that may be lost if transformed into a different data modality. An example agent system can leverage a multimodal machine-learned model to process multimodal memories directly to leverage the benefits of the rich context in the native memory modality.
[0059] Example implementations of machine-learned agent systems according to the present disclosure can provide a number of technical improvements that solve problems experienced by existing approaches. For instance, some traditional agent systems that lack a memory capability may lead to interruptions or prompts to process additional inputs and render additional outputs to perform a task. Without a memory of useful context, traditional agent devices may require entry of the same information multiple times over the same or multiple sessions in order to perform the same or similar tasks. Some possible alternative approaches to resolve this problem include exposing an agent system to various application programming interfaces that facilitate retrieval of relevant information from external data sources. While sometimes helpful, external data sources that have not been curated by the agent based on interactions with a user may still lack sufficient context for seamlessly supporting future interactions with the user. Further, the agent may not have control over the data for maintaining fresh information. Further, a user may not wish to permit an agent device to access all information in all available external data sources. For example, dietary restrictions may be relevant context for many daily tasks for which an agent device may be used (e.g., ordering food). A user may wish to permit the agent device to access this information. However, context sources that might contain dietary information (e.g., health data applications, medical entity communication applications) may contain other sensitive data to which the user does not wish to extend access Under traditional approaches to enabling external data sources, the user may be faced with an all-or-nothing choice, which may be undesirable in some circumstances.
[0060] Machine-learned agent systems according to the present disclosure can provide solutions to these and other technical problems currently experienced by existing technologies. Example machine- learned agent systems can reduce a number and timing of user interactions to perform a task. For example, in situations in which a traditional system would require further interaction to obtain additionaldetails, an example machine-learned agent system, leveraging a memory according to the present disclosure, can initiate execution with a single interaction. This can provide efficient use of computational resources by avoiding the processing of additional inputs and rendering of additional outputs for a given task.
[0061] Example machine-learned agent systems can further facilitate improved asynchronous execution of previously instructed tasks. Example machine-learned agent systems can leverage a memory of past interactions with a user to improve asynchronous execution of previously instructed tasks. An example agent system can store user preferences and instructions in a dynamic interaction memory datastore. The system can react to new events (e.g., inputs received from other systems, prompts rendered on a Ul and viewed by the agent system, etc.) by using stored context to execute a response without requiring further synchronous interaction from the user This can reduce a latency of tasks (e.g., measured between a time of the event and a time of the executed response) relative to traditional agent systems. Further, this can provide efficient use of computational resources by avoiding the processing of additional inputs and rendering of additional outputs for a given task.
[0062] Example machine-learned agent systems can reduce a computational load at runtime through pre-processing and storage of high-quality memory data objects. For example, by pre-processing user interactions offline, the system can create high-quality memory data objects containing key features and metadata. These objects can be stored in a structured format, enabling efficient retrieval at runtime. This reduces the computational load on the machine-learned model during online inference, as the system can respond by processing concise, relevant information instead of large amounts of raw data (e.g., such as a full transcript of all prior interactions). This can reduce latency in response to queries by reducing a complexity of an inference task.
[0063] Example machine-learned agent systems can enable improved control over stored contextual information. For example, stored memories can be grounded in information provided by the user to the agent system throughout a history of interactions. These interactions can be sourced directly from the user or based on authorizations provided by the user. The user can decline to provide information that the user does not wish to share, or the user can instruct the agent system to use the information for limited purposes and subsequently forget.
[0064] Example implementations are discussed in greater detail with respect to the Figures.
[0065] Figure 1 is a block diagram that illustrates an example implementation of a machine-learned agent system 100 according to example aspects of the present disclosure. Input interface(s) 102 can receive inputs (e.g., from a user). Query 104 can be based on data from input interfaces 102. Machine- learned agent system 100 can process query 104 to obtain one or more inferences and execute one or more actions. To condition the inferences based on relevant contextual information or memories, machine- learned agent system 100 can use a memory model 105.
[0066] Memory model 105 can include hand-tuned and / or learned logic that governs how machine- learned agent system 100 performs memory and recall to improve its performance. Memory model 105 can execute a recall cycle 106. Memory model 105 can execute a recall cycle 106 at runtime in an online execution (e.g., synchronously with processing of query 104) or previously in an offline execution. Machine- learned agent system 100 can issue a memory query 108 to interaction memory datastore 110. Interaction memory datastore 110 can contain stored contextual information catalogued and stored over prior interactions with the user. Data in interaction memory datastore 110 can be specific to the user. Machine- learned agent system 100 can obtain a memory response 112 that can include data retrieved from interaction memory datastore 110 responsive to memory query 108.
[0067] Machine-learned agent system 100 can obtain memory data 114 based on memory response 112 Memory data 114 can include or be based on memory values (and metadata) retrieved from memory objects stored in interaction memory datastore 110. Machine-learned agent system 100 can output input data structure 116 based on memory data 114. For example, input data structure 116 can include memory data 114 (e.g., in a textual or multimodal prompt).
[0068] Machine-learned model system(s) 118 can manage and perform execution of one or more machine-learned models based on inputs received from machine-learned agent system 100. Machine- learned model system 118 can process input data structure 116 to generate output data structure 120. For example, output data structure 120 can include predicted content that was predicted conditioned on data in input data structure 116 (e.g., including memory data 114).
[0069] Machine-learned agent system 100 can output response 122 based on output data structure 120. Response 122 can include data from output data structure 120. Response 122 can be a final or complete response to query 104, or response 122 can be a partial response to query 104 that effectuates a step in a multi-step response (e.g., performing a subtask in a multi-part task). Output interface(s) 124 can render, transmit, or execute data from response 122.
[0070] Machine-learned agent system 100 can be or include processing logic, software, firmware, or hardware configured to automate one or more operations of or interactions with a computing device or system. For instance, machine-learned agent system 100 can be hosted on a local device or a cloud server to control operations of the host device or other devices. Machine-learned agent system 100 can receive inputs and initiate actions or tasks based on the inputs.
[0071] Machine-learned agent system 100 can be or include an artificial intelligence ("Al”) agent. Machine-learned agent system 100 can control machine-learned models and Al-enabled systems to help users solve tasks. For instance, machine-learned agent system 100 can employ one or more machine- learned models to generate outputs responsive to queries from users. As one example, an agent system can operate on a computing system configured to receive an input from a user device and provide an output responsive to the input to the user device. The agent system can be or can implement a multi-modalagent (e.g a multi-modal artificial intelligence agent). For instance, a multi-modal agent can process inputs from one or more data modalities. In some implementations, the agent system can be implemented as a "situated agent” in which the agent system shares one or more perceptual inputs with a human user. For example, the situated agent can receive and process various data inputs, including video, audio, and / or textual data which are also observable by the human user. The agent system can process these inputs to generate responses that are contextually-relevant for the user’s physical or digital environment, for example enabling the agent system to generate dialogue or other responses or outputs which assist the user in understanding and / or navigating the environment.
[0072] Machine-learned agent system 100 can execute operations that include both learned operators (e.g., operators that execute predictions using machine-learned models having learned values, or operators that execute code generated based on predictions using machine-learned models having learned values) and non-learned operators. In some implementations, machine-learned agent system 100 can execute one or multiple learned operators to perform tasks and make decisions within a non-learned framework of rules and software infrastructure. For instance, an agent framework can include routing layers, i / o callbacks, or other subroutines that route and trigger processing of incoming information with various learned operators. For some processing stages, learned operators can apply predetermined recipes or templates for processing the information (e.g., using a series of prompt templates to ingest new information or format output information). For some processing stages, learned operators can use the outputs of a first prediction to control the inputs for downstream operations, such as by predicting parameters for application programming interface calls, invoking code interpreter environments for executing generated code, etc.
[0073] Machine-learned agent system 100 and any components thereof can engage one or more machine-learned models to generate inferences for performing various tasks. Machine-learned agent system 100 and any components thereof can interact with machine-learned model system(s) 118 to obtain inferences from one or multiple models. As described herein, reference to machine-learned agent system 100 and any components thereof using a machine-learned model can include machine-learned agent system 100 and any components thereof interacting with machine-learned model system(s) 118. As described herein, reference to machine-learned agent system 100 and any components thereof using a machine-learned model can include machine-learned agent system 100 and any components thereof interacting with machine-learned models other than machine-learned model system(s) 118.
[0074] Machine-learned agent system 100 can operate responsive to queries 104 obtained from one or more input interfaces 102. Machine-learned agent system 100 can execute to assist a user with a task by processing a query 104 based on a user input received from a user. For instance, machine-learned agent system 100 can reduce a complexity of inputs to perform various tasks on a computing device or system. For example, machine-learned agent system 100 can use machine-learned models to recognize tasks to perform and execute operations to achieve the tasks based on inputs in view of accumulated context fromprior inputs. In this manner, for instance, each input can be augmented by the model’s learned skillset as well as the available context from the interaction memory, so that even minimal inputs can be effective to initiate execution of complex tasks.
[0075] As used herein, a "user” can refer to a number of different entities including, as some examples, an account (e.g., a “user account" associated with a software or a service), a sub-account of an account, a person or individual, a corporation or corporate user, a legal entity or other defined entity, an administrator, a system manager, a computer-implemented user (e.g., an agent system, a debug user or testing user, etc.), and / or other suitable users. A user can be associated with a key or other credential that can authenticate inputs and outputs received from and output to the user. A user can be specified using a key or credential used to accompany or sign calls to an application programming interface.
[0076] Machine-learned agent system 100 can operate in a user-specific manner. For example, memory data can be stored in a user-specific manner. For instance, a user can be associated with one or more accounts and / or an account can be associated with one or more users, and memory data can be specific to an account or specific to a user. For instance, an account may be associated with one or more individuals that have access to (e.g., manage) the account, and user-specific memory data may be associated with the individuals themselves and / or the account. As one particular example, one user can be a corporate entity associated with a corporate account, and one or more employees of the corporate entity can each be individual users that have access to memory data linked to the account associated with the corporate entity. Each of the individual users can further be linked to memory data associated with their respective user profiles or accounts. In this manner, for instance, multiple overlapping memories can be stored and accessed as desired. In some additional examples, a user or account can be associated with one or more user sub-accounts or profiles. For example, an account may have a first profile associated with personal use and a second profile associated with business use. Memory data may be associated with each profile. In some implementations, these example aspects may be combined in a variety of combinations. For example, a user may be associated with a first user-specific memory associated with a first profile and a second user-specific memory associated with a second profile, each of which is inaccessible by the other profile, and a third user-specific memory that is associated with the user (e.g. the user's overall account) directly such that it is accessible to both the first profile and the second profile.
[0077] Input interface(s) 102 can include various devices or systems that receive input signals or other data from users or other devices or systems. Examples of such devices or systems can include microphones and cameras that capture voice and visual inputs, respectively, keyboards and touchscreens that allow for textual and touch-based inputs, sensors, transducers, or other digital or analog signal sources, including network adapters, wireless receivers, etc.
[0078] Input interface(s) 102 can process a variety of inputs provided by users or automated systems. For instance, users can provide voice commands, typed messages, or selections made via touchscreeninterfaces. For instance, users can provide input through spoken commands detected via an analog transducer, and a system can convert the detected signals into a digital format that the machine-learned agent system can process. External devices and systems can communicate via input interface(s) 102 to transmit messages or other data objects or signals to machine-learned agent system 100. Input interface(s) 102 can be implemented using an application programming interface (API) exposed over a network or within an execution environment. For example, machine-learned agent system 100 can expose an application programming interface accessible by one or more other applications (e.g., a front end application hosting a user interface, a back-end application, such as an operating system, or other software application) to engage machine-learned agent system 100 to perform tasks.
[0079] Query 104 can be or include any type of data provided to machine-learned agent system 100 for processing. Query 104 can be structured or unstructured and may come from diverse sources such as text entries, voice commands, stored files, sensor outputs, downloaded networked content, etc. For example, a user might type a request into a text interface, speak a command into a voice-activated device, or a connected device might automatically send data based on certain triggers.
[0080] Query 104 can include instructions to perform a task. Various example tasks are described herein with respect to the description of example machine-learned model 1 .
[0081] Query 104 can include a declaration of information. The information can be declared in a recorded statement from a user or data recorded from other sources. Machine-learned agent system 100 can ingest query 104 to determine whether and how to remember the provided information.
[0082] Query 104 can contain a single data modality or multiple data modalities. Various example data modalities for query 104 are described herein with respect to the description of inputs to example machine- learned model 1 . Query 104 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, etc.
[0083] An example query can include image data. For example, image data can include on-screen context data from a user device, such as a screenshot or screen recording. On-screen context data can include image or textual content representing the visual elements displayed on the user’s screen at the time of the interaction. This on-screen context data can be captured using various techniques, such as screen capture APIs, screen recording APIs, or other methods capable of capturing the visual output of a user device. The on-screen context data can be encoded in various formats, such as JPEG, PNG, or video formats, and can be included as part of the query. On-screen context data can be combined with other data modalities within query 104, such as user instructions, audio recordings, or sensor data. On-screen context data can also be used to improve the performance of machine-learned agent system 100 in assisting user tasks by conditioning agent responses on the specific visual context of the interaction. Example on-screen captures can be obtained depicting a variety of software applications, such as word processing applications, image editing applications, browser applications, operating systems and filemanagement systems, gaming applications, media applications, etc. Machine-learned agent system 100 can ingest such content to generate outputs to control the depicted applications (e.g., generating application programming interface calls to directly provide inputs to the application via application programming interfaces of the application; generating application programming interface calls to simulate user inputs to the application; etc.).
[0084] Query 104 can include audio data representing audio being rendered by a user device or audio being received at a microphone of the user device. This audio data can be stored in various formats. Metadata associated with the audio data can include timestamps indicating the start and end times of the audio segment, a source identifier specifying whether the audio was captured by the device’s microphone or rendered by the device’s speakers, or other contextual information such as the application or process that generated or captured the audio. For instance, if the audio is part of a newscast or other content being played on the device, the metadata could include the name of the content and a summary of the specific segment being played. If the audio is captured from the device's microphone, the metadata could include environmental information or other contextualizing descriptors (e.g., “morning walk’’). Audio information can then be used by machine-learned models to better understand user intentions, provide more relevant responses, or enhance the user's overall experience. For example, if the user is listening to a newscast and asks a question related to a specific topic mentioned in the newscast, the system can use the audio data to identify the relevant segment of the newscast and provide a more accurate and contextually appropriate response. Similarly, if the user is in an outdoor environment and asks about the surrounding wildlife, the system can analyze the audio data to identify bird calls or other animal sounds and provide information about the species present.
[0085] Query 104 can be ingested by machine-learned agent system 100 using an input ingestion system. An input ingestion system can process and analyze input data received from users or other systems. An input ingestion system can parse the raw query data, which may be in structured or unstructured forms, such as text entries, voice commands, or other digital formats. An input ingestion system can employ natural language processing (NLP) techniques to interpret and transform textual data into a structured format that can be further utilized within the system. For instance, NLP techniques such as tokenization, entity recognition, and syntactic parsing can be used to parse the input data into chunks. Chunks can encapsulate semantically associated portions of query 104. An input ingestion system can use machine-learned model system(s) 118 to ingest query 104. For example, an input ingestion system can pass query 104 to a machine-learned model to generate a set of chunks. A machine-learned model can process query 104 and return a set of chunks that each collect, summarize, or otherwise represent semantically associated portions of query 104.
[0086] Memory model 105 can be or include processing logic, software, firmware, or hardware configured to govern the storage and recall of memory data describing interactions with machine-learnedagent system 100. Memory model 105 can execute operations that include both learned operators (e.g., operators that execute predictions using machine-learned models having learned values, or operators that execute code generated based on predictions using machine-learned models having learned values) and non-learned operators.
[0087] Memory model 105 can execute one or multiple learned operators to perform tasks and make decisions within a non-learned framework of rules and software infrastructure. For instance, memory model 105 may execute learned operators that apply predetermined recipes or templates for processing incoming information (e.g., using a series of prompt templates to ingest new information or format output information). For some processing stages, learned operators can use the outputs of a first prediction to control the inputs for downstream operations, such as by predicting parameters for application programming interface calls, invoking code interpreter environments for executing generated code, etc.
[0088] Memory model 105 can control the storage of new memories. For example, memory model 105 can perform one or more operations to ingest new information to extract information to remember. For example, memory model 105 can use one or more inferences generated using one or more machine- learned models to process one or multiple inputs (e.g., a single new input, a dialog between a user and agent system 100) to predict new information to remember. Predicting new information to remember can include generating a summary of information to remember. Predicting new information to remember can include generating a portion or excerpt of content to remember. For example, memory model 105 can invoke a machine-learned model input (e.g., after one or more inputs are received) that causes a machine- learned model to predict whether there is anything in the query that is to be remembered.
[0089] Memory model 105 can control the updating of stored memories. For example, memory model 105 can purge or update stale information based on new information. Memory model 105 can query interaction memory datastore 110 to retrieve relevant entries for new information and update the relevant entries based on the new information. The new information can replace the information stored in the relevant entries (e.g., an updated date for an event) or can instruct deletion of the information stored in the relevant entries (e.g., an instruction to forget a particular conversation).
[0090] Memory model 105 can control the recall of stored memories. For example, memory model 105 can query interaction memory datastore 110 to retrieve information for processing a query 104. For example, memory model 105 can initiate similarity search operations to retrieve a number of relevant stored memories for a particular query. Memory model 105 can use one or more machine-learned models to generate a query (e.g., generate SQL string) for querying interaction memory datastore 110 to retrieve memories for a particular query.
[0091] Recall cycle 106 can include a set of operations performed by machine-learned agent system 100 to retrieve data from interaction memory datastore 110. Memory model 105 can control recall cycle106. In general, recall cycle 106 can include querying interaction memory datastore 110 using a memory query 108 and receiving memory response 112.
[0092] Recall cycle 106 can occur at query time (e.g ., at runtime processing of query 104). Recall cycle 106 can be performed to recall relevant memory data associated with query 104.
[0093] Recall cycle 106 can occur in an offline manner in advance of runtime or in advance of processing query 104. For example, general recall of baseline information applicable to multiple different queries can be recalled offline and cached by memory model 105 to improve latency at runtime. In an example, memory model 105 can maintain a cached memory snapshot of top-K memory values. For instance, memory model 105 can store a string of listed declarations of information from interaction memory datastore. The string can be inserted into one or more inputs to one or more machine-learned models to condition inference on the memory snapshot
[0094] Memory query 108 can include a command, string (e.g., SQL string), vector, or other data object or message that is configured to filter, sort, or otherwise engage with a structure of interaction memory datastore 110 to facilitate the retrieval of relevant memory data.
[0095] Interaction memory datastore 110 can be a data structure configured for the structured storage of data. Interaction memory datastore 110 can be or include a relational or non-relational database, a data table, a document, a file system, or other structured data representation. Examples of devices or systems that can be used to implement interaction memory datastore 110 include traditional relational databases, NoSQL databases, in-memory data stores, and distributed file systems. Additionally, cloud storage solutions provided by network-hosted platforms can also be used. Interaction memory datastore 110 can be implemented locally to machine-learned agent system 100 (e.g., on a same device or system) or remotely from machine-learned agent system 100 (e.g., on a different device or system).
[0096] Interaction memory datastore 110 can store data of various different modalities. Interaction memory datastore 110 can store text data. Interaction memory datastore 110 can store image data. Interaction memory datastore 110 can store audio data. Interaction memory datastore 110 can store combined audio and image data (e.g., video data). Interaction memory datastore 110 may store arbitrary data types. Various example data modalities that can be stored in interaction memory datastore 110 are described herein with respect to the description of inputs to example machine-learned model 1 .
[0097] Interaction memory datastore 110 can store memory data in a native modality in which the interaction was performed. For example, an agent system that implements a machine-learned model that can perform inference natively on input audio data to generate responses (e.g., rather than requiring a transcription to text first) can store, in interaction memory datastore 110, audio recordings of interactions in addition to or in lieu of transcribed text. These audio recordings can capture rich contextual information that may be lost in the transcription, such as cadence, tone, inflection, background noise, etc. of a user's utterance. Similarly, an agent system that implements a machine-learned model that can perform inferencenatively on input image data to generate responses (e.g., rather than requiring a transcription to text first) can store, in interaction memory datastore 110, image data of interactions in addition to or in lieu of captioning text. The image data (e.g., video frames, still image captures) can capture rich contextual information that may be lost in captioning, such as environmental information, lighting, colors, mood, facial expressions, body posture, etc. In general, an agent system that implements a machine-learned model to perform inference natively on multiple modalities of input data to generate responses can store, in interaction memory datastore 110, multiple of such modalities of memories.
[0098] Interaction memory datastore 110 can include a vector-based data recall structure. For instance, a vector database can store embedded representations of memory values to facilitate similarity searches based on an embedding of a query value. Based on identified similarity matches, a corresponding data record in interaction memory datastore 110 for a given matched vector in the vector database can be retrieved and served in response to the query.
[0099] Interaction memory datastore 110 can include a hierarchical storage structure. For example, a low-precision storage layer can provide for rapid querying with low precision. For some queries, low precision may be sufficient. For other queries, the results of the rapid low-precision query can be used to direct or guide slower searches through more detailed storage layers (e.g., higher-dimensional vector embeddings, higher-detail textual, audio, or image entries).
[0100] Interaction memory datastore 110 can be multimodal. For example, one or more queries can include data of a first modality and one or more queries can include data of a second modality. Interaction memory datastore 110 can store memory information in both the first and the second modality. For example, interaction memory datastore 110 can store text data and image data, audio data and image data, text data and audio data, or other combinations of data modalities.
[0101] Memory response 112 can include data retrieved from interaction memory datastore 110. Memory response 112 can include one or more complete or partial entries from interaction memory datastore 110.
[0102] Memory data 114 can be or include data based on memory response 112. Memory data 114 can be or include all or part of memory response 112. For example, memory values stored in memory objects returned in memory response 112 can be extracted from the corresponding memory objects and compiled into memory data 114.
[0103] Memory data 114 can be generated based on memory response 112. For example, memory model 105 can receive memory objects returned in memory response 112. Memory model 105 can input the memory objects returned in memory response 112 to a machine-learned model to generate a summary or other representation of the memory objects returned in memory response 112.
[0104] Memory data 114 can contain a single data modality or multiple data modalities. Various example data modalities for memory data 114 are described herein with respect to the description of inputs toexample machine-learned model 1. Memory data 114 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, etc. In an example, memory data 114 can be in a format that matches the source of the memory. For instance, image-based memories can be represented with image data; audio-based memories can be represented with audio data; text-based memories can be represented with text data. A multimodal machine-learned model can be configured to process multimodal memory data 114 to condition generation of content based on the memory data.
[0105] Input data structure 116 can be or include a data object configured for input to machine-learned model system(s) 118. Input data structure 116 can include a structure defined based on an application programming interface of machine-learned model system(s) 118.
[0106] Input data structure 116 can contain a single data modality or multiple data modalities. Various example data modalities for input data structure 116 are described herein with respect to the description of inputs to example machine-learned model 1 . Input data structure 116 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, etc.
[0107] Input data structure 116 can include memory data 114. Input data structure 116 can include data based on query 104 (e.g., all or part of query 104). Input data structure 116 can be formatted or otherwise configured based on an input template or predefined structure for processing all or part of query 104 in view of memory data 114. For instance, input data structure 116 can include a prompt for prompting a machine-learned sequence processing model to generate an output sequence. The prompt can include memory data 114 and all or part of query 104.
[0108] In an example, machine-learned agent system 100 can provide a multimodal input data structure 116 for processing by a multimodal machine-learned model of machine-learned model system(s) 118. The multimodal machine-learned model can ingest the multimodal input data structure 116 to generate outputs conditioned on memory data based on multiple different data modalities. The multimodal machine-learned model can generate output values in one or multiple modalities.
[0109] Machine-learned model system(s) 118 can be or include processing logic, software, firmware, or hardware configured to host and execute machine-learned models to obtain predictions. For example, machine-learned model system(s) 118 can include software platforms configured to manage the storage and execution of various different machine-learned models, various different adapters or other profiles for one or more machine-learned models, etc. Machine-learned model system(s) 118 can be implemented locally to machine-learned agent system 100 (e.g., on a same device or system) or remotely from machine- learned agent system 100 (e.g., on a different device or system). Machine-learned model system(s) 118 can load machine-learned parameters from storage into memory devices (e.g., memory of one or more hardware accelerator devices), transform inputs based on an architecture of the loaded machine-learned model, maintain a cache of intermediate states (e.g., latent or otherwise) for the machine-learned model during execution, and return outputs generated by the executed model. Machine-learned model system(s)118 can execute, for instance, one or more machine-learned model(s) 1 . Example machine-learned model types and configurations that can be used to process input data structure 116 are described herein with respect to machine-learned model 1. Example aspects of machine-learned model system(s) 118 are described herein with respect to model host 31 .
[0110] Machine-learned model system(s) 118 can facilitate interactions between components of machine-learned agent system 100 and one or more machine-learned models. Machine-learned model system(s) 118 can directly execute one or more machine-learned models or can provide API access to other systems (on-device or on external devices) that execute one or more machine-learned models using inputs provided via the API.
[0111] Machine-learned model system(s) 118 can provide access to generalist models. For instance, machine-learned model systems 118 can provide access to foundational models that are configured to perform inference for a wide variety of tasks. Machine-learned model systems 118 can provide access to machine-learned sequence processing models, such as large language models or “LLMs” or small language models or “SLMs,” vision-language models of "VLMs,” vision models (e.g . , convolutional neural nets), audio models, etc.
[0112] Machine-learned model systems 118 can provide access to a variety of specialized models that implement various functionality of components of machine-learned agent system 100. The models can be designed and trained (e.g., fine-tuned) to perform specific tasks such as parsing and analyzing input data, extracting information to remember, predicting a classification or type of information, extracting memories relevant to a query, etc
[0113] Machine-learned model systems 118 can include or invoke various types of hardware and software components specifically designed to execute machine learning algorithms and models. Examples of devices or systems that can be used to implement machine-learned model systems 118 include dedicated machine learning engines equipped with one or more high-performance GPUs or other hardware accelerators for accelerated computing. These environments can be hosted locally on an edge device or hosted on a server to offload computational tasks.
[0114] Output data structure 120 can be or include a data object output by machine-learned model system(s) 118. Output data structure 120 can include a structure defined based on an application programming interface of machine-learned model system(s) 118.
[0115] Output data structure 120 can contain a single data modality or multiple data modalities. Various example data modalities for output data structure 120 are described herein with respect to the description of outputs from example machine-learned model 1 . Output data structure 120 can be or include any one or more of text or other symbolic data, image data, audio data, compressed or encoded data, etc.
[0116] Output data structure 120 can include content generated by one or more machine-learned models executed by machine-learned model system(s) 118. Output data structure 120 can includemetadata associated with the generated content, such as confidence values, scores (e.g., based on an evaluation policy), logits, etc.
[0117] Machine-learned agent system 100 can invoke one or more machine-learned models over multiple turns or iterations (e.g., N iterations) prior to preparing and outputting response 122. For example, machine-learned agent system 100 can implement a chain-of-thought self-deliberation technique to generate content describing reasoning about query 104 in view of memory data 114. This generated content can be used to condition future generations of additional content in service of a final response 122. For example, a first input data structure can contain an instruction to generate a thorough analysis of query 104 in view of memory data 114. Output data structure 120 can contain the generated analysis. A second input data structure can contain the generated analysis along with an instruction to generate a final response based on the analysis. A second output data structure can contain the final response. Machine- learned agent system 100 can generate response 122 based on the second output data structure, which was based on the first output data structure. The raw outputs may not be surfaced to the user. For example, only a response 122 might be output to output interface(s) 124. The intermediate outputs generated during the N iterations may be logged for analysis but may not be used in an output.
[0118] Response 122 can include data from output data structure 120. Response 122 can be a final or complete response to query 104, or response 122 can be a partial response to query 104 that effectuates a step in a multi-step response (e.g., performing a subtask in a multi-part task).
[0119] Response 122 can include data for output to a user interface for rendering for a user. Response 122 can include text, image, audio, or other data that an output interface can render for a user.
[0120] Response 122 can include data for communication to a user (e.g., to a device associated with an account, to another agent, to an external system). Response 122 can include instructions to control a receiving system. For instance, response 122 can include an application programming interface call generated based on output data structure 120. For instance, one or more parameters of the application programming interface call can be generated by a machine-learned model and provided to machine- learned agent system 100 in output data structure 120. The application programming interface call can be parsed and packaged into response 122. Response 122 can be output via an output interface to a receiving system (e.g., over a network, over a system bus, via a software queue or operating system communication pathway) to initiate an action of the receiving system according to the application programming interface call.
[0121] Output interfaces 124 can include various mechanisms and devices that enable the machine- learned agent system to communicate with users or other systems. These interfaces may consist of graphical user interfaces (GUIs), audio output devices, network connectivity devices, API libraries, and other components that can send data or render data relevant to the agent actions executed by the system. For example, GUIs can display responses to questions, results of a task, notifications about upcomingtasks, or alerts regarding any required user inputs. Audio outputs can provide auditory responses (e.g., spoken language responses) or alerts
[0122] Output interface(s) 124 can include network interfaces that enable machine-learned agent system 100 to send data or commands to other systems. For instance, the output interfaces can facilitate the execution of API calls to external services.
[0123] In an example implementation, an input interface 102 and an output interface 124 can be associated with the same software application or device. For example, a virtual assistant or agent application can provide an interface for interacting with a virtual assistant. The interactions can be implemented in one or multiple interaction modalities, such as a textual chat, a voice conversation, a situated agent (e.g., implemented using a wearable device). Inputs (e.g., query 104) can represent one part of a dialog corresponding to a user and responses 122 can represent the other part of the dialog corresponding to the agent's responses.
[0124] In a practical use case example, a user can be walking along a walkway. The user can operate a mobile device that captures imagery of surrounding buildings. The user can utter a question, and the utterance can be recorded by the mobile device and input to machine-learned agent system 100 as a query. For example, the question can be, "who was the architect who designed this building?” The mobile device can store an image of the building and pass it to machine-learned agent system 100 as part of the query.
[0125] Machine-learned agent system 100 can process this query. Machine-learned agent system 100 can determine that no saved memories help answer the query. Machine-learned agent system 100 can invoke a machine-learned model to assist in answering the query. A machine-learned model can perform image analysis and language analysis to recognize the building and interpret the question. The machine- learned model can generate an output (e.g., an output string) that contains an application programming interface call to engage a search engine service to perform a search based on the question and the image. Machine-learned agent system 100 can receive the output string and output a first response 122 via an output interface over a network to communicate with the search engine service to execute the application programming interface call. Machine-learned agent system 100 can receive via an input interface over a network a response from the search engine service that identifies the architect.
[0126] Machine-learned agent system 100 can invoke a machine-learned model to assist in answering the query based on the response from the search engine. A machine-learned model can perform language analysis to interpret the response from the search engine and formulate a natural language response to the user’s query. The machine-learned model can generate an output that responds to the user’s query. For example, output data structure 120 can contain a generated audio segment. Machine-learned agent system 100 can receive the audio segment and output the audio segment in response 122 to an audio driver output interface of the mobile device for rendering the audio segment.
[0127] The user can respond to the answer by uttering a statement, "Thanks! Remember her. Her work is probably my favorite architecture.” The utterance can be recorded by the mobile device and input to machine-learned agent system 100 as a query. Machine-learned agent system 100 can process this query. Machine-learned agent system 100 can invoke a machine-learned model to assist in processing the query. For example, memory model 105 can invoke a machine-learned model input that causes a machine- learned model to predict whether there is anything in the query that is to be remembered. The prediction can be affirmative based on the user's utterance. Based on an affirmative prediction, memory model 105 can store data based on the query. Memory model 105 can store the entire input or can invoke a machine- learned model to return a portion of the query that is to be remembered (this invocation can be the same as or different from the prior invocation to determine whether there is anything in the query that is to be remembered). Memory model 105 can cause this information to be stored in interaction memory datastore 110.
[0128] Accordingly, in a future interaction, in the same or in a different session, the user can utter a command, “Take me to another building by my favorite architect.” The utterance can be recorded by the mobile device and input to machine-learned agent system 100 as a query. Machine-learned agent system 100 can process this query. For example, memory model 105 can execute a recall cycle 106 to recall data associated with the user's favorite architect. The stored memory from the prior interaction can be recalled and input to a machine-learned model along with the uttered command. The machine-learned model can generate an output (e.g., an output string) that contains an application programming interface call to engage a search engine service to perform a search based on the command to find nearby buildings by that architect. Machine-learned agent system 100 can receive the output string and output a first response 122 via an output interface over a network to communicate with the search engine service to execute the application programming interface call. Machine-learned agent system 100 can receive via an input interface over a network a response from the search engine service that identifies the next nearest building by the architect.
[0129] Machine-learned agent system 100 can invoke a machine-learned model to assist in answering the query based on the response from the search engine. A machine-learned model can generate an output (e.g., an output string) that contains an application programming interface call to engage a mapping service to generate navigation instructions based on the identified nearby building. Machine-learned agent system 100 can receive the output string and output a first response 122 via an output interface over a network to communicate with the mapping service to execute the application programming interface call. Machine-learned agent system 100 can receive via an input interface over a network a response from the mapping service that identifies navigation instructions to the building. Alternatively, execution of the application programming interface call output by machine-learned agent system 100 can initiate executionof a mapping application on the mobile device that provides step-by-step instructions to navigate to the building.
[0130] In this manner, for instance, machine-learned agent system 100 can remember interactions over multiple sessions and intelligently select and use context from prior interactions to inform later processing steps. This capability to remember can be used for not only recalling information as a memory aid but also to configure and customize the execution of tasks on behalf of a user.
[0131] Another practical example implementation relates to navigating web-based resources using a browser application. Web-based resources (e.g., web applications, web pages, etc.) often require users to interact with pop-up user interface features to proceed to view or interact with underlying content on the resource. Such pop-up user interface features may relate to configuration parameters for analytics systems associated with the web-based resources.
[0132] Such interface features can be intrusive and cause delays in rendering the subject content of a web resource. A machine-learned agent system 100 could respond to the interface features on the behalf of a user in near real-time to reduce a latency of loading the underlying content. To do so, machine-learned agent system 100 can recall previously-expressed preferences regarding such configuration parameters. Memories of previously-expressed preferences regarding such configuration parameters can be stored in interaction memory datastore 110 and retrieved for generating responses 112 that contain an application programming interface call that, when processed by a device interacting with web resources, can automatically execute the interaction with the pop-up user interface elements consistent with the user’s p revi ou si y-expressed preferen ces.
[0133] In this manner, for instance, the described memory system may facilitate a more intuitive and adaptive approach to configuration management by leveraging contextual awareness and user history. The system may analyze the context of each interaction, including the user’s current task, the application being used, and the type of configuration being requested, along with the user’s short-term and long-term history of interactions and configuration decisions. This analysis may enable the system to infer values for configuration in situations where the user's past behavior or stated preferences provide a high confidence inference of configuration parameter values. For instance, if a user has consistently configured web resources to use location services in a specific application or processing context, the system may infer configuration for continued access without requiring the user to respond to repeated prompts.Furthermore, the system may learn user preferences regarding different types of data and levels of access.
[0134] The system's ability to dynamically update its memory based on user interactions and inferred configuration values may further improve the efficiency of a user experience. As the system learns more about the user’s preferences and behavior, it may become increasingly adept at anticipating configuration needs and minimizing interruptions. For example, in an application, the system may remember which data points the user consistently configures as accessible by one or more other systems, allowing it toautomatically configure data access without requiring manual intervention. The system may also allow users to specify time horizons for inference, enabling temporary access to data for specific tasks or periods without establishing support for a long-term preference.
[0135] The system may seamlessly integrate with modern computing environments by utilizing standard APIs and data formats. The system may be designed to interact with various applications and services through well-defined interfaces. The system may also store configuration information in a structured and standardized format, facilitating interoperability and data exchange between different systems. The system's ability to handle multimodal inputs and outputs may further enhance its adaptability and usability across various computing platforms and devices. For example, the system may be able to interpret both textual and visual cues to infer configuration values, and provide feedback in various formats such as text, speech, or visual indicators
[0136] Example implementations of machine-learned agent systems according to the present disclosure can provide a number of technical improvements over existing approaches, resolving several outstanding challenges in the field. For instance, some traditional agent systems that lack a memory capability may lead to interruptions or prompts that increase a required physiological effort (e.g., cognitive effort to redirect attention, musculoskeletal effort to provide additional inputs) associated with interacting with and controlling computing devices, as well as increased computational cost to process additional inputs and render additional outputs. For example, a user may utter a speech command to a user device, "Create a calendar entry for an appointment at my dentist during my lunch break tomorrow.” Interpreted without a memory of context, the command provides insufficient information to identify the dentist or the time interval during which the user breaks for lunch on the following day. Without a memory of useful context, traditional agent devices may cause further disruptions of a user's attention (e.g., stop performance of a task, divert mental focus, redirect gaze, interrupt a current utterance) and consume further computational resources in order to solicit the necessary information to perform the instructed task. Without a memory of useful context, traditional agent devices may require entry of the same information multiple times over the same or multiple sessions.
[0137] Some possible alternative approaches to resolve this problem include exposing an agent system to various application programming interfaces that facilitate retrieval of relevant information from external data sources. While sometimes helpful, external data sources may not be fully contextualized for a particular interaction between a user and an agent device. For example, a user may have a best friend named Blake. The user’s dentist may also be named Blake. Absent any other context to resolve the ambiguity in a request, “call Blake,” reliance on the contact list alone may not help avoid causing further disruptions of a user’s attention (e.g., stop performance of a task, divert mental focus, redirect gaze, interrupt a current utterance) in order to solicit the necessary information to perform the instructed task.
[0138] Further, a user may not wish to permit an agent device to access all information in all available external data sources. For example, dietary restrictions may be relevant context for many daily tasks for which an agent device may be used (e.g., ordering food). A user may wish to permit the agent device to access this information. However, context sources that might contain dietary information (e.g., health data applications, medical entity communication applications) may contain other sensitive data to which the user does not wish to extend access. Under traditional approaches to enabling external data sources, the user may be faced with an all-or-nothing choice, which may be undesirable in some circumstances.
[0139] Example implementations of machine-learned agent systems according to the present disclosure can provide solutions to these and other technical problems currently experienced by existing technologies. For example, an outstanding problem with prior approaches to prior assistive technology is how to perform tasks for a user while reducing the number and timing of interactions imposed on the user to perform the task. Approaches that require repeated synchronous follow-up instructions or clarifications can waste computing resources to process (e.g., receive inputs, parse inputs, generate outputs, render outputs) the follow-up interactions. Even if the user does not respond to the synchronous follow-up events, each repeated synchronous follow-up event can cause involuntary physiological effects (e.g., arresting auditory attention, visual attention, mental focus) for a time period following the event. For example, a switching cost can refer to a measurable interval of time required for the human brain to switch from performing any other task to the task of responding to the follow-up event.
[0140] Advantageously, example implementations of machine-learned agent systems according to the present disclosure can use the technical improvement of an intelligently managed, dynamic interaction memory datastore to avoid unnecessary interactions required to perform a task by retrieving and using stored memory data to generate responses. This use of stored context can reduce repeated synchronous follow-up instructions or clarifications that can waste computing resources to process (e.g., receive inputs, parse inputs, generate outputs, render outputs) the follow-up interactions. Further, this use of stored context can reduce repeated synchronous follow-up instructions or clarifications that can cause involuntary physiological effects (e.g., arresting auditory attention, visual attention, mental focus) that introduce switching costs.
[0141] Another outstanding problem with some existing computing technology is how to perform tasks for a user with asynchronous operation. Asynchronicity can refer to decoupling a timing of the user's configuration or instruction from performance of a task. For example, many existing computing systems render, when a user device browser application loads a web-based resource, prompts that request an input of one or more configuration parameters for controlling an operation of a computing system associated with the web-based resource (e.g., the data handling operations of an analytics system). Generally, existing systems require synchronous perception of and response to the prompt (e.g., inputting a configuration parameter).
[0142] Advantageously, example implementations of machine-learned agent systems according to the present disclosure can use the technical improvement of an intelligently managed, dynamic interaction memory datastore to implement asynchronous execution of previously instructed tasks (e.g ., based on explicit or implicit instructions). For example, a user can at a first time communicate to the machine-learned agent system one or more preferences for performing a task (e.g., configuring a parameter value for data handling operations of an analytics system). The machine-learned agent system can parse and store these communicated preferences in an interaction memory datastore. The machine-learned agent system can at a later time receive an input associated with the task (e.g., data describing the prompt for the web-based resource) and generate an assistive response on behalf of the user based on the stored contextual information. In this manner, for instance, the task can be completed without initiating a synchronous interaction event at the time of the task performance. This use of stored context can reduce repeated synchronous follow-up instructions or clarifications that can waste computing resources to process (e.g., receive inputs, parse inputs, generate outputs, render outputs) the follow-up interactions. Further, this use of stored context can reduce repeated synchronous follow-up instructions or clarifications that can cause involuntary physiological effects (e.g., arresting auditory attention, visual attention, mental focus) that introduce switching costs.
[0143] In this manner, for instance, example implementations provide a mechanism for controlling a computing device that would not be possible if the machine-learned agent system did not implement an intelligently managed, dynamic context datastore as described herein.
[0144] In another aspect, the present disclosure provides for structured context management In an example, structured context management can include categorization and storage of objects derived from unstructured input data. A machine-learned agent system can obtain pieces of information that form "memories” of prior interactions. Each memory piece can be stored or represented in a memory object. A memory object can store input data and its attributes to facilitate recall (e.g., for maintaining a list of reminder tasks). A memory object can store input data and its attributes for conditioning future machine- learned model inferences (e.g., by injecting context data from the context object into an input of the model). Each object can encapsulate a plurality of attributes that not only contain the content from a respective input chunk but also attributes characterizing features of the input. These attributes may include, for instance, temporal attributes (e.g., indicating timing or urgency), a classification of the input (e g., as an action to execute or as context for storage), related objects (e.g., related chunks of context, related actions to perform based on the input), etc.
[0145] A technical problem overcome by example implementations of the present disclosure is the efficient indexing and storing of numerous contextual signals and commands received over time for processing by a machine-learned agent system. Machine-learned agent systems can aim to provide an integrated interface for inputting various commands, notes, comments, and reminders as users or othersystems perform other tasks so that the agent system can provide customized active assistance. These inputs, if not catalogued effectively, can inhibit the performance of the agent system as an interface for a computing device. For instance, if an agent system fails to return outputs that are responsive to inputs in an accurate and reliable fashion (e.g ., leading to a failure to retrieve relevant context from memory or retrieving irrelevant memories), then the agent system may render a device to be unresponsive to user inputs and commands.
[0146] In an example solution to this technical problem, an example implementation of the present disclosure can leverage inferences from a machine-learned model to implement decision logic for when and how to store memory data in association with one or more of its attributes. This learned contextualizing of memory data can help persist not only the memory itself but also its context, timing, grouping with other inputs, etc. These attributes can be encapsulated with the memory data in an object. The object can be stored for later retrieval and ingestion by the machine-learned agent system. As the machine-learned agent system processes the object, its attributes can be exposed to various downstream processes that can use the attribute data to maintain a consistent and coherent memory.
[0147] For example, stored contextual attributes can help disambiguate or deconflict memories. For instance, conflicting memories can be resolved based on recency or based on contextual distinctions that explain or eliminate a conflict between contrary inputs. The machine-learned agent system can use one or more machine-learned models to perform contextual analysis of its memory data, both new and old. For example, the machine-learned agent system can process new information in view of existing related memory data to predict memory updates. The predicted memory updates can include updates that resolve conflicts, clarify ambiguities, or replace stale information.
[0148] Various aspects of the technology described herein can provide other technical effects and benefits. For instance, an example technical effect of example implementations of the present disclosure is the reduction in the use of computing resources during runtime by pre-processing and storing high-quality memory data objects. Unstructured input data received by the system can be initially processed through a machine-learned model. The resulting memory objects created by the machine-learned agent system can provide much higher quality context signals by condensing key features into explicit attributes. At runtime, then, a machine-learned agent system can quickly retrieve memory data for a particular inference and access the high-quality context. Pre-processed, structured storage of memory objects may significantly reduce the computational load at runtime, as the system may not need to ingest larger amounts of unstructured data. For instance, directly inputting large amounts of unstructured data into a machine- learned model (e.g., prepending an exhaustive record of raw interactions) for obtaining an inference for a particular task can sometimes be less reliable or accurate than inputting clean, curated, and condensed data for the same query. And performing such data curation and cleaning at input time and caching for lateruse can facilitate lower latency responses at runtime and leverage the usage of compute on a time-shifted schedule (e.g., shifted off peak demand times)
[0149] An example technical effect of example implementations of the present disclosure is increased energy efficiency in performing operations using machine-learned models, thereby improving the functioning of computers implementing such models. For instance, example implementations can provide for more energy-efficient runtime execution or inference (e.g., by the use of pre-processed memory data). In some scenarios, increased energy efficiency can provide for less energy to be used to perform a given task (e.g., less energy expended to maintain the model in memory, less energy expended to perform calculations within the model, etc.). In some scenarios, increased energy efficiency can provide for more task(s) to be completed for a given energy budget (e.g., a larger quantity of tasks, more complex tasks, the same task but with more accuracy or precision, etc ).
[0150] In this manner, for instance, the improved energy efficiency of example implementations of the present disclosure can reduce an amount of pollution or other waste associated with implementing machine-learned models and systems, thereby advancing the field of machine-learning and artificial intelligence as a whole. The amount of pollution can be reduced in toto (e.g., an absolute magnitude thereof) or on a normalized basis (e.g., energy per task, per model size, etc.). For example, an amount of CO2 released (e.g., by a power source) in association with training and execution of machine-learned models can be reduced by implementing more energy-efficient training or inference operations. An amount of heat pollution in an environment (e.g., by the processors / storage locations) can be reduced by implementing more energy-efficient training or inference operations.
[0151] Figure 2 is a block diagram that illustrates an example implementation of interaction memory datastore 110 according to example aspects of the present disclosure.
[0152] Interaction memory datastore 110 can contain a plurality of memory objects 202-1 , . . . . 202-M. A memory object can be a data structure configured for storing memory information. The memory object can encapsulate pieces of memory information along with relevant metadata associated with the piece of memory information as well as indicating associations between pieces of memory information (e.g., a knowledge graph). A memory object can be stored in a human-readable format, such as JSON, YAML, etc. A memory object can be encoded and stored in a non-human readable format.
[0153] Each memory object can contain one or more memory value(s). For instance, memory object 202-1 can contain one or more memory value(s) 204-1 . Memory object 202-M can contain one or more memory value(s) 204-M. A memory value can be an individual piece of memory information that is desired to be stored for future recall. For example, a memory value can be a value for an attribute associated with the user (e.g., an attribute of the user, an attribute of an object associated with the user, an attribute associated with an event or occurrence associated with the user).
[0154] A memory value can be extracted or excerpted from an interaction. For example, a visual memory can be stored by saving all or part of one or more images associated with an interaction session. A textual memory can be stored by saving all or part of one or more messages or documents associated with an interaction session. An audio memory can be stored by saving all or part of one or more audio files or streams associated with an interaction session.
[0155] A memory value can be generated using a machine-learned model based on an interaction. For example, a machine-learned model can receive a query 104 and generate one or more memory value(s) 204-1 based on the query. A machine-learned model can generate a summary of the query and the summary can be used as a memory value(s) 204-1 . A machine-learned model can infer data based on query 104 and other context (e.g., other memories, prior conversational context, user preferences) and the inferred data can be used as memory value(s) 204-1 . A machine-learned model can transform input data in one modality and store a memory in another modality (e.g., a more d ata-efficient modality). For example, a machine-learned model can transform an audio input into a textual input that is then stored in memory object 202-1 . A machine-learned model can generate a caption for image data to record a memory value in text data.
[0156] Example memory values can include various data elements. Text data can include strings of characters, representing natural language, code, or other symbolic information, including structured data objects. Text data can be represented in UTF-8 encoded strings. Image data can include pixel data representing visual information. Image data can be represented in JPEG, PNG, or other image formats, or byte-encoded representations thereof. Audio data can include sampled waveform data representing sound Audio data can be represented in WAV, MP3, or other audio formats. Video data can include a combination of image data and audio synchronized over time. Video data can be represented in MP4, AVI, or other video formats. Vector data can include numerical representations of data, which may be used for similarity comparisons. Vector data can be represented in floating-point or other numerical arrays of a specific dimension. Vector data can be generated by an embedding model based on one or more other memory values of the memory object. Sensor data can include data collected from various sensors, such as temperature, pressure, or acceleration. Sensor data can be represented with numerical values with associated units and timestamps. Location data can be represented by latitude and longitude coordinates, addresses, or the like. Document data can include collections of structured and unstructured data, often containing text, images, and other media. Document data can be represented as TXT, PDF, XML, or other document formats.
[0157] Each memory object can contain respective memory metadata 206-1 , . . . , 206-M. Memory metadata 206-1 , . . . , 206-M can indicate information about a corresponding memory object. For example, memory metadata 206-1 can indicate the context in which a memory was collected. Example metadata fields can include a time (e.g., date) at which the memory was collected, a location of where the memorywas collected, corresponding raw input data that was used to generate the information in the memory object, or other properties of the memory object.
[0158] Example metadata can include various data elements. Time data can be captured in a time stamp. Time data can include a creation date, an update date, etc. Source data can identify an origin of the memory value or object (e.g., an explicit user command, implicit system inference). Summary data can contain a brief description of the memory value or the interaction that led to the memory object’s creation. Confidence data can include a numerical value representing the system's confidence in the accuracy of the memory value. Expiration data can include a sunset date for "forgetting” the value. Session data can include a session identifier linking the memory object to a specific user interaction session. User data can include an identifier of an account or profile to which the memory object is linked. Tag data can include keywords or labels for efficient searching and retrieval.
[0159] One or more memory objects can include multiple different modalities of data. A memory object can include data of a first modality and data of a second modality. Different modalities can be stored in different memory objects and linked through metadata. For example, one memory object might store the first modality, and another might store the second modality, with metadata linking them to the same interaction. Each memory object can be retrieved independently or may be linked to be retrieved together.
[0160] In some implementations, more context-rich data modalities can be "forgotten” (e.g., deleted, subsampled, noised, or otherwise obfuscated) first according to a memory deletion schedule of memory model 105. For instance, in a first “forgetting” action, video data in a memory object can be deleted and replaced with image frames sampled from the video data. In a second forgetting action, the image frames can be replaced by a single image frame. In a third forgetting action, the image frame can be replaced with a textual summary of the image frame. Such actions can be scheduled to occur over time. The schedule can be based on a time at which the memory was last accessed. A corresponding memory in a different modality may be degraded less. For instance, a textual memory may be summarized after a longer interval as compared to the interval(s) between the actions applied to the video data.
[0161] Although a schedule of forgetting memory information by memory model 105 may be understood in terms of time, it is also to be understood that a schedule can also correspond to a hierarchy of permissions. For instance, a first permissions level for an operation of agent system 100 can correspond to access to original memory information. A second permissions level for an operation of agent system 100 can correspond to access to memory information obfuscated according to a first “forgetting” action. A third permissions level for an operation of agent system 100 can correspond to access to memory information obfuscated according to a second “forgetting” action (e.g., the first and the second combined, or just a second). A fourth permissions level for an operation of agent system 100 can correspond to access to memory information obfuscated according to a third “forgetting” action (e.g., the first and the second and the third combined, or just a third).
[0162] In an example, interaction memory datastore 110 can store image data in association with audio data, such as a video recording recorded during an interaction session and a voice recording recorded during the interaction session. For instance, a user of a wearable device can interact with the wearable device to discuss topics related to an environment, and the wearable device can record video content of the environment. These interactions can be stored together. For example, a memory value can be an image (e.g., a video frame), multiple images (e.g., a list of video frames), an embedding of at least a portion of the video, an embedding of at least a portion of the video and audio together, etc. A memory value can be an audio recording, an embedding of at least a portion of the audio recording, a summary or excerpt transcribed from the audio recording, etc. A single memory object can include multiple memory values.
[0163] In an example, interaction memory datastore 110 can store image data in association with text data, such as image data from a video recording recorded during an interaction session and a transcription from a voice recording recorded during the interaction session. For instance, a user of a wearable device can interact with the wearable device to discuss topics related to an environment, and the wearable device can record video content of the environment. These interactions can be stored together. For example, a memory value can be an image (e.g., a video frame), multiple images (e.g., a list of video frames), an embedding of at least a portion of the video, an embedding of at least a portion of the video and text together, etc. A memory value can be text transcribed from an audio recording, an embedding of at least a portion of the text, a summary of the transcription, etc.
[0164] In an example, interaction memory datastore 110 can store audio data in association with text data. Audio data (e.g., music) and text data (e.g., comments on the music) can be stored as multimodal memories in interaction memory datastore 110. For example, a memory object can be created to store a user's interaction with a piece of music. The memory object can contain a memory value representing the audio data (e.g., an audio file or a vector embedding representing features of the audio). Another memory value within the same memory object or a different object can store text data (e.g., user comments, reviews, or annotations about the music). Metadata associated with the memory object can include timestamps indicating when the interaction occurred or register one or more textual annotations within the interval of the audio recording.
[0165] Memory data 114 can include a combination of memory values from different memory objects. The combination can be composed based on metadata associated with the different memory objects. In some implementations, the memory data includes an ordered combination of one or more memory values, the ordered combination ordered based on corresponding timestamps of the plurality of memory objects. This ordering allows the machine-learned agent system to prioritize more recent context when making inferences. This temporal ordering can be helpful for resolving potential conflicts or ambiguities in the memory data. For example, if a user provides conflicting information at different times, a machine-learned model can prioritize the most recent information based on its ordering within the list. The ordered list cancontain time information (e.g., the timestamp) or the ordering can implicitly communicate the temporal relationships between listed memory values.
[0166] The ordered combination of memory values may be presented to the machine-learned sequence processing model in various formats. For example, the memory values may be concatenated into a single string, with separators or other delimiters used to distinguish individual memory items. Alternatively, the memory values may be presented as a structured data object, such as a JSON array or a key-value store, with metadata such as timestamps and confidence scores included to provide additional context. The specific format employed may depend on the input requirements of the machine-learned sequence processing model and the overall design of the machine-learned agent system. The system may dynamically adapt the format of the memory data to optimize its compatibility with different models or processing pipelines. In some implementations, the system may employ a machine-learned model to generate a summary or condensed representation of the ordered memory values, providing a more concise and efficient input for the sequence processing model.
[0167] A memory object can contain a support record. A support record can contain a trace of one or more supporting interactions or data elements that contribute to generation of content of a memory object. For example, memory object 202-1 can contain a list of interaction objects that store interaction data (e.g., input data, output data) that support a memory.
[0168] A support record can include one or multiple modalities of data in various levels of precision. For example, the support record may be stored in a hierarchical structure, with higher-level summaries providing a concise overview and lower-level details providing granular information about individual interactions. The support record may also include metadata such as timestamps, confidence scores, and source identifiers to further contextualize the supporting data. The system may use machine learning models to analyze the support record and identify inconsistencies or ambiguities in the memory objects, enabling more accurate and reliable memory management. The support record may be dynamically updated as new interactions occur, providing a continuously evolving record of evidence for each memory object. In some implementations, the support record may be selectively pruned to manage storage space and computational resources, while retaining sufficient information to maintain the integrity and accuracy of the memory objects. The support record may be used to resolve conflicts between memory objects, by considering the relative strength and consistency of the supporting evidence. The system may prioritize memory objects with stronger supporting evidence, or resolve conflicts by creating new memory objects that synthesize information from multiple sources.
[0169] Maintaining the support record can facilitate intelligent updating over time of the contents of a memory object (e.g., updating an attribute of an object associated with a user). Each time an interaction is received relevant to a memory object, the memory object can be updated based on the additional information received from the interaction in full view of the other related interactions. This can stabilize theevolution of memories over time by anchoring new memories in context of prior support. For example, memory model 105 can provide an input to a machine-learned model containing data from the support record and the new information (e.g., from query 104) along with an instruction to return an update to the memory based on the new and the old information. The machine-learned model can return a prediction that contains data for updating the contents of the memory object that is conditioned on and consistent with the full support record.
[0170] The support record can facilitate deconflicting competing memories. For example, a memory value may operate as an information bottleneck as compared to the full record of interactions that support the memory value. For example, a memory value can be a summary or conclusion based on a record of distinct interactions, statements, or actions. The record of supporting actions can provide valuable context when updating the memory When a new memory appears to contradict a prior memory, the support record can be used to generate more nuanced memories that can reconcile the memory record. For example, a user may only enjoy a certain food from a particular restaurant. The user might repeatedly avoid ordering that food over a period of time, such that memory model 105 may, using a machine-learned model, generate one or more memory values that indicate an aversion to the food. However, at a later time, the user might order the food.
[0171] In some situations, if operating based only on the prior memory indicating the aversion to the food, memory model 105, using a machine-learned model, might predict that a new memory indicating a preference for the food should replace the prior memory. But the apparent conflict may be resolved in view of the support record For example, by processing the new query in view of the full support record, memory model 105, using a machine-learned model, might generate an inference that the old memory should instead be clarified to indicate that the user avoids the food except when served at a particular restaurant.
[0172] In this manner, for instance, any generalizations or errors made in memory creation (e.g., when memory model 105 engages a machine-learned model to generate new memory information based on interactions) might be resolved or corrected in view of new information. In this manner, for instance, new information can serve to refine and build more nuanced memories rather than noisily overcorrecting based on limited information.
[0173] A memory object can be associated with one or more other memory objects. For example, relationships between memory objects can be represented as a knowledge graph. For instance, an example knowledge graph can be composed of memory objects interconnected through various relationships, forming a dynamic, user-specific representation of contextual information. Each memory object may correspond to a node in the graph. The relationships between memory objects can correspond to edges between nodes that are dynamically updated based on user interactions and the system’s inference mechanisms.
[0174] For example, edges can indicate a chronological order of memory objects within a session or across sessions. This allows for tracking the evolution of context over time Edges can represent semantic connections between memory objects. These relationships can be inferred by a machine-learned model based on the content of the memory objects and can include: hierarchical relationships (e.g ., “Vegan Diet” IS-A “Dietary Restriction”); relationships indicating components or constituents (e.g., “Dairy-Free Pizza Recipe” PART-OF “Vegan Recipes”); relationships indicating attributes or properties (e.g., “Appointment” HAS-PROPERTY “Time,” “Location,” “Dentist”); more general relationships indicating semantic similarity or relevance (e.g., “Favorite Architect” RELATED-TO “Architectural Style”); causal links between memory objects; support links associating memories that support other memories (e.g., objects in a support record), which can form a chain of evidence for the memory, allowing for context-aware updates and conflict resolution.
[0175] Memory model 105 can use a number of different techniques for retrieving memory objects from interaction memory datastore 110. An example process can begin by interpreting new information in query 104. For example, query 104 can include a new input for processing in view of stored memories.
[0176] Query 104 can be preprocessed to generate memory query 108. For example, memory model 105 can process query 104 using a machine-learned model to generate memory query 108. The processing can include generating an embedding of all or part of query 104 for conducting a similarity search. The processing can include generating a caption, summary, or other rewriting / reformatting prior to or in lieu of embedding. Memory model 105 can obtain a memory query 108 that contains structured query information for executing a query against interaction memory datastore 110.
[0177] Memory model 105 can use this structured query representation (e.g., memory query 108) to initiate a search within interaction memory datastore 110. The search strategy can depend on the nature of the query and the organization of the datastore. Several methods may be employed concurrently or sequentially. Example techniques are provided below.
[0178] If the pre-processed query contains keywords, memory model 105 can perform a keyword search across the metadata fields (e.g., tags, summaries, session IDs) of memory objects. This search can return a set of candidate memory objects containing the specified keywords. The search may utilize efficient indexing techniques such as inverted indexes or keyword trees to speed up the process. For example, memory query 108 can include search terms, a SQL string, etc.
[0179] Memory model 105 can leverage a semantic search mechanism, such as by employing a vector database within interaction memory datastore 110. A vector embedding of the pre-processed query can be compared to vector embeddings of values within memory objects stored in the database. This comparison can use similarity metrics like cosine similarity to identify memory objects semantically related to the query. The top-K most similar memory objects can be retrieved, where K is a configurable parameter. For example, memory query 108 can include a vector representation of a query for a similarity search.
[0180] A temporal filter can be applied to candidate memory objects to reduce a search space. This filter can operate over timestamps associated with the memory objects, retaining only those within a specified time window relative to the query's timestamp or other relevant temporal context. For example, memory query 108 can include a time interval or cutoff time. A filter can be applied based on a composite metric that measures importance and recency. For example, old but high priority memories (e.g., peanut allergy) may have high priority such that they may be recalled even after extended intervals. Memory model 105 can apply a contextual filter to leverage relationships between memory objects represented in a knowledge graph within interaction memory datastore 110. For example, if a memory object related to “diet” is retrieved, the system might also retrieve related objects such as “recipes” or “food orders.” For example, memory query 108 can include keywords, categories, etc.
[0181] Interaction memory datastore 110 can return a number of memory objects, or data retrieved therefrom, based on memory query 108.
[0182] Memory model 105 can generate memory data 114 by extracting relevant information from the selected memory objects. This generation can include extracting the memory values from the objects directly, such as by concatenating multiple textual memory values into a string representation for insertion into a prompt; passing image data to an image object portion of input data structure 116 and passing text data to a text data portion of input data structure 116; interleaving multiple modalities of memory data into semantically related groups; etc. This generation can include using a machine-learned model to generate a summary or other representation of the retrieved memories, such as a textual summary, one or more embeddings representing the collected memories, etc.
[0183] Figure 3 is a block diagram of an example recall cycle 106 according to example aspects of the present disclosure. Memory model 105 can query interaction memory datastore 110 based on input 300 and receive three memory objects 302, 304, and 306. Memory model 105 can process the received memory objects to generate memory data 308.
[0184] Input 300 can include data from query 104. Input 300 can include all or a portion of query 104. Input 300 can include a caption or other rewriting of query 104.
[0185] Memory object 302 can include structured data indicating a first memory. Memory object 302 can be associated with a unique identifier. In an example, memory object 302 can be represented as a JSON object:
[0186] {
[0187] “tags”: [“finance”, "food”]
[0188] “value”: “prefers to spend less than $20 for lunch”,
[0189] “created": 20240610,
[0190] “lastUpdated”: 20240610,
[0191] “confidence”: 0.6,
[0192] “support”: [
[0193] {
[0194] “content”: “I'd like to keep lunch options under $20 ”,
[0195] “time”: 20240610
[0196] }
[0197] ]
[0198] }
[0199] For example, memory object 302 can be retrieved based on a relevance of the stored tags to input 300.
[0200] Memory object 304 can include structured data indicating a second memory. Memory object 304 can be associated with a unique identifier. In an example, memory object 304 can be represented as a JSON object:
[0201] {
[0202] “tags”: [“food”, "health”]
[0203] “value”: “is vegetarian”,
[0204] “created”: 20240822,
[0205] “lastUpdated”: 20241015,
[0206] “confidence”: 0.8,
[0207] “support”: [
[0208] {
[0209] “content”: “Are there any vegetarian options?”,
[0210] “time”: 20240801
[0211] },
[0212] {
[0213] “content”: “What on the menu is vegetarian?”,
[0214] “time”: 20240822
[0215] },
[0216] {
[0217] “content”: “That contains meat - please suggest another recipe.”,
[0218] “time”: 20241015
[0219] }
[0220] ]
[0221] }
[0222] For example, memory object 304 can be retrieved based on a relevance of the stored tags to input 300.
[0223] Memory object 306 can include structured data indicating a third memory. Memory object 306 can be associated with a unique identifier. In an example, memory object 306 can be represented as a JSON object:
[0224] {
[0225] “tags”: [“food”, "health”]
[0226] “value”: “low-sodium diet”,
[0227] “created”: 20241101
[0228] “lastUpdated”: 20241101 ,
[0229] “confidence”: 1 ,
[0230] “support”: [
[0231] {
[0232] “content”: “Remember not to suggest high-sodium foods.”,
[0233] “time”: 20241101
[0234] }
[0235] ]
[0236] }
[0237] For example, memory object 306 can be retrieved based on a relevance of the stored tags to input 300.
[0238] Memory data 308 can contain extracted portions of the memory objects. For example, memory data 308 can include a concatenated string of the memory values contained in the memory objects:
[0239] “- is vegetarian
[0240] - low-sodium diet
[0241] - prefers to spend less than $20 for lunch”.
[0242] Memory data 308 may be incorporated into input data structure 116 to condition the predictions of a machine-learned model within machine-learned model system(s) 118. This may be achieved by formatting memory data 308 as a structured prompt or instruction within input data structure 116. For example, if input data structure 116 is designed for a large language model, memory data 308 may be prepended to a user’s query as contextual information. This may take the form of a concise summary of relevant memories, such as: “User prefers vegetarian, low-sodium meals costing under $20 ” Alternatively, individual memory items may be presented as separate bullet points or in a more detailed structured format, potentially including metadata such as confidence scores or timestamps. The specific formatting may depend on the capabilities and requirements of the machine-learned model.
[0243] In scenarios involving multimodal data, memory data 308 may include both textual and visual components. For instance, if the user has previously interacted with images of specific dishes, these images may be included in input data structure 116 alongside the textual summary of dietary preferences.The machine-learned model may then use both the textual and visual information to generate a more relevant and accurate response. The image data may be represented as embedded vectors, image URLs, or directly embedded image data, depending on the model's input requirements. The model may be trained to interpret and utilize this multimodal context to refine its predictions. For example, the model may be able to identify specific dishes in images and cross-reference them with the user’s dietary preferences expressed in the text data.
[0244] Figure 4 is a block diagram illustrating generation of data for a memory object according to example aspects of the present disclosure.
[0245] Input 400 can include a request indicating interest in vegetarian food options. For instance, input 400 can include a textual statement or audible utterance,“Are there any vegetarian options?” Input 400 can include data from query 104 Input 400 can include all or a portion of query 104. Input 400 can include a caption or other rewriting of query 104.
[0246] Memory model 105 can interact with machine-learned model system(s) 118 to generate values for a new memory object 402 based on input 400. For example, memory model 105 can use machine- learned model system(s) 118 to predict a memory value “prefers vegetarian” based on the input. Memory model 105 can use machine-learned model system(s) 118 to predict one or more tags for categorizing or classifying the memory object to facilitate retrieval (e.g., food, health).
[0247] Memory model 105 can store a support object that supports the inferred memory value. For example, a support object can include the original content of the input and a timestamp value. For example, a support object for the present example can include
[0248] {
[0249] “content”: “Are there any vegetarian options?”,
[0250] “time”: 20240801
[0251] }.
[0252] A support object can include other information, such as other context associated with the input (e.g., a current task, a location, a summary of associated dialog, etc.).
[0253] Based on the support data, memory model 105 can interact with machine-learned model system(s) 118 to obtain confidence values. A confidence value can represent a confidence of model system(s) 118 in the accuracy of an inferred memory The confidence can be based on logit values or other internal states associated with the neural network or other machine-learned model used for the prediction. The confidence can be based on an explicit confidence prediction output by a machine-learned model in the machine-learned model system(s) 118.
[0254] In some examples, an amount of supporting data (e.g., entries in a support record) can be associated with increased confidence. For instance, a repeated preference or declaration may be remembered with increased confidence or importance as compared to a memory with only one supportingstatement. Further, a type of support can affect a confidence value. For example, even a single supporting declaration of "I love coffee” may result in a higher-confidence memory of the user's preference for coffee as compared to a record of a series of coffee orders from a food service entity. For example, explicit support for a memory can be associated with higher confidence memories.
[0255] Because this example memory is formed with only one supporting interaction, a confidence might be relatively low. For example, a user might be vegetarian or may only be interested in eating a vegetarian meal at a particular time.
[0256] Memory model 105 can cause memory object 402 to store time data associated with the memory. For example, a creation date can record when memory object 402 was created. An updated date can record when memory object 402 was last updated.
[0257] Memory model 105 may generate memory objects using a predefined schema including predefined attributes. For each memory object, values for these attributes may be inferred from user interactions. This inference may leverage one or more machine-learned models to process user inputs and identify salient information to be stored. The predefined attributes may include, but are not limited to, timestamps indicating when the information was obtained, confidence scores reflecting the certainty of the inferred information, tags for categorization and retrieval, and summaries describing the context of the information. The schema may also include fields for storing the raw input data or a representation thereof, along with any supporting evidence used to generate the memory object. Memory model 105 may use these attributes to create structured representations of user interactions, facilitating efficient storage, retrieval, and utilization of contextual information by the machine-learned agent system. The schema may be designed to support various data modalities, allowing for the storage of text, images, audio, or other relevant data types. Furthermore, the schema may incorporate mechanisms for managing the lifespan of memory objects, enabling the system to selectively forget or update information based on user instructions or system-inferred relevance.
[0258] Memory model 105 may generate a schema for storing memory objects dynamically, rather than using a predefined schema. This may involve a machine-learned model trained to predict not only which information should be remembered, but also the optimal attributes to describe that information within a memory object. The model may be trained on a dataset of user interactions and associated metadata, learning to identify patterns and relationships between the content of interactions and the relevant attributes for effective memory representation. For example, the model may learn to associate certain keywords or phrases with specific attribute types, such as associating “favorite restaurant” with attributes like “name,” “location,” “cuisine,” and “last visit.” A machine-learned model can generate an output schema dynamically for each memory object.
[0259] Figure 5 is a block diagram illustrating generation of data for updating a memory object according to example aspects of the present disclosure.
[0260] Input 500 can include a statement “I don’t eat meat - please suggest another recipe.” Input 500 can include data from query 104. Input 500 can include all or a portion of query 104. Input 500 can include a caption or other rewriting of query 104.
[0261] Memory model 105 can update prior memory object 402 to obtain an updated memory object 502. For example, based on prior support, the memory value was inferred to be “prefers vegetarian.” The supporting statements recorded in the support list suggest that the user may be vegetarian but do not preclude other interpretations (e.g., the user was simply interested in a lighter meal option).
[0262] Memory model 105 can provide an input to a machine-learned model containing the new input to obtain an inference indicating a new or updated memory. For example, memory model 105 can provide an input to a machine-learned model containing data from memory object 402 (e.g., from the support record of memory object 402) and input 500 along with an instruction to return an update to the memory based on the new and the old information. The machine-learned model can return a prediction that contains data for updating the contents of memory object 402. The prediction can include an updated memory value, such as “is vegetarian.” The prediction can include an updated confidence value, such as 0.8. The prediction can include an updated support record. The updated support record can include the original support record and the new input.
[0263] Memory model 105 can update memory object 402 based on the prediction. For example, memory model 105 can update the memory value, the confidence value, and the support record. The updated memory object 502 can be stored in interaction memory datastore 110.
[0264] In this manner, for instance, memory model 105 can use machine-learned models to refine and update memories over time based on new information and prior context.
[0265] Figure 6 is a block diagram illustrating generation of data for updating a memory object according to example aspects of the present disclosure.
[0266] In an example, input 600 can include a multimodal input containing an audio recording of an utterance “This was the best dish of the entire trip!” paired with a recorded image of the dish. Input 600 can include data from query 104. Input 600 can include all or a portion of query 104.
[0267] Memory model 105 can provide an input to a machine-learned model containing the new input to obtain an inference indicating a new or updated memory value. For example, memory model 105 can provide an input to a machine-learned model containing the new input and the prior memory object 502 to obtain an inference indicating a new or updated memory value for memory object 602. The machine- learned model can return a prediction that contains data for updating the contents of the memory.
[0268] In an example, the dish is a fish-based dish (e.g., nigiri sushi with fish). One possible determination based on the prior memory and the new input is that the prior memory is incorrect and is to be deleted. Another possible determination is that the prior memory is incomplete or lacks nuance and is to be updated.
[0269] In an example, memory model 105 can provide an input to a machine-learned model containing the new input to obtain an inference, and the inference can indicate that a memory of “is pescatarian” is more likely to be a helpful memory as compared to simply “not vegetarian." The machine-learned model can return a prediction for the memory value of “is pescatarian.”
[0270] In an example, this prediction can be conditioned on the support record. For instance, a record of prior basis for the inference of “is vegetarian" can inform a prediction that the memory value should be “is pescatarian” rather than simply “not vegetarian.” For example, if presented with only “is vegetarian” and an input indicating enthusiasm for a fish-based dish, there may be some uncertainty whether the prior memory was simply incorrect or whether it simply failed to capture a vegetarian diet plus seafood. Maintaining a record of supporting basis for the inferred memory can allow for a re-evaluation of the memory in view of all the evidence. Viewed in context, a prediction of “is pescatarian” may be made with higher confidence.
[0271] In this manner, for instance, memory model 105 can use machine-learned models to refine and update memories over time based on new information and prior context.
[0272] Figure 7 is a communication diagram illustrating communication sequences in an example implementation according to example aspects of the present disclosure. Machine-learned agent system 100 can detect occurrence of one or more trigger(s) 702. A trigger 702 can be a defined event or condition that triggers execution of a memory update cycle 704. Machine-learned agent system 100 can then execute a memory update cycle 704 after some period of time following the occurrence of a trigger 702 (e.g. , immediately, or at a scheduled time or after a scheduled interval). Memory update cycle 704 can include execution of operations of machine-learned agent system 100 for adding new or updating existing memory objects in interaction memory datastore 110. Memory update cycle 704 can involve interactions between machine-learned agent system 100, interaction memory datastore 110, and machine-learned model system(s) 118.
[0273] At some time after memory update cycle 704 has been completed, machine-learned agent system 100 can receive query 104. Machine-learned agent system 100 can initiate a recall cycle 106 to retrieve stored memory data from interaction memory datastore 110 for servicing query 104. After recalling pertinent information from interaction memory datastore 110, machine-learned agent system 100 can compose and transmit input data structure 116 to machine-learned model system(s) 118 for inference using machine-learned model(s). Machine-learned agent system 100 can receive output data structure 120 from machine-learned model system(s) 118. Machine-learned agent system 100 can engage in multiple rounds of interaction with machine-learned model system(s) 118 before concluding with an output response 122 to query 104. Machine-learned agent system 100 can engage in other operations in between rounds of interaction with machine-learned model system(s) 118. For instance, machine-learned agent system 100 can engage in operations for executing tasks using other functions or tools. Examples of such functions or tools may include: operating system APIs for file I / O, network communication, or process management;database APIs for data storage and retrieval; cloud computing APIs; natural language processing (NLP) libraries for text analysis; computer vision libraries for image processing; speech recognition and synthesis APIs for audio processing; and application programming interfaces (APIs) for interacting with external services or applications.
[0274] Example triggers 702 may be based on explicit user commands, implicit system inferences, or scheduled events. Explicit user commands may include, but are not limited to, verbal or textual instructions such as “remember this,” “forget that,” “remember [information],” “forget [information],” “add to memory,” “delete from memory,” “save this,” “clear memory,” or other similar directives. These commands may be accompanied by specific data to be remembered or forgotten. Implicit system inferences may be generated by machine-learned models processing user interactions. These models may analyze user input, output, and contextual information to identify patterns and infer information that may be relevant to future interactions. For example, frequent mention of a particular topic or entity may trigger the system to store related information as a memory object. Similarly, the system may infer a user's preference or intention based on their actions or stated goals. Scheduled events may trigger memory updates at predetermined intervals, such as daily, weekly, or monthly. These events may involve reviewing and updating existing memory objects, deleting expired or irrelevant information, or consolidating related memory objects. The system may also employ a combination of these triggers, using explicit commands to prioritize certain information while relying on implicit inferences and scheduled events for maintaining a comprehensive and up-to-date memory datastore. Further, triggers 702 may be based on events external to the user interaction, such as receiving data from a connected device or an external system. For example, a sensor reading exceeding a threshold may trigger the system to store the sensor data as a memory object. The system may also be configured to receive external commands or instructions to add, update, or delete memory objects. The system may prioritize certain triggers over others based on configurable parameters or system-learned heuristics. For example, explicit user commands may have higher priority than implicit inferences. The system may also incorporate mechanisms for handling conflicting triggers, such as resolving conflicts based on recency, confidence scores, or user-defined rules.
[0275] The receipt of an input (e.g., query 104) may be a trigger 702. The machine-learned agent system 100 may initiate memory update cycle 704 upon receiving said input. This cycle may include several steps. First, the system may analyze query 104 using one or more machine-learned models within machine-learned model system 118. This analysis may identify information within query 104 that may be relevant for updating or adding to the interaction memory datastore 110. This identification may be based on criteria such as the presence of explicit memory instructions (e.g., “remember this”), the identification of novel or important information, or the detection of patterns or trends in user interactions. The system may then extract the identified information, which may include text, images, audio, or other data modalities. This extracted information may be further processed to generate one or more memory objects 202, eachcontaining a memory value 204 and associated metadata 206. The generation of memory objects may involve the use of additional machine-learned model inferences to summarize, categorize, or otherwise structure the extracted information. The system may then update interaction memory datastore 110 by adding the newly generated memory objects or updating existing objects based on the newly processed information. The update process may involve various operations such as insertion, modification, or deletion of memory objects. The system may also utilize a conflict resolution mechanism to handle potential conflicts between new and existing memory objects. This mechanism may prioritize information based on recency, confidence scores, or other criteria. Finally, the system may log the memory update operation for auditing or debugging purposes. The entire memory update cycle 704 may be executed either synchronously with the processing of query 104 or asynchronously, potentially during periods of low system load. The system may employ a queuing mechanism to manage asynchronous memory updates
[0276] In an example implementation, machine-learned agent system 100 may conduct a systematic review of interactions within a preceding time interval (e.g., hourly, daily) to identify content for memory updates. This review may be initiated by a scheduler component that triggers memory update cycle 704 at predetermined intervals. The scheduler may be configured to operate asynchronously, independent of user interactions.
[0277] During the interval, the recent interactions can be cached so that all information is retained in an interaction log. For instance, each turn in a multi-turn dialog between an agent and a user can be stored in a transcript that is used to condition responses from the agent to the user until the cached record can be parsed into memory objects to store in interaction memory datastore.
[0278] Upon initiation, the system may access the cache of recent interactions stored in a separate datastore. This log may contain a chronological record of user inputs and system outputs, including timestamps, data modalities (e.g., text, audio, images), and other relevant metadata. Machine-learned agent system 100 may then employ one or more machine-learned models within machine-learned model system 118 to analyze the interaction log. For example, all or part of the interaction log can be input to a machine-learned model to generate an output that returns information to remember. Using this information, machine-learned agent system 100 can then update memory datastore 110.
[0279] Machine-learned agent system 100 may utilize multiple inference cycles to decompose complex queries and leverage memory context. A first inference cycle may involve inputting query 104 and retrieved memory data 114 to a machine-learned model within machine-learned model system 118. The model may generate an intermediate output data structure 120 representing an analysis of the query in light of the memory context. This analysis may include identifying relevant memory objects, assessing their significance, and formulating intermediate hypotheses or conclusions. Subsequent inference cycles may then be initiated, each using the previous cycle's output data structure 120 as input, along with additional memory data 114 or refined query aspects. Each cycle may focus on a specific aspect of the query orrefine previous conclusions. For example, subsequent cycles may involve verifying hypotheses, resolving ambiguities, or exploring alternative interpretations based on the available memory context. The number of cycles may be dynamically determined based on factors such as query complexity, available memory, and confidence thresholds. The final inference cycle may synthesize the results from previous cycles to generate a comprehensive response 122. Intermediate outputs from the inference cycles may be retained for logging and analysis, but may not be directly presented to the user. This iterative process may improve the accuracy and explainability of the system's responses to complex queries.
[0280] The iterations can be used to process sub-tasks of an overall task. For example, machine- learned agent system 100 may use one or more tools via their application programming interfaces (APIs) to perform sub-tasks within a larger user request. For example, the system may receive a query to create a calendar event. The API definition may require, in each call, details like time, location, and participants Machine-learned agent system 100 may use recalled memory data to populate parameters such as the user's default calendar, preferred notification settings, or frequently contacted participants. One or more recall cycles 106 can be executed within an inference iteration.
[0281] Another example sub-task can engage image processing tools. A user may request an image editing operation to be performed on an image (e.g., adjusting brightness or contrast, color tone, white balance, cropping, retouching). The agent system may use an image editing API to perform the operations. The API calls may require detailed information not provided in the user’s initial, high-level request. Memory objects may store the user’s preferred image formats or editing styles, or records of prior editing tasks approved by the user, which may be used to populate parameters for API calls to image editing services.
[0282] Figure 8 is a block diagram of an example memory update cycle 704 according to example aspects of the present disclosure. During a memory update cycle, machine-learned agent system 100 can provide, in one or more input data structures, cached input(s) 802 to machine-learned model system(s) 118 for processing. Cached input(s) 802 can include values cached from interactions between machine-learned agent system 100 and a user. Machine-learned agent system 100 (e.g., memory model 105) can instruct one or more machine-learned model(s) within machine-learned model system(s) 118 to analyze cached input(s) 802 to evaluate whether to extract information for storing in a memory. In response, optionally after one or more inference iterations performing intermediate analysis, machine-learned model system(s) 118 can provide, in one or more output data structures, memory data update(s) 804 to machine-learned agent system 100. Memory data update(s) 804 can include information extracted or inferred from cached input(s) 802. Memory model 105 can receive and parse memory data update(s) 804 for storage in interaction memory datastore 110. Memory model 105 can execute a memory data write 806 to add or update memory data objects in interaction memory datastore 110.
[0283] In some examples, machine-learned models configured to process natural language (among other modalities) may be used to extract information for storing in interaction memory datastore 110. Forexample, natural language prompts may be designed to elicit salient information for memory storage from cached user interactions. These prompts may leverage various techniques, including instruction-following prompts, few-shot learning prompts, and chain-of-thought prompts. Instruction-following prompts may directly instruct the model to identify key pieces of information from a given input, for example: "Identify the three most important pieces of information in the following conversation that should be remembered for future reference: [cached conversation transcript]”. Such prompts may specify the desired format for the extracted information, such as a list of bullet points, a structured JSON object, or a concise summary. The prompt may also include constraints or criteria for selecting salient information, such as specifying a time horizon for relevance or limiting the output to specific data types.
[0284] Few-shot learning prompts may provide the model with several examples of input-output pairs demonstrating the desired behavior. Each example may consist of a sample cached interaction and the corresponding extracted salient information. The prompt may then present a new cached interaction and request the model to generate the corresponding output based on the provided examples. For instance, the prompt may include several examples of conversations with extracted key phrases, followed by a new conversation and the instruction: "Based on the previous examples, extract the salient information from the following conversation that should be remembered: [cached conversation transcript]”.
[0285] Chain-of-thought prompting may guide the model through a step-by-step reasoning process to extract salient information. The prompt may break down the task into smaller sub-tasks, such as identifying key entities, relationships, and actions within the cached interaction. The model may then be prompted to reason about the identified elements and determine which information is most important for remembering. For example, the prompt may include instructions such as: “First, identify the key entities mentioned in the following conversation. Next, describe the relationships between these entities. Finally, summarize the most important information that should be remembered: [cached conversation transcript]”.
[0286] Furthermore, prompts may incorporate metadata associated with the cached inputs to guide the extraction process. This metadata may include timestamps, user identifiers, interaction types, or other relevant context. The prompt may instruct the model to consider this metadata when identifying salient information. For example, a prompt may include: “Extract the important information from the following conversation, considering the timestamp and user ID. Prioritize information relevant to the current task: [cached conversation transcript], Timestamp: [timestamp], User ID: [user ID]”. The inclusion of metadata may improve the accuracy and relevance of the extracted information. The prompts may also specify constraints on the length or format of the extracted information. For instance, the prompt may limit the output to a maximum number of characters or words, or require the output to be structured in a specific format, such as a key-value pair or a list of items.
[0287] Each memory data update 804 may include one or more updates to one or more memory objects or one or more new memory objects. Memory data update 804 can be communicated in a structuredformat, such as JSON or XML. Each update may include a unique identifier, a timestamp indicating the time of the update, and a set of key-value pairs representing the updated memory data. The key-value pairs may represent various attributes of the memory object, such as tags for categorization, confidence scores reflecting the certainty of the information, and summaries providing contextual information. The update may also include a field indicating the type of update, such as “add,” “modify," or “delete,” to indicate which operations are to be performed on the interaction memory datastore 110. The update may contain a reference to the original memory object being updated. The data types of the values may vary depending on the nature of the memory information, including text strings, numerical values, embedded vectors, image data, audio data, or other data modalities.
[0288] Memory data update 804 may include instructions to perform a sequence of operations, each specifying a modification to the memory datastore. Each operation may include an operation type, such as “insert,” “update,” or “delete,” and the parameters necessary to perform the operation. For example, an “insert” operation may specify the new memory object to be added, while an “update” operation may specify the identifier of the memory object to be modified and the new values for its attributes. A “delete” operation may specify the identifier of the memory object to be removed.
[0289] Figure 9 is a block diagram of an example memory update cycle 704 according to example aspects of the present disclosure. As compared to Figure 8, the example cycle in Figure 9 generates memory updates based on a current reading of memory objects with a current memory data read operation 902. For instance, machine-learned agent system 100 can retrieve one or more current memory objects from interaction memory datastore 110. The retrieved memory objects can include all stored memory objects or just objects retrieved based on relevance to cached inputs 802.
[0290] The retrieved memory objects can be passed as current memory data 904 to machine-learned model system(s) 118 for processing. For example, machine-learned model system(s) 118 can process current memory data 904 along with cached input(s) 802 to generate and provide memory data update(s) 804 to machine-learned agent system 100. In this manner, for instance, machine-learned model system(s) 118 can generate inferences of memory updates in view of existing memories.
[0291] Figures 8 and 9 provide examples of different approaches to memory management. For instance, by updating a memory in view of existing memories, the approach shown in Figure 9 can perform conflict resolution, deduplication, or other updates to consolidate memory data prior to a runtime inference for a given query. However, the memory update operations can be more expensive, depending on the size of the current interaction memory datastore 110. In contrast, memory updates performed without accessing the existing memories may be more inexpensive, as they may primarily focus on ingesting recent information and adding new memories. This can offload the conflict resolution / interpretation task to a runtime execution of one or more models for processing a given query. While this can lead to the runtime inference processing more memory information, it can avoid propagating any intermediate errors inmemory consolidation / conflict resolution. Runtime queries can be processed in full view of all retrieved memories. Memories that appear more often, or in a more emphatic presentation (e.g., explicit instructions to remember) can be stronger signals in the conditioning context that outweigh weaker conflicting signals.
[0292] Figure 10 is a block diagram of an example memory update cycle 704 according to example aspects of the present disclosure. Memory update cycles 704 can be performed at various times independent of the receipt of new information to review. For example, some memory update cycles 704 can be memory maintenance cycles that are used to prune, consolidate, or otherwise organize memory objects. In general, machine-learned agent system 100 (e.g., memory model 105) can transmit instructions for performing a memory update in memory update request(s) 1002. The instructions can be transmitted along with current memory data 904 to cause a machine-learned model of machine-learned model system(s) 118 to generate one or more memory data update(s) 804 for execution.
[0293] The system may perform memory maintenance operations to simulate forgetting data, either based on explicit user instructions or implicit predictions. Explicit forgetting may be triggered by user commands such as “forget that,” “delete this conversation,” or “forget information related to [topic].” These commands may specify data to be deleted directly or indirectly, e.g., by specifying a time window (“forget everything from last week”) or a topic (“forget details of my last doctor's appointment”). The system may interpret these instructions using natural language processing (NLP) techniques and may leverage machine-learned models to identify the relevant data to delete from the interaction memory datastore. Deletion may be immediate or scheduled for a later time.
[0294] Implicit forgetting may involve the system predicting which data is no longer relevant or likely to be needed. This prediction may be based on factors such as data age, frequency of access, and contextual relevance. The system may employ machine-learned models trained on data usage patterns to assign a relevance score to each memory object. Memory objects with scores below a predefined threshold may be candidates for deletion. The system may prioritize deleting less relevant or older data to manage storage space and computational resources.
[0295] The system may also employ a time-based forgetting mechanism. Each memory object may be associated with a timestamp and an optional expiration time. Memory objects exceeding their expiration time may be automatically deleted during scheduled maintenance operations. The expiration time may be specified explicitly by the user or implicitly determined by the system based on predicted expiration. The system may use machine-learned models to predict appropriate expiration times based on data usage patterns and contextual information. The system may implement different forgetting strategies for various data types and contexts. For example, highly sensitive data may have shorter expiration times, while less sensitive data may be retained for longer periods. The system may provide users with configurable parameters to control the time-based forgetting mechanism, allowing them to customize the system's memory retention policies. These parameters may include setting default expiration times for different datacategories or specifying a maximum age for retaining data. The system may log all memory maintenance operations, including explicit deletions and implicit removals, for auditing and debugging purposes
[0296] Memory model 105 may implement gradual "forgetting.” For example, for an initial period of time, a memory object can be stored in full detail. After a period of time, in a maintenance update cycle, such as upon expiration of some interval after ingestion or use of memory data, memory model 105 can replace a value in a stored memory object with a noised, reduced resolution, or otherwise less detailed version of values in the memory object (e.g., replacing document content with summaries, downsampling images or audio recordings or adding noise, etc.). For example, in a scenario in which the user indicates a dietary restriction, periodically the memory model 105 may purge information identifying the exact restriction. This can facilitate the maintenance of fresh context information, such as by deleting or replacing stale context.
[0297] Gradual “forgetting” may enable a gentle rolloff of a designated memory duration. For example, a hard memory deletion threshold can lead to disturbances in user experience when information suddenly is no longer available. In contrast, a gentle rolloff can lead to interactions that confirm the details. For example, in the dietary restriction example, the memory model may refresh its memory by confirming details, such as by rendering a request for confirmation, “I recall you previously had a dietary restriction - what was the restriction?” A gradual memory rolloff may also provide for improved resource utilization. For example, older memories may be progressively represented with smaller and smaller amounts of data. After an interaction reaffirming the memory, the “age” of the memory can be reset.
[0298] A configuration for “forgetting” information may be applied to create different versions of memory data. For instance, some applications may benefit from perfect memory (e.g., without “forgetting” anything) For instance, a user may prefer that an agent system maintain perfect memory when acting as a personal assistant to assist the user in performing tasks and remembering information. A configuration of the agent system can preserve access to an intact interaction memory datastore 110 for certain operations. However, for other operations, the agent system may access an imperfect or fuzzy memory. For example, a user may prefer that the agent system only leverage a fuzzy or noisy memory when executing operations that interact with certain external applications, devices, or services. For instance, the agent system may provide limited context when communicating with certain systems. Interaction memory datastore 110 can store multiple versions of memory content, with various levels of noise or obfuscation applied, to enable retrieval and use of various qualities of memory data.
[0299] Memory model 105 may consolidate related memory objects, merging redundant or overlapping information to reduce storage overhead. This consolidation may be performed periodically, such as during scheduled maintenance operations, or triggered by specific events, such as the detection of a high density of memory objects exceeding a predefined threshold. The consolidation process may employ machine- learned models to identify memory objects with significant overlap in content or semantic meaning. Thesemodels may analyze the memory values and metadata associated with the memory objects, using techniques such as semantic similarity measures or clustering algorithms to group related objects.
[0300] The memory model 105 may merge identified related memory objects into a single, more concise representation. This merging may involve combining memory values, such as concatenating text strings, replacing single image objects with an object storing multiple images, etc. The memory model 105 may also integrate metadata from multiple memory objects, potentially resolving conflicts or inconsistencies through weighted averaging or other conflict resolution techniques. For example, conflicting timestamps may be resolved by selecting the most recent timestamp, or by creating a composite timestamp reflecting the range of recorded times. Conflicting memory values may be resolved by selecting the value with the highest confidence score, or by generating a new value that synthesizes information from multiple sources. The memory model 105 may prioritize merging memory objects with high similarity scores and high confidence values, while preserving information from highly reliable sources.
[0301] The system may employ various strategies for merging memory objects, depending on the data modalities and the nature of the overlap. For textual data, the system may concatenate strings, summarize content, or generate a more concise representation using techniques such as sentence compression or abstractive summarization. For numerical data, the system may compute averages, medians, or other statistical measures to combine values. For multimodal data, the system may integrate information from different modalities, such as combining textual descriptions with image or audio data.
[0302] In some embodiments, the system may manage memory object size by splitting large memory objects into smaller, distinct memory objects This may be particularly beneficial when a large memory object accumulates substantial supporting evidence over time, suggesting the presence of multiple distinct sub-memories within the original object. The process may involve analyzing the supporting evidence associated with the large memory object using one or more machine-learned models. These models may identify clusters or subgroups within the supporting evidence, each representing a distinct aspect or submemory related to the original memory object.
[0303] Each identified cluster of supporting evidence may then be used to generate a new, smaller memory object. The new memory objects may each contain a subset of the original memory object’s supporting evidence, along with a refined memory value reflecting the specific aspect represented by the cluster. The refined memory values may be generated using machine-learned models, which may summarize, abstract, or otherwise distill the information contained within each cluster of supporting evidence. The system may also assign new metadata to each smaller memory object, reflecting its specific content and context. This metadata may include refined tags, confidence scores, and timestamps, reflecting the specific supporting evidence associated with each new object.
[0304] The splitting process may be performed periodically, such as during scheduled memory maintenance operations, or triggered by specific events, such as the detection of a large memory objectexceeding a predefined size threshold or accumulating a significant amount of supporting evidence. The system may prioritize splitting memory objects with high confidence scores and substantial supporting evidence, ensuring that the resulting smaller memory objects are accurate and reliable. The system may also maintain links or relationships between the original large memory object and the newly created smaller memory objects, allowing for efficient retrieval and contextualization of information. This may involve creating a hierarchical structure within the interaction memory datastore, where the original large memory object serves as a parent object, and the smaller memory objects represent its children. This hierarchical structure may facilitate efficient querying and retrieval of information, allowing the system to access both the overall summary and the specific details associated with each sub-memory. The system may employ various strategies for managing the relationships between the original and split memory objects, such as using pointers, linked lists, or graph-based structures
[0305] Figure 11 depicts a flowchart of a method 1100 for training one or more machine-learned models according to aspects of the present disclosure. For instance, an example machine-learned model can include any one or more of the machine-learned agent system, machine-learned preprocessing model, machine-learned description generation model, machine-learned context classification model, or any other machine-learned model or component described herein.
[0306] One or more portion(s) of example method 1100 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 1100 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 1100 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. Figure 11 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 11 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 1100 can be performed additionally, or alternatively, by other systems.
[0307] At 1102, example method 1100 can include obtaining a training instance. A set of training data can include a plurality of training instances divided between multiple datasets (e.g., a training dataset, a validation dataset, or testing dataset). A training instance can be labeled or unlabeled. Although referred to in example method 1100 as a "training” instance, it is to be understood that runtime inferences can form training instances when a model is trained using an evaluation of the model’s performance on that runtime instance (e.g., online training / learning). Example data types for the training instance and various tasks associated therewith are described throughout the present disclosure.
[0308] At 1104, example method 1100 can include processing, using one or more machine-learned models, the training instance to generate an output. The output can be directly obtained from the one or more machine-learned models or can be a downstream result of a chain of processing operations that includes an output of the one or more machine-learned models.
[0309] At 1106, example method 1100 can include receiving an evaluation signal associated with the output. The evaluation signal can be obtained using a loss function. Various determinations of loss can be used, such as mean squared error, likelihood loss, cross entropy loss, hinge loss, contrastive loss, or various other loss functions. The evaluation signal can be computed using known ground-truth labels (e.g., supervised learning), predicted or estimated labels (e.g., semi- or self-supervised learning), or without labels (e.g., unsupervised learning). The evaluation signal can be a reward (e.g., for reinforcement learning). The reward can be computed using a machine-learned reward model configured to generate rewards based on output(s) received. The reward can be computed using feedback data describing human feedback on the output(s).
[0310] At 1108, example method 1100 can include updating the machine-learned model using the evaluation signal. For example, values for parameters of the machine-learned model(s) can be learned, in some embodiments, using various training or learning techniques, such as, for example, backwards propagation. For example, the evaluation signal can be backpropagated from the output (or another source of the evaluation signal) through the machine-learned model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the evaluation signal with respect to the parameter value(s)). For example, system(s) containing one or more machine-learned models can be trained in an end-to-end manner. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. Example method 1100 can include implementing a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
[0311] In some implementations, example method 1100 can be implemented for training a machine- learned model from an initialized state to a fully trained state (e.g., when the model exhibits a desired performance profile, such as based on accuracy, precision, recall, etc.).
[0312] In some implementations, example method 1100 can be implemented for particular stages of a training procedure. For instance, in some implementations, example method 1100 can be implemented for pre-training a machine-learned model. Pre-training can include, for instance, large-scale training over potentially noisy data to achieve a broad base of performance levels across a variety of tasks / data types.
[0313] In some implementations, example method 1100 can be implemented for fine-tuning a machine- learned model. Fine-tuning can include, for instance, smaller-scale training on higher-quality (e.g., labeled, curated, etc.) data. Fine-tuning can affect all or a portion of the parameters of a machine-learned model.For example, various portions of the machine-learned model can be “frozen” for certain training stages. For example, parameters associated with an embedding space can be “frozen” during fine-tuning (e.g., to retain information learned from a broader domain(s) than present in the fine-tuning dataset(s)). In some implementations, example method 1100 uses adapter modules. Adapters can be small trainable layers that are inserted between pre-existing layers of a pre-trained model. During the fine-tuning process, the original parameters of the pre-trained model are typically frozen, and only the parameters of the adapters are updated.
[0314] In some implementations, example method 1100 can be implemented to execute parameterefficient fine-tuning methods, such as Low Rank Adaptation (LoRA). LoRA can refine pre-trained models with minimal adjustments to the original parameters. This can be achieved by introducing trainable low-rank matrices that modify the behavior of the pre-trained weights without directly altering them. In some implementations, during fine-tuning, only these auxiliary matrices are updated, which significantly reduces the number of parameters that are trained.
[0315] An example fine-tuning approach includes reinforcement learning. Reinforcement learning can be based on user feedback on model performance during use.
[0316] Figure 12 is a block diagram of an example processing flow for using machine-learned model(s) 1 to process input(s) 2 to generate output(s) 3.
[0317] Machine-learned model(s) 1 can be or include one or multiple machine-learned models or model components. Example machine-learned models can include neural networks (e.g., deep neural networks). Example machine-learned models can include non-linear models or linear models. Example machine- learned models can use other architectures in lieu of or in addition to neural networks. Example machine- learned models can include decision tree based models, support vector machines, hidden Markov models, Bayesian networks, linear regression models, k-means clustering models, etc.
[0318] Machine-learned model(s) 1 can be or include, or otherwise be representative of any one or more of the machine-learned models described above with respect to the preceding figures. For example, machine-learned model(s) 1 can be or include, or otherwise be representative of any one or more of the machine-learned models described herein used by machine-learned agent system 100, including any one or more of model(s) operated by machine-learned model system 118. Although various features, variations, and implementations described below are described with respect to machine-learned model(s) 1 for the sake of concision, it is to be understood that such features, variations, and combinations and other implementations thereof are to be understood as described with respect to any of such machine-learned model described herein.
[0319] Example neural networks can include feed-forward neural networks, recurrent neural networks (RNNs), including long short-term memory (LSTM) based recurrent neural networks, convolutional neural networks (CNNs), diffusion models, generative-adversarial networks, or other forms of neural networks.Example neural networks can be deep neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention or cross attention. For example, some example machine-learned models can include multi-headed self-attention models, multi-query self-attention models, or other attention mechanisms.
[0320] Machine-learned model(s) 1 can include a single or multiple instances of the same model configured to operate on data from input(s) 2. Machine-learned model(s) 1 can include multiple different models or multiple different model portions configured to operate on data from input(s) 2.
[0321] Machine-learned model(s) 1 can include an ensemble of different models that can cooperatively interact to process data from input(s) 2. For example, a model ensemble can include multiple models that have different attributes (e.g., different architectures, trained with different recipes, etc.). The ensemble can output an overall output based on the individual outputs of the constituent models. In this manner, for instance, the diverse constituent models can work together to provide system-level robustness by effectively aggregating over individual strengths and weaknesses of any given model. The respective individual outputs can be combined in a weighted combination, using a voting or routing mechanism, or a learned output layer (e.g., one or more feedforward or fully-connected layers).
[0322] Machine-learned model(s) 1 can employ a mixture-of-experts structure. See, e.g., Zhou et al., Mixture-of-Experts with Expert Choice Routing, arXiv:2202.09368v2 (Oct. 14, 2022). For example, different portions of a model can learn (explicitly or implicitly) different expertise areas, with pathways through the model being selected by a learned routing mechanism that engages the appropriate expert for a given input (e.g., a given portion of an input, such as on a per-token basis). For example, a feedforward network can be sparsely activated for a given portion of an input based on an output of a routing mechanism that processes the portion of the input. In this manner, for instance, the group of activated weights can form an "expert” that is selected by the router. On each forward pass, only a subset of the total model weights may be engaged, thereby decreasing the quantity of operations performed for processing a given input compared to a densely activated model. In this manner, for instance, the expressive and interpretive power of a high-parameter-count model can be achieved with more compute-efficient forward passes.
[0323] Input(s) 2 can generally include or otherwise represent various types of data. Input(s) 2 can include one type or many different types of data. Output(s) 3 can be data of the same type(s) or of different types of data as compared to input(s) 2. Output(s) 3 can include one type or many different types of data.
[0324] Input(s) 2 can be or include, or otherwise be representative of any one or more of the inputs described above with respect to the preceding figures. For example, input(s) 2 can be or include, or otherwise be representative of any one or more of query 104, input data structure 116, etc. Although various features, variations, and implementations described below are described with respect to input(s) 2, it is to be understood that such features, variations, and implementations are to be understood asdescribed with respect to each of input data 104, etc., or any other input data to a component described herein.
[0325] Output(s) 3 can be or include, or otherwise be representative of any one or more of the outputs described above with respect to the preceding figures. For example, output(s) 3 can be or include, or otherwise be representative of any one or more of output data structure 120, memory update 804, or any other output of a model described herein. Although various features, variations, and implementations described below are described with respect to output(s) 3, it is to be understood that such features, variations, and implementations are to be understood as described with respect to any output data from a component described herein.
[0326] Example data types for input(s) 2 or output(s) 3 include natural language text data, software code data (e.g., source code, object code, machine code, or any other form of computer-readable instructions or programming languages), machine code data (e.g., binary code, assembly code, or other forms of machine-readable instructions that can be executed directly by a computer’s central processing unit), assembly code data (e.g., low-level programming languages that use symbolic representations of machine code instructions to program a processing unit), genetic data or other chemical or biochemical data, image data, audio data, audiovisual data, haptic data, biometric data, medical data, financial data, statistical data, geographical data, astronomical data, historical data, sensor data generally (e.g., digital or analog values, such as voltage or other absolute or relative level measurement values from a real or artificial input, such as from an audio sensor, light sensor, displacement sensor, etc.), and the like. Data can be raw or processed and can be in any format or schema.
[0327] In multimodal inputs 2 or outputs 3, example combinations of data types include image data and audio data, image data and natural language data, natural language data and software code data, image data and biometric data, sensor data and medical data, etc. It is to be understood that any combination of data types in an input 2 or an output 3 can be present.
[0328] An example input 2 can include one or multiple data types, such as the example data types noted above. An example output 3 can include one or multiple data types, such as the example data types noted above. The data type(s) of input 2 can be the same as or different from the data type(s) of output 3. It is to be understood that the example data types noted above are provided for illustrative purposes only. Data types contemplated within the scope of the present disclosure are not limited to those examples noted above.
[0329] Figure 13 is a block diagram of an example implementation of an example machine-learned model configured to process sequences of information. For instance, an example implementation of machine-learned model(s) 1 can include machine-learned sequence processing model(s) 4. An example system can pass input(s) 2 to sequence processing model(s) 4. Sequence processing model(s) 4 can include one or more machine-learned components. Sequence processing model(s) 4 can process the datafrom input(s) 2 to obtain an input sequence 5. Input sequence 5 can include one or more input elements 5- 1 , 5-2, . . . . 5-M, etc. obtained from input(s) 2. Sequence processing model 4 can process input sequence 5 using prediction layer(s) 6 to generate an output sequence 7. Output sequence 7 can include one or more output elements 7-1 , 7-2, . . . . 7-N, etc. generated based on input sequence 5. The system can generate output(s) 3 based on output sequence 7.
[0330] Sequence processing model(s) 4 can include one or multiple machine-learned model components configured to ingest, generate, or otherwise reason over sequences of information. For example, some example sequence processing models are referred to as language models and can leverage language-based understandings across one or multiple modalities of input information. Sequence processing model(s) 4 can include relatively large models (e.g., more parameters, computationally expensive, etc ), which may be referred to as "Large Language Models” or LLMs. Sequence processing model(s) 4 can include relatively small models (e.g., fewer parameters, computationally lightweight, etc.), which may be referred to as "Small Language Models” or SLMs. Example language models include, for instance, models described in Gemma: Open Models Based on Gemini Research and Technology, GOOGLE, https: / / arxiv.org / abs / 2403.08295; Gemma 2: Improving Open Language Models at a Practical Size, GOOGLE, https: / / arxiv.org / abs / 2408.00118.
[0331] Sequence processing model(s) 4 can process one or multiple types of data simultaneously. Variations of language models that can perform joint vision and language tasks may be referred to as “Vision-Language Models,” or VLMs. Example VLMs include models described in PaliGemma: A versatile 3B VLM for transfer, GOOGLE, https: / / arxiv.org / abs / 2407.07726; PaliGemma 2: A Family of Versatile VLMs for Transfer, GOOGLE, https: / / arxiv.org / abs / 2412.03555; Flamingo: a Visual Language Model for Few-Shot Learning, GOOGLE, https: / / arxiv.org / abs / 2204.14198; PaLI: A Jointly-Scaled Multilingual Language-Image Model, GOOGLE, https: / / arxiv.org / abs / 2209.06794.
[0332] Sequence processing model(s) 4 can be multimodal. Example multimodal sequence processing models include, for instance, models described in Gemini: A Family of Highly Capable Multimodal Models, GOOGLE, https: / / arxiv.org / abs / 2312.11805; Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, GOOGLE, https: / / arxiv.org / abs / 2403.05530;
[0333] Other example sequence processing models can operate to generate outputs or receive inputs in specific domains, such as image domains, see, e.g., Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ARXIV:2010.11929v2 (Jun. 3, 2021), audio domains, see, e.g., Agostinelli et al., MusicLM: Generating Music From Text, ARXIV:2301 .11325v1 (Jan. 26, 2023), biochemical domains, see, e.g., Jumper et al., Highly accurate protein structure prediction with AlphaFold, 596 Nature 583 (Aug. 26, 2021 ), by way of example.
[0334] In general, sequence processing model(s) 4 can obtain input sequence 5 using data from input(s) 2. For instance, input sequence 5 can include a representation of data from input(s) 2 in a formatunderstood by sequence processing model(s) 4. One or more machine-learned components of sequence processing model(s) 4 can ingest the data from input(s) 2, parse the data into pieces compatible with the processing architectures of sequence processing model(s) 4 (e.g., via “tokenization”), and project the pieces into an input space associated with prediction layer(s) 6 (e.g., via “embedding”).
[0335] Sequence processing model(s) 4 can ingest the data from input(s) 2 and parse the data into a sequence of elements to obtain input sequence 5. For example, a portion of input data from input(s) 2 can be broken down into pieces that collectively represent the content of the portion of the input data. The pieces can provide the elements of the sequence.
[0336] Elements 5-1 , 5-2, . . . . 5-M can represent, in some cases, building blocks for capturing or expressing meaningful information in a particular data domain. For instance, the elements can describe “atomic units” across one or more domains. For example, for textual input source(s), the elements can correspond to groups of one or more words or sub-word components, such as sets of one or more characters.
[0337] For example, elements 5-1 , 5-2, . . . , 5-M can represent tokens obtained using a tokenizer. For instance, a tokenizer can process a given portion of an input source and output a series of tokens (e.g., corresponding to input elements 5-1 , 5-2, . . . , 5-M) that represent the portion of the input source. Various approaches to tokenization can be used. For instance, textual input source(s) can be tokenized using a byte-pair encoding (BPE) technique. See, e.g., Kudo et al., SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (System Demonstrations), pages 66-71 (October 31-November 4, 2018), https: / / aclanthology.org / D18-2012.pdf. Image-based input source(s) can be tokenized by extracting and serializing patches from an image.
[0338] In general, arbitrary data types can be serialized and processed into input sequence 5. It is to be understood that element(s) 5-1 , 5-2, . . . . 5-M can be the tokens or can be the embedded representations thereof.
[0339] Prediction layer(s) 6 can predict one or more output elements 7-1 , 7-2, . . . , 7-N based on the input elements. Prediction layer(s) 6 can include one or more machine-learned model architectures, such as one or more layers of learned parameters that manipulate and transform the input(s) to extract higher- order meaning from, and relationships between, input element(s) 5-1 , 5-2, . . . , 5-M. In this manner, for instance, example prediction layer(s) 6 can predict new output element(s) in view of the context provided by input sequence 5.
[0340] Prediction layer(s) 6 can evaluate associations between portions of input sequence 5 and a particular output element. These associations can inform a prediction of the likelihood that a particular output follows the input context. For example, consider the textual snippet, “The carpenter's toolbox was small and heavy. It was full of Example prediction layer(s) 6 can identify that “It” refers back to“toolbox" by determining a relationship between the respective embeddings. Example prediction layer(s) 6 can also link “It” to the attributes of the toolbox, such as “small” and “heavy.” Based on these associations, prediction layer(s) 6 can, for instance, assign a higher probability to the word “nails” than to the word “sawdust.”
[0341] A transformer is an example architecture that can be used in prediction layer(s) 4. See, e.g., Vaswani et al., Attention Is All You Need, arXiv:1706.03762v7 (Aug. 2, 2023). A transformer is an example of a machine-learned model architecture that uses an attention mechanism to compute associations between items within a context window. The context window can include a sequence that contains input sequence 5 and potentially one or more output element(s) 7-1 , 7-2, . . . , 7-N. A transformer block can include one or more attention layer(s) and one or more post-attention layer(s) (e.g., feedforward layer(s), such as a multi-layer perceptron).
[0342] Prediction layer(s) 6 can include other machine-learned model architectures in addition to or in lieu of transformer-based architectures. For example, recurrent neural networks (RNNs) and long shortterm memory (LSTM) models can also be used, as well as convolutional neural networks (CNNs). In general, prediction layer(s) 6 can leverage various kinds of artificial neural networks that can understand or generate sequences of information.
[0343] Output sequence 7 can include or otherwise represent the same or different data types as input sequence 5. For instance, input sequence 5 can represent textual data, and output sequence 7 can represent textual data. Input sequence 5 can represent image, audio, or audiovisual data, and output sequence 7 can represent textual data (e.g., describing the image, audio, or audiovisual data) It is to be understood that prediction layer(s) 6, and any other interstitial model components of sequence processing model(s) 4, can be configured to receive a variety of data types in input sequence(s) 5 and output a variety of data types in output sequence(s) 7.
[0344] Output sequence 7 can have various relationships to input sequence 5. Output sequence 7 can be a continuation of input sequence 5. Output sequence 7 can be complementary to input sequence 5. Output sequence 7 can translate, transform, augment, or otherwise modify input sequence 5. Output sequence 7 can answer, evaluate, confirm, or otherwise respond to input sequence 5. Output sequence 7 can implement (or describe instructions for implementing) an instruction provided via input sequence 5.
[0345] Output sequence 7 can be generated autoreg ressively. For instance, for some applications, an output of one or more prediction layer(s) 6 can be passed through one or more output layers (e.g., softmax layer) to obtain a probability distribution over an output vocabulary (e.g., a textual or symbolic vocabulary) conditioned on a set of input elements in a context window. In this manner, for instance, output sequence 7 can be autoregressively generated by sampling a likely next output element, adding that element to the context window, and re-generating the probability distribution based on the updated context window, and sampling a likely next output element, and so forth.
[0346] Output sequence 7 can also be generated non-autoregressively. For instance, multiple output elements of output sequence 7 can be predicted together without explicit sequential conditioning on each other. See, e.g., Saharia et al., Non-Autoreg ressive Machine Translation with Latent Alignments, arXiv:2004.07437 v3 (Nov. 16, 2020).
[0347] Output sequence 7 can include one or multiple portions or elements. In an example content generation configuration, output sequence 7 can include multiple elements corresponding to multiple portions of a generated output sequence (e.g., a textual sentence, values of a discretized waveform, computer code, etc.). In an example classification configuration, output sequence 7 can include a single element associated with a classification output. For instance, an output “vocabulary” can include a set of classes into which an input sequence is to be classified. For instance, a vision transformer block can pass latent state information to a multilayer perceptron that outputs a likely class value associated with an input image.
[0348] Figure 14 is a block diagram of an example technique for populating an example input sequence 8. Input sequence 8 can include various functional elements that form part of the model infrastructure, such as an element 8-0 obtained from a task indicator 9 that signals to any model(s) that process input sequence 8 that a particular task is being performed (e.g., to help adapt a performance of the model(s) to that particular task). Input sequence 8 can include various data elements from different data modalities. For instance, an input modality 10-1 can include one modality of data. A data-to-sequence model 11-1 can process data from input modality 10-1 to project the data into a format compatible with input sequence 8 (e.g., one or more vectors dimensioned according to the dimensions of input sequence 8) to obtain elements 8-1 , 8-2, 8-3. Another input modality 10-2 can include a different modality of data. A data-to- sequence model 11-2 can project data from input modality 10-2 into a format compatible with input sequence 8 to obtain elements 8-4, 8-5, 8-6. Another input modality 10-3 can include yet another different modality of data. A data-to-sequence model 11-3 can project data from input modality 10-3 into a format compatible with input sequence 8 to obtain elements 8-7, 8-8, 8-9.
[0349] Input sequence 8 can be the same as or different from input sequence 5. Input sequence 8 can be a multimodal input sequence that contains elements that represent data from different modalities using a common dimensional representation. For instance, an embedding space can have P dimensions. Input sequence 8 can be configured to contain a plurality of elements that have P dimensions. In this manner, for instance, example implementations can facilitate information extraction and reasoning across diverse data modalities by projecting data into elements in the same embedding space for comparison, combination, or other computations therebetween.
[0350] For example, elements 8-0, . . . . 8-9 can indicate particular locations within a multidimensional embedding space. Some elements can map to a set of discrete locations in the embedding space. For instance, elements that correspond to discrete members of a predetermined vocabulary of tokens can mapto discrete locations in the embedding space that are associated with those tokens. Other elements can be continuously distributed across the embedding space. For instance, some data types can be broken down into continuously defined portions (e.g., image patches) that can be described using continuously distributed locations within the embedding space.
[0351] In some implementations, the expressive power of the embedding space may not be limited to meanings associated with any particular set of tokens or other building blocks. For example, a continuous embedding space can encode a spectrum of high-order information. An individual piece of information (e.g., a token) can map to a particular point in that space: for instance, a token for the word “dog” can be projected to an embedded value that points to a particular location in the embedding space associated with canine-related information. Similarly, an image patch of an image of a dog on grass can also be projected into the embedding space. In some implementations, the projection of the image of the dog can be similar to the projection of the word “dog” while also having similarity to a projection of the word “grass,” while potentially being different from both. In some implementations, the projection of the image patch may not exactly align with any single projection of a single word. In some implementations, the projection of the image patch can align with a combination of the projections of the words “dog” and “grass.” In this manner, for instance, a high-order embedding space can encode information that can be independent of data modalities in which the information is expressed.
[0352] Task indicator 9 can include a model or model component configured to identify a task being performed and inject, into input sequence 8, an input value represented by element 8-0 that signals which task is being performed. For instance, the input value can be provided as a data type associated with an input modality and projected along with that input modality (e.g., the input value can be a textual task label that is embedded along with other textual data in the input; the input value can be a pixel-based representation of a task that is embedded along with other image data in the input; etc.). The input value can be provided as a data type that differs from or is at least independent from other input(s). For instance, the input value represented by element 8-0 can be a learned value within a continuous embedding space.
[0353] Input modalities 10-1 , 10-2, and 10-3 can be associated with various different data types (e.g., as described above with respect to input(s) 2 and output(s) 3).
[0354] Data-to-sequence models 11-1 , 11-2, and 11-3 can be the same or different from each other. Data-to-sequence models 11-1 , 11-2, and 11-3 can be adapted to each respective input modality 10-1 , 10- 2, and 10-3. For example, a textual data-to-sequence model can subdivide a portion of input text and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-1, 8-2, 8-3, etc.). An image data-to-sequence model can subdivide an input image and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-4, 8-5, 8-6, etc.). An arbitrary datatype data-to-sequence model can subdivide an input of that arbitrary datatype and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-7, 8-8, 8-9, etc.).
[0355] Data-to-sequence models 11-1 , 11-2, and 11-3 can form part of machine-learned sequence processing model(s) 4. Data-to-sequence models 11-1 , 11-2, and 11-3 can be jointly trained with or trained independently from machine-learned sequence processing model(s) 4. Data-to-sequence models 11-1, 11- 2, and 11-3 can be trained end-to-end with machine-learned sequence processing model(s) 4.
[0356] Figure 15 is a block diagram of an example model development platform 12 that can facilitate creation, adaptation, and refinement of example machine-learned models (e.g., machine-learned model(s) 1 , sequence processing model(s) 4, etc.). Model development platform 12 can provide a number of different toolkits that developer systems can employ in the development of new or adapted machine- learned models.
[0357] Model development platform 12 can provide one or more model libraries 13 containing building blocks for new models. Model libraries 13 can include one or more pre-trained foundational models 13-1 , which can provide a backbone of processing power across various tasks. Model libraries 13 can include one or more pre-trained expert models 13-2, which can be focused on performance in particular domains of expertise. Model libraries 13 can include various model primitives 13-3, which can provide low-level architectures or components (optionally pre-trained), which can be assembled in various arrangements as desired. Model primitives 13-3 can include a library of pre-trained adapters or LoRA modules that can adapt a baseline foundational model to align its outputs with a desired performance profile, augment model capabilities (e.g., to adapt to a different input modality, etc.), and the like.
[0358] Model development platform 12 can receive selections of various model components 14. Model development platform 12 can pass selected model components 14 to a workbench 15 that combines selected model components 14 into a development model 16.
[0359] Workbench 15 can facilitate further refinement and adaptation of development model 16 by leveraging a number of different toolkits integrated with model development platform 12. For example, workbench 15 can facilitate alignment of the development model 16 with a desired performance profile on various tasks using a model alignment toolkit 17.
[0360] Model alignment toolkit 17 can provide a number of tools for causing development model 16 to generate outputs aligned with desired behavioral characteristics. Alignment can include increasing the accuracy, precision, recall, etc. of model outputs. Alignment can include enforcing output styles, schema, or other preferential characteristics of model outputs. Alignment can be general or domain-specific. For instance, a pre-trained foundational model 13-1 can begin with an initial level of performance across multiple domains. Alignment of the pre-trained foundational model 13-1 can include improving a performance in a particular domain of information or tasks (e.g., even at the expense of performance in another domain of information or tasks).
[0361] Model alignment toolkit 17 can integrate one or more dataset(s) 17-1 for aligning development model 16. Curated dataset(s) 17-1 can include labeled or unlabeled training data. Dataset(s) 17-1 can beobtained from public domain datasets. Dataset(s) 17-1 can be obtained from private datasets associated with one or more developer system(s) for the alignment of bespoke machine-learned model(s) customized for private use-cases.
[0362] Pre-training pipelines 17-2 can include a machine-learned model training workflow configured to update development model 16 over large-scale, potentially noisy datasets. For example, pre-training can leverage unsupervised learning techniques (e.g., de-noising, etc.) to process large numbers of training instances to update model parameters from an initialized state and achieve a desired baseline performance. Pre-training pipelines 17-2 can leverage unlabeled datasets in dataset(s) 17-1 to perform pretraining. Workbench 15 can implement a pre-training pipeline 17-2 to pre-train development model 16.
[0363] Fine-tuning pipelines 17-3 can include a machine-learned model training workflow configured to refine the model parameters of development model 16 with higher-quality data. Fine-tuning pipelines 17-3 can update development model 16 by conducting supervised training with labeled dataset(s) in dataset(s) 17-1. Fine-tuning pipelines 17-3 can update development model 16 by conducting reinforcement learning using reward signals from user feedback signals. Workbench 15 can implement a fine-tuning pipeline 17-3 to fine-tune development model 16.
[0364] Prompt libraries 17-4 can include sets of inputs configured to induce behavior aligned with desired performance criteria. Prompt libraries 17-4 can include few-shot prompts (e.g., inputs providing examples of desired model outputs for prepending to a desired runtime query), chain-of-thought prompts (e.g., inputs providing step-by-step reasoning within the exemplars to facilitate thorough reasoning by the model), and the like.
[0365] Example prompts can be retrieved from an available repository of prompt libraries 17-4. Example prompts can be contributed by one or more developer systems using workbench 15.
[0366] In some implementations, pre-trained or fine-tuned models can achieve satisfactory performance without exemplars in the inputs. For instance, zero-shot prompts can include inputs that lack exemplars. Zero-shot prompts can be within a domain within a training dataset or outside of the training domain(s).
[0367] Prompt libraries 17-4 can include one or more prompt engineering tools. Prompt engineering tools can provide workflows for retrieving or learning optimized prompt values. Prompt engineering tools can facilitate directly learning prompt values (e.g., input element values) based on one or more training iterations. Workbench 15 can implement prompt engineering tools in development model 16.
[0368] Prompt libraries 17-4 can include pipelines for prompt generation. For example, inputs can be generated using development model 16 itself or other machine-learned models. In this manner, for instance, a first model can process information about a task and output an input for a second model to process in order to perform a step of the task. The second model can be the same as or different from the first model. Workbench 15 can implement prompt generation pipelines in development model 16.
[0369] Prompt libraries 17-4 can include pipelines for context injection. For instance, a performance of development model 16 on a particular task can improve if provided with additional context for performing the task. Prompt libraries 17-4 can include software components configured to identify desired context, retrieve the context from an external source (e.g., a database, a sensor, etc.), and add the context to the input prompt. Workbench 15 can implement context injection pipelines in development model 16.
[0370] Although various training examples described herein with respect to model development platform 12 refer to "pre-training” and "fine-tuning,” it is to be understood that model alignment toolkit 17 can generally support a wide variety of training techniques adapted for training a wide variety of machine- learned models. Example training techniques can correspond to the example training method 1100 described above.
[0371] Model development platform 12 can include a model plugin toolkit 18. Model plugin toolkit 18 can include a variety of tools configured for augmenting the functionality of a machine-learned model by integrating the machine-learned model with other systems, devices, and software components. For instance, a machine-learned model can use tools to increase performance quality where appropriate. For instance, deterministic tasks can be offloaded to dedicated tools in lieu of probabilistically performing the task with an increased risk of error. For instance, instead of autoregressively predicting the solution to a system of equations, a machine-learned model can recognize a tool to call for obtaining the solution and pass the system of equations to the appropriate tool. The tool can be a traditional system of equations solver that can operate deterministically to resolve the system of equations. The output of the tool can be returned in response to the original query. In this manner, tool use can allow some example models to focus on the strengths of machine-learned models— e.g., understanding an intent in an unstructured request for a task— while augmenting the performance of the model by offloading certain tasks to a more focused tool for rote application of deterministic algorithms to a well-defined problem.
[0372] Model plugin toolkit 18 can include validation tools 18-1. Validation tools 18-1 can include tools that can parse and confirm output(s) of a machine-learned model. Validation tools 18-1 can include engineered heuristics that establish certain thresholds applied to model outputs. For example, validation tools 18-1 can ground the outputs of machine-learned models to structured data sources (e.g., to mitigate “hallucinations").
[0373] Model plugin toolkit 18 can include tooling packages 18-2 for implementing one or more tools that can include scripts or other executable code that can be executed alongside development model 16. Tooling packages 18-2 can include one or more inputs configured to cause machine-learned model(s) to implement the tools (e.g., few-shot prompts that induce a model to output tool calls in the proper syntax, etc.). Tooling packages 18-2 can include, for instance, fine-tuning training data for training a model to use a tool.
[0374] Model plugin toolkit 18 can include interfaces for calling external application programming interfaces (APIs) 18-3. For instance, in addition to or in lieu of implementing tool calls or tool code directly with development model 16, development model 16 can be aligned to output instructions that initiate API calls to send or obtain data via external systems.
[0375] Model plugin toolkit 18 can integrate with prompt libraries 17-4 to build a catalog of available tools for use with development model 16. For instance, a model can receive, in an input, a catalog of available tools, and the model can generate an output that selects a tool from the available tools and initiates a tool call for using the tool.
[0376] Model development platform 12 can include a computational optimization toolkit 19 for optimizing a computational performance of development model 16. For instance, tools for model compression 19-1 can allow development model 16 to be reduced in size while maintaining a desired level of performance. For instance, model compression 19-1 can include quantization workflows, weight pruning and sparsification techniques, etc. Tools for hardware acceleration 19-2 can facilitate the configuration of the model storage and execution formats to operate optimally on different hardware resources. For instance, hardware acceleration 19-2 can include tools for optimally sharding models for distributed processing over multiple processing units for increased bandwidth, lower unified memory requirements, etc. Tools for distillation 19-3 can provide for the training of lighter-weight models based on the knowledge encoded in development model 16. For instance, development model 16 can be a highly performant, large machine- learned model optimized using model development platform 12. To obtain a lightweight model for running in resource-constrained environments, a smaller model can be a "student model” that learns to imitate development model 16 as a "teacher model." In this manner, for instance, the investment in learning the parameters and configurations of development model 16 can be efficiently transferred to a smaller model for more efficient inference.
[0377] Workbench 15 can implement one, multiple, or none of the toolkits implemented in model development platform 12. Workbench 15 can output an output model 20 based on development model 16. Output model 20 can be a deployment version of development model 16. Output model 20 can be a development or training checkpoint of development model 16. Output model 20 can be a distilled, compressed, or otherwise optimized version of development model 16.
[0378] Figure 16 is a block diagram of an example training flow for training a machine-learned development model 16. One or more portion(s) of the example training flow can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of the example training flow can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of the example training flow can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. Figure 16 depicts elementsperformed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 16 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of the example training flow can be performed additionally, or alternatively, by other systems.
[0379] Initially, development model 16 can persist in an initial state as an initialized model 21 . Development model 16 can be initialized with weight values. Initial weight values can be random or based on an initialization schema. Initial weight values can be based on prior pre-training for the same or for a different model.
[0380] Initialized model 21 can undergo pre-training in a pre-training stage 22. Pre-training stage 22 can be implemented using one or more pre-training pipelines 17-2 over data from dataset(s) 17-1 . Pre-training can be omitted, for example, if initialized model 21 is already pre-trained (e.g., development model 16 contains, is, or is based on a pre-trained foundational model or an expert model).
[0381] Pre-trained model 23 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Pre-trained model 23 can be the initial state if development model 16 was already pre-trained. Pre-trained model 23 can undergo fine-tuning in a fine- tuning stage 24. Fine-tuning stage 24 can be implemented using one or more fine-tuning pipelines 17-3 over data from dataset(s) 17-1 . Fine-tuning can be omitted, for example, if a pre-trained model has satisfactory performance, if the model was already fine-tuned, or if other tuning approaches are preferred.
[0382] Fine-tuned model 29 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Fine-tuned model 29 can be the initial state if development model 16 was already fine-tuned. Fine-tuned model 29 can undergo refinement with user feedback 26. For instance, refinement with user feedback 26 can include reinforcement learning, optionally based on human feedback from human users of fine-tuned model 25. As reinforcement learning can be a form of fine-tuning, it is to be understood that fine-tuning stage 24 can subsume the stage for refining with user feedback 26. Refinement with user feedback 26 can produce a refined model 27. Refined model 27 can be output to downstream system(s) 28 for deployment or further development.
[0383] In some implementations, computational optimization operations can be applied before, during, or after each stage. For instance, initialized model 21 can undergo computational optimization 29-1 (e.g., using computational optimization toolkit 19) before pre-training stage 22. Pre-trained model 23 can undergo computational optimization 29-2 (e.g., using computational optimization toolkit 19) before fine-tuning stage 24. Fine-tuned model 25 can undergo computational optimization 29-3 (e.g., using computational optimization toolkit 19) before refinement with user feedback 26. Refined model 27 can undergocomputational optimization 29-4 (e.g., using computational optimization toolkit 19) before output to downstream system(s) 28. Computational optimization(s) 29-1 , . . . , 29-4 can all be the same, all be different, or include at least some different optimization techniques.
[0384] Figure 17 is a block diagram of an inference system for operating one or more machine-learned model(s) 1 to perform inference (e.g., for training, for deployment, etc.). A model host 31 can receive machine-learned model(s) 1 . Model host 31 can host one or more model instance(s) 31-1 , which can be one or multiple instances of one or multiple models. Model host 31 can host model instance(s) 31-1 using available compute resources 31-2 associated with model host 31 .
[0385] Model host 31 can perform inference on behalf of one or more client(s) 32. Client(s) 32 can transmit an input request 33 to model host 31 . Using input request 33, model host 31 can obtain input(s) 2 for input to machine-learned model(s) 1 . Machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3. Using output(s) 3, model host 31 can return an output payload 34 for responding to input request 33 from client(s) 32. Output payload 34 can include or be based on output(s) 3.
[0386] Model host 31 can leverage various other resources and tools to augment the inference task. For instance, model host 31 can communicate with tool interfaces 35 to facilitate tool use by model instance(s) 31-1 . Tool interfaces 35 can include local or remote APIs. Tool interfaces 35 can include integrated scripts or other software functionality. Model host 31 can engage online learning interface(s) 36 to facilitate ongoing improvements to machine-learned model(s) 1 . For instance, online learning interface(s) 36 can be used within reinforcement learning loops to retrieve user feedback on inferences served by model host 31 . Model host 31 can access runtime data source(s) 37 for augmenting input(s) 2 with additional contextual information. For instance, runtime data source(s) 37 can include a knowledge graph 37-1 that facilitates structured information retrieval for information associated with input request(s) 33 (e.g., a search engine service). Runtime data source(s) 37 can include public or private, external or local database(s) 37-2 that can store information associated with input request(s) 33 for augmenting input(s) 2. Runtime data source(s) 37 can include account data 37-3 which can be retrieved in association with a user account corresponding to a client 32 for customizing the behavior of model host 31 accordingly.
[0387] Model host 31 can be implemented by one or multiple computing devices or systems. Client(s) 2 can be implemented by one or multiple computing devices or systems, which can include computing devices or systems shared with model host 31 .
[0388] For example, model host 31 can operate on a server system that provides a machine-learning service to client device(s) that operate client(s) 32 (e.g., over a local or wide-area network). Client device(s) can be end-user devices used by individuals. Client device(s) can be server systems that operate client(s) 32 to provide various functionality as a service to downstream end-user devices.
[0389] In some implementations, model host 31 can operate on the same device or system as client(s) 32. Model host 31 can be a machine-learning service that runs on-device to provide machine-learningfunctionality to one or multiple applications operating on a client device, which can include an application implementing client(s) 32 Model host 31 can be a part of the same application as client(s) 32. For instance, model host 31 can be a subroutine or method implemented by one part of an application, and client(s) 32 can be another subroutine or method that engages model host 31 to perform inference functions within the application. It is to be understood that model host 31 and client(s) 32 can have various different configurations.
[0390] Model instance(s) 31-1 can include one or more machine-learned models that are available for performing inference. Model instance(s) 31-1 can include weights or other model components that are stored in persistent storage, temporarily cached, or loaded into high-speed memory. Model instance(s) 31- 1 can include multiple instance(s) of the same model (e.g., for parallel execution of more requests on the same model). Model instance(s) 31-1 can include instance(s) of different model(s). Model instance(s) 31-1 can include cached intermediate states of active or inactive model(s) used to accelerate inference of those models. For instance, an inference session with a particular model may generate significant amounts of computational results that can be re-used for future inference runs (e.g., using a KV cache for transformerbased models). These computational results can be saved in association with that inference session so that session can be executed more efficiently when resumed.
[0391] Compute resource(s) 31-2 can include one or more processors (central processing units, graphical processing units, tensor processing units, machine-learning accelerators, etc.) connected to one or more memory devices. Compute resource(s) 31-2 can include a dynamic pool of available resources shared with other processes. Compute resource(s) 31-2 can include memory devices large enough to fit an entire model instance in a single memory instance. Compute resource(s) 31-2 can also shard model instance(s) across multiple memory devices (e.g., using data parallelization or tensor parallelization, etc.). This can be done to increase parallelization or to execute a large model using multiple memory devices which individually might not be able to fit the entire model into memory.
[0392] Input request 33 can include data for input(s) 2. Model host 31 can process input request 33 to obtain input(s) 2. Input(s) 2 can be obtained directly from input request 33 or can be retrieved using input request 33. Input request 33 can be submitted to model host 31 via an API.
[0393] Model host 31 can perform inference over batches of input requests 33 in parallel. For instance, a model instance 31-1 can be configured with an input structure that has a batch dimension. Separate input(s) 2 can be distributed across the batch dimension (e.g., rows of an array). The separate input(s) 2 can include completely different contexts. The separate input(s) 2 can be multiple inference steps of the same task. The separate input(s) 2 can be staggered in an input structure, such that any given inference cycle can be operating on different portions of the respective input(s) 2. In this manner, for instance, model host 31 can perform inference on the batch in parallel, such that output(s) 3 can also contain the batch dimension and return the inference results for the batched input(s) 2 in parallel. In this manner, forinstance, batches of input request(s) 33 can be processed in parallel for higher throughput of output payload(s) 34.
[0394] Output payload 34 can include or be based on output(s) 3 from machine-learned model(s) 1 . Model host 31 can process output(s) 3 to obtain output payload 34. This can include chaining multiple rounds of inference (e.g., iteratively, recursively, across the same model(s) or different model(s)) to arrive at a final output for a task to be returned in output payload 34. Output payload 34 can be transmitted to client(s) 32 via an API.
[0395] Online learning interface(s) 36 can facilitate reinforcement learning of machine-learned model(s) 1 . Online learning interface(s) 36 can facilitate reinforcement learning with human feedback (RLHF). Online learning interface(s) 36 can facilitate federated learning of machine-learned model(s) 1 .
[0396] Model host 31 can access a library of pre-trained adapters or LoRA modules that can adapt a baseline model to align its outputs with a desired performance profile, augment model capabilities (e.g., to adapt to a different input modality, etc.), and the like. For instance, model host 31 can receive an input request to load a customized model, and model host 31 can retrieve one or more components to adapt a baseline model to the custom profile. Model host 31 can determine that a particular functionality is needed for a particular task (e.g., based on an output of a model that preprocesses an input) and retrieve a pretrained component accordingly.
[0397] Model host 31 can execute machine-learned model(s) 1 to perform inference for various tasks using various types of data. For example, various different input(s) 2 and output(s) 3 can be used for various different tasks. In some implementations, input(s) 2 can be or otherwise represent image data. Machine-learned model(s) 1 can process the image data to generate an output. As an example, machine- learned model(s) 1 can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an image segmentation output. As another example, machine-learned model(s) 1 can process the image data to generate an image classification output. As another example, machine-learned model(s) 1 can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an encoded image data output (e.g., an encoded and / or compressed representation of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an upscaled image data output. As another example, machine-learned model(s) 1 can process the image data to generate a prediction output.
[0398] In some implementations, the task is a computer vision task. In some cases, input(s) 2 includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score correspondingto a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.
[0399] In some implementations, input(s) 2 can be or otherwise represent natural language data. Machine-learned model(s) 1 can process the natural language data to generate an output. As an example, machine-learned model(s) 1 can process the natural language data to generate a language encoding output. As another example, machine-learned model(s) 1 can process the natural language data to generate a latent text embedding output. As another example, machine-learned model(s) 1 can process the natural language data to generate a translation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a classification output. As another example, machine- learned model(s) 1 can process the natural language data to generate a textual segmentation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a semantic intent output. As another example, machine-learned model(s) 1 can process the natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, machine-learned model(s) 1 can process the natural language data to generate a prediction output (e.g., one or more predicted next portions of natural language content).
[0400] In some implementations, input(s) 2 can be or otherwise represent speech data (e.g., data describing spoken natural language, such as audio data, textual data, etc.). Machine-learned model(s) 1 can process the speech data to generate an output. As an example, machine-learned model(s) 1 can process the speech data to generate a speech recognition output. As another example, machine-learned model(s) 1 can process the speech data to generate a speech translation output. As another example, machine-learned model(s) 1 can process the speech data to generate a latent embedding output. As another example, machine-learned model(s) 1 can process the speech data to generate an encoded speech output (e.g., an encoded and / or compressed representation of the speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate an upscaled speech output(e.g. , speech data that is higher quality than the input speech data, etc.). As another example, machine- learned model(s) 1 can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate a prediction output.
[0401] In some implementations, input(s) 2 can be or otherwise represent latent encoding data (e.g., a latent space representation of an input, etc.). Machine-learned model(s) 1 can process the latent encoding data to generate an output. As an example, machine-learned model(s) 1 can process the latent encoding data to generate a recognition output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a reconstruction output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a search output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a reclustering output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a prediction output.
[0402] In some implementations, input(s) 2 can be or otherwise represent statistical data. Statistical data can be, represent, or otherwise include data computed and / or calculated from some other data source. Machine-learned model(s) 1 can process the statistical data to generate an output. As an example, machine-learned model(s) 1 can process the statistical data to generate a recognition output. As another example, machine-learned model(s) 1 can process the statistical data to generate a prediction output. As another example, machine-learned model(s) 1 can process the statistical data to generate a classification output. As another example, machine-learned model(s) 1 can process the statistical data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the statistical data to generate a visualization output. As another example, machine-learned model(s) 1 can process the statistical data to generate a diagnostic output.
[0403] In some implementations, input(s) 2 can be or otherwise represent sensor data. Machine-learned model(s) 1 can process the sensor data to generate an output. As an example, machine-learned model(s) 1 can process the sensor data to generate a recognition output. As another example, machine-learned model(s) 1 can process the sensor data to generate a prediction output. As another example, machine- learned model(s) 1 can process the sensor data to generate a classification output. As another example, machine-learned model(s) 1 can process the sensor data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the sensor data to generate a visualization output. As another example, machine-learned model(s) 1 can process the sensor data to generate a diagnostic output. As another example, machine-learned model(s) 1 can process the sensor data to generate a detection output.
[0404] In some implementations, machine-learned model(s) 1 can be configured to perform a task that includes encoding input data for reliable and / or efficient transmission or storage (and / or corresponding decoding). For example, the task may be an audio compression task. The input may include audio dataand the output may include compressed audio data. In another example, the input includes visual data (e.g. one or more images or videos), the output includes compressed visual data, and the task is a visual data compression task. In another example, the task may include generating an embedding for input data (e.g. input audio or visual data). In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output may include a text output which is mapped to the spoken utterance. In some cases, the task includes encrypting or decrypting input data. In some cases, the task includes a microprocessor performance task, such as branch prediction or memory address translation.
[0405] In some implementations, the task is a generative task, and machine-learned model(s) 1 can be configured to output content generated in view of input(s) 2. For instance, input(s) 2 can be or otherwise represent data of one or more modalities that encodes context for generating additional content.
[0406] In some implementations, the task can be a text completion task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent textual data and to generate output(s) 3 that represent additional textual data that completes a textual sequence that includes input(s) 2. For instance, machine-learned model(s) 1 can be configured to generate output(s) 3 to complete a sentence, paragraph, or portion of text that follows from a portion of text represented by input(s) 2.
[0407] In some implementations, the task can be an instruction following task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent instructions to perform a function and to generate output(s) 3 that advance a goal of satisfying the instruction function (e.g., at least a step of a multi-step procedure to perform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., imagebased instructions for a task to be performed, optionally accompanied by textual instructions) and machine- learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward accomplishing the requested functionality. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of performing a function. Multiple steps can be performed, with a final output being obtained that is responsive to the initial instructions.
[0408] In some implementations, the task can be a question answering task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent a question to answer and to generate output(s) 3 that advance a goal of returning an answer to the question (e.g., at least a step of a multi-step procedure toperform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and machine- learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward answering the question. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of obtaining an answer to the question (e.g., querying a database, performing a computation, executing a script, etc.). Multiple steps can be performed, with a final output being obtained that is responsive to the question.
[0409] In some implementations, the task can be an image generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of image content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent image data that depicts imagery related to the context. For instance, machine-learned model(s) 1 can be configured to generate pixel data of an image. Values for channel(s) associated with the pixels in the pixel data can be selected based on the context (e.g., based on a probability determined based on the context).
[0410] In some implementations, the task can be an audio generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of audio content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent audio data related to the context. For instance, machine- learned model(s) 1 can be configured to generate waveform data in the form of an image (e.g., a spectrogram). Values for channel(s) associated with pixels of the image can be selected based on the context. Machine-learned model(s) 1 can be configured to generate waveform data in the form of a sequence of discrete samples of a continuous waveform. Values of the sequence can be selected based on the context (e.g., based on a probability determined based on the context).
[0411] In some implementations, the task can be a data generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of data (e.g., data from various data domains, such as sensor data, image data, multimodal data, statistical data, etc.). The desired data can be, for instance, synthetic data for training other machine-learned models. The context can include arbitrary data type(s). Machine-learned model(s) 1 can be configured to generateoutput(s) 3 that represent data that aligns with the desired data. For instance, machine-learned model(s) 1 can be configured to generate data values for populating a dataset. Values for the data object(s) can be selected based on the context (e.g., based on a probability determined based on the context).
[0412] Figure 18 is a block diagram of an example networked computing system that can perform aspects of example implementations of the present disclosure. The system can include a number of computing devices and systems that are communicatively coupled over a network 49. An example computing device 50 is described to provide an example of a computing device that can perform any aspect of the present disclosure (e.g., implementing model host 31 , client(s) 32, or both). An example server computing system 60 is described as an example of a server computing system that can perform any aspect of the present disclosure (e.g., implementing model host 31 , client(s) 32, or both). Computing device 50 and server computing system(s) 60 can cooperatively interact (e.g , over network 49) to perform any aspect of the present disclosure (e.g., implementing model host 31 , client(s) 32, or both). Model development platform system 70 is an example system that can host or serve model development platform(s) 12 for development of machine-learned models. Third-party system(s) 80 are example system(s) with which any of computing device 50, server computing system(s) 60, or model development platform system(s) 70 can interact in the performance of various aspects of the present disclosure (e.g., engaging third-party tools, accessing third-party databases or other resources, etc.).
[0413] Network 49 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over network 49 can be carried via any type of wired or wireless connection, using a wide variety of communication protocols (e.g., TCP / IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), or protection schemes (e.g., VPN, secure HTTP, SSL). Network 49 can also be implemented via a system bus. For instance, one or more devices or systems of Figure 18 can be co-located with, contained by, or otherwise integrated into one or more other devices or systems.
[0414] Computing device 50 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, a server computing device, a virtual machine operating on a host device, or any other type of computing device. Computing device 50 can be a client computing device. Computing device 50 can be an end-user computing device. Computing device 50 can be a computing device of a service provider that provides a service to an end user (who may use another computing device to interact with computing device 50).
[0415] Computing device 50 can include one or more processors 51 and a memory 52. Processor(s) 51 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 52 can include one or more non-transitory computer-readable storage media, such asHBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 52 can store data 53 and instructions 54 which can be executed by processor(s) 51 to cause computing device 50 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein.
[0416] Computing device 50 can also include one or more input components that receive user input. For example, a user input component can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, camera, LIDAR, a physical keyboard or other buttons, or other means by which a user can provide user input.
[0417] Computing device 50 can store or include one or more machine-learned models 55. Machine- learned models 55 can include one or more machine-learned model(s) 1 , such as a sequence processing model 4. Machine-learned models 55 can include one or multiple model instance(s) 31-1. Machine-learned model(s) 55 can be received from server computing system(s) 60, model development platform system 70, third party system(s) 80 (e.g., an application distribution platform), or developed locally on computing device 50. Machine-learned model(s) 55 can be loaded into memory 52 and used or otherwise implemented by processor(s) 51 . Computing device 50 can implement multiple parallel instances of machine-learned model(s) 55.
[0418] Server computing system(s) 60 can include one or more processors 61 and a memory 62. Processor(s) 61 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 62 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 62 can store data 63 and instructions 64 which can be executed by processor(s) 61 to cause server computing system(s) 60 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein.
[0419] In some implementations, server computing system 60 includes or is otherwise implemented by one or multiple server computing devices. In instances in which server computing system 60 includes multiple server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.
[0420] Server computing system 60 can store or otherwise include one or more machine-learned models 65. Machine-learned model(s) 65 can be the same as or different from machine-learned model(s) 55. Machine-learned models 65 can include one or more machine-learned model(s) 1 , such as a sequenceprocessing model 4. Machine-learned models 65 can include one or multiple model instance(s) 31-1 . Machine-learned model(s) 65 can be received from computing device 50, model development platform system 70, third party system(s) 80, or developed locally on server computing system(s) 60. Machine- learned model(s) 65 can be loaded into memory 62 and used or otherwise implemented by processor(s) 61 . Server computing system(s) 60 can implement multiple parallel instances of machine-learned model(s) 65.
[0421] In an example configuration, machine-learned models 65 can be included in or otherwise stored and implemented by server computing system 60 to establish a client-server relationship with computing device 50 for serving model inferences. For instance, server computing system(s) 60 can implement model host 31 on behalf of client(s) 32 on computing device 50. For instance, machine-learned models 65 can be implemented by server computing system 60 as a portion of a web service (e.g., remote machine-learned model hosting service, such as an online interface for performing machine-learned model operations over a network on server computing system(s) 60). For instance, server computing system(s) 60 can communicate with computing device 50 over a local intranet or internet connection. For instance, computing device 50 can be a workstation or endpoint in communication with server computing system(s) 60, with implementation of machine-learned models 65 being managed by server computing system(s) 60 to remotely perform inference (e.g., for runtime or training operations), with output(s) returned (e.g., cast, streamed, etc.) to computing device 50. Machine-learned models 65 can work cooperatively or interoperatively with machine-learned models 55 on computing device 50 to perform various tasks.
[0422] Model development platform system(s) 70 can include one or more processors 71 and a memory 72. Processor(s) 71 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 72 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 72 can store data 73 and instructions 74 which can be executed by processor(s) 71 to cause model development platform system(s) 70 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein. Example operations include the functionality described herein with respect to model development platform 12. This and other functionality can be implemented by developer tool(s) 75.
[0423] Third-party system(s) 80 can include one or more processors 81 and a memory 82. Processor(s) 81 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 82 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinationsthereof. Memory 82 can store data 83 and instructions 84 which can be executed by processor(s) 81 to cause third-party system(s) 80 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein. Example operations include the functionality described herein with respect to tools and other external resources called when training or performing inference with machine-learned model(s) 1 , 4, 16, 20, 55, 65, etc. (e.g., third-party resource(s) 85).
[0424] Figure 18 illustrates one example arrangement of computing systems that can be used to implement the present disclosure. Other computing system configurations can be used as well. For example, in some implementations, one or both of computing system 50 or server computing system(s) 60 can implement all or a portion of the operations of model development platform system 70. For example, computing system 50 or server computing system(s) 60 can implement developer tool(s) 75 (or extensions thereof) to develop, update / train, or refine machine-learned models 1 , 4, 16, 20, 55, 65, etc. using one or more techniques described herein with respect to model alignment toolkit 17. In this manner, for instance, computing system 50 or server computing system(s) 60 can develop, update / train, or refine machine- learned models based on local datasets (e.g., for model personalization / customization, as permitted by user data preference selections).
[0425] Figure 19 is a block diagram of an example computing device 98 that performs according to example embodiments of the present disclosure. Computing device 98 can be a user computing device or a server computing device (e.g., computing device 50, server computing system(s) 60, etc.). Computing device 98 can implement model host 31 . For instance, computing device 98 can include a number of applications (e.g., applications 1 through N). Each application can contain its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. As illustrated in Figure 19, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.
[0426] Figure 20 is a block diagram of an example computing device 99 that performs according to example embodiments of the present disclosure. Computing device 99 can be the same as or different from computing device 98. Computing device 99 can be a user computing device or a server computing device (e.g., computing device 50, server computing system(s) 60, etc.). Computing device 98 can implement model host 31 . For instance, computing device 99 can include a number of applications (e.g., applications 1 through N). Each application can be in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, avirtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
[0427] The central intelligence layer can include a number of machine-learned models. For example, as illustrated in Figure 20, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of computing device 99.
[0428] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for computing device 99 As illustrated in Figure 20, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).
[0429] Figure 21 depicts a flowchart of a method 2100 for implementing a machine-learned agent system according to aspects of the present disclosure.
[0430] One or more portion(s) of example method 2100 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the preceding figures. Each respective portion of example method 2100 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 2100 can be implemented on the hardware components of the device(s) described herein. Figure 21 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 21 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 2100 can be performed additionally, or alternatively, by other systems.
[0431] At 2102, example method 2100 includes receiving, by a machine-learned agent system, a query associated with a user. For example, the query (e.g., 104) can be received from one or more input interfaces (e.g., 102) of a machine-learned agent system (e.g., 100). The query can be received from a user via a user interface (e.g., a virtual assistant application) or from another system (e.g., via an API call). The query can be in the form of a natural language utterance, a structured data object, or a combination thereof. The query can include multimodal data (e.g., text, audio, and / or image data). The machine-learned agent system (e.g., 100) can process the query to determine the user’s intent and relevant context.
[0432] At 2104, example method 2100 includes accessing, by the machine-learned agent system, memory data from an interaction memory datastore associated with the user. In some implementations, the interaction memory datastore includes one or more memory objects that were generated based on one or more prior interactions between the machine-learned agent system and the user. In some implementations, the interaction memory datastore includes multimodal data. For example, the memory data can be based on a first memory object associated with a first data modality and a second memory object associated with a second data modality. For example, the first data modality can include audio data, and wherein the second data modality can include image data. For example, the first data modality can include text data, and wherein the second data modality can include image data. For example, the first data modality can include text data, and wherein the second data modality can include audio data.
[0433] For example, the machine-learned agent system (e.g., 100) can issue a memory query (e.g., 108) to an interaction memory datastore (e.g., 110) associated with the user. The memory query can be based on the query (e.g., 104) and can include keywords, semantic embeddings, or other data structures that facilitate retrieval of relevant memory data from the interaction memory datastore (e.g., 110). The interaction memory datastore (e.g., 110) can contain a plurality of memory objects (e.g., 202-1 , .... 202-M), each including one or more memory values (e.g., 204-1 , .... 204-M) and associated metadata (e.g., 206-1 , .... 206-M). The memory query (e.g., 108) can be processed by the interaction memory datastore (e.g., 110) to retrieve relevant memory objects (e.g., 302, 304, 306 in Figure 3) based on criteria such as keyword matching, semantic similarity, or temporal proximity. The retrieved memory objects can be used to generate memory data (e.g., 308, 114) that is relevant to the query (e.g., 104). The generation of memory data (e.g., 308) can involve extracting memory values (e.g., 204-1 , .... 204-M) from the retrieved memory objects or using a machine-learned model to generate a summary or other representation of the retrieved memory objects. The memory data (e.g., 114) can then be used to condition the input to a machine- learned model (e.g., 118) for generating a response to the query (e.g., 104). A recall cycle (e.g., 106) can encapsulate these operations. The recall cycle (e.g., 106) can be performed online (e.g., during processing of the query) or offline (e.g., in advance of processing the query). The memory data (e.g., 114) can be incorporated into an input data structure (e.g., 116) that is provided to the machine-learned model (e.g , 118) for inference. The memory data (e.g., 114) can be formatted in various ways to optimize its compatibility with the machine-learned model (e.g., 118), such as by concatenating textual memory values into a string, presenting multimodal data as a structured object, or generating embeddings representing the collected memories.
[0434] At 2106, example method 2100 includes inputting, by the machine-learned agent system and to a machine-learned sequence processing model, an input data structure based on the query and the memorydata. For example, the input data structure (e.g., 116) can include the query (e.g 104) and the memory data (e.g., 114). The memory data (e.g., 114) can be formatted as a structured prompt or instruction (e.g , as in Figure 3) that is prepended to the query (e.g., 104) as contextual information. The input data structure (e.g., 116) can be provided to the machine-learned sequence processing model (e.g., 4 in Figure 13) for processing, with the output sequence (e.g., 7 in Figure 13) being used to generate the output (e.g., 120 in Figure 1). The input data structure (e.g., 116) can be formatted to optimize its compatibility with the machine-learned model (e.g., 118), such as by concatenating textual memory values into a string, presenting multimodal data as a structured object, or generating embeddings representing the collected memories.
[0435] At 2108, example method 2100 includes generating, by the machine-learned agent system and based on processing the input data structure using the machine-learned sequence processing model, an output. For example, the machine-learned agent system (e.g., 100) can use the machine-learned sequence processing model (e.g., 4 in Figure 13) to process the input data structure (e.g., 116) and generate an output (e.g., 120 in Figure 1). The output (e.g., 120) can be a sequence of elements (e.g., corresponding to 7-1 , 7-2, .... 7-N in Figure 13) that represent a response to the query (e.g., 104). The output (e.g., 120) can be generated autoregressively or non-autoregressively. The output (e.g., 120) can be in the same or a different modality as the query (e.g., 104). The output (e.g., 120) can include metadata, such as confidence scores or probabilities. The machine-learned agent system (e.g., 100) can process the output (e.g., 120) to generate a response (e.g., 122 in Figure 1) to the query (e.g., 104). The response (e.g., 122) can be rendered or transmitted to the user via output interfaces (e.g., 124 in Figure 1). The generation of the output can involve multiple rounds of interaction between the machine-learned agent system (e.g., 100) and the machine-learned sequence processing model (e.g., 4), with intermediate outputs being used to refine subsequent predictions.
[0436] At 2110, example method 2100 includes outputting, by the machine-learned agent system, and based on the output, a response to the query. For example, the machine-learned agent system (e.g., 100) can generate response 122 based on output data structure 120. Response 122 can include data from output data structure 120. Response 122 can be a final or complete response to query 104, or response 122 can be a partial response to query 104 that effectuates a step in a multi-step response (e.g., performing a subtask in a multi-part task). Output interface(s) 124 can render, transmit, or execute data from response 122. Response 122 can include data for output to a user interface for rendering for a user. Response 122 can include text, image, audio, or other data that an output interface can render for a user. Response 122 can include data for communication to a user (e.g., to a device associated with an account, to another agent, to an external system). Response 122 can include instructions to control a receiving system. For instance, response 122 can include an application programming interface call generated based on output data structure 120. For instance, one or more parameters of the application programminginterface call can be generated by a machine-learned model and provided to machine-learned agent system 100 in output data structure 120. The application programming interface call can be parsed and packaged into response 122. Response 122 can be output via an output interface to a receiving system (e.g., over a network, over a system bus, via a software queue or operating system communication pathway) to initiate an action of the receiving system according to the application programming interface call. Output interfaces 124 can include various mechanisms and devices that enable the machine-learned agent system to communicate with users or other systems. These interfaces may consist of graphical user interfaces (GUIs), audio output devices, network connectivity devices, API libraries, and other components that can send data or render data relevant to the agent actions executed by the system. For example, GUIs can display responses to questions, results of a task, notifications about upcoming tasks, or alerts regarding any required user inputs. Audio outputs can provide auditory responses (e.g., spoken language responses) or alerts.
[0437] In some implementations, example method 2100 includes retrieving, by the machine-learned agent system, and from the interaction memory datastore, the one or more memory values based on a relevance of the one or more memory values to the query. For example, the machine-learned agent system (e.g., 100) can retrieve memory objects (e.g., 202-1 , ..., 202-M in Figure 2) from the interaction memory datastore (e.g., 110) based on a relevance measure (e.g., a similarity score, a distance metric, or a probability computed by a machine-learned model) between the query (e.g., 104) and the memory values (e.g., 204-1 , .... 204-M) stored in the memory objects. This retrieval process can involve a search operation (e.g., a keyword search, a semantic search, or a similarity search) that uses the query (e.g., 104) as a search key to identify relevant memory objects (e.g., 302, 304, 306 in Figure 3) in the interaction memory datastore (e.g., 110). The relevance measure can be computed using various techniques, such as comparing vector embeddings of the query (e.g., 104) and the memory values (e.g., 204-1 , .... 204-M), using keyword matching techniques, or using a machine-learned model to assess semantic similarity. The retrieved memory objects (e.g., 302, 304, 306) can then be used to generate memory data (e.g., 308, 114) that is relevant to the query (e.g., 104). The memory data (e.g., 114) can be incorporated into an input data structure (e.g., 116) that is provided to the machine-learned model (e.g., 118) for inference.
[0438] In some implementations, example method 2100 includes, after receiving an input during an interactive session, processing, by the machine-learned agent system, the input using a machine-learned model to generate one or more values. In some implementations, example method 2100 includes, based on the generated one or more values indicating that at least a portion of the input is to be stored: extracting, by the machine-learned agent system, the portion. In some implementations, example method 2100 includes storing, by the machine-learned agent system, the portion as a memory value in a memory object in the interaction memory datastore.
[0439] For example, after receiving an input (e.g., a textual input, an image, audio recording, etc.) during an interactive session (e.g., between a user and a machine-learned agent system 100), the machine- learned agent system (e.g., 100) can process the input using a machine-learned model (e.g., a model implemented by machine-learned model system(s) 118) to generate one or more values. The values can indicate that at least a portion of the input is to be stored. For example, the values can include scores associated with portions of the input, excerpts from the input, summaries or other generated representations of the input, etc. The machine-learned agent system (e.g., 100) can extract a portion of the input (e.g., using memory model 105) or generated representation indicated by the generated values. The extracted portion is then stored as a memory value (e.g., 204-1 in Figure 2) in a memory object (e.g., 202-1 in Figure 2) in the interaction memory datastore (e.g., 110). The memory object (e.g., 202-1) may also store associated metadata (e.g., 206-1 in Figure 2), such as a timestamp, the type of input, or other relevant context. The machine-learned model used for this processing (e.g., a component of machine- learned model system(s) 118) may be trained to identify salient information for memory storage, such as by analyzing the input for keywords, semantic meaning, or other relevant features. The machine-learned model used for this processing (e.g., a component of machine-learned model system(s) 118) may be a general purpose or foundational model that is prompted to perform this task. The process of generating values, extracting the portion, and storing the memory value and metadata can be part of a larger memory update cycle (e.g., 704 in Figure 7) that may be triggered by various events, such as explicit user commands or implicit system inferences. The memory update cycle (e.g., 704) may involve interactions between the machine-learned agent system (e g., 100), the interaction memory datastore (e.g., 110), and the machine-learned model system(s) (e.g., 118), as shown in Figures 7, 8, 9, and 10.
[0440] In some implementations, example method 2100 includes storing, by the machine-learned agent system, metadata associated with the memory value in the memory object. For example, metadata (e.g., 206-1 in Figure 2) associated with the memory value (e.g., 204-1 in Figure 2) can include a timestamp indicating when the memory value was created or last updated, a source identifier indicating the origin of the memory value (e.g., an explicit user command or an implicit system inference), a confidence score indicating the certainty of the memory value, tags or keywords for categorization and retrieval, a summary describing the context in which the memory value was created, a session identifier linking the memory value to a specific user interaction session, or other relevant attributes The metadata (e.g., 206-1) can be stored in a structured format, such as JSON or XML, to facilitate efficient storage and retrieval. The metadata (e.g., 206-1) can be used by the machine-learned agent system (e.g., 100) to manage and retrieve memory data (e.g., 114) . For example, the machine-learned agent system (e.g., 100) can use the timestamp metadata to prioritize more recent memories, the confidence score metadata to filter out unreliable memories, and the tags or keywords metadata to perform efficient searches. The metadata (e.g., 206-1) can also be used to resolve conflicts between different memory values or to generatesummaries or other representations of the memory data. A support record (e.g., within 202-1 in Figure 2) can be used to maintain a trace of the interactions and data elements that contributed to the generation of the memory value, enabling context-aware updates and conflict resolution.
[0441] In some implementations of example method 2100, the portion includes at least one of: text data, image data, or audio data.
[0442] In some implementations of example method 2100, the portion corresponds to an explicit instruction to remember information. For example, an explicit instruction to remember information (e.g., a user command) can be processed by the machine-learned agent system (e.g., 100) to extract relevant information for storage in the interaction memory datastore (e.g., 110). The system (e.g., memory model 105) may use a machine-learned model (e.g., within machine-learned model system(s) 118) to detect that a user has instructed the machine-learned agent system to remember information and to detect the information to be remembered. For example, the machine-learned agent system (e.g., 100) can receive an explicit instruction to remember information (e.g., a user command such as "remember that my favorite color is blue”) via input interface(s) 102. Memory model 105 can then process this input (e.g., using a machine-learned model within machine-learned model system(s) 118) to identify the information to be remembered ("my favorite color is blue”). This information is then extracted and stored as a memory value (e.g., 204-1 in Figure 2) within a new memory object (e.g., 202-1 in Figure 2) in interaction memory datastore 110. Associated metadata (e.g., 206-1 in Figure 2), such as a timestamp, the type of input (explicit user command), and potentially a confidence score reflecting the certainty of the information, can also be stored in the memory object The process of creating this memory object can be part of a memory update cycle (e.g., 704 in Figure 7), which may include interactions with machine-learned model system(s) 118.
[0443] In some implementations, example method 2100 includes receiving, by the machine-learned agent system, input data indicating an instruction to forget specified information. In some implementations, example method 2100 includes deleting, by the machine-learned agent system, the specified information from the interaction memory datastore. For example, the machine-learned agent system (e.g., 100) can receive input data (e.g., a user command such as “forget that”) via input interface(s) 102. Memory model 105 can then process this input (e.g., using a machine-learned model within machine-learned model system(s) 118) to identify the information to be forgotten (e.g., based on processing a cache of recent interactions). This information can be matched to one or more memory objects (e.g., 202-1 , .... 202-M in Figure 2) in interaction memory datastore 110 based on various criteria (e.g., keyword matching, semantic similarity, or temporal proximity). Memory model 105 can then cause the identified memory objects to be deleted from interaction memory datastore 110. This deletion can be performed immediately or scheduled for a later time. The process of identifying and deleting the specified information can be part of a memoryupdate cycle (e.g., 704 in Figure 7), which may include interactions with machine-learned model system(s) 118
[0444] In some implementations, example method 2100 includes receiving, by the machine-learned agent system, input data indicating an instruction to forget specified information after a specified interval. In some implementations, example method 2100 includes queuing, by the machine-learned agent system, the specified information for deletion from the interaction memory datastore after the specified interval. For example, the machine-learned agent system (e.g., 100) can receive input data (e.g., a user command such as "forget my appointments after three months”) via input interface(s) 102. Memory model 105 can then process this input (e.g., using a machine-learned model within machine-learned model system(s) 118) to identify the information to be forgotten (“my appointments”) and the specified interval (“three months”). This information can be used to configure a deletion policy for interaction memory datastore 110. Memory model 105 can instruct interaction memory datastore 110 to delete memory objects (e.g., 202-1 , ..., 202-M in Figure 2) that meet specified criteria (e.g., memory objects containing “appointments” as a tag and having a timestamp older than three months from the current time). The deletion can be performed immediately or scheduled for a later time.
[0445] In some implementations of example method 2100, the memory data includes an ordered combination of the one or more memory values, the ordered combination ordered based on corresponding timestamps of the plurality of memory objects. For example, the ordered combination of memory values (e.g., 204-1 , .... 204-M in Figure 2) can be presented to the machine-learned sequence processing model (e.g., 4 in Figure 13) in a structured format, with metadata such as timestamps (e.g., within 206-1 , .... 206- M in Figure 2) and confidence scores included to provide additional context. The combination of memory values can be presented in an unstructured format or natural language format, such as a bulleted list. The timestamps associated with each memory value (e.g., 204-1 , .... 204-M) allow the model to prioritize more recent context when making inferences. This temporal ordering can be helpful for resolving potential conflicts or ambiguities in the memory data. For example, if a user provides conflicting information at different times, the model can prioritize the most recent information based on its ordering within the list (e.g., as in memory data 308 in Figure 3). The ordered list can contain time information (e.g., the timestamp) or the ordering can implicitly communicate the temporal relationships between listed memory values.
[0446] In some implementations, example method 2100 includes filtering the plurality of memory objects to obtain the one or more memory values, wherein the filtering includes computing one or more relevance measures for the plurality of memory objects based on the query. In some implementations, example method 2100 includes filtering the plurality of memory objects to obtain the one or more memory values, wherein the filtering includes returning a subset of the plurality of memory objects based on the relevance measure. For example, at 2104, the machine-learned agent system (e.g., 100) can use a memory query(e.g., 108) to filter the plurality of memory objects (e.g., 202-1 , .... 202-M in Figure 2) stored in the interaction memory datastore (e.g., 110). The memory query (e.g , 108) can be generated based on the received query (e.g., 104) and can include keywords, semantic embeddings (e.g., vectors), or other data structures that facilitate the retrieval of relevant memory objects. The filtering process can involve computing one or more relevance measures (e.g., similarity scores, distance metrics, or probabilities) for each memory object (e.g., 202-1 , .... 202-M) based on the query (e.g., 104). For example, the relevance measures can be computed by comparing vector embeddings of the query (e.g., 104) and the memory values (e.g., 204-1 , .... 204-M) stored in the memory objects, using keyword matching techniques, or using a machine-learned model (e.g., within 118) to assess semantic similarity. The filtering process can then return a subset of the memory objects (e.g., 302, 304, 306 in Figure 3) based on the computed relevance measures, such as by selecting the top-K most relevant memory objects, where K is a configurable parameter. The retrieved memory objects can then be used to generate the memory data (e.g., 308, 114) that is input to the machine-learned sequence processing model (e.g., 4 in Figure 13) along with the query (e.g., 104) in the input data structure (e.g., 116).
[0447] In some implementations of example method 2100, the one or more relevance measures include a respective score generated based on distance between a query embedding and a respective memory value embedding. In some implementations of example method 2100, the one or more relevance measures include a respective sequence output generated by processing the query and one or more respective memory values using a second machine-learned sequence processing model that is optionally the same as or different from the machine-learned sequence processing model.
[0448] In some implementations of example method 2100, the one or more memory values each include a respective string, and wherein the memory data includes a string including the one or more respective strings from the one or more memory values. For example, the one or more respective strings (e.g., from memory values 204-1 , .... 204-M in Figure 2) can be concatenated into a single string (e.g., as in memory data 308 in Figure 3) using a delimiter (e.g., a newline character, a semicolon, or other separator) to create a composite string representing the memory data (e.g., 114). This composite string can then be included in the input data structure (e.g., 116 in Figure 1) that is provided to the machine-learned sequence processing model (e.g., 4 in Figure 13). The machine-learned sequence processing model (e.g., 4) can process this composite string along with the query (e g., 104) to generate an output (e.g., 120) that is used to create the response (e.g., 122) to the query.
[0449] In some implementations, example method 2100 includes inputting at least a portion of the output to the machine-learned sequence processing model. In some implementations, example method 2100 includes generating, based on processing the output using the machine-learned sequence processing model, the response to the query. In some implementations of example method 2100, the output includes a plurality of predictions for analytical content that includes an analysis of the query with respect to thememory data. For example, at 2108, the machine-learned agent system (e.g. , 100) can use the machine- learned sequence processing model (e.g., 4 in Figure 13) to process the input data structure (e.g., 116 in Figure 1) and generate an output (e.g., 120 in Figure 1) that includes a plurality of predictions for analytical content. This analytical content can comprise an analysis of the query (e.g., 104) with respect to the memory data (e.g., 114) retrieved from the interaction memory datastore (e.g., 110). The machine-learned sequence processing model (e.g., 4) may generate this analytical content using techniques such as chain- of-thought reasoning, where the model generates intermediate steps of reasoning that are used to support its final prediction. These intermediate steps can be included in the output (e.g., 120) as part of the analytical content. The analytical content may include a summary of relevant memories (e.g., from memory objects 302, 304, 306 in Figure 3), an assessment of their relevance to the query (e.g., 104), and a reasoned explanation of how the memories influenced the final response (e.g., 122). The analytical content may be generated in multiple rounds of interaction between the machine-learned agent system (e.g., 100) and the machine-learned sequence processing model (e.g., 4), with intermediate outputs being used to refine subsequent predictions. The analytical content can be used for debugging, logging, or evaluating the performance of the system. In some implementations, the analytical content is not exposed to the user via a user interface (e.g., not included in response 122). The machine-learned agent system (e.g., 100) can use the analytical content (e.g., within output 120) to refine its response (e.g., 122) to the query (e.g., 104) or to improve its memory management strategies.
[0450] In some implementations of example method 2100, the plurality of predictions for analytical content are not exposed on a user interface associated with the query.
[0451] In some implementations, example method 2100 includes consolidating, in the interaction memory datastore, a subset of the plurality of memory objects into a single memory object based on an alignment between the subset. For example, the machine-learned agent system (e.g., 100) can identify a subset of memory objects (e.g., a subset of 202-1 , ..., 202-M in Figure 2) in the interaction memory datastore (e.g., 110) that are semantically similar or related based on their content or metadata (e.g., 206- 1 , .... 206-M in Figure 2). This identification can be performed using various techniques, such as clustering algorithms, semantic similarity measures, or machine-learned models trained to identify relationships between memory objects. The system (e.g., memory model 105) may then merge the identified memory objects (e.g., using a machine-learned model within machine-learned model system(s) 118) into a single memory object (e.g., a new memory object in Figure 2) that represents a consolidated summary or representation of the information contained in the subset. This consolidation can involve combining memory values (e.g., 204-1 , .... 204-M in Figure 2), such as concatenating text strings or averaging numerical values, and integrating metadata (e.g., 206-1 , ..., 206-M in Figure 2) from multiple memory objects. The system (e.g., memory model 105) may use machine-learned models (e.g., within machine- learned model system(s) 118) to resolve conflicts or inconsistencies in the metadata (e.g., 206-1, .... 206-M) during the consolidation process. For example, conflicting timestamps may be resolved by selecting the most recent timestamp, or by creating a composite timestamp reflecting the range of recorded times Conflicting memory values may be resolved by selecting the value with the highest confidence score, or by generating a new value that synthesizes information from multiple sources. The consolidated memory object can then be stored in the interaction memory datastore (e.g., 110) to replace the original subset of memory objects. This consolidation process can improve the efficiency of memory management by reducing redundancy and improving the retrieval of relevant information. The consolidation process can be part of a memory maintenance cycle (e.g., 704 in Figure 7) that is performed periodically to optimize the organization and structure of the interaction memory datastore (e.g., 110).
[0452] In some implementations, example method 2100 includes receiving input data associated with a timestamp. In some implementations, example method 2100 includes creating a current memory object including: the timestamp; and a current memory value based on the input data. In some implementations, example method 2100 includes matching the current memory object to a prior memory object. In some implementations, example method 2100 includes replacing, in the interaction memory datastore, the prior memory object with the current memory object.
[0453] In some implementations, example method 2100 includes receiving input data associated with a timestamp. In some implementations, example method 2100 includes matching at least a portion of the input data to a prior memory object. In some implementations, example method 2100 includes updating, in the interaction memory datastore, the prior memory object based on the portion of the input data. For example, at 2108, the machine-learned agent system (e.g., 100) receives input data (e.g., 500 in Figure 5) associated with a timestamp. The system then matches at least a portion of the input data to a prior memory object (e.g., 402 in Figure 4). Based on this match, the system updates (e.g., as in Figure 5), in the interaction memory datastore (e.g., 110), the prior memory object (e.g., 402) based on the portion of the input data (e.g., 500). This update might involve modifying existing attributes of the memory object (e.g., updating the memory value, confidence score, or support record) or adding new attributes. The updated memory object (e.g., 502 in Figure 5) reflects the new information provided in the input data (e.g., 500) while preserving the context and history associated with the prior memory object (e.g., 402). The update process might involve using a machine-learned model (e.g., within 118) to assess the relevance of the new information and to determine how to integrate it into the existing memory object. The updated memory object (e.g., 502) can then be stored in the interaction memory datastore (e.g., 110) for future use in conditioning the inferences of the machine-learned model (e.g., 118).
[0454] In some implementations of example method 2100, the response includes instructions configured to control an application programming interface to cause a computing system to perform an action. For example, the response (e.g., 122) can include an API call (e.g., generated by a machine-learned model (e.g., 118) and provided to machine-learned agent system 100 (e.g., in output data structure 120)) to acalendar application to create a calendar entry. The API call can specify parameters such as the time, date, and description of the appointment, which can be extracted from the query (e.g., 104) and / or memory data (e.g., 114). The API call can be packaged into response 122 and output via an output interface (e.g., 124) to the calendar application to initiate the creation of the calendar entry. This allows the machine-learned agent system (e.g., 100) to automate tasks on behalf of the user by generating and executing API calls based on the user’s query (e.g., 104) and the context provided by the memory data (e.g., 114). The API call can be generated using a machine-learned model (e.g., 118) that is trained to generate API calls in the correct format and with the appropriate parameters. The API call can include parameters that are customized based on the user's preferences and prior interactions, as stored in the interaction memory datastore (e.g., 110). For example, the API call might specify a default calendar or notification settings based on the user’s preferences stored in memory objects (e.g., 202-1 , ..., 202-M in Figure 2). The API call can be sent to the calendar application via an output interface (e.g., 124) that is configured to communicate with the calendar application (e.g., via a network connection or a system bus). The calendar application can then process the API call and create the calendar entry accordingly. The response (e.g., 122) might also include a confirmation message to the user indicating that the calendar entry has been created successfully.
[0455] In some implementations of example method 2100, the query indicates a configuration parameter that controls an operation of the computing system, and wherein the instructions are configured to control the application programming interface to cause the computing system to assign a value for the configuration parameter. For example, the query (e.g , 104) might indicate a configuration parameter that controls an operation of the computing system. The computing system can be a cloud server, a local device, or a host device for the machine-learned agent system. The machine-learned agent system (e.g., 100) processes the query (e.g., 104) and accesses relevant memory data (e.g., 114) from the interaction memory datastore (e.g., 110). This memory data (e.g., 114) might include past user preferences or instructions related to the configuration parameter. The system then generates an output (e.g., 120) that includes instructions (e.g., in response 122) configured to control an API call (e.g., via output interface 124) to cause the computing system to assign a value for the configuration parameter. For instance, if the query (e.g., 104) requests to enable location services, and the memory data (e.g., 114) indicates the user previously enabled location services for a specific app, or in a situation having matching context, the system can generate an API call (e.g., via output interface 124) to enable those services for that app without requiring further user input.
[0456] In some implementations of example method 2100, the configuration parameter is associated with at least one of: data access by the computing system, data retention by the computing system, or data communication by the computing system. For example, the configuration parameter (e.g., a setting within an analytics system for a web-based resource) might relate to data access permissions. The user mighthave previously expressed preferences regarding which data points should be accessible by the analytics system (e.g., via a memory value (e.g., 204-1) stored in a memory object (e.g., 202-1) within interaction memory datastore (e.g., 110)). When the web-based resource is loaded, and a pop-up user interface element requests confirmation of the data access permissions, the machine-learned agent system (e.g., 100) can automatically respond (e.g., at 2110) by generating an API call (e.g., within response 122) that configures the analytics system according to the user's previously expressed preferences (e.g., retrieved at 2104) without requiring further user interaction. The machine-learned agent system (e.g., 100) can use the memory data (e.g., 114) to determine the appropriate value for the configuration parameter and to construct the API call (e.g., using a machine-learned model). The API call can be sent to the analytics system via an output interface (e.g., 124), which can be a network interface or other communication mechanism. The analytics system can then process the API call and apply the specified data access permissions. The entire process, from receiving the query (e.g., at 2102) to generating the API call (e.g., at 2110) and updating the analytics system, can be automated by the machine-learned agent system (e.g., 100) based on the user’s past interactions and preferences stored in the interaction memory datastore (e.g., 110).
[0457] The technology discussed herein refers to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.
[0458] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.
[0459] Aspects of the disclosure have been described in terms of illustrative embodiments thereof. Any and all features in the following claims can be combined or rearranged in any way possible, including combinations of claims not explicitly enumerated in combination together, as the example claim dependencies listed herein should not be read as limiting the scope of possible combinations of features disclosed herein. Accordingly, the scope of the present disclosure is by way of example rather than by wayof limitation, and the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art Moreover, terms are described herein using lists of example elements joined by conjunctions such as “and,” “or,” “but,” etc. It should be understood that such conjunctions are provided for explanatory purposes only. Clauses and other sequences of items joined by a particular conjunction such as “or,” for example, can refer to “and / or,” “at least one of’, “any combination of’ example elements listed therein, etc. Terms such as “based on” should be understood as “based at least in part on.”
[0460] The term “can” should be understood as referring to a possibility of a feature in various implementations and not as prescribing an ability that is necessarily present in every implementation. For example, the phrase “X can perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perform Y, and not as indicating that in every instance X must always be able to perform Y. It should be understood that, in various implementations, X might be unable to perform Y and remain within the scope of the present disclosure.
[0461] The term “may" should be understood as referring to a possibility of a feature in various implementations and not as prescribing an ability that is necessarily present in every implementation. For example, the phrase “X may perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perform Y, and not as indicating that in every instance X must always be able to perform Y. It should be understood that, in various implementations, X might be unable to perform Y and remain within the scope of the present disclosure.
Claims
WHAT IS CLAIMED IS:1 . A computer-implemented method, comprising: receiving, by a machine-learned agent system, a query associated with a user; accessing, by the machine-learned agent system, memory data from an interaction memory datastore associated with the user, wherein the interaction memory datastore comprises one or more memory objects that were generated based on one or more prior interactions between the machine-learned agent system and the user; inputting, by the machine-learned agent system and to a machine-learned sequence processing model, an input data structure based on the query and the memory data; generating, by the machine-learned agent system and based on processing the input data structure using the machine-learned sequence processing model, an output; and outputting, by the machine-learned agent system, and based on the output, a response to the query.
2. The method of claim 1 , wherein the interaction memory datastore comprises multimodal data, and wherein the memory data is based on a first memory object associated with a first data modality and a second memory object associated with a second data modality.
3. The method of claim 2, wherein: the first data modality comprises audio data, and wherein the second data modality comprises image data; the first data modality comprises text data, and wherein the second data modality comprises image data; or the first data modality comprises text data, and wherein the second data modality comprises audio data.
4. The method of claim 1 , comprising: retrieving, by the machine-learned agent system, and from the interaction memory datastore, the one or more memory values based on a relevance of the one or more memory values to the query.
5. The method of any of the preceding claims, comprising: after receiving an input during an interactive session, processing, by the machine-learned agent system, the input using a machine-learned model to generate one or more values; andbased on the generated one or more values indicating that at least a portion of the input is to be stored: extracting, by the machine-learned agent system, the portion; and storing, by the machine-learned agent system, the portion as a memory value in a memory object in the interaction memory datastore.
6. The method of claim 5, comprising: storing, by the machine-learned agent system, metadata associated with the memory value in the memory object.
7. The method of claim 5, wherein the portion comprises at least one of: text data, image data, or audio data.
8. The method of claim 5, wherein the portion corresponds to an explicit instruction to remember information.
9. The method of any of the preceding claims, comprising: receiving, by the machine-learned agent system, input data indicating an instruction to forget specified information; and deleting, by the machine-learned agent system, the specified information from the interaction memory datastore.
10. The method of any of the preceding claims, comprising: receiving, by the machine-learned agent system, input data indicating an instruction to forget specified information after a specified interval; and queuing, by the machine-learned agent system, the specified information for deletion from the interaction memory datastore after the specified interval.11 . The method of any of the preceding claims, wherein the memory data comprises an ordered combination of the one or more memory values, the ordered combination ordered based on corresponding timestamps of the plurality of memory objects.
12. The method of any of the preceding claims, comprising: filtering the plurality of memory objects to obtain the one or more memory values, wherein the filtering comprises:computing one or more relevance measures for the plurality of memory objects based on the query; and returning a subset of the plurality of memory objects based on the relevance measure.
13. The method of claim 12, wherein the one or more relevance measures comprise: a respective score generated based on distance between a query embedding and a respective memory value embedding; or a respective sequence output generated by processing the query and one or more respective memory values using a second machine-learned sequence processing model that is optionally the same as or different from the machine-learned sequence processing model.
14. The method of any of the preceding claims, wherein the one or more memory values each comprise a respective string, and wherein the memory data comprises a string comprising the one or more respective strings from the one or more memory values.
15. The method of any of the preceding claims, comprising: inputting at least a portion of the output to the machine-learned sequence processing model; generating, based on processing the output using the machine-learned sequence processing model, the response to the query; wherein the output comprises a plurality of predictions for analytical content that comprises an analysis of the query with respect to the memory data.
16. The method of any of the preceding claims, wherein the plurality of predictions for analytical content are not exposed on a user interface associated with the query.
17. The method of any of the preceding claims, comprising: consolidating, in the interaction memory datastore, a subset of the plurality of memory objects into a single memory object based on an alignment between the subset.
18. The method of any of the preceding claims, comprising: receiving input data associated with a timestamp; creating a current memory object comprising: the timestamp; and a current memory value based on the input data; matching the current memory object to a prior memory object; andreplacing, in the interaction memory datastore, the prior memory object with the current memory object.
19. The method of any of the preceding claims, comprising: receiving input data associated with a timestamp; matching at least a portion of the input data to a prior memory object; updating, in the interaction memory datastore, the prior memory object based on the portion of the input data.
20. The method of any of the preceding claims, wherein the response comprises instructions configured to control an application programming interface to cause a computing system to perform an action.21 . The method of claim 20, wherein the query indicates a configuration parameter that controls an operation of the computing system, and wherein the instructions are configured to control the application programming interface to cause the computing system to assign a value for the configuration parameter.
22. The method of claim 21 , wherein the configuration parameter is associated with at least one of: data access by the computing system; data retention by the computing system; or data communication by the computing system.
23. One or more non-transitory computer-readable media storing instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising the method of any of the preceding claims.
24. A computing system, comprising: one or more processors; and one or more non-transitory computer-readable media storing instructions that are executable by the one or more processors to cause the computing system to perform operations, the operations comprising the method of any of the preceding claims.
25. A computer program product comprising instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising the method of any of the preceding claims.