Multimodal memory for machine-learned agents

The multimodal multi-index data structure addresses inefficiencies in storing and retrieving multimodal memories for machine-learned agents, enhancing inference output quality and reducing computational cost through optimized retrieval methods.

WO2026128110A1PCT designated stage Publication Date: 2026-06-18GDM HOLDING LLC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GDM HOLDING LLC
Filing Date
2025-10-30
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing machine-learned systems face challenges in efficiently storing and retrieving multimodal memories for effective inference, leading to suboptimal performance and increased computational costs.

Method used

A multimodal multi-index data structure is employed to store and retrieve memories, using various indexing techniques to facilitate rapid retrieval of relevant data entries based on multimodal inputs, including image embeddings, video captions, natural language embeddings, and extracted facts, enabling higher-quality inference outputs at reduced computational cost.

🎯Benefits of technology

The solution enhances the quality and relevance of inference outputs while reducing computational complexity, improving the functioning of machine-learned agents by ensuring more accurate and efficient retrieval of memories.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025053241_18062026_PF_FP_ABST
    Figure US2025053241_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A computing system can obtain a multimodal multi-index data structure. The data structure can include a first and second plurality of data entries having first or second data types, and a plurality of index data structures. Each index data structure can include a plurality of index data entries each correlating a respective data entry of the first or second plurality of data entries with a corresponding indexing value. The computing system can receive a multimodal input comprising a query directed to a machine-learned agent. The computing system can retrieve, based at least in part on the query and the plurality of index data structures, a first data entry of the first or second plurality of data entries. The computing system can provide, to the machine-learned agent, the first data entry. The machine-learned agent can generate, based at least in part on the first data entry, an inference output.
Need to check novelty before this filing date? Find Prior Art

Description

MULTIMODAL MEMORY FOR MACHINE-LEARNED AGENTSPRIORITY CLAIM

[0001] The present application claims priority to United States Provisional Application No. 63 / 730,687 filed December 11, 2024, which is hereby incorporated by reference herein in its entirety.BACKGROUND

[0002] The present disclosure relates generally to machine learning processes and machine-learned devices and systems. A computer can receive input(s). The computer can execute instructions to process the input(s) to generate output(s) using a parameterized model. The computer can obtain feedback on its performance in generating the outputs with the model. The computer can generate feedback by evaluating its performance. The computer can receive feedback from an external source. The computer can update parameters of the model based on the feedback to improve its performance. In this manner, the computer can iteratively "learn" to generate the desired outputs. The resulting model is often referred to as a machine-learned model.SUMMARY

[0003] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.

[0004] Example aspects of the present disclosure provide an example method. In some implementations, the example method can include obtaining, by a computing system comprising one or more computing devices, a multimodal multi-index data structure. The multimodal multi-index data structure can include a first plurality of data entries having at least a first data type. The multimodal multi-index data structure can include a second plurality of data entries having at least a second data type different from the first data type. The multimodal multi-index data structure can include a plurality of index data structures, wherein each index data structure of the plurality of index data structures comprises a plurality of index data entries each correlating a respective data entry of the first plurality of data entries or second plurality of data entries with a corresponding indexing value. The method can include receiving, by the computing system, a multimodal input comprising aquery directed to a machine-learned agent. The method can include retrieving, by the computing system based at least in part on the query’ and based at least in part on the plurality of index data structures, a first data entry of the first plurality of data entries or second plurality of data entries. The method can include providing, by the computing system to the machine-learned agent, the first data entry for generation of an inference output based at least in part on the first data entry.

[0005] In the example method, obtaining the multimodal multi-index data structure can include receiving, by the computing system, one or more inputs directed to a machine- learned agent. In the example method, obtaining the multimodal multi-index data structure can include extracting, by the computing system and based at least in part on the one or more inputs, one or more data entries to be added to the first plurality of data entries or second plurality of data entries. In the example method, obtaining the multimodal multi-index data structure can include generating, by the computing system and based at least in part on the one or more data entries, one or more index data entries to be added to the plurality of index data entries. In the example method, obtaining the multimodal multi-index data structure can include stonng, by the computing system, the one or more data entries and the one or more index data entries in the multimodal multi-index data structure.

[0006] In the example method, the one or more data entries can include one or more image frames. In the example method, generating the one or more index data entries can include providing, by the computing system to a machine-learned image embedding model, the one or more image frames. In the example method, generating the one or more index data entries can include receiving, by the computing system from the machine-learned image embedding model, an image embedding. In the example method, generating the one or more index data entries can include generating, by the computing system, an index data entry of the plurality of index data entries based at least in part on the image embedding.

[0007] In the example method, the one or more data entries can include one or more captions associated with one or more video segments. In the example method, extracting the one or more data entries can include segmenting the multimodal input to generate one or more video segments. In the example method, extracting the one or more data entries can include providing the one or more video segments to a machine-learned video captioning model. In the example method, generating the one or more index data entries can include providing, by the computing system to a machine-learned language embedding model, the one or more captions. In the example method, generating the one or more index data entries can include receiving, by the computing system from the machine-learned languageembedding model, a language embedding vector. In the example method, generating the one or more index data entries can include generating, by the computing system, an index data entry of the plurality of index data entries based at least in part on the language embedding vector.

[0008] In the example method, the one or more data entries can include data indicative of a natural language input component. In the example method, generating the one or more index data entries can include providing, by the computing system to a machine- learned natural language embedding model, the data indicative of a natural language input component. In the example method, generating the one or more index data entries can include receiving, by the computing system from the machine-learned natural language embedding model, a natural language embedding based on the data indicative of the natural language input component. In the example method, generating the one or more index data entries can include generating, by the computing system, an index data entry of the plurality of index data entries based at least in part on the natural language embedding.

[0009] In the example method, the one or more data entries can include one or more facts extracted from one or more dialogue turns between a user and the machine-learned agent. In the example method, extracting the one or more data entries can include providing, by the computing system to a machine-learned fact extraction system, at least a portion of the one or more dialogue turns. In the example method, generating the one or more index data entries can include providing, by the computing system to a machine-learned language embedding model, at least a portion of the one or more facts.

[0010] In the example method, the one or more data entries can include one or more summaries of content of one or more dialogue turns between a user and the machine-learned agent. In the example method, extracting the one or more data entries can include providing, by the computing system to a machine-learned language model, at least a portion of the one or more dialogue turns. In the example method, generating the one or more index data entries can include providing, by the computing system to a machine-learned natural language embedding model, at least a portion of the one or more summaries.

[0011] In the example method, the one or more data entries can include multimodal data. In the example method, generating the one or more index data entries can include providing, by the computing system to a multimodal machine-learned model, the multimodal data. In the example method, generating the one or more index data entries can include receiving, by the computing system and from the multimodal machine-learned model, a multimodal embedding. In the example method, generating the one or more index data entriescan include generating, by the computing system, an index data entry of the plurality of index data entries based at least in part on the multimodal embedding.

[0012] In the example method, retrieving the first data entry can include generating, by the computing system based at least in part on the query, a plurality of indexing values associated with the plurality of index data structures. In the example method, retrieving the first data entry can include identifying, by the computing system based at least in part on the plurality of indexing values and based at least in part on the plurality of index data structures, a plurality of candidate data entries. In the example method, retrieving the first data entry can include selecting, by the computing system from the plurality of candidate data entries, the first data entry.

[0013] In the example method, retrieving the first data entry can include generating, by the computing system, a plurality of scores respectively associated with the plurality of candidate data entries. In the example method, selecting the first data entry can include selecting based at least in part on the plurality of scores.

[0014] In the example method, generating the plurality of scores can include scoring the candidate data entries according to a common scoring function that is shared among the plurality of index data structures.

[0015] In the example method, generating the plurality of scores can include generating based at least in part on a metric of diversity relative to one or more other candidate data entries of the plurality of candidate data entries.

[0016] In the example method, generating the plurality of scores can include generating based at least in part on data indicative of a time at which a candidate data entry was obtained.

[0017] In the example method, generating the plurality of indexing values can include segmenting, by the computing system, the multimodal input to extract a segment comprising the query. In the example method, generating the plurality of indexing values can include generating, by the computing system based on the segment, one or more revised queries having one or more formats that are compatible with one or more indexers associated with the plurality of index data structures. In the example method, generating the plurality of indexing values can include providing, by the computing system, to the one or more indexers, the one or more revised queries to generate the plurality' of indexing values.

[0018] In the example method, retrieving the plurality of candidate data entries can include retrieving, by the computing system based on a metric of similarity between a firstindexing value associated with the query and a plurality of indexing values associated with the plurality of first data entries, a plurality of top-k candidate data entries.

[0019] In the example method, the computing system can include a client device comprising one or more input devices. In the example method, the computing system can include a server device comprising one or more of the machine-learned agent and the plurality of index data structures.

[0020] In the example method, the client device can include a wearable device configured to capture at least one of audio input and video input.

[0021] In the example method, the inference output can include data indicative of one or more application programming interfaces. The example method can include calling, by the computing system, the one or more application programming interfaces based on the inference output.

[0022] Example aspects of the present disclosure provide one or more example non- transitory computer-readable media storing instructions that are executable by one or more processors to cause a computing system to perform example operations. In some implementations, the example operations can include the example method described above.

[0023] Example aspects of the present disclosure provide an example computing system that includes one or more processors and one or more example non-transitory computer-readable media storing instructions that are executable by one or more processors to cause a computing system to perform example operations. In some implementations, the example operations can include the example method described above.

[0024] Example aspects of the present disclosure provide one or more example non- transitory computer-readable media storing instructions that are executable by one or more processors to cause a computing system to perform example operations. In some implementations, the example operations can include obtaining a multimodal multi-index data structure. The multimodal multi-index data structure can include a first plurality of data entries having at least a first data ty pe. The multimodal multi-index data structure can include a second plurality of data entries having at least a second data type different from the first data type. The multimodal multi-index data structure can include a plurality of index data structures, wherein each index data structure of the plurality of index data structures comprises a plurality' of index data entries each correlating a respective data entry of the first plurality of data entries or second plurality' of data entries with a corresponding indexing value. The example operations can include receiving a multimodal input comprising a query directed to a machine-learned agent. The example operations can include retrieving, based atleast in part on the query and based at least in part on the plurality of index data structures, a first data entry of the first plurality of data entries or second plurality of data entries. The example operations can include providing, to the machine-learned agent, the first data entry for generation of an inference output based at least in part on the first data entry.

[0025] Example aspects of the present disclosure provide an example computing system that includes one or more processors and one or more example non-transitory computer-readable media storing instructions that are executable by one or more processors to cause a computing system to perform example operations. In some implementations, the example operations can include obtaining a multimodal multi-index data structure. The multimodal multi-index data structure can include a first plurality of data entries having at least a first data type. The multimodal multi-index data structure can include a second plurality of data entries having at least a second data type different from the first data type. The multimodal multi-index data structure can include a plurality of index data structures, wherein each index data structure of the plurality of index data structures comprises a plurality of index data entries each correlating a respective data entry of the first plurality of data entries or second plurality of data entries with a corresponding indexing value. The example operations can include receiving a multimodal input comprising a query directed to a machine-learned agent. The example operations can include retrieving, based at least in part on the query and based at least in part on the plurality of index data structures, a first data entry of the first plurality of data entries or second plurality of data entries. The example operations can include providing, to the machine-learned agent, the first data entry for generation of an inference output based at least in part on the first data entry.

[0026] Other example aspects of the present disclosure are directed to other systems, methods, apparatuses, tangible non-transitory computer-readable media, and devices for performing functions described herein. These and other features, aspects, and advantages of various implementations will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate implementations of the present disclosure and, together with the description, help explain the related principles.BRIEF DESCRIPTION OF THE DRAWINGS

[0027] Figure 1 is a block diagram of an example system for retrieving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure;

[0028] Figure 2 is a block diagram of an example system for storing multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure;

[0029] Figure 3 is a block diagram of an example system for retrieving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure;

[0030] Figure 4A is block diagram of a first view of an example system for stonng and retrieving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure;

[0031] Figure 4B is block diagram of a second view of an example system for storing and retrieving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure;

[0032] Figure 5 is a block diagram of an example system for retrieving multimodal memories for a machine-learned agent configured to use tools according to example implementations of some aspects of the present disclosure;

[0033] Figure 6 is a flow chart diagram of an example method for retneving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure;

[0034] Figure 7 is a flow chart diagram of an example method for storing multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure;

[0035] Figure 8 is a flow chart diagram illustrating an example method for training a machine-learned model according to example implementations of aspects of the present disclosure;

[0036] Figure 9 is a block diagram of an example processing flow for using machine- learned model(s) to process input(s) to generate output(s) according to example implementations of aspects of the present disclosure;

[0037] Figure 10 is a block diagram of an example sequence processing model according to example implementations of aspects of the present disclosure;

[0038] Figure 11 is a block diagram of an example technique for populating an example input sequence for processing by a sequence processing model according to example implementations of aspects of the present disclosure;

[0039] Figure 12 is a block diagram of an example model development platform according to example implementations of aspects of the present disclosure;

[0040] Figure 13 is a block diagram of an example training workflow for training a machine-learned model according to example implementations of aspects of the present disclosure;

[0041] Figure 14 is a block diagram of an inference system for operating one or more machine-learned model(s) to perform inference according to example implementations of aspects of the present disclosure;

[0042] Figure 15 is a block diagram of an example networked computing system according to example implementations of aspects of the present disclosure;

[0043] Figure 16 is a block diagram of an example computing device according to example implementations of aspects of the present disclosure; and

[0044] Figure 17 is a block diagram of an example computing device according to example implementations of aspects of the present disclosure.DETAILED DESCRIPTION

[0045] Generally, the present disclosure is directed to persistent storage and retrieval of memories for multimodal machine-learned agents. A computing system comprising a machine-learned agent can receive multimodal inputs associated with a plurality of interactions (e.g., with a user), such as multimodal inputs comprising audio and video data. The computing system can extract, from the multimodal inputs, relevant memories to store in a persistent data structure so that the memories can be used in later interactions of the machine-learned agent (e.g., with the same user, etc.). Later, the machine-learned agent can receive (e.g., from a user) a multimodal input comprising a query. A computing system can retrieve, from the persistent data structure, a memory that is relevant to the query, and the machine-learned agent can use the memory7to generate an inference output that is responsive to the query.

[0046] In some instances, a data structure for storing memories can include a multiindex data structure comprising a plurality7of indexes that may store indexing values indicative of different aspects of the memory data. An index can include, for example, a data structure to facilitate rapid retrieval of memory data based on indexing values associated w ith the data. For example, an index can include a plurality of data entries each correlating an indexing value with data (e.g., numerical identifier data, memory address data, etc.) indicative of a corresponding memory7associated w ith the indexing value. To retrieve data based on an indexing value, a computing system can identify, in the index, an index dataentry associated with the indexing value; and retrieve, based on the index data entry, a memory associated with the indexing value.

[0047] In some instances, a multi-index data structure can include index data structures comprising machine-learned indexing values, such as semantic embedding values; index data structures comprising raw indexing values, such as timestamp values, raw input data, or the like; or other index data structures. In some instances, an indexing value can be generated from raw multimodal inputs, or from data generated based on the multimodal inputs (e.g., using a machine-learned model).

[0048] For example, in some instances, multimodal inputs received by the machine- learned agent can include a combination of video data and natural language data (e.g., speech data, text data, etc.). In some instances, stored memories can include raw image inputs (e.g., video frames, etc.), raw video inputs (e.g., video segments comprising multiple frames, etc.), raw natural language inputs (e.g., speech inputs, text inputs, etc.), or a combination thereof. Additionally, in some instances, stored memories can include data generated based on raw image inputs, such as caption data (e.g., caption data describing a video or image input, caption generated by a machine-learned vision language model, etc.), transcript data (e.g., audio speech recognition data, etc.), summary data (e.g., summary data summarizing natural language inputs, video inputs, or the like; summary generated by a machine-learned language model or multimodal language model, etc.), extracted facts (e.g., facts about a user or entities related to the user), or other data generated from the multimodal inputs received by the machine-learned model.

[0049] In some instances, indexing values stored in an index data structure can include raw values extracted from the memories to be stored (e.g., raw timestamps, session identifiers, or the like; raw text data such as names or natural language data; raw numerical data; etc.) or generated indexing values generated based on all or part of the memories to be stored. In some instances, generated indexing values can include semantic embedding values generated by a machine-learned semantic embedding model, such as a natural language embedding model, image embedding model, multimodal embedding model, or other semantic embedding model. For example, in some instances, one or more natural language embedding models can be used to generate a plurality of indexing values for one or more memories based on a pl urality of different indexing inputs, such as one or more of raw or processed natural language input data (e.g., audio data, text transcripts of natural language speech data, etc.), caption data (e.g., video caption data, image frame caption data, etc.), summary data (e.g., natural language text summary’ of transcripts, captions, or other data, etc.), andextracted facts. As another example, in some instances, one or more image embedding models or video embedding models can generate an embedding vector based on a video frame or video segment to be stored in the memory data structure, and a computing system can store, in a video embedding index data structure, a data entry correlating the embedding vector with the frame or segment to be stored. As another example, in some instances, a multimodal memory (e.g., comprising raw multimodal inputs, etc.) can be provided to a multimodal machine-learned embedding model to generate an embedding vector indicative of the multimodal memory, and a computing system can store, in an index data structure comprising multimodal embeddings, a data entry correlating the embedding vector with the memory' to be stored.

[0050] In some instances, a single memory can be indexed in one index data structure or a plurality of index data structures. For example, in some instances, memories comprising facts (e.g., facts about a user, etc.) extracted from past interactions of a machine-learned agent can be stored in a data structure (e.g., database, table, etc.) that may be separate from a data structure for storing raw memories (e.g., raw video inputs, raw natural language inputs, etc.), and an index data structure for extracted facts may comprise data entries correlating indexing values with the extracted facts (e.g., with or without reference to any raw multimodal inputs from which the facts were extracted, etc.). As another example, in some instances, memories comprising a plurality of data values (e.g., one or more of raw video data, raw audio data, audio transcript, automatically generated caption(s), summary- data, or other memory data) can be stored in a single data structure (e.g., single database, single row, etc.), and a plurality7of index data structures (e.g., caption index, transcript index, summary7index, image embedding index, audio embedding index, etc.) can each comprise a plurality' of data entries, with each data entry correlating a memory with a corresponding indexing value associated with the memory.

[0051] In some instances, storing memories based on interactions of a machine- learned agent can include segmenting multimodal inputs received by the machine-learned agent; processing (e.g., summarizing, captioning, etc.) one or more segments to generate one or more memories; providing the one or more memories or portions thereof to one or more indexers to generate indexing values; storing index data entries comprising the indexing values in one or more index data structures; and storing the memories in a memory' data structure (e.g., in a compressed or uncompressed format).

[0052] In some instances, retrieving memories to respond to a query can include segmenting multimodal inputs received by the machine-learned agent; identifying a segmentcomprising the query; rewriting the query in a form that is suitable for indexing; providing the rewritten query to one or more indexers to generate one or more indexing values; and retrieving, from a memory data structure, one or more memories based on the indexing values. In some instances, retrieving the memories can include retrieving the top K candidate memories (e.g., nearest K indexing values wherein K is a positive integer, etc.) from each index data structure; reranking the candidate memories (e.g., based on a common reranking scale that is shared between the index data structures); and providing the top M reranked memories to the machine-learned agent, where M can be a positive integer (e.g., constant integer, variable integer, maximum number of memories that can fit in a context window of the machine-learned agent, etc.). In some instances, reranking can be based on various factors, such as relevance of an individual memory (e.g., based on a plurality of indexing values, etc.); diversity of a plurality of memories (e.g., similarity to or difference from other candidate memories); or other relevant ranking factors.

[0053] In some instances, a machine-learned agent can include a machine-learned agent configured to act as a digital assistant (e.g., mobile digital assistant, situated agent, etc.). In some implementations, a machine-learned agent can be implemented as a “situated agent”. The term situated agent refers to a setting in which the agent shares one or more perceptual inputs with a human user. For example, the situated agent can receive and process various data inputs, including video, audio, and / or textual data which are also observable by the human user. The agent can process these inputs to generate responses that are contextually relevant for the user’s physical or digital environment, for example enabling the agent to generate dialogue or other responses or outputs which assist the user in understanding and / or navigating the environment.

[0054] In some instances, a machine-learned agent can include an agent configured to perform actions using tools, such as tools accessible through an application programming interface (API). For example, in some instances, a machine-learned agent can receive multimodal inputs comprising a query (e.g., user query ); a computing system comprising the machine-learned agent can retrieve, based on the query, one or more memories to assist the agent in responding to the query; the computing system can provide the one or more memories to the agent (e.g., along with the multimodal inputs, etc.); and the machine-learned agent can generate, based at least in part on the one or more memories, an inference output indicative of an action to be performed by a tool (e.g., API call, action name, action identifier, etc.). Based on the inference output, the computing system can perform the action or cause the action to be performed (e.g., using an API, etc.). In some instances, an action to beperformed can be selected according to a chain-of-thought reasoning process, such as a thought-observation-action reasoning process.

[0055] In some instances, a computing system for implementing example aspects of the present disclosure can include a client / server computing system. For example, in some instances, a client device can capture (e.g., using camera(s), microphone(s), or other sensor(s)) multimodal inputs (e.g.. from a user, etc.), and can provide the multimodal inputs to one or more machine-learned agents executing on the client device or a server device. In some instances, the computing system can retrieve memories from a memory data structure located on the client device or a server device, and the machine-learned agent can generate an inference output based on the memories. In some instances, a client device can include an augmented reality headset, smart glasses, or other client device (e.g., other wearable device, smartphone, tablet, laptop, desktop, etc.).

[0056] Systems and methods according to some aspects of the present disclosure can provide a variety of technical effects and benefits. For example, in some instances, systems and methods according to some aspects of the present disclosure can generate higher quality (e.g., more accurate, more useful, more relevant, etc.) inference outputs compared to some alternative implementations. As another example, in some instances, systems and methods according to some aspects of the present disclosure can generate inference outputs at a reduced computational cost compared to some alternative implementations.

[0057] In some instances, systems and methods according to some aspects of the present disclosure can generate higher quality (e.g., more accurate, more useful, more relevant, etc.) inference outputs compared to some alternative implementations. For example, in some instances, the quality (e.g., accuracy, usefulness, relevance, etc.) of a machine- learned inference output may depend on a quality of the inputs provided to the machine- learned model. In some instances, the most relevant input for a first query (e.g., “I forgot the entry code for this building. Do you remember what I typed in last time?”, etc.) may be identifiable using a first index data structure (e.g., image embedding data structure, etc.), while the most relevant input for a second query (e.g.. ‘“What kind of flowers are Jennifer's favorite?”, etc.) may be identifiable using a second index data structure (e.g., extracted fact data structure, etc.) different from the first index data structure. By indexing memories according to a plurality' of indexing techniques, systems and methods according to some aspects of the present disclosure can in some instances retrieve more relevant memories compared to some alternative implementations, thereby improving the functioning of a computing system comprising the machine-learned agent. As another example, in someinstances, the quality of an inference output may depend on the diversity of memories available, as duplicate or near-duplicate memories may be less useful to a machine-learned agent compared to a new memory comprising new or different information. Advantageously, systems and methods according to some aspects of the present disclosure can deduplicate memories or can select memories based at least in part on an amount of additional information provided by the selected memory, thereby improving the functioning of a computing system comprising the machine-learned agent.

[0058] Additionally, in some instances, systems and methods according to some aspects of the present disclosure can generate similar-quality inference outputs at a reduced computational cost compared to some alternative implementations. For example, in some instances, the quality of an inference output can scale with a computational complexity (e.g.. parameter count, context window size, computational cost of inference, etc.) of a machine- learned model. In some instances, methods that can enable higher-quality' outputs using a similar-complexity (e.g., same parameter count, same context window size, same computational cost, etc.) machine-learned model can be adapted to generate similar-quality outputs using a reduced-computational-cost (e.g., reduced parameter count, reduced context window size, reduced electricity cost of inference, reduced memory footprint, reduced processor usage, etc.) machine-learned model, thereby improving the functioning of a computing system comprising the machine-learned model by enabling similar (e.g., same) functionality at a reduced computational cost compared to some alternative implementations.

[0059] Various example implementations are described herein with respect to the accompanying Figures.

[0060] Figure 1 is a block diagram of an example system for retrieving multimodal memories for a machine-learned agent 112 according to example implementations of some aspects of the present disclosure. A computing system 102 can receive one or more inputs 104a. Based on the inputs 104a, the computing system 102 can send one or more requests 106 to a memory storage system 108, and can receive one or more memories 110 from the memory storage system 108 in response to the request. The computing system 102 can provide one or more unmodified or modified inputs 104b and the one or more memories 110 to a machine-learned agent 112, which can generate an output 111 based on the one or more inputs 104b and one or more memories 110.

[0061] A computing system 102 can be or include one or more software, firmw are, or hardware components configured to retrieve memories 110 based on inputs 104a. In some instances, the computing system 102 can be, comprise, be comprised by, or share one or moreproperties with a computing device or system described below with respect to Figures 15-17 (e.g., computing device 50, third-party system 80. computing device 98, computing device 99, etc.).

[0062] Input(s) 104a can generally include or otherwise represent various types of data. An input 104a can include one type or many different types of data. Example data types for an input 104a can include, for example, video data, natural language data (e.g., speech data, text data, etc.), sensor data (e.g., global positioning system (GPS) sensor, accelerometer, gyroscopic sensor or tilt sensor, heart rate, blood oxygenation, heart rate variability7, skin temperature sensor, etc.), image data, text data, audio data, or other appropriate data types. In some instances, inputs 104a can include multimodal data comprising a plurality of data types (e.g., synchronized video and audio data, etc.). In some instances, inputs 104b can include raw data received from one or more input / output devices (e.g., camera, microphone, keyboard, etc.), such as raw input received from a user via one or more input / output devices. In some instances, inputs 104a can include a user input directed to a machine-learned agent 112. Inputs 104b can generally include or otherwise represent various types of data.

[0063] An input 104b can include one ty pe or many different types of data. Example data types for an input 104b can include, for example, video data, image data, natural language data (e.g., text data, etc.), or other data types. In some instances, inputs 104b can include multimodal data directed to a multimodal machine-learned agent 112, such as multimodal data comprising text and image data; text and video data; text and audio data; audio and video data; or other multimodal data. In some instances, inputs 104b can include processed data extracted from raw inputs 104a, such as data extracted by a machine-learned model (e.g., second machine-learned model different from the machine-learned agent, etc.), data extracted according to a heuristic or algorithm, or other data extraction. Further details of some example methods for extracting data from inputs 104a are provided below with respect to Figure 2 and a memory generation / extraction system 220.

[0064] In some instances, inputs 104 can include data to be displayed to a user, such as visual frames (e g., rendered frame of a video game, video frame of a video file to be displayed to a user, composited frame showing one or more windows associated with application(s) executing on an operating system of a computing system 102, screen capture frame, etc.) that have been or will be displayed on a monitor or other display device of a computing system 102; audio data that has been or will be output to speakers or headphones of the computing system 102; or other content (e.g.. perceptual data, natural language content, etc.) to be provided to a user. As a non-limiting illustrative example, a machine-learned agent 112 can include a desktop assistant, mobile assistant, or gaming assistant configured to assist a user in navigating a digital task or environment (e.g., gaming task or environment; productivity task such as word processing or the like; desktop or mobile software environment; navigation task, communication task, or other mobile digital assistant task; etc.), and the inputs 104 can include data indicative of perceptual outputs (e.g., visual outputs, audio outputs, etc.) to be provided to the user, or other data indicative of the digital task or environment.

[0065] A request 106 can include a command (e g., database query, hypertext transfer protocol (HTTP) request, application programming interface (API) request, etc.), string (e.g.. SQL string), vector, or other data object (e.g., data entry identifier, key value, or indexing value of a data entry to be received, etc.) or message that is configured to filter, sort, or otherw ise engage with a data structure of a memory storage system 108 to facilitate the retrieval of relevant memory data.

[0066] A memory' storage system 108 can be or include one or more software, firmware, or hardware components configured to store memories 110 and output retrieved memories 110 responsive to one or more requests 106. In some instances, the memory storage system 108 can be, comprise, be comprised by, or share one or more properties with a computing device or system described below with respect to Figures 15-17 (e.g., computing device 50, third-party system 80, computing device 98, computing device 99, etc.).

[0067] Memory storage system 108 can include one or more data structures configured for the structured storage of data. Memory storage system 108 can be or include a relational or non-relational database, a data table, a document, a file system, or other structured data representation. Examples of devices or systems that can be used to implement memory storage system 108 include traditional relational databases, NoSQL databases, inmemory data stores, and distributed file systems. Additionally, cloud storage solutions provided by network-hosted platforms can also be used. For example, in some instances, a memory7storage system 108 can include a distributed storage system (e.g., distributed SQL system such as Spanner, etc.). Memory storage system 108 can be implemented locally to a computing system 102 (e.g., on a same device or system) or remotely from a computing system 102 (e g., on a different device or system).

[0068] Memory / storage system 108 can store data of various different modalities. Memory storage system 108 can store text data. Memory storage sy stem 108 can store image data. Memory storage system 108 can store video data. Memory storage system 108 can store audio data. Memory storage system 108 can store data entries comprising multiple types ofdata per data entry (e.g., image plus caption, video segment plus transcript, etc.). Memorystorage system 108 may store arbitrary data types. Various example data modalities that can be stored in memory storage system 108 are described herein with respect to Figures 9-17 and input(s) 2.

[0069] Memory- storage system 108 can include one or more indexing data structures. Further details of some example indexing data structures are provided below with respect to Figures 2 and 3.

[0070] Memory storage system 108 can be multimodal. For example, one or more queries can include data of a first modality and one or more queries can include data of a second modality. Memory storage system 108 can store memory information in both the first and the second modality. For example, memory storage system 108 can store text data and image data, audio data and image data, text data and audio data, or other combinations of data modalities.

[0071] A memory- 110 can generally include or otherwise represent various types of data. A memory 110 can include one type or many different types of data. A memory- 110 can include one or more data types that are similar to (e.g., same as) or different from one or more data types of an input 104a or input 104b. Example data types for a memory 110 can include, for example, video data, image data, natural language data (e.g., text data, etc.), sensor data, or other data types. In some instances, memories 110 can include multimodal data associated with a user interaction with a multimodal machine-learned agent 112, such as multimodal data comprising text and image data; text and video data; text and audio data; audio and video data; or other multimodal data. In some instances, memories 110 can include processed data extracted from raw inputs 104a, such as data extracted by a machine-learned model (e.g., second machine-learned model different from the machine-learned agent, etc.), data extracted according to a heuristic or algorithm, or other data extraction. Further details of some example methods for extracting data from memories 110 are provided below- with respect to Figure 2 and a memory- generation / extraction system 220.

[0072] In some instances, a memory 110 can include or be based on (e.g., generated or extracted based on, etc.) data collected in a prior interaction between a user and a machine- learned agent 112, such as a user associated with the input(s) 104a. However, this is not required. For example, in some instances, a memory- 110 can include data generated or collected in other ways, such as general knowledge data (e.g.. Wiki data, web search data, etc.), data collected from other interactions of a user (e.g., configuration settings set by the user; interactions of the user with a website, application, or other system; data provided bythe user to the memory storage system 108; etc.) that is the same as or different from a user associated with the input(s) 104a; or other relevant data without deviating from the scope of the present disclosure. As a non-limiting illustrative example, a computing system 102 comprising a machine-learned gaming assistant can obtain (e.g., receive, generate, etc.) inputs 104 associated with a game executing on the computing system 102 or another computing device (e.g., rendered video frames associated with the game, audio outputs of the game, user inputs to the game, etc.); retrieve, based on the inputs 104, one or more memories 110 comprising one or more of: general knowledge data associated with the game (e.g.. Wiki data provided by a plurality7of game users; manual, walkthrough, or game guide data provided by a game maker or guidance publisher; etc.), recorded gameplay data provided by a plurality of users different from a user currently interacting with a machine-learned agent 112 (e.g., users who have uploaded streamed or prerecorded video to a system for sharing content with a machine-learned agent 112 or other users, etc.), recorded gameplay data provided by the user currently interacting with the machine-learned agent 112 (e.g., including recordings of gameplay that did not include interactions with a machine-learned agent 112, etc.), or other data associated with the game. Other examples are possible.

[0073] An output 111 can generally include or otherw ise represent various types of data. An output 111 can include one type or many different types of data. An output 111 can include one or more data types that are similar to (e g., same as) or different from one or more data types of an input 104. Example data types for an output 111 can include, for example, natural language outputs (e.g., text outputs, audio natural language outputs such as text-to-speech outputs), generative outputs (e.g., video generation outputs, image generation outputs, audio generation outputs, etc.), action selection outputs (e.g., API calls, HTTP requests, name or numerical identifier of a selected action, natural language description of a selected action, etc.), computer code outputs (e.g., API calls, programming language code such as Python or C, assembly language code, machine language code, etc.), or other data types. In some instances, an output 111 can include an output to be provided to a user, an output 111 to be provided to a computing device (e.g., client device) associated with the user to cause the computing device to perform an action (e.g., display action, audio output action, etc.), an output 1 11 (e.g., action selection output) to be processed by a computing system 102 (e.g., server device of a computing system 102, etc.) to perform further actions or generate further outputs, or the like.

[0074] A machine-learned agent 112 can include one or more machine-learned models. The machine-learned agent 112 can include various model architectures, such asvarious neural network model architectures. An example model architecture for a machine- learned agent 112 can include a sequence processing model architecture (e.g., a transformer model). For example, the machine-learned agent 112 can be configured to receive an input sequence and generate an output sequence. For instance, the machine-learned agent 112 can be configured to generate an output sequence where elements of the output sequence are predicted based on the elements of the input sequence. In some instances, a machine-learned agent 1 12 can include a model architecture having an attention mechanism (e.g., selfattention). In some instances, the machine-learned agent 112 can be a pre-trained model (e.g., pretrained using large-scale unsupervised learning). In some instances, the machine-learned agent 112 can be fine-tuned over one or more fine-tuning datasets, such as a fine-tuning dataset associated with one or more specialized generation tasks. In some instances, a machine-learned agent 112 can include a situated agent that shares one or more perceptual inputs with a human user. For example, the situated agent can receive and process various data inputs, including video, audio, and / or textual inputs 104a which are also observable by the human user, such as inputs 104a received from a camera, microphone, or other sensor that observes a physical environment of the user; inputs 104 comprising data indicative of a digital environment of the user, such as output data to be provided to a user (e.g., screen capture data indicative of a visual frame to be displayed to a user via a monitor, augmented reality display, or other display device; audio data indicative of audio to be output to speakers, headphones, or another audio output device; etc.). In some instances, the machine- learned agent 1 12 can include an agent configured to select actions to be performed by a computing system 102 or other tools. Further details of an example machine-learned agent 112 for generating action selection outputs are provided below with respect to Figure 5.

[0075] Figure 2 is a block diagram of an example system for storing multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure. An input buffer 214 of a computing system 102 can receive and temporarily store one or more inputs 104a. An input segmentation system 216 of the computing system 102 can split the input(s) 104a into one or more segments 218, which can be provided to a memory storage system 108 for storage in a memory table 219 or provided to a memory generation / extraction system 220 for extraction of one or more memories 210 based on the segment(s) 218. The one or more memories 210 or segments 218 can be provided to one or more indexers 222 to generate one or more indexing values 224. One or more index tables 226 of the memory storage system 108 can then be updated with one or more data entries mapping the indexing values 224 to one or more memory table 219 entriescomprising one or more corresponding memories 210 or segments 218 used to generate the indexing values 224.

[0076] In some instances, a memory 210 can be, comprise, be comprised by, or otherw ise share one or more properties with a memory 110. For example, in some instances, a memory' 210 can have any property' described herein yvith respect to a memory 110, and vice versa. In some instances, a memory 210 can include relevant data generated or extracted from inputs 104 or segments 218. For example, in some instances, a memory 210 can include caption data, such as a natural language (e.g., text, etc.) caption describing one or more video segments 218, images, audio segments 218, multimodal segments 218, or other data.

[0077] As another example, in some instances, a memory 210 can include transcript data generated or extracted based on one or more inputs 104 or segments 218, such as natural language (e.g., text, etc.) transcript of audio data (e.g., speech data or other natural language input component, etc.), a natural language (e.g., text, etc.) transcript of one or more events (e.g., sign language conversations, nonverbal interactions or other events, etc.) depicted in image data or video data, or other transcript data. As another example, in some instances, a memory 210 can include summary data (e.g., natural language summary data, machinegenerated summary data) summarizing other data (e.g., memory 210 data, segment 218 data, etc.), such as natural language (e.g., text, etc.) summary' data summarizing a natural language transcript; natural language (e.g., text, etc.) summary data summarizing one or more natural language captions (e.g., captions describing events depicted in images or video segments 218, etc.), natural language (e.g., text, etc.) summary data summarizing a plurality of data items such as a combination of a transcript and one or more captions, hierarchical natural language (e.g., text, etc.) summary data comprising a summary of other summaries, or the like. As another example, in some instances, a memory 210 can include factual data (e.g., knowledge graph data, structured factual data in a human-readable structured data format such as XML or JSON, etc.), such as factual data learned from one or more prior interactions of the machine-learned agent 112 yvith a user (e.g., a user that is the same as or different from a user providing input(s) 104a, etc.) or other factual data (e.g., wiki data, knoyvledge graph data, computer knowledge base data, etc.). In some instances, factual data can include data that is known, certain, or nearly certain (e g., ground truth data provided by a human annotator or trusted data source, etc.) or uncertain data (e.g., machine-learned data inferred from inputs 104a using a machine-learned fact extraction system, etc.), such as uncertain data associated ■with a numerical confidence level (e.g., machine-learned estimated probability that a fact is correct, etc.).

[0078] In some instances, a memory 210 can include metadata associated with the memory 210. Metadata can indicate information about a corresponding memory object. For example, memory 210 metadata can indicate the context in which a memory 210 was collected. Example metadata fields can include a time (e.g., date) at which the memory was collected, a location of where the memory was collected, corresponding raw input data that was used to generate the information in the memory object, or other properties of the memory 210.

[0079] Example metadata can include various data elements. Time data can be captured in a time stamp. Time data can include a creation date, an update date, etc. Source data can identify an origin of the memory value or object (e.g., an explicit user command, implicit system inference). Summary data can contain a brief description of the memory value or the interaction that led to the memory’ object’s creation. Confidence data can include a numerical value representing the system’s confidence in the accuracy of the memory value. Expiration data can include a sunset date for “forgetting” the value. Session data can include a session identifier linking the memory object to a specific user interaction session. User data can include an identifier of an account or profile to which the memory object is linked. Tag data can include keywords or labels for efficient searching and retrieval. In some instances, metadata for a memory' 210 generated or extracted based on one or more inputs 104 or segments 218 can include data indicative of the input(s) 104 or segment(s) 218 used to generate or extract the memory 210. As a non-limiting illustrative example, metadata for a memory 210 comprising an extracted fact may comprise data indicative of one or more inputs 104 or segments 218 used to determine the fact; data indicative of one or more inputs 104, segments 218, or other data sources (e.g., websites, computer knowledge base data records, etc.) that may provide supporting evidence that may influence a confidence level associated with the fact, such as data entry identifier(s) identifying data records comprising the raw input(s) 104 or segment(s) 218, or the like.

[0080] An input buffer 214 can include, for example, one or more non-transitory computer-readable media configured to store (e.g., temporarily store, etc.) inputs 104a. For example, an input buffer 214 can include a data structure (e.g., cache, buffer, file, database, table, etc.) to store (e.g., temporarily store) inputs 104a for processing by a computing system 102, machine-learned agent 112, or other system. For example, in some instances, an input buffer 214 can include a data structure having a finite size (e.g.. predetermined fixed size, etc.), and old inputs 104a can be periodically removed (e.g., deleted, overwritten, etc.) from the input buffer 214 as new inputs 104a are received. For example, in some instances, aninput buffer 214 can include a buffer having a fixed number of video frames or other fixed data size (e.g., fixed data size in bytes, etc.), and one or more older (e.g., oldest, etc.) inputs 104a in the buffer can be deleted as the buffer is filled.

[0081] An input segmentation system 216 can be or include one or more software, firmware, or hardware components configured to segment (e.g., divide, separate, etc.) inputs 104a (e.g., streamed inputs 104a, etc.) into segments 218. In some instances, segmenting the inputs 104a can include splitting the inputs 104a into fixed-size (e.g., fixed time period, fixed compressed or uncompressed data size in bytes, etc.) segments 218. In some instances, segmenting the inputs 104a can include adaptively splitting the inputs 104a into variable-size segments, such as adaptively splitting based on logical or semantic relationships. For example, in some instances, segmentation can include heuristic or algorithmic separation based on logical boundaries in the data, such as sentence boundaries (e.g., based on a regular expression indicative of a period followed by a space, etc.) or other grammatical heuristics, image or video segment boundaries (e g., frame boundaries such as intra-coded frame boundaries, boundaries based on metrics of change between video frames, such as absolute pixel difference, dynamic flow, bitrate of a variable-bitrate compression algorithm, etc.), or other logical boundaries. In some instances, segmentation can include machine-learned segmentation, such as machine-learned segmentation based on semantic boundaries (e.g., based on metrics of similarity or difference between machine-learned semantic embeddings, such as cosine distance or Euclidean distance, etc.), grammatical boundaries, visual boundaries or other visual groupings (e g., as detected by a machine-learned model comprising one or more convolutional layers, object identification boundaries, etc.), or other machine-learned segmentation boundaries. In some instances, segmentation can include aligning related segments 218. such as aligning video inputs 104a with audio inputs 104a or other data based on a time the inputs 104a were collected.

[0082] A segment 218 can be, comprise, be comprised by, or otherwise share one or more properties with an input 104. For example, in some instances, a segment 218 can have any property described herein with respect to inputs 104a, 104b, or vice versa. In some instances, a segment 218 can include all or part of a raw7input 104 (e.g., received from user) or data generated based on one or more raw inputs 104. In some instances, a segment 218 can include compressed data generated by compressing all or part of a raw input 104, such as a compressed video segment (e.g., compressed video segment having a compressed data size small enough to fit into a single “cell” of a database table, such as less than or equal to 10 megabytes, etc.).

[0083] A memory' table 219 can include, for example, any data structure (e.g., database table or plurality of tables, file or files, spreadsheet, collection, etc.) for storing a plurality of memories 210 or segments 218. In some instances, a memory7table 219 can include one or more database tables, such as a table of a distributed database (e.g., relational database, structured query language (SQL) database, Spanner database, etc.). Each data entry' (e.g., database row, cell, etc.) of a memory’ table 219 can include or be associated with one memory 210 or segment 218, or a plurality of memories 210 or segments 218. As a non- limiting illustrative example, a memory table 219 could include a plurality of rows, with each row comprising a first segment 218 comprising video data collected during a first time period, a second segment 218 comprising audio data associated with the first time period, a first memory 210 comprising a caption generated based on the first segment 218. a second memory 210 comprising transcript data generated based on the second segment 218, and a third memory7210 comprising summary data generated based on one or more of the first and second memories 210 and first and second segments 218. As another example, in some instances, a memory storage system 108 can include one or more separate memory tables 219 that may store memories 210 comprising summary data generated based on a plurality of corresponding segments 218 or memories 210 over an extended period of time. For example, in some instances, summarization memories 210 can be stored as part of a hierarchical summarization scheme. An example hierarchical summarization scheme can include storing first summaries of segments 218, captions, or transcripts; second summaries, which can each summarize a plurality of related (e.g., chronologically adjacent, logically related, etc.) first summaries; third summaries, which can each summarize a plurality' of related second summaries (e.g., logically related, such as summaries of every' math class of a user’s current semester, etc.); and so on.

[0084] A memory generation / extraction system 220 can be or include one or more softw are, firmware, or hardware components configured to generate or extract memories 210 from raw input(s) 104 or segment(s) 218. In some instances, the memory generation / extraction system 220 can be, comprise, be comprised by, or share one or more properties with a computing device or system described below with respect to Figures 15-17 (e.g., computing device 50, third-party' system 80, computing device 98, computing device 99, etc.).

[0085] In some instances, a memory generation / extraction system 220 can include one or more machine-learned models (e.g., second machine-learned models that are the same as or different from the machine-learned agent 112, etc.). In such instances, machine-learnedmodel(s) of a memory generation / extraction system 220 can include various model architectures, such as various neural network model architectures. An example model architecture for a machine-learned model(s) of a memory generation / extraction system 220 can include a sequence processing model architecture (e.g., a transformer model). For example, machine-learned model(s) of a memory generation / extraction system 220 can be configured to receive an input sequence and generate an output sequence. For instance, machine-learned model(s) of a memory generation / extraction system 220 can be configured to generate an output sequence where elements of the output sequence are predicted based on the elements of the input sequence. In some instances, machine-learned model(s) of a memory generation / extraction system 220 can include a model architecture having an attention mechanism (e.g.. self-attention). In some instances, machine-learned model(s) of a memory generation / extraction system 220 can be a pre-trained model (e.g., pretrained using large-scale unsupervised learning). In some instances, machine-learned model(s) of a memory7generation / extraction system 220 can be fine-tuned over one or more fine-tuning datasets, such as a fine-tuning dataset associated with one or more specialized generation tasks.

[0086] For example, in some instances, a memory generation / extraction system 220 can include a machine-learned model configured to generate a caption (e.g., natural language caption such as text caption, etc.) based on one or more raw inputs 104 or segments 218 (e.g., video segments 218, images, video frames, etc.). In some instances, a machine-learned model configured to generate a caption (e.g., video captioning model, image captioning model, etc.) can include a multimodal model configured to receive at least a first datatype (e.g., image data, video data, etc.) as input, and generate at least a second data ty pe (e.g., natural language data such as text data, etc.) as output. In some instances, a machine-learned model configured to generate a caption can include a machine-learned model that was fine-tuned on a caption generation task, or a general purpose machine-learned model (e.g., multimodal language model, vision language model, etc.) that has been pretrained on data that is not specific to caption generation. For example, in some instances, a computing system 102 can prompt a general purpose machine-learned model (e.g., vision language model, etc.) with in-context learning content (e g., instruction content, few-shot prompt content, chain-of-thought prompt content, etc.) to cause the machine-learned model to generate a caption based on an input segment 218 (e.g., video segment, etc.).

[0087] As another example, in some instances, a memory generation / extraction system 220 can include a machine-learned model configured to generate a transcript (e.g.,natural language transcript such as text transcript, etc.) based on one or more raw inputs 104 or segments 218 (e.g., audio segments 218, video segments, etc.). In some instances, a machine-learned model configured to generate a transcript can include a multimodal model configured to receive at least a first data type (e.g., audio data, etc.) as input, and generate at least a second data type (e.g., natural language data such as text data, etc.) as output. In some instances, a machine-learned model configured to generate a transcript can include a speech- to-text model or a machine-learned model that was fine-tuned on a transcript generation task, or a general purpose machine-learned model (e.g., multimodal language model, etc.) that has been pretrained on data that is not specific to transcript generation. For example, in some instances, a computing system 102 can prompt a general-purpose machine-learned model (e.g., multimodal language model, etc.) with in-context learning content (e.g., instruction content, few-shot prompt content, chain-of-th ought prompt content, etc.) to cause the machine-learned model to generate a transcript based on an input segment 218 (e.g., audio segment, etc.).

[0088] As another example, in some instances, a memory generation / extraction system 220 can include a machine-learned model configured to generate a summary (e.g., natural language summary such as text summary, etc.) based on one or more raw inputs 104 or segments 218 (e.g., audio segments 218, video segments, etc.), or based on processed data (e.g., memories 210, transcripts, captions, etc.). In some instances, a machine-learned model configured to generate a summary can include a language model configured to receive natural language inputs (e.g., text inputs, etc.) and generate natural language outputs (e.g., text outputs, etc.), or a multimodal model configured to receive at least a first data type (e.g., audio data, video data, etc.) as input, and generate at least a second data type (e.g., natural language data such as text data, etc.) as output. In some instances, a machine-learned model configured to generate a summary can include a machine-learned model that was fine-tuned on a summary generation task, or a general-purpose machine-learned model (e.g., language model, etc.) that has been pretrained on data that is not specific to summary generation. For example, in some instances, a computing system 102 can prompt a general-purpose machine- learned model (e.g., language model, etc.) with in-context learning content (e.g., instruction content, few-shot prompt content, chain-of-th ought prompt content, etc.) to cause the machine-learned model to generate a summan' based on an input segment 218 (e.g., audio segment, etc.).

[0089] As another example, in some instances, a memory generation / extraction system 220 can include a machine-learned model configured to extract factual data from rawinputs 104 or segments 218 or from processed data (e.g., memories 210, transcripts, captions, etc.). In some instances, a machine-learned model configured to extract factual data can include a language model configured to receive natural language inputs (e.g., text inputs, etc.) and generate structured factual outputs (e.g., Javascript Object Notation (JSON)- structured outputs, extensible Markup Language (XML)-structured formats, SQL-structured factual data, graph-structured knowledge graph data, object or struct of a programming language, etc.), or a multimodal model configured to receive at least a first data type (e.g., audio data, video data, etc.) as input, and generate at least a second data type (e.g., structured factual data such as JSON-structured factual data, etc.) as output. In some instances, structured factual data can include multimodal data, such as a structured data item correlating image data (e.g., screenshot or image capture of an entity of interest, such as a user’s pet, etc.) with textual data (e.g., ‘‘user’s dog Max,” etc.) or other factual data or metadata associated with the image data. In some instances, a machine-learned model configured to extract factual data can include a fact extraction model or a machine-learned model that was fine-tuned on a fact extraction task, or a general-purpose machine-learned model (e g., language model, etc.) that has been pretrained on data that is not specific to fact extraction. For example, in some instances, a computing system 102 can prompt a general-purpose machine-learned model (e.g., language model, etc.) with in-context learning content (e.g., instruction content, fewshot prompt content, chain-of-thought prompt content, etc.) to cause the machine-learned model to extract factual data based on an input segment 218 (e.g., video segment, etc.) or memory 210 (e.g., transcript, caption, etc.).

[0090] Indexer(s) 222 can be or include one or more software, firmware, or hardware components configured to generate indexing value(s) 224 based on one or more memories 210 or segments 218. In some instances, an indexer 222 can be. comprise, be comprised by, or share one or more properties with a computing device or system described below with respect to Figures 15-17 (e.g., computing device 50, third-party system 80, computing device 98, computing device 99, etc.). In some instances, an indexer 222 can include a system (e.g., heuristic system, algorithmic system, etc.) configured to extract raw indexing values 224, such as timestamps, text-based indexing values (e.g., names, titles, keywords, etc.), numerical indexing values (e.g., numerical identifiers such as segment identifier, frame identifier, etc.), or other raw indexing values 224. In some instances, an indexer 222 can include one or more machine-learned models configured to generate machine-learned indexing values 224. For example, in some instances, a machine-learned indexing value 224 can include a machine- learned embedding value, such as an embedding vector associated with a machine-learnedembedding space. In some instances, a machine-learned model configured to generate embedding-based indexing values 224 can include a machine-learned embedding model (e.g., Word2Vec, Doc2Vec, Sentence-BERT, Universal Sentence Encoder, InferSent, Google multimodal embedding model, Contrastive Language-Image Pretraining (CLIP) model, etc.) configured to output an embedding vector as a final output, or another model (e.g., MagicLens, sentence text-to-text transfer transformer (sT5), multimodal machine-learned model such as Gemini, PaliGemma, PaliGemma2, or other multimodal language model, etc.) configured to generate another data type (e g., natural language, text, video, audio, etc.) as a final output. For example, in some instances, a model configured to generate a non-vector output can include one or more input layers, a plurality of intermediate (e.g., hidden, etc.) layers, and one or more output layers, and an embedding can be generated by passing one or more inputs (e.g., inputs 104, segments 218, memories 210, etc.) through one or more of the input layer(s) and intermediate layer(s), wherein an output (e.g., embedding output, vector or tensor output, etc.) of a non-output layer of the machine-learned model can be used as an indexing value 224.

[0091] In some instances, indexer(s) 222 can include an image indexer 222a configured to generate an indexing value 224 (e.g., embedding vector, etc.) based on one or more input images (e.g., video frames, etc.), such as an image embedding model, multimodal embedding model configured to process images, or other machine-learned image processing model (e.g., Gemini. PaliGemma, PaliGemma2. MagicLens, etc.).

[0092] In some instances, indexer(s) 222 can include a caption indexer 222b configured to generate an indexing value 224 (e.g., embedding vector, etc.) based on one or more input captions, such as a natural language embedding model (e.g., text embedding model, sentence embedding model, etc.) or other machine-learned language model (e.g., multimodal language model, etc.).

[0093] In some instances, indexer(s) 222 can include a transcript indexer 222c configured to generate an indexing value 224 (e.g., embedding vector, etc.) based on one or more input transcripts, such as a natural language embedding model (e.g., text embedding model, sentence embedding model, etc.) or other machine-learned language model (e.g., multimodal language model, etc ). In some instances, a transcript indexer 222c can include a machine-learned model that is the same as or different from a machine-learned model of the caption indexer 222b. In some instances, a transcript indexer 222c can generate embedding values based solely on a transcript, or based on a combination of the transcript and additional data (e.g., in-context learning content, etc.). In some instances, in-context learning contentused to generate an indexing value 224 based on a transcript can be the same as or different from in-context learning content used to generate an indexing value 224 based on a caption.

[0094] In some instances, indexer(s) 222 can include an extracted fact indexer 222d configured to generate one or more indexing values 224 (e.g., embedding vector, etc.) based on factual input data. In some instances, an extracted fact indexer 222d can include one or more (e.g., a plurality of, etc.) machine-learned models, such as a natural language embedding model (e.g., text embedding model, sentence embedding model, etc.) or other machine-learned language model (e.g., multimodal language model, etc.), an image embedding model, a multimodal embedding model, or other machine-learned models. For example, in some instances, a memory 210 comprising factual content can include a memory 210 correlating a first data value with a corresponding second data value. As a non-limiting illustrative example, a memory 210 could include a fact correlating one or more images of a user’s pet with natural language content associated with the user’s pet (e.g., name, dietary restrictions, age, veterinary history , etc.). In some instances, an extracted fact indexer 222d can generate a first indexing value 224 (e.g., embedding vector; raw data value such as name, keyword, numerical identifier, etc.) based on the first data value, and a second indexing value 224 based on the second data value. Continuing the non-limiting illustrative example, the extracted fact indexer 222d can use a machine-learned image embedding model or other image processing model to generate an image embedding vector based on the image of the user’s pet, and can use a machine-learned natural language embedding model (e.g., text embedding model, etc.) to generate an embedding vector based on the natural language content. Other examples are possible.

[0095] In some instances, indexer(s) 222 can include a summary' indexer 222e configured to generate an indexing value 224 (e.g., embedding vector, etc.) based on one or more input summaries, such as a natural language embedding model (e.g., text embedding model, sentence embedding model, etc.) or other machine-learned language model (e.g., multimodal language model, etc.), which can be the same as or different from a machine- learned model associated with a caption indexer 222b, transcript indexer 222c, or extracted fact indexer 222d.

[0096] In some instances, indexer(s) 222 can include one or more other indexers 222f, such as other machine-learned indexer(s) 222 or non-machine-1 earned indexers configured to extract other indexing values 224 (e.g., timestamps, keywords, etc.) from inputs 104, memories 210. or segments 218. For example, in some instances, indexer(s) 222 can include one or more multimodal indexers (e.g., multimodal vision-language models such as Gemini,PaliGemma, PaliGemma2, etc.) configured to generate a single embedding vector based on multimodal data comprising a plurality of related segments 218 or memories 210 (e.g.. multimodal data comprising one or more of a video segment 218, an audio segment 218, a caption, a transcript, a summary, or other segments 218 or memories 210).

[0097] An index table 226 can include, for example, a data structure (e.g., database, table, spreadsheet, file, folder, column(s), etc.) comprising a plurality of data entries (e.g., rows, cells, objects, files, lines or other file segments, etc.). In some instances, a data entry of the index table 226 can map one or more indexing values 224 to one or more corresponding memories 210 stored in the memory table 219. For example, in some instances, a data entry of an index table 226 can map an indexing value 224 to an identifier (e.g., numerical identifier, database key, etc.) or other data indicative of a corresponding memory 210 or segment 218 that was used to generate the indexing value 224. In some instances, a single memory 210 or segment 218 can be associated with one indexing value 224 or a plurality of indexing values 224, and a single indexing value 224 can be associated with one memory 210 or segment 218 or a plurality of memories 210 or segments 218. As a non-limiting illustrative example, a memory table 219 can include a data row comprising a transcript memory 210 transcribing a speech input received from a user during a particular time period; a video segment 218 or image (e.g., video frame) segment 218 collected during the time period; a caption memory 210 generated based on the video or image segment 218; a summary memory 210 generated based on one or more of the caption, the transcript, and the video or image; or other data. In such instances, an image index 226a can store a data entry correlating an image-based indexing value 224 with the row; a caption index 226b can store a data entry' correlating a caption-based indexing value 224 with the row; a transcript index 226c can store a data entry’ correlating a transcript-based indexing value 224 with the row; and a summary’ index 226e can store a data entry correlating a summary-based indexing value 224 with the row. As a second non-limiting illustrative example, the memory' table(s) 219 can in some instances include a plurality’ of separate tables, such as a separate table for extracted fact memories 210, a separate table for summaries that may summarize data from a plurality of rows of another memory table 219, or the like. Continuing the second non-limiting illustrative example, a data entry’ of the extracted fact index 226d may correlate an indexing value 224 generated by an extracted fact indexer 222d with a corresponding factual memory’ 210 stored in a separate memory table 219 (e.g., with or without reference to any memories 210 or segments 218 from which the factual content was extracted, etc.). Other examples are possible.

[0098] Figure 3 is a block diagram of an example system for retrieving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure. A query detection system 328 of a computing system 102 can receive one or more inputs 104a. Responsive to determining that an input 104a comprises a query (e.g., user query for which a memory7210 may be helpful), the query detection system 328 can provide the input 104a to a query rewriting system 330, which can generate one or more queries 332 in one or more formats that are suitable for input to one or more indexers 222. Based on the one or more queries 332, the indexers 222 can generate one or more indexing values 224. A memory7storage system 108 can retrieve, from one or more index tables 226 based on the one or more indexing values 224, one or more data entry identifiers 333 corresponding to the indexing values 224. A deduplication system 334 can remove duplicate data entry identifiers 333, and a plurality of candidate memories 336 can be retrieved from one or more memory tables 219 based on the remaining data entry identifiers 333. A memory7reranking system 338 can rank the candidate memories 336, and one or more candidate memories 336 can be selected based on the reranking. The selected memories 310 can be provided to a machine-learned agent 112, which can generate one or more outputs 111 based at least in part on the selected memories 310.

[0099] In some instances, a selected memory7310 can be, comprise, be comprised by, or otherw ise share one or more properties with a memory 210. For example, in some instances, a selected memory 310 can have any property described herein with respect to a memory 210, and vice versa. In some instances, a selected memory 310 can include a memory that was stored in the past, such as a memory7obtained during a previous session (e.g., session that occurred days, weeks, or months ago, etc.) betw een a user and a machine- learned agent 112, or a memory 310 comprising general (e.g.. non-user-specific) knowledge that may have been obtained in the past (e.g., from a ground truth data source, etc.). In some instances, a selected memory' 310 can include a memory' 210 that w as selected to be provided to a machine-learned agent 112 (e.g., in response to a user query7directed to the machine- learned agent 112, etc.).

[0100] A query detection system 328 can be or include one or more software, firmware, or hardware components configured to detect one or more queries (e.g., user queries directed to a machine-learned agent 112, etc.). In some instances, the query7detection system 328 can include a lightweight machine-learned model (e.g., machine-learned model having a lower latency, lower computational cost, or lower number of parameters compared to the machine-learned agent 112, such as less than 0. 1, 0.01, or 0.001 times as manyparameters as the machine-learned agent 112, etc.) configured to detect a user query, such as a lightweight model configured to detect whether a user (e.g.. user of interest having a known voice profile, etc.) has spoken a designated phrase (e.g., “Hey Google,’’ etc.) to indicate that the user plans to provide a query to the machine-learned agent 112. In some instances, the query detection system 328 can include a non-machine-leamed query detection system 328, such as a system that uses a lightweight heuristic to identify a possible query, such as a heuristic that compares data associated with an audio input (e.g., audio input loudness, change in audio input loudness, etc.) to one or more heuristic rules (e.g., loudness threshold, change threshold, etc.). In some instances, a query detection system 328 can include a keyword-based query detection heuristic, such as a heuristic that correlates detection of specific keywords (e.g., “what,” “where,” “when,” “my,” etc.) with the existence of a user query or the existence of a user query for which memory 110 data may be available. In some instances, query detection 328 can include detecting whether information needed to respond to a user query is already available to the machine-learned agent 112 (e.g., as part of a context window of the machine-learned agent 112, such as a context window comprising past input 104 data, output 111 data, or other context data, etc.). In some instances, the query detection system 328 can be separate from or combined with the query rewrite system 330. For example, in some instances, query' detection can be performed separately (e.g., sequentially, etc.) from query rewriting; or query detection and rewriting can be performed in combination or simultaneously; or one or more of query detection and query rewriting can be omitted. For example, in some instances, a computing system 102 can provide (e.g., periodically provide, etc.) raw inputs 104a or processed inputs 104b to one or more indexers 222 without deviating from the scope of the present disclosure.

[0101] A query rewriting system 330 can be or include one or more software, firmware, or hardw are components configured to receive inputs 104 (e.g., inputs 104 associated wdth a detected user query, etc.) and generate a revised query 332 based on the inputs 104. Generating a query 332 can include, for example, receiving data (e.g., input(s) 104a. segment(s) 218, etc.) indicative of a user query, and outputting a query 332 configured to be processed by an indexer 222. In some instances, generating a query can include converting data indicative of a user query into a format used by an indexer 222, such as converting audio input(s) 104a (e.g., speech inputs, etc.) comprising a query to a text-based input for processing by an indexer 222 (e.g., language model, etc.) configured to generate indexing values 224 based on text inputs; converting video data into image data (e.g., by extracting one or more video frames, etc.); or other data type conversion. In some instances,converting speech inputs 104a to text inputs can include generating a transcript of the speech inputs 104a (e.g., using a machine-learned automatic speech recognition model, etc.).

[0102] In some instances, rewriting a query can include adding context to the query based on past interactions of the machine-learned agent 112. As a non-limiting illustrative example, rewriting a query can include identifying inputs 104a that implicitly refer to prior data (e.g., pronouns such as "‘that,” “it,” “there,” etc.); identifying one or more referents that are referred to by the query; and rewriting the query to expressly include the prior data. As a non-limiting illustrative example, if a user asks, “When was the last time I went there?”, rewriting the query' can include identifying, in a prior interactional turn (e.g., most recent output 111 of the machine-learned agent 112, etc.), data indicating where the user is referring to (e.g., an output 111 mentioning a particular place, etc.). Rewriting the query to include prior data can include, for example, concatenating the prior data with the uery (e.g., “My memory data indicates that your favorite travel destination is Rocky Mountain National Park. When was the last time I went there?”, etc.); replacing the reference with the data (e.g., replacing “there” with “Central Park” or “to Rome,” etc.); or otherwise including the prior data in a query 332. In some instances, query rewriting 330 can include machine-learned rewriting or non-machine-leamed rewriting. For example, in some instances, a lightweight (e.g., having lower latency, fewer parameters, or otherwise reduced computational cost compared to a machine-learned agent 112, etc.) machine-learned model can be provided with inputs 104a comprising a user query and prior interaction data (e.g., N prior conversational turns, prior inputs 104, prior outputs 1 1 1 etc.), and the machine-learned model can generate a query' 332. As another example, query rewriting 330 can in some instances include rewriting based on one or more heuristics (e.g., simple, low-latency heuristics to reduce overall response times, etc.), such as a heuristic mapping one or more references (e.g., pronouns such as “that,” “there,” “then,” etc.) to a simple low-latency strategy for including additional data (e.g., append N most recent conversational turns, where N can be a positive integer, etc.) in the query 332.

[0103] In some instances, a query rewriting system 330 can include one or more components that are the same as or different from components of a memory generation / extraction system 220. As a non-limiting illustrative example, generating a query 332 for processing by an image indexer 222 (e.g., in response to a user query' saying “I know I've seen that guy before, but I can’t remember his name. Help me out here,” etc.) can include extracting (e.g., selecting, segmenting, etc.) one or more image frames (e.g., video frames, etc.) from video inputs 104a, in a manner similar to (e.g., same as) a manner forgenerating memories 210 comprising image data. As another example, generating a query- 332 can include generating a text transcript of an audio query in a manner similar to (e.g., same as) a manner for generating a transcript for storage as a memory 210.

[0104] A query- 332 can generally include or otherwise represent various ty pes of data. A query- 332 can include one ty pe or many different types of data. Example data types for a query 332 can include, for example, natural language data (e.g., audio input(s) 104a, textual input(s) 104a, textual transcript of audio input(s) 104a, etc.), keyword data, timestamp data, image data (e.g., frame(s) of a video input 104a, etc.), or other data types.

[0105] In some instances, the query 332 can be provided to one or more indexers 222 to generate one or more indexing values 224, and one or more data entry identifiers 333 or candidate memories 336 can be retrieved based on the indexing values 224 and index tables 226. In some instances, retrieving based on an indexing value 224 can include retrieving based on an exact match (e.g., keyword = “food”, etc.), based on a range (e.g., timestamp between January- 1 and December 31, 2024, etc.), or based on a metric of similarity between one or more indexing values 224 generated based on a query 332 and one or more indexing values 224 stored in one or more index tables 226. For example, in some instances, an index table 226 can include vector-based data recall structure. For instance, an index table 226 can store machine-learned embedded representations of memory- 210 values to facilitate similaritysearches based on an embedding of a query 332 value. In some instances, a similarity search can include retrieving based on a metric of distance between one or more indexing values 224 generated based on a query- 332 and one or more indexing values 224 stored in one or more index tables 226. A similarity- metric can include, for example, a metric of vector distance between two embedding vectors, such as cosine distance, Euclidean distance, or other distance metric.

[0106] In some instances, a computing system 102 can provide one or more queries 332 to a plurality- of indexers 222; generate a plurality- of indexing values based on the one or more queries 332; and retrieve a plurality- of sets of data entry identifiers 333 from a plurality of index tables 226. As a non-limiting example, a computing system 102 can provide an image-based query 332 to one or more image indexers 222; provide a text-based query 332 (e.g., transcript of an audio input 104a comprising a user query, etc.) to one or more of a caption indexer 222b, transcript indexer 222c, extracted fact indexer 222d, summary- indexer 222e, or other indexer 222 configured to process textual inputs; and provide a multimodal query 332 (e.g., query 332 comprising a video segment 218 and a transcript generated from an audio input 104a comprising a user query-, etc.) to a multimodal indexer 222 (e.g., indexer222 comprising a vision-language model such as Gemini, PaliGemma, PaliGemma2, etc ). Continuing the example, the memory storage system 108 can retrieve, based on the textbased indexing value 224, a first plurality of data entry identifiers 333 based on a caption index 226b; a second plurality of data entry identifiers 333 based on a transcript index 226c; a third plurality of data entry identifiers 333 based on an extracted fact index 226d; and a fourth plurality of data entry identifiers 333 based on a summary index 226e. In this manner, for instance, a plurality of sets of data entry identifiers 333 can be retrieved based on one indexing value 224. Other implementations are possible.

[0107] In some instances, a computing system 102 or memory storage system 108 can select, for each respective index table 226 of a plurality of index tables 226, a respective set of data entry identifiers 333 based on the respective index table 226. For example, in some instances, a memory storage system 108 can retrieve, for each respective index table 226 of a plurality of index tables 226, a respective set of K nearest neighbors, wherein the K nearest neighbors correspond to K data entries of the index table 226 having an indexing value 224 that is most similar to (e.g., according to a metric of vector distance, etc.) an indexing value 224 generated based on a query, where K is a positive integer. In some instances, a value of K can be the same for each respective index table 226, or can be a different value for different index tables 226.

[0108] A data entry identifier 333 can include, for example, a value (e.g., numerical identifier, database key’, etc.) that uniquely identifies a candidate memory 336 stored in a memory table 219. For example, in some instances, a data entry' identifier 333 can include a database key that can be used to retrieve a memory’ 210 corresponding to the data entry identifier 333 from a memory' table 219.

[0109] A deduplication system 334 can be or include one or more software, firmware, or hardware components configured to receive a plurality of memories 310 and return a subset (e.g., diverse subset, etc.) of the memories 310 according to a deduplication process. In some instances, deduplication can include removing exact duplicate data entry identifiers 333, such as a data entry identifier 333 that was included in a plurality of K-nearest-neighbor sets of data entry identifiers 333 associated with a plurality of index tables 226. In some instances, deduplication can also include removing data entry identifiers 333 associated with near-duplicate or highly similar data, such as data entry' identifiers 333 associated with pairs of data entry identifiers 333 having indexing values 224 that are nearly identical to each other. However, this is not required. For example, in some instances, removal of duplicate or near-duplicate memories can be performed as part of a memory reranking 338 process (e.g.,instead of or in addition to deduplication 334, etc.), or can be omitted altogether (e.g., providing duplicate copies of a selected memory 310 to a machine-learned agent 112 whose output quality is not significantly harmed by duplication of inputs, etc.).

[0110] In some instances, a candidate memory 336 can be, comprise, be comprised by, or otherwise share one or more properties with a memory 210, 310. For example, in some instances, a candidate memory 336 (e.g., candidate data entry of a memory table 219, etc.) can have any property described herein with respect to a memory 210, 310. and vice versa. In some instances, candidate memories 336 can be retrieved from a memory table 219 based on a set (e.g., deduplicated set, etc.) of data entry identifiers 333 retrieved from the index tables 226. For example, in some instances, data entry identifiers 333 can include database key values, and candidate memories 336 can be retrieved based on the key values using a retrieval command (e.g., SQL command such as ‘'SELECT * WHERE entry id = <data entry identifier 333 value>”, etc.).

[0111] A memory7reranking system 338 can be or include one or more software, firmware, or hardware components configured to perform a plurality of evaluations (e.g., generate a plurality of evaluation scores, rankings, etc.) on a plurality of candidate memories 336 for use in selecting selected memories 310. In some instances, a memory7reranking system 338 can generate a score for each candidate memory7336. In some instances, a memory reranking system 338 can select a plurality of selected memories 310 based on the generated scores. For example, in some instances, a memory reranking system 338 can select the candidate memories 336 with the top N scores, where N can be a positive integer (e.g., constant integer, adaptive or variable integer, etc.). In some instances, a number of selected memories 310 can be equal to a maximum number of selected memories 310 that can fit into a context window of the machine-learned agent 112. Other implementations are possible.

[0112] In some instances, a score generated by the memory reranking system 338 can be based at least in part on a metric of relevance of a candidate memory' 336 being scored.For example, in some instances, a metric of relevance can include a metric based on a similarity between one or more indexing values 224 of a candidate memory7336 and one or more indexing values 224 generated based on a query 332. As a non-limiting illustrative example, a candidate memory 336 can include a data row comprising a plurality of data values (e.g., caption, transcript, video segment 218, audio segment 218, summary', etc.), with each data value being associated with a corresponding indexing value 224. In some instances, each indexing value 224 of the memory can be compared (e.g., according to a similarity metric such as cosine distance, etc.) to a corresponding indexing value 224 generated from auser query, and a score can be generated based at least in part on a plurality of comparison values (e.g., distance metric values, etc.). For example, in some instances, a combined relevance metric can be generated based on a combination of a plurality of distance metric values (e.g., additive combination, multiplicative combination, mean, median, maximum, minimum, or other value generated from a plurality of values, etc.). In some instances, a metric of relevance can include other data, such as a metric of recency (e.g., timestamp data, etc.), a metric of up-to-dateness (e.g., based on a comparison of timestamp data to an update schedule, based on data indicative of later memories 210 that may have superseded a candidate memory 336, etc.), a metric of informativeness (e.g., heuristic metric indicating that a memory 210 comprising an extracted fact may be more informative or relevant than a raw segment 218 with a similar indexing value 224. etc.), or other relevance metric. In some instances, a memory ranking score can be based at least in part on user preference data, such as data (e.g., user feedback data, etc.) indicative of a user preference for outputs 111 generated based on different types of memories 110.

[0113] In some instances, a ranking score can be based at least in part on a metric of diversity between candidate memories 336. For example, in some instances, selected memories 310 can be selected according to a greedy strategy, wherein a top-scoring (e.g., most relevant, etc.) candidate memory' 336 can be added to a set of selected memories 310; a metric of similarity between the top-scoring memory and the remaining candidate memories 336 can be computed; and a ranking score can be updated based on the similarity metric values. For example, an updated ranking score may increase the ranking score of candidate memories 336 that are highly different from the top-scoring memory (e.g., according to a metric of similarity between indexing values 224, according to a similarity of data types, based on whether the candidate memory 336 was retrieved based on the same index table 226 as the top-scoring memory, etc.) or otherwise adds new information relative to the topscoring memory, and may decrease the ranking score of candidate memories 336 that are highly similar to the top-scoring memory' or otherwise add little new' information. As another example, an indexing value 224 space (e.g., embedding space, etc.) can be divided into regions (e.g., quadrants, octants, regions determined based on dimensions associated with a principal component analysis, etc.), and a top-scoring memory from each region can be selected. As another example, indexing values 224 can be clustered (e.g., according to a k means clustering or g means clustering algorithm, etc ), and a top-scoring candidate memory 336 from each cluster can be selected. Other implementations are possible.

[0114] Figure 4A is block diagram of a first view of an example system for storing and retrieving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure. A computing system 102 can include a client / server system comprising one or more client devices 440 and one or more server devices 441. A client device 440 comprising one or more input / output devices 442 can receive interactions 404a from a user 444, and can generate input 104a data based on the interactions 404a. One or more server devices 441 comprising one or more of a machine- learned agent system 412, a memory management module 446, and a memory storage system 108 can receive inputs 104a and generate outputs 411 based on the inputs 104a, which can be provided to a user 444 by the client device 440.

[0115] An interaction 404a can include, for example, a user interaction with one or more input / output devices 442, such as user speech captured by a microphone 442b, video or image data captured by a camera 442a (e.g., camera having a field of view controlled by a user 444, etc.), sensor data detected by a sensor 442c, a button press or touchscreen input, a keyboard input, or other interaction.

[0116] In some instances, an output 411 can be, comprise, be comprised by, or otherwise share one or more properties with an output 111. For example, in some instances, an output 411 can have any property' described herein with respect to an output 111, and vice versa.

[0117] In some instances, a machine-learned agent system 412 can be, comprise, be comprised by, or otherwise share one or more properties with a machine-learned agent 1 12. For example, in some instances, a machine-learned agent system 412 can have any property7described herein with respect to a machine-learned agent system 112, and vice versa. Further details of some example implementations of a machine-learned agent system 412 are provided below with respect to Figure 4B.

[0118] A client device 440 can be or include one or more software, firmware, or hardware components configured to provide input(s) 104a to one or more server devices 441 or machine-learned agent systems 412 based on one or more interactions 404a. In some instances, a client device 440 can be, comprise, be comprised by, or share one or more properties with a computing device or system described below with respect to Figures 1 -17 (e.g., computing device 50, third-party' system 80, computing device 98, computing device 99, etc.). In some instances, a client device can include a wearable client device such as an augmented reality headset, smart glasses, wearable camera (e.g., helmet camera, chestmounted clip-on camera, camera-equipped smart watch, etc.), clip-on artificial intelligencedevice (e.g., Al pin, etc.), or other wearable device. In some instances, a client device 440 can include a smart phone, tablet, laptop, desktop, or other client device 440. In some instances, a client device 440 can include a vehicle-mounted device (e.g., dashboard camera, onboard computing system, etc.), robot-mounted device, smart appliance, or the like.

[0119] Server device(s) 441 can be or include one or more software, firmware, or hardware components configured to operate a machine-learned agent system 412, memory management module 446, memory storage system 108, or other components (e.g., processes, devices, etc.) that may interact with the machine-learned agent system 412. In some instances, the computing system 102 can be, comprise, be comprised by, or share one or more properties with a computing device or system described below with respect to Figures 15-17 (e.g., server computing system 60. third-party system 80, computing device 98, computing device 99, etc.). In some instances, a single server device 441 can interact with one client device 440 or a plurality of client devices 440, and a single client device 440 can interact with one server device 441 or a plurality of server devices 441. For example, in some instances, a plurality of server devices 441 may perform a plurality of operations in parallel to reduce a latency associated with responding to a user query. As another example, in some instances, a client device 440 can connect with a first server device 441 at one or more first times (e.g., at the beginning of a first session of a machine-learned agent 112, etc.) and connect with a second server device 441 (e.g., over a network such as the internet) at one or more second times (e.g.. at the beginning of a second session of a machine-learned agent 112, etc.).

[0120] An input / output device 442 can include one or more software, firmware, or hardware components configured to receive inputs or transmit outputs (e.g., to and from a user 444, etc.) or to generate inputs 104a based on interactions 404a. For example, in some instances, an input / output device 442 can include an input device such as a camera, a microphone, a keyboard, a touchscreen, a mouse or trackball, a communication device for communicating via a bus or network (e.g., peripheral component interconnect device, network interface device, etc.), sensor (e.g., health sensor such as heart rate variability, skin temperature, pulse, sleep stage, blood pressure, activity level, oxygen saturation, etc.; industrial sensors such as temperature, pressure, current, voltage, vibration, movement, etc.; environmental sensors such as air pollutant concentration sensors, temperature, humidity’, wind speed, direction, precipitation, etc.; biological compound analysis sensors, such as fluorescence, luminescence, absorbance, or cell morphology' sensors; or other sensors); an output device such as speakers, display component, or other output device; or other input / output device 442.

[0121] A user 444 can include, for example, one or more people interacting with a machine-learned agent 112 or client device 440. However, this is not required. For example, in some instances, a user 444 can include, or a computing system 102 can interact with, another computing system or device, such as another machine-learned agent in a multi-agent environment; a robot or self-driving vehicle; or another computing device.

[0122] A memory management module 446 can be or include one or more software, firmware, or hardware components configured to provide one or more memory management functions to the machine-learned agent system 412 or memory storage system 108, such as functionality for interfacing between the machine-learned agent system 412 and memory storage system 108. Further details of some example implementations of a memory' management module 446 are provided below with respect to Figure 4B. In some instances, a memory management module 446 can execute on a server device 441 that is the same as or different from a server device 441 on which a machine-learned agent system 412 or memory' storage system 108 is executing. Additionally, although Figure 4A depicts the machine- learned agent system 412, memory’ storage system 108, and memory management module 446 all executing on server device(s) 441, all or part of one or more of the machine-learned agent system 412, memory' storage system 108, and memory management module 446 can execute on one or more client devices 440 without deviating from the scope of the present disclosure. For example, in some instances, a client device 440 can include one or more machine-learned agents 112 (e.g., lightweight machine-learned agents such as machine- learned agents 1 12 having a small enough memory footprint to execute in a client device 440), data structures storing memories 210 (e.g., private memories that are not shared with a sen- er device 441, etc.), or other components performing other memory' storage or retrieval functions.

[0123] Figure 4B is block diagram of a second view of an example system for storing and retrieving multimodal memories for a machine-learned agent according to example implementations of some aspects of the present disclosure. A machine-learned agent system 412 can receive inputs 104a, and can extract inputs 104b from the inputs 104a. For example, a visual memory manager 452 can extract one or more visual inputs 104b, while a voice activity detection / audio speech recognition (VAD / ASR) module 454 can extract one or more natural language inputs 104b (e.g., speech-to-text inputs 104b, etc.). The inputs 104b can be provided to a memory generation and storage module 448, which can generate memories 210, segments 218, and indexing values 224 for storage in the memory’ storage system 108 (e.g., according to one or more methods described above with respect to Figure 2, etc.).Additionally, the inputs 104b can be provided to a dialogue history manager 456 or tool manager 457, which can provide one or more requests 106 to a memory’ retrieval module 450, which can generate indexing values 224 and retrieve candidate memories 336 or selected memories 310 from the memory storage system 108 based on the indexing values 224 (e.g., according to one or more methods described above with respect to Figure 3, etc.).

[0124] A memory generation / storage module 448 can be or include one or more software, firmware, or hardware components configured to perform various functionality- described above with respect to Figure 2, such as memory generation / extraction 220 functionality, input buffer 214 and segmentation 216 functionality, indexer 222 functionality, and the like.

[0125] A memory retrieval module 450 can be or include one or more software, firmware, or hardware components configured to perform various functionality described above with respect to Figure 2, such as query detection or reyvriting, indexing value 224 generation or request 106 generation, deduplication, reranking, or other memory retrieval functionality.

[0126] A visual memory manager 452 can be or include one or more software, firmware, or hardware components configured to process (e.g., segment, etc.) visual inputs 104a (e.g., video inputs, still image inputs, etc.) and provide corresponding visual inputs 104b to a memory generation and storage module 448. In some instances, a visual memorymanager 452 can select visual inputs 104b for storage from a stream of raw visual inputs 1 4a according to one or more heuristics, such as according to a fixed or variable frame rate (e.g., one frame per second, variable framerate based on a rate of change betyveen frames of raw video data, etc.), based on the occurrence of speech inputs (e.g., user speech inputs) or other types of user input data (e.g., increasing a number of inputs 104b selected for storage when a user is actively interacting with a machine-learned agent 112, etc.), or other heuristic. In some instances, a visual memory manager 452 can include or not include one or more components described herein yvith respect to a memory generation / extraction system 220.

[0127] A voice activity detection and audio speech recognition module 454 can be or include one or more software, firmware, or hardware components configured to detect and record voice activity; convert speech into natural language inputs 104b (e.g., transcripts generated using a machine-learned model configured for speech-to-text translation, etc.); or other voice activity detection and automatic speech recognition actions.

[0128] A dialogue history manager 456 can be or include one or more software, firmware, or hardware components configured to determine whether and when to store neyvmemories 210; whether and when to send a request 106 to retrieve memories 210; or other dialogue history management functions. In some instances, a dialogue history’ manager 456 can execute one or more actions (e.g., storage actions, retrieval actions, decision-making actions, etc.) at every' conversational turn of a user interaction, or can otherwise regularly perform actions according to a fixed schedule. For example, in some instances, a dialogue history manager 456 can generate a request 106 at every’ interaction turn, and a memory retrieval module 450 can determine whether to provide any memories 110 to the machine- learned agent system 412 (e.g., based on determining whether or not any relevant memories exist in the memory’ storage system 108, etc.). In some instances, a dialogue history' manager 456 can identify or otherwise manage dialogue turns in user-agent systems that are not strictly turn-based, such as voice-activated machine-learned agent systems 412. In some instances, a dialogue history manager 456 can perform actions according to a variable or context-dependent schedule. For example, in some instances, a dialogue history manager 456 or other retrieval component can be treated as a tool (e.g., tool controlled by outputs 111 of a machine-learned agent 112, etc.), and can retrieve data only when called (e.g., by a machine- learned agent 112). Other implementations are possible. Further details of an example system for tool use by a machine-learned agent 112 are provided below with respect to Figure 5.

[0129] A tool manager 457 can be or include one or more software, firmware, or hardware components configured to facilitate the use of one or more tools by a machine- learned agent system 412, such as one or more API tools called based on an output 111 of a machine-learned agent 1 12. For example, in some instances, a tool manager 457 can include one or more software, firmware, or hardware components (e.g., glue code, etc.) for causing a tool to perform an action selected by a machine-learned agent 112. Further details of an example system for tool use by a machine-learned agent 112 are provided below with respect to Figure 5.

[0130] In some instances, one or more modules 108, 412, 446, 448, 450, 452, 454, 456, 457 depicted herein can include one or more standalone modules or standalone computing processes (e.g., Python programs, etc.) that may communicate with other modules via inter-process communication or inter-device communication (e.g., WebRTC, etc.); components of a computing process comprising a plurality of modules 108, 412, 446, 448, 450, 452, 454, 456, 457; or other configuration. For example, although Figures 4A and 4B depict separate modules 108, 412, 446, 448, 450, 452. 454, 456, 457 performing designated functionality, such functionality could be performed by a single component (e.g.. singleapplication or computing process, etc.) or by another number of components that is smaller or larger than the number of components depicted in Figures 4A and 4B.

[0131] Figure 5 is a block diagram of an example system for retrieving multimodal memories for a machine-learned agent 112 configured to use tools 558 according to example implementations of some aspects of the present disclosure. A computing system 102 can receive one or more inputs 104a. Based on the inputs 104a, the computing system 102 can send one or more requests 106 to a memory storage system 108, and can receive one or more memories 110 from the memory storage system 108 in response to the request. The computing system 102 can provide one or more unmodified or modified inputs 104b and the one or more memories 110 to a machine-learned agent 112, which can generate an action selection 511 based on the one or more inputs 104b and one or more memories 110. Based on the action selection 511, one or more tools 558 can perform one or more actions, such as actions comprising generation of an output 560a (e.g., output provided to a user 444, etc.) or generation of a response 560b to be provided to the machine-learned agent 112, which can generate one or more additional action selections 511 or outputs 111 based on the response 560b.

[0132] In some instances, an action selection 511 can be, comprise, be comprised by, or otherw ise share one or more properties with an output 111. For example, in some instances, an action selection 511 can have any property described herein with respect to an output 111. and vice versa. In some instances, an action selection 511 can include computer code that, when executed, causes one or more tools 558 to perform the selected actions, such as one or more API calls associated with an API of a tool 558. In some instances, an action selection 511 can include other data indicative of a selected action, such as an action name, action identifier (e.g., numerical identifier, token, etc.), action description, action parameters, or the like. For example, in some instances, a computing system 102 or tool 558 can receive an action selection 511 that does not comprise executable computer code, and can execute one or more operations based on the action selection 511. As a non-limiting illustrative example, a computing system 102 or tool 558 can include glue code configured to receive an action selection 511 comprising an action name; and perform, based on the action selection 511, an action identified by the action name.

[0133] In some instances, an output 560a can be, comprise, be comprised by, or otherwise share one or more properties with an output 111. For example, in some instances, output 560a can have any property described herein with respect to an output 111, and vice versa.

[0134] In some instances, a response 560b can be, comprise, be comprised by, or otherwise share one or more properties with one or more of inputs 104. outputs 111, memories 210, or other data. For example, in some instances, a response 560b can be provided as input 104b to the machine-learned agent 112 (e.g., alone or in combination with other inputs 104a, 104b), and the machine-learned agent 112 can process the response 560b in a manner similar to (e g., same as) a manner for processing memories 210. inputs 104b, or other values provided to the machine-learned agent 112. For example, in some instances, a machine-learned agent 112 can include an agent 112 configured to execute a multicomponent reasoning process or multi-component action plan (e.g., recursive or hierarchical reasoning process or action plan, etc.), such as an action plan comprising retrieving one or more memories 110; generating an action selection 511 based at least in part on the memories 1 10; performing or causing to be performed a first action based on the action selection 511; receiving a response 560b associated with the first action; and performing additional operations (e.g., retrieval operations, action selection 511 generation, output 111 generation, etc.) based on the response 560b.

[0135] A tool 558 can be or include one or more software, firmware, or hardware components configured to perform actions identified in an action selection 511. In some instances, a tool 558 can include one or more API tools having an application programming interface (API) that can be called to invoke the API tool. In some instances, a tool 558 can include a tool 558 that is executed or invoked in another manner, such as a tool 558 comprising glue code configured to receive an action selection 51 1 output and perform one or more actions identified by the action selection 511.

[0136] An API tool can include, for example, any tool (e.g.. hardware tool, software tool, firmware tool, etc.) that can be accessed via an API. For example, in some instances, an API tool can include software (e.g., application, operating system, etc.) installed on a device (e.g., mobile device, smartphone, etc.) running the machine-learned agent 112; software available via a network (e.g., internet); hardware devices (e.g., internet-connected hardware devices, Bluetooth-connected hardware devices, etc.); etc. In some instances, a hardware API tool can include a hardware tool connected (e.g., via a network, via a wireless or wired connection, etc.) to a device running the machine-learned agent 112. In some instances, an API tool can include a navigation API (e.g., map-related API, global positioning system- related API, etc.); a communication API (e.g., API associated with making phone calls, emails, or text messages such as SMS or MMS; APIs associated with communication applications, such as messaging applications, social media communication applications, etc.);a scheduling API (e.g., calendar, alarm, automated task scheduling, etc.); media player API (e.g., video player such as YouTube, audio player, etc.); shopping API; payment API such as Google Pay; mobile banking API; travel-related API (e.g., flight booking, hotel booking, etc.); or API of any application installed on a mobile device. In some instances, an API tool can include a hardware device such as a Bluetooth-connected lock, gate opener, garage door opener, etc. ; an internet-connected doorbell or surveillance camera device; smart home device such as smart TV, smart appliance, lighting devices, thermostats, etc.; or any other API-accessible hardware tool.Example Methods

[0137] Figure 6 depicts a flowchart diagram of an example method for generating an inference output based on a retrieved memory according to example embodiments of the present disclosure. Although Figure 6 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of example method 600 can be omitted, rearranged, combined, and / or adapted in various ways without deviating from the scope of the present disclosure.

[0138] At 602, example method 600 can include obtaining, by a computing system (e.g., computing system 102, etc.) comprising one or more computing devices, a multimodal multi-index data structure (e.g., memory storage system 108, data structure comprising memory tables 219 and index tables 226, etc.) comprising a first plurality of data entries (e.g., data table rows or cells; segments 218, memories 210, etc.) having at least a first data type; a second plurality of data entries having at least a second data type different from the first data type; and a plurality of index data structures (e.g.. index tables 226), wherein each index data structure of the plurality of index data structures comprises a plurality of index data entries each correlating a respective data entry of the first plurality of data entries or second plurality of data entries with a corresponding indexing value (e.g., indexing value 224, etc.). In some instances, example method 600 at 602 can include using one or more systems or performing one or more activities described with respect to Figure 3.

[0139] At 604, example method 600 can include receiving, by the computing system, a multimodal input (e.g., input 104a, etc.) comprising a query' directed to a machine-learned agent (e.g., machine-learned agent 112, etc.). In some instances, example method 600 at 604 can include using one or more systems or performing one or more activities described with respect to Figure 3.

[0140] At 606, example method 600 can include retrieving, by the computing system based at least in part on the query and based at least in part on the plurality of index data structures, a first data entry (e.g., selected memory 310, etc.) of the first plurality of data entries or second plurality of data entries. In some instances, example method 600 at 606 can include using one or more sy stems or performing one or more activities described with respect to Figure 3.

[0141] At 608, example method 600 can include providing, by the computing system to the machine-learned agent, the first data entry. In some instances, example method 600 at 608 can include using one or more systems or performing one or more activities described with respect to Figure 3.

[0142] At 610, example method 600 can include generating, by the machine-learned agent based at least in part on the first data entry, an inference output (e.g., output 11 1, action selection 511, etc.). In some instances, example method 600 at 610 can include using one or more systems or performing one or more activities described with respect to Figure 1 or 5.

[0143] Figure 7 depicts a flowchart diagram of an example method for storing a memory in a memory data structure according to example embodiments of the present disclosure. Although Figure 7 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement. The various steps of example method 700 can be omitted, rearranged, combined, and / or adapted in various ways without deviating from the scope of the present disclosure.

[0144] At 702, example method 700 can include receiving, by a computing system (e.g., computing system 102, etc.), one or more inputs (e.g., inputs 104a, etc.) directed to a machine-learned agent (e.g., machine-learned agent 112, etc.). In some instances, example method 700 at 702 can include using one or more systems or performing one or more activities described with respect to Figure 2.

[0145] At 704, example method 700 can include extracting, by the computing system and based at least in part on one or more inputs, one or more data entries (e.g., segments 218, memories 210, etc.) to be added to a first plurality of data entries or second plurality of data entries of a multimodal multi-index data structure (e.g., memory storage system 108, data structure comprising one or more memory tables 219 and index tables 226, etc.) comprising a first plurality of data entries having at least a first data ty pe; a second plurality of data entries having at least a second data type different from the first data type; and a plurality of index data structures (e.g., index tables 226, etc.), wherein each index data structure of the pluralityof index data structures comprises a plurality of index data entries each correlating a respective data entry of the first plurality of data entries or second plurality of data entries with a corresponding indexing value (e.g., indexing value 224, etc.). In some instances, example method 700 at 704 can include using one or more systems or performing one or more activities described with respect to Figure 2.

[0146] At 706, example method 700 can include generating, by the computing system and based at least in part on the one or more data entries, one or more index data entries to be added to the plurality of index data entries. In some instances, example method 700 at 706 can include using one or more systems or performing one or more activities described with respect to Figure 2.

[0147] At 708, example method 700 can include storing, by the computing system, the one or more data entries and the one or more index data entries in the multimodal multiindex data structure. In some instances, example method 700 at 708 can include using one or more systems or performing one or more activities described with respect to Figure 2.

[0148] Figure 8 depicts a flowchart of a method 800 for training one or more machine-learned models according to aspects of the present disclosure. For instance, an example machine-learned model can include a machine-learned agent 112.

[0149] One or more portion(s) of example method 800 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 800 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 800 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. Figure 8 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. Figure 8 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 800 can be performed additionally, or alternatively, by other systems.

[0150] At 802, example method 800 can include obtaining a training instance. A set of training data can include a plurality of training instances divided between multiple datasets (e.g., a training dataset, a validation dataset, or testing dataset). A training instance can belabeled or unlabeled. Although referred to in example method 800 as a “training"’ instance, it is to be understood that runtime inferences can form training instances when a model is trained using an evaluation of the model’s performance on that runtime instance (e.g., online training / leaming). Example data types for the training instance and various tasks associated therewith are described throughout the present disclosure.

[0151] At 804, example method 800 can include processing, using one or more machine-learned models, the training instance to generate an output. The output can be directly obtained from the one or more machine-learned models or can be a downstream result of a chain of processing operations that includes an output of the one or more machine- learned models.

[0152] At 806, example method 800 can include receiving an evaluation signal associated with the output. The evaluation signal can be obtained using a loss function. Various determinations of loss can be used, such as mean squared error, likelihood loss, cross entropy loss, hinge loss, contrastive loss, or various other loss functions. The evaluation signal can be computed using known ground-truth labels (e g., supervised learning), predicted or estimated labels (e.g., semi- or self-supervised learning), or without labels (e.g., unsupervised learning). The evaluation signal can be a reward (e.g., for reinforcement learning). The reward can be computed using a machine-learned reward model configured to generate rewards based on output(s) received. The reward can be computed using feedback data describing human feedback on the output(s).

[0153] At 808, example method 800 can include updating the machine-learned model using the evaluation signal. For example, values for parameters of the machine-learned model(s) can be learned, in some embodiments, using various training or learning techniques, such as, for example, backwards propagation. For example, the evaluation signal can be backpropagated from the output (or another source of the evaluation signal) through the machine-learned model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the evaluation signal with respect to the parameter value(s)). For example, system(s) containing one or more machine-learned models can be trained in an end-to-end manner. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. Example method 800 can include implementing a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.

[0154] In some implementations, example method 800 can be implemented for training a machine-learned model from an initialized state to a fully trained state (e.g., when the model exhibits a desired performance profile, such as based on accuracy, precision, recall, etc.).

[0155] In some implementations, example method 800 can be implemented for particular stages of a training procedure. For instance, in some implementations, example method 800 can be implemented for pre-training a machine-learned model. Pre-training can include, for instance, large-scale training over potentially noisy data to achieve a broad base of performance levels across a variety of tasks / data types. In some implementations, example method 800 can be implemented for fine-tuning a machine-learned model. Fine-tuning can include, for instance, smaller-scale training on higher-quality (e.g., labeled, curated, etc.) data. Fine-tuning can affect all or a portion of the parameters of a machine-learned model. For example, various portions of the machine-learned model can be “frozen” for certain training stages. For example, parameters associated with an embedding space can be “frozen” during fine-tuning (e.g., to retain information learned from a broader domain(s) than present in the fine-tuning dataset(s)). An example fine-tuning approach includes reinforcement learning. Reinforcement learning can be based on user feedback on model performance during use.Example Machine-Learned Models

[0156] Figure 9 is a block diagram of an example processing flow for using machine- learned model(s) 1 to process input(s) 2 to generate output(s) 3.

[0157] Machine-learned model(s) 1 can be or include one or multiple machine- learned models or model components. Example machine-learned models can include neural networks (e.g., deep neural networks). Example machine-learned models can include nonlinear models or linear models. Example machine-learned models can use other architectures in lieu of or in addition to neural networks. Example machine-learned models can include decision tree based models, support vector machines, hidden Markov models, Bayesian networks, linear regression models, k-means clustering models, etc.

[0158] Example neural networks can include feed-forward neural networks, recurrent neural networks (RNNs), including long short-term memory (LSTM) based recurrent neural networks, convolutional neural networks (CNNs), diffusion models, generative-adversarial networks, or other forms of neural networks. Example neural networks can be deep neural networks. Some example machine-learned models can leverage an attention mechanism suchas self-atention. For example, some example machine-learned models can include multiheaded self-attention models.

[0159] Machine-learned model(s) 1 can include a single or multiple instances of the same model configured to operate on data from input(s) 2. Machine-learned model(s) 1 can include an ensemble of different models that can cooperatively interact to process data from input(s) 2. For example, machine-learned model(s) 1 can employ a mixture-of-experts structure. See, e.g., Zhou et al., Mixture-of-Experts with Expert Choice Routing, ARXIV:2202.09368V2 (Oct. 14, 2022).

[0160] Input(s) 2 can generally include or otherwise represent various types of data. Input(s) 2 can include one type or many different types of data. Output(s) 3 can be data of the same type(s) or of different types of data as compared to input(s) 2. Output(s) 3 can include one type or many different types of data.

[0161] Example data types for input(s) 2 or output(s) 3 include natural language text data, software code data (e.g., source code, object code, machine code, or any other form of computer-readable instructions or programming languages), machine code data (e.g.. binary code, assembly code, or other forms of machine-readable instructions that can be executed directly by a computer's central processing unit), assembly code data (e.g., low-level programming languages that use symbolic representations of machine code instructions to program a processing unit), chemical or biochemical data, image data, audio data, audiovisual data, haptic data, statistical data, geographical data, astronomical data, historical data, sensor data generally (e g., digital or analog values, such as voltage or other absolute or relative level measurement values from a real or artificial input, such as from an audio sensor, light sensor, displacement sensor, etc.), and the like. Data can be raw or processed and can be in any format or schema.

[0162] In multimodal inputs 2 or outputs 3, example combinations of data types include image data and audio data, image data and natural language data, natural language data and software code data, image data and astronomical data, sensor data and chemical data, etc. It is to be understood that any combination of data types in an input 2 or an output 3 can be present.

[0163] An example input 2 can include one or multiple data types, such as the example data types noted above. An example output 3 can include one or multiple data types, such as the example data types noted above. The data type(s) of input 2 can be the same as or different from the data type(s) of output 3. It is to be understood that the example data typesnoted above are provided for illustrative purposes only. Data types contemplated within the scope of the present disclosure are not limited to those examples noted above.Example Machine-Learned Sequence Processing Models

[0164] Figure 10 is a block diagram of an example implementation of an example machine-learned model configured to process sequences of information. For instance, an example implementation of machine-learned model(s) 1 can include machine-learned sequence processing model(s) 4. An example system can pass input(s) 2 to sequence processing model(s) 4. Sequence processing model(s) 4 can include one or more machine- learned components. Sequence processing model(s) 4 can process the data from input(s) 2 to obtain an input sequence 5. Input sequence 5 can include one or more input elements 5-1, 5- 2, . . . , 5-M, etc. obtained from input(s) 2. Sequence processing model 4 can process input sequence 5 using prediction layer(s) 6 to generate an output sequence 7. Output sequence 7 can include one or more output elements 7-1, 7-2, . . . , 7-A, etc. generated based on input sequence 5. The system can generate output(s) 3 based on output sequence 7.

[0165] Sequence processing model(s) 4 can include one or multiple machine-learned model components configured to ingest, generate, or otherwise reason over sequences of information. For example, some example sequence processing models in the text domain are referred to as "Large Language Models,” or LLMs. See. e.g, PaLM 2 Technical Report, GOOGLE, https: / / ai.google / static / documents / palm2techreport.pdf (n.d.). Other example sequence processing models can operate in other domains, such as image domains, see, e.g., Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ARXIV:2010. 11929V2 (Jun. 3, 2021), audio domains, see, e.g., Agostinelli et al., MusicLM: Generating Music From Text, ARXIV:2301.11325V1 (Jan. 26, 2023), biochemical domains, see, e.g, Jumper et al., Highly accurate protein structure prediction with AlphaFold, 596 Nature 583 (Aug. 26, 2021), by way of example. Sequence processing model(s) 4 can process one or multiple types of data simultaneously. Sequence processing model(s) 4 can include relatively large models (e.g.. more parameters, computationally expensive, etc.), relatively small models (e.g., fewer parameters, computationally lightweight, etc.), or both.

[0166] In general, sequence processing model(s) 4 can obtain input sequence 5 using data from input(s) 2. For instance, input sequence 5 can include a representation of data from input(s) 2 in a format understood by sequence processing model(s) 4. One or more machine- learned components of sequence processing model(s) 4 can ingest the data from input(s) 2, parse the data into pieces compatible with the processing architectures of sequenceprocessing model(s) 4 (e.g., via “tokenization”), and project the pieces into an input space associated with prediction layer(s) 6 (e.g., via “embedding”).

[0167] Sequence processing model(s) 4 can ingest the data from input(s) 2 and parse the data into a sequence of elements to obtain input sequence 5. For example, a portion of input data from input(s) 2 can be broken down into pieces that collectively represent the content of the portion of the input data. The pieces can provide the elements of the sequence.

[0168] Elements 5-1, 5-2, . . . . 5-M can represent, in some cases, building blocks for capturing or expressing meaningful information in a particular data domain. For instance, the elements can describe “atomic units” across one or more domains. For example, for textual input source(s), the elements can correspond to groups of one or more words or sub-word components, such as sets of one or more characters.

[0169] For example, elements 5-1, 5-2, . . . , 5-M can represent tokens obtained using atokenizer. For instance, a tokenizer can process a given portion of an input source and output a series of tokens (e.g., corresponding to input elements 5-1, 5-2, . . . , 5 -AT) that represent the portion of the input source. Various approaches to tokenization can be used. For instance, textual input source(s) can be tokenized using a byte-pair encoding (BPE) technique. See, e.g., Kudo et al., SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, PROCEEDINGS OF THE 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (System Demonstrations), pages 66-71 (October 31-November 4. 2018), https: / / aclanthology.org / Dl 8-2012.pdf. Image-based input source(s) can be tokenized by extracting and serializing patches from an image.

[0170] In general, arbitrary' data types can be serialized and processed into input sequence 5. It is to be understood that element(s) 5-1, 5-2, . . . , 5-M depicted in Figure 10 can be the tokens or can be the embedded representations thereof.

[0171] Prediction layer(s) 6 can predict one or more output elements 7-1, 7-2, . . . , 7- N based on the input elements. Prediction layer(s) 6 can include one or more machine-learned model architectures, such as one or more layers of learned parameters that manipulate and transform the input(s) to extract higher-order meaning from, and relationships between, input element(s) 5-1, 5-2, . . . , 5-M. In this manner, for instance, example prediction layer(s) 6 can predict new output element(s) in view of the context provided by input sequence 5.

[0172] Prediction layer(s) 6 can evaluate associations between portions of input sequence 5 and a particular output element. These associations can inform a prediction of the likelihood that a particular output follows the input context. For example, consider the textualsnippet, “The carpenter’s toolbox was small and heavy. It was full of .” Example prediction layer(s) 6 can identify that “It” refers back to “toolbox” by determining a relationship between the respective embeddings. Example prediction layer(s) 6 can also link “It” to the attributes of the toolbox, such as “small” and “heavy.” Based on these associations, prediction layer(s) 6 can, for instance, assign a higher probability to the word “nails” than to the word “sawdust.”

[0173] A transformer is an example architecture that can be used in prediction layer(s) 4. See, e.g., Vaswani et al.. Attention Is All You Need, ARXIV: 1706.03762V7 (Aug. 2, 2023). A transformer is an example of a machine-learned model architecture that uses an attention mechanism to compute associations between items within a context window'. The context window can include a sequence that contains input sequence 5 and potentially one or more output element(s) 7-1, 7-2, . . . , 7-N. A transformer block can include one or more attention layer(s) and one or more post-attention layer(s) (e.g., feedforward layer(s), such as a multi-layer perceptron).

[0174] Prediction layer(s) 6 can include other machine-learned model architectures in addition to or in lieu of transformer-based architectures. For example, recurrent neural networks (RNNs) and long short-term memory (LSTM) models can also be used, as w'ell as convolutional neural networks (CNNs). In general, prediction layer(s) 6 can leverage various kinds of artificial neural networks that can understand or generate sequences of information.

[0175] Output sequence 7 can include or otherwise represent the same or different data types as input sequence 5. For instance, input sequence 5 can represent textual data, and output sequence 7 can represent textual data. Input sequence 5 can represent image, audio, or audiovisual data, and output sequence 7 can represent textual data (e.g., describing the image, audio, or audiovisual data). It is to be understood that prediction layer(s) 6, and any other interstitial model components of sequence processing model(s) 4, can be configured to receive a variety of data types in input sequence(s) 5 and output a variety of data ty pes in output sequence(s) 7.

[0176] Output sequence 7 can have various relationships to input sequence 5. Output sequence 7 can be a continuation of input sequence 5. Output sequence 7 can be complementary to input sequence 5. Output sequence 7 can translate, transform, augment, or otherwise modify' input sequence 5. Output sequence 7 can answer, evaluate, confirm, or otherwise respond to input sequence 5. Output sequence 7 can implement (or describe instructions for implementing) an instruction provided via input sequence 5.

[0177] Output sequence 7 can be generated autoregressively. For instance, for some applications, an output of one or more prediction layer(s) 6 can be passed through one or more output layers (e.g., softmax layer) to obtain a probability distribution over an output vocabulary (e.g., a textual or symbolic vocabulary) conditioned on a set of input elements in a context window. In this manner, for instance, output sequence 7 can be autoregressively generated by sampling a likely next output element, adding that element to the context window, and re-generating the probability distribution based on the updated context window, and sampling a likely next output element, and so forth.

[0178] Output sequence 7 can also be generated non-autoregressively. For instance, multiple output elements of output sequence 7 can be predicted together without explicit sequential conditioning on each other. See, e.g., Saharia et al., Non-Autoregressive Machine Translation with Latent Alignments, ARXIV:2004.07437V3 (Nov. 16, 2020).

[0179] Output sequence 7 can include one or multiple portions or elements. In an example content generation configuration, output sequence 7 can include multiple elements corresponding to multiple portions of a generated output sequence (e.g., a textual sentence, values of a discretized waveform, computer code, etc.). In an example classification configuration, output sequence 7 can include a single element associated with a classification output. For instance, an output “vocabulary” can include a set of classes into which an input sequence is to be classified. For instance, a vision transformer block can pass latent state information to a multilayer perceptron that outputs a likely class value associated with an input image.

[0180] Figure 11 is a block diagram of an example technique for populating an example input sequence 8. Input sequence 8 can include various functional elements that form part of the model infrastructure, such as an element 8-0 obtained from a task indicator 9 that signals to any model(s) that process input sequence 8 that a particular task is being performed (e.g., to help adapt a performance of the model(s) to that particular task). Input sequence 8 can include various data elements from different data modalities. For instance, an input modality 10-1 can include one modality of data. A data-to-sequence model 11-1 can process data from input modality 10-1 to project the data into a format compatible with input sequence 8 (e.g., one or more vectors dimensioned according to the dimensions of input sequence 8) to obtain elements 8-1, 8-2, 8-3. Another input modality 10-2 can include a different modality of data. A data-to-sequence model 11-2 can project data from input modality 10-2 into a format compatible with input sequence 8 to obtain elements 8-4, 8-5, 8- 6. Another input modality 10-3 can include yet another different modality of data. A data-to-sequence model 11-3 can project data from input modality 10-3 into a format compatible with input sequence 8 to obtain elements 8-7. 8-8, 8-9.

[0181] Input sequence 8 can be the same as or different from input sequence 5. Input sequence 8 can be a multimodal input sequence that contains elements that represent data from different modalities using a common dimensional representation. For instance, an embedding space can have P dimensions. Input sequence 8 can be configured to contain a plurality of elements that have P dimensions. In this manner, for instance, example implementations can facilitate information extraction and reasoning across diverse data modalities by projecting data into elements in the same embedding space for comparison, combination, or other computations therebetween.

[0182] For example, elements 8-0. . . . , 8-9 can indicate particular locations within a multidimensional embedding space. Some elements can map to a set of discrete locations in the embedding space. For instance, elements that correspond to discrete members of a predetermined vocabulary of tokens can map to discrete locations in the embedding space that are associated with those tokens. Other elements can be continuously distributed across the embedding space. For instance, some datatypes can be broken down into continuously defined portions (e.g., image patches) that can be described using continuously distributed locations within the embedding space.

[0183] In some implementations, the expressive power of the embedding space may not be limited to meanings associated with any particular set of tokens or other building blocks. For example, a continuous embedding space can encode a spectrum of high-order information. An individual piece of information (e.g., a token) can map to a particular point in that space: for instance, a token for the word ‘‘dog’' can be projected to an embedded value that points to a particular location in the embedding space associated with canine-related information. Similarly, an image patch of an image of a dog on grass can also be projected into the embedding space. In some implementations, the projection of the image of the dog can be similar to the projection of the word “dog” while also having similarity to a projection of the word “grass.” while potentially being different from both. In some implementations, the projection of the image patch may not exactly align with any single projection of a single word. In some implementations, the projection of the image patch can align with a combination of the projections of the words “dog” and “grass.” In this manner, for instance, a high-order embedding space can encode information that can be independent of data modalities in which the information is expressed.

[0184] Task indicator 9 can include a model or model component configured to identify a task being performed and inject, into input sequence 8, an input value represented by element 8-0 that signals which task is being performed. For instance, the input value can be provided as a data type associated with an input modality and projected along with that input modality (e.g., the input value can be a textual task label that is embedded along with other textual data in the input; the input value can be a pixel-based representation of a task that is embedded along with other image data in the input; etc.). The input value can be provided as a data type that differs from or is at least independent from other input(s). For instance, the input value represented by element 8-0 can be learned within a continuous embedding space.

[0185] Input modalities 10-1, 10-2, and 10-3 can be associated with various different data types (e.g., as described above with respect to input(s) 2 and output(s) 3).

[0186] Data-to-sequence models 11-1, 11-2, and 11-3 can be the same or different from each other. Data-to-sequence models 11-1, 11-2, and 11-3 can be adapted to each respective input modality 10-1, 10-2, and 10-3. For example, a textual data-to-sequence model can subdivide a portion of input text and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-1, 8-2, 8-3, etc.). An image data-to-sequence model can subdivide an input image and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-4, 8-5, 8-6, etc.). An arbitrary datatype data-to-sequence model can subdivide an input of that arbitrary datatype and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-7, 8-8, 8-9, etc.).

[0187] Data-to-sequence models 11-1, 11-2, and 11-3 can form part of machine- learned sequence processing model(s) 4. Data-to-sequence models 11-1, 11-2, and 11-3 can be jointly trained with or trained independently from machine-learned sequence processing model(s) 4. Data-to-sequence models 11-1, 11-2, and 11-3 can be trained end-to-end with machine-learned sequence processing model(s) 4.Example Machine-Learned Model Development Platform

[0188] Figure 12 is a block diagram of an example model development platform 12 that can facilitate creation, adaptation, and refinement of example machine-learned models (e.g., machine-learned model(s) 1, sequence processing model(s) 4, etc.). Model development platform 12 can provide a number of different toolkits that developer systems can employ in the development of new or adapted machine-learned models.

[0189] Model development platform 12 can provide one or more model libraries 13 containing building blocks for new models. Model libraries 13 can include one or more pretrained foundational models 13-1, which can provide a backbone of processing power across various tasks. Model libraries 13 can include one or more pre-trained expert models 13-2, which can be focused on performance in particular domains of expertise. Model libraries 13 can include various model primitives 13-3, which can provide low-level architectures or components (optionally pre-trained), which can be assembled in various arrangements as desired.

[0190] Model development platform 12 can receive selections of various model components 14. Model development platform 12 can pass selected model components 14 to a workbench 15 that combines selected model components 14 into a development model 16.

[0191] Workbench 15 can facilitate further refinement and adaptation of development model 16 by leveraging a number of different toolkits integrated with model development platform 12. For example, workbench 15 can facilitate alignment of the development model 16 with a desired performance profile on various tasks using a model alignment toolkit 17.

[0192] Model alignment toolkit 17 can provide a number of tools for causing development model 16 to generate outputs aligned with desired behavioral characteristics.Alignment can include increasing an accuracy, precision, recall, etc. of model outputs. Alignment can include enforcing output styles, schema, or other preferential characteristics of model outputs. Alignment can be general or domain-specific. For instance, a pre- trained foundational model 13-1 can begin with an initial level of performance across multiple domains. Alignment of the pre-trained foundational model 13-1 can include improving a performance in a particular domain of information or tasks (e g., even at the expense of performance in another domain of information or tasks).

[0193] Model alignment toolkit 17 can integrate one or more dataset(s) 17-1 for aligning development model 16. Curated dataset(s) 17-1 can include labeled or unlabeled training data. Dataset(s) 17-1 can be obtained from public domain datasets. Dataset(s) 17-1 can be obtained from private datasets associated with one or more developer system(s) for the alignment of bespoke machine-learned model(s) customized for private use-cases.

[0194] Pre-training pipelines 17-2 can include a machine-learned model training workflow configured to update development model 16 over large-scale, potentially noisy datasets. For example, pre-training can leverage unsupervised learning techniques (e g., denoising, etc.) to process large numbers of training instances to update model parameters from an initialized state and achieve a desired baseline performance. Pre-training pipelines 17-2can leverage unlabeled datasets in dataset(s) 17-1 to perform pre-training. Workbench 15 can implement a pre-training pipeline 17-2 to pre-train development model 16.

[0195] Fine-tuning pipelines 17-3 can include a machine-learned model training workflow configured to refine the model parameters of development model 16 with higher- quality data. Fine-tuning pipelines 17-3 can update development model 16 by conducting supervised training with labeled dataset(s) in dataset(s) 17-1. Fine-tuning pipelines 17-3 can update development model 16 by conducting reinforcement learning using reward signals from user feedback signals. Workbench 15 can implement a fine-tuning pipeline 17-3 to finetune development model 16.

[0196] Prompt libraries 17-4 can include sets of inputs configured to induce behavior aligned with desired performance criteria. Prompt libraries 17-4 can include few-shot prompts (e.g., inputs providing examples of desired model outputs for prepending to a desired runtime query), chain-of-thought prompts (e.g., inputs providing step-by-step reasoning within the exemplars to facilitate thorough reasoning by the model), and the like.

[0197] Example prompts can be retrieved from an available repository of prompt libraries 17-4. Example prompts can be contributed by one or more developer systems using workbench 15.

[0198] In some implementations, pre-trained or fine-tuned models can achieve satisfactory performance without exemplars in the inputs. For instance, zero-shot prompts can include inputs that lack exemplars. Zero-shot prompts can be within a domain within a training dataset or outside of the training domain(s).

[0199] Prompt libraries 17-4 can include one or more prompt engineering tools. Prompt engineering tools can provide workflows for retrieving or learning optimized prompt values. Prompt engineering tools can facilitate directly learning prompt values (e.g., input element values) based on one or more training iterations. Workbench 15 can implement prompt engineering tools in development model 16.

[0200] Prompt libraries 17-4 can include pipelines for prompt generation. For example, inputs can be generated using development model 16 itself or other machine- learned models. In this manner, for instance, a first model can process information about a task and output an input for a second model to process in order to perform a step of the task. The second model can be the same as or different from the first model. Workbench 15 can implement prompt generation pipelines in development model 16.

[0201] Prompt libraries 17-4 can include pipelines for context injection. For instance, a performance of development model 16 on a particular task can improve if provided withadditional context for performing the task. Prompt libraries 17-4 can include software components configured to identify desired context, retrieve the context from an external source (e.g., a database, a sensor, etc.), and add the context to the input prompt. Workbench 15 can implement context injection pipelines in development model 16.

[0202] Although various training examples described herein with respect to model development platform 12 refer to “‘pre-training” and “fine-tuning.” it is to be understood that model alignment toolkit 17 can generally support a wide variety of training techniques adapted for training a wide variety of machine-learned models. Example training techniques can correspond to the example training method 800 described above.

[0203] Model development platform 12 can include a model plugin toolkit 18. Model plugin toolkit 18 can include a variety of tools configured for augmenting the functionality of a machine-learned model by integrating the machine-learned model with other systems, devices, and software components. For instance, a machine-learned model can use tools to increase performance quality where appropriate. For instance, deterministic tasks can be offloaded to dedicated tools in lieu of probabilistically performing the task with an increased risk of error. For instance, instead of autoregressively predicting the solution to a system of equations, a machine-learned model can recognize a tool to call for obtaining the solution and pass the system of equations to the appropriate tool. The tool can be a traditional system of equations solver that can operate deterministically to resolve the system of equations. The output of the tool can be returned in response to the original query. In this manner, tool use can allow some example models to focus on the strengths of machine-learned models — e.g., understanding an intent in an unstructured request for a task — while augmenting the performance of the model by offloading certain tasks to a more focused tool for rote application of deterministic algorithms to a well-defined problem.

[0204] Model plugin toolkit 18 can include validation tools 18-1. Validation tools 18- 1 can include tools that can parse and confirm output(s) of a machine-learned model.Validation tools 18-1 can include engineered heuristics that establish certain thresholds applied to model outputs. For example, validation tools 18-1 can ground the outputs of machine-learned models to structured data sources (e.g., to mitigate “hallucinations”).

[0205] Model plugin toolkit 18 can include tooling packages 18-2 for implementing one or more tools that can include scripts or other executable code that can be executed alongside development model 16. Tooling packages 18-2 can include one or more inputs configured to cause machine-learned model(s) to implement the tools (e.g.. few-shot promptsthat induce a model to output tool calls in the proper syntax, etc.). Tooling packages 18-2 can include, for instance, fine-tuning training data for training a model to use a tool.

[0206] Model plugin toolkit 18 can include interfaces for calling external application programming interfaces (APIs) 18-3. For instance, in addition to or in lieu of implementing tool calls or tool code directly with development model 16, development model 16 can be aligned to output instructions that initiate API calls to send or obtain data via external systems.

[0207] Model plugin toolkit 18 can integrate with prompt libraries 17-4 to build a catalog of available tools for use with development model 16. For instance, a model can receive, in an input, a catalog of available tools, and the model can generate an output that selects a tool from the available tools and initiates a tool call for using the tool.

[0208] Model development platform 12 can include a computational optimization toolkit 19 for optimizing a computational performance of development model 16. For instance, tools for model compression 19-1 can allow development model 16 to be reduced in size while maintaining a desired level of performance. For instance, model compression 19-1 can include quantization workflows, weight pruning and sparsification techniques, etc. Tools for hardware acceleration 19-2 can facilitate the configuration of the model storage and execution formats to operate optimally on different hardware resources. For instance, hardware acceleration 19-2 can include tools for optimally sharding models for distributed processing over multiple processing units for increased bandwidth, lower unified memory requirements, etc. Tools for distillation 19-3 can provide for the training of lighter-weight models based on the knowledge encoded in development model 16. For instance, development model 16 can be a highly performant, large machine-learned model optimized using model development platform 12. To obtain a lightweight model for running in resource-constrained environments, a smaller model can be a '‘student model’’ that learns to imitate development model 16 as a “teacher model.” In this manner, for instance, the investment in learning the parameters and configurations of development model 16 can be efficiently transferred to a smaller model for more efficient inference.

[0209] Workbench 15 can implement one, multiple, or none of the toolkits implemented in model development platform 12. Workbench 15 can output an output model 20 based on development model 16. Output model 20 can be a deployment version of development model 16. Output model 20 can be a development or training checkpoint of development model 16. Output model 20 can be a distilled, compressed, or otherwise optimized version of development model 16.

[0210] Figure 13 is a block diagram of an example training flow for training a machine-learned development model 16. One or more portion(s) of the example training flow can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of the example training flow can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of the example training flow can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. FIG. 13 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. FIG. 13 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of the example training flow can be performed additionally, or alternatively, by other systems.

[0211] Initially, development model 16 can persist in an initial state as an initialized model 21. Development model 16 can be initialized with weight values. Initial weight values can be random or based on an initialization schema. Initial weight values can be based on prior pre-training for the same or for a different model.

[0212] Initialized model 21 can undergo pre-training in a pre-training stage 22. Pretraining stage 22 can be implemented using one or more pre-training pipelines 17-2 over data from dataset(s) 17-1. Pre-training can be omitted, for example, if initialized model 21 is already pre-trained (e.g.. development model 16 contains, is, or is based on a pre-trained foundational model or an expert model).

[0213] Pre-trained model 23 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Pre-trained model 23 can be the initial state if development model 16 was already pre-trained. Pre-trained model 23 can undergo fine-tuning in a fine-tuning stage 24. Fine-tuning stage 24 can be implemented using one or more fine-tuning pipelines 17-3 over data from dataset(s) 17-1. Fine-tuning can be omitted, for example, if a pre-trained model has satisfactory performance, if the model was already fine-tuned, or if other tuning approaches are preferred.

[0214] Fine-tuned model 25 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Fine-tunedmodel 25 can be the initial state if development model 16 was already fine-tuned. Fine-tuned model 25 can undergo refinement with user feedback 26. For instance, refinement with user feedback 26 can include reinforcement learning, optionally based on human feedback from human users of fine-tuned model 25. As reinforcement learning can be a form of fine-tuning, it is to be understood that fine-tuning stage 24 can subsume the stage for refining with user feedback 26. Refinement with user feedback 26 can produce a refined model 27. Refined model 27 can be output to downstream system(s) 28 for deployment or further development.

[0215] In some implementations, computational optimization operations can be applied before, during, or after each stage. For instance, initialized model 21 can undergo computational optimization 29-1 (e.g., using computational optimization toolkit 19) before pre-training stage 22. Pre-trained model 23 can undergo computational optimization 29-2 (e.g., using computational optimization toolkit 19) before fine-tuning stage 24. Fine-tuned model 25 can undergo computational optimization 29-3 (e.g., using computational optimization toolkit 19) before refinement with user feedback 26. Refined model 27 can undergo computational optimization 29-4 (e.g., using computational optimization toolkit 19) before output to downstream system(s) 28. Computational optimization(s) 29-1, . . . . 29-4 can all be the same, all be different, or include at least some different optimization techniques.Example Machine-Learned Model Inference System

[0216] Figure 14 is a block diagram of an inference system for operating one or more machine-learned model(s) 1 to perform inference (e.g., for training, for deployment, etc.). A model host 31 can receive machine-learned model(s) 1. Model host 31 can host one or more model instance(s) 31-1, which can be one or multiple instances of one or multiple models. Model host 31 can host model instance(s) 31-1 using available compute resources 31-2 associated with model host 31.

[0217] Model host 31 can perform inference on behalf of one or more client(s) 32. Client(s) 32 can transmit an input request 33 to model host 31. Using input request 33, model host 31 can obtain input(s) 2 for input to machine-learned model(s) 1. Machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3. Using output(s) 3, model host 31 can return an output payload 34 for responding to input request 33 from client(s) 32. Output payload 34 can include or be based on output(s) 3.

[0218] Model host 31 can leverage various other resources and tools to augment the inference task. For instance, model host 31 can communicate with tool interfaces 35 tofacilitate tool use by model instance(s) 31-1. Tool interfaces 35 can include local or remote APIs. Tool interfaces 35 can include integrated scripts or other software functionality’. Model host 31 can engage online learning interface(s) 36 to facilitate ongoing improvements to machine-learned model(s) 1. For instance, online learning interface(s) 36 can be used within reinforcement learning loops to retrieve user feedback on inferences served by model host 31. Model host 31 can access runtime data source(s) 37 for augmenting input(s) 2 with additional contextual information. For instance, runtime data source(s) 37 can include a knowledge graph 37-1 that facilitates structured information retrieval for information associated with input request(s) 33 (e.g., a search engine service). Runtime data source(s) 37 can include public or private, external or local database(s) 37-2 that can store information associated with input request(s) 33 for augmenting input(s) 2. Runtime data source(s) 37 can include account data 37-3 which can be retrieved in association with a user account corresponding to a client 32 for customizing the behavior of model host 31 accordingly.

[0219] Model host 31 can be implemented by one or multiple computing devices or systems. Client(s) 2 can be implemented by one or multiple computing devices or systems, which can include computing devices or systems shared with model host 31.

[0220] For example, model host 31 can operate on a server system that provides a machine-learning service to client device(s) that operate client(s) 32 (e.g., over a local or wide-area network). Client device(s) can be end-user devices used by individuals. Client device(s) can be server systems that operate client(s) 32 to provide various functionality as a service to downstream end-user devices.

[0221] In some implementations, model host 31 can operate on a same device or system as client(s) 32. Model host 31 can be a machine-learning service that runs on-device to provide machine-learning functionality to one or multiple applications operating on a client device, which can include an application implementing chent(s) 32. Model host 31 can be a part of a same application as client(s) 32. For instance, model host 31 can be a subroutine or method implemented by one part of an application, and client(s) 32 can be another subroutine or method that engages model host 31 to perform inference functions within the application. It is to be understood that model host 31 and client(s) 32 can have various different configurations.

[0222] Model instance(s) 31-1 can include one or more machine-learned models that are available for performing inference. Model instance(s) 31-1 can include weights or other model components that are stored in persistent storage, temporarily cached, or loaded into high-speed memory. Model instance(s) 31-1 can include multiple instance(s) of the samemodel (e.g., for parallel execution of more requests on the same model). Model instance(s) 31-1 can include instance(s) of different model(s). Model instance(s) 31-1 can include cached intermediate states of active or inactive model(s) used to accelerate inference of those models. For instance, an inference session with a particular model may generate significant amounts of computational results that can be re-used for future inference runs (e.g., using a KV cache for transformer-based models). These computational results can be saved in association with that inference session so that session can be executed more efficiently when resumed.

[0223] Compute resource(s) 31-2 can include one or more processors (central processing units, graphical processing units, tensor processing units, machine-learning accelerators, etc.) connected to one or more memory devices. Compute resource(s) 31-2 can include a dynamic pool of available resources shared with other processes. Compute resource(s) 31-2 can include memory devices large enough to fit an entire model instance in a single memory instance. Compute resource(s) 31-2 can also shard model instance(s) across multiple memory devices (e.g., using data parallelization or tensor parallelization, etc.). This can be done to increase parallelization or to execute a large model using multiple memory devices which individually might not be able to fit the entire model into memory.

[0224] Input request 33 can include data for input(s) 2. Model host 31 can process input request 33 to obtain input(s) 2. Input(s) 2 can be obtained directly from input request 33 or can be retrieved using input request 33. Input request 33 can be submitted to model host 31 via an API.

[0225] Model host 31 can perform inference over batches of input requests 33 in parallel. For instance, a model instance 31-1 can be configured with an input structure that has a batch dimension. Separate input(s) 2 can be distributed across the batch dimension (e.g., rows of an array). The separate input(s) 2 can include completely different contexts. The separate input(s) 2 can be multiple inference steps of the same task. The separate input(s) 2 can be staggered in an input structure, such that any given inference cycle can be operating on different portions of the respective input(s) 2. In this manner, for instance, model host 31 can perform inference on the batch in parallel, such that output(s) 3 can also contain the batch dimension and return the inference results for the batched input(s) 2 in parallel. In this manner, for instance, batches of input request(s) 33 can be processed in parallel for higher throughput of output payload(s) 34.

[0226] Output payload 34 can include or be based on output(s) 3 from machine- learned model(s) 1. Model host 31 can process output(s) 3 to obtain output payload 34. Thiscan include chaining multiple rounds of inference (e.g., iteratively, recursively, across the same model(s) or different model(s)) to arrive at a final output for a task to be returned in output payload 34. Output payload 34 can be transmitted to client(s) 32 via an API.

[0227] Online learning interface(s) 36 can facilitate reinforcement learning of machine-learned model(s) 1. Online learning interface(s) 36 can facilitate reinforcement learning with human feedback (RLHF). Online learning interface(s) 36 can facilitate federated learning of machine-learned model(s) 1.

[0228] Model host 31 can execute machine-learned model(s) 1 to perform inference for various tasks using various types of data. For example, various different input(s) 2 and output(s) 3 can be used for various different tasks. In some implementations, input(s) 2 can be or otherwise represent image data. Machine-learned model(s) 1 can process the image data to generate an output. As an example, machine-learned model(s) 1 can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an image segmentation output. As another example, machine-learned model(s) 1 can process the image data to generate an image classification output. As another example, machine-learned model(s) 1 can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, machine- learned model(s) 1 can process the image data to generate an encoded image data output (e.g., an encoded and / or compressed representation of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an upscaled image data output. As another example, machine-learned model(s) 1 can process the image data to generate a prediction output.

[0229] In some implementations, the task is a computer vision task. In some cases, input(s) 2 includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. Forexample, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.

[0230] In some implementations, input(s) 2 can be or otherwise represent natural language data. Machine-learned model(s) 1 can process the natural language data to generate an output. As an example, machine-learned model(s) 1 can process the natural language data to generate a language encoding output. As another example, machine-learned model(s) 1 can process the natural language data to generate a latent text embedding output. As another example, machine-learned model(s) 1 can process the natural language data to generate a translation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a classification output. As another example, machine-learned model(s) 1 can process the natural language data to generate a textual segmentation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a semantic intent output. As another example, machine-learned model(s) 1 can process the natural language data to generate an upscaled text or natural language output (e.g.. text or natural language data that is higher quality than the input text or natural language, etc.). As another example, machine-learned model(s) 1 can process the natural language data to generate a prediction output (e.g., one or more predicted next portions of natural language content).

[0231] In some implementations, input(s) 2 can be or otherwise represent speech data (e.g., data describing spoken natural language, such as audio data, textual data, etc.). Machine-learned model(s) 1 can process the speech data to generate an output. As an example, machine-learned model(s) 1 can process the speech data to generate a speech recognition output. As another example, machine-learned model(s) 1 can process the speech data to generate a speech translation output. As another example, machine-learned model(s) 1 can process the speech data to generate a latent embedding output. As another example, machine-learned model(s) 1 can process the speech data to generate an encoded speech output (e.g., an encoded and / or compressed representation of the speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data.etc.). As another example, machine-learned model(s) 1 can process the speech data to generate a textual representation output (e g., a textual representation of the input speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate a prediction output.

[0232] In some implementations, input(s) 2 can be or otherwise represent latent encoding data (e.g., a latent space representation of an input, etc.). Machine-learned model(s) 1 can process the latent encoding data to generate an output. As an example, machine- learned model(s) 1 can process the latent encoding data to generate a recognition output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a reconstruction output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a search output. As another example, machine- learned model(s) 1 can process the latent encoding data to generate a reclustering output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a prediction output.

[0233] In some implementations, input(s) 2 can be or otherwise represent statistical data. Statistical data can be, represent, or otherwise include data computed and / or calculated from some other data source. Machine-learned model(s) 1 can process the statistical data to generate an output. As an example, machine-learned model(s) 1 can process the statistical data to generate a recognition output. As another example, machine-learned model(s) 1 can process the statistical data to generate a prediction output. As another example, machine- learned model(s) 1 can process the statistical data to generate a classification output. As another example, machine-learned model(s) 1 can process the statistical data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the statistical data to generate a visualization output. As another example, machine-learned model(s) 1 can process the statistical data to generate a diagnostic output.

[0234] In some implementations, input(s) 2 can be or otherwise represent sensor data. Machine-learned model(s) 1 can process the sensor data to generate an output. As an example, machine-learned model(s) 1 can process the sensor data to generate a recognition output. As another example, machine-learned model(s) 1 can process the sensor data to generate a prediction output. As another example, machine-learned model(s) 1 can process the sensor data to generate a classification output. As another example, machine-learned model(s) 1 can process the sensor data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the sensor data to generate a visualization output. As another example, machine-learned model(s) 1 can process the sensor data togenerate a diagnostic output. As another example, machine-learned model(s) 1 can process the sensor data to generate a detection output.

[0235] In some implementations, machine-learned model(s) 1 can be configured to perform a task that includes encoding input data for reliable and / or efficient transmission or storage (and / or corresponding decoding). For example, the task may be an audio compression task. The input may include audio data and the output may comprise compressed audio data. In another example, the input includes visual data (e.g. one or more images or videos), the output comprises compressed visual data, and the task is a visual data compression task. In another example, the task may comprise generating an embedding for input data (e.g. input audio or visual data). In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output may comprise a text output which is mapped to the spoken utterance. In some cases, the task comprises encrypting or decry pting input data. In some cases, the task comprises a microprocessor performance task, such as branch prediction or memory' address translation.

[0236] In some implementations, the task is a generative task, and machine-learned model(s) 1 can be configured to output content generated in view of input(s) 2. For instance, input(s) 2 can be or otherwise represent data of one or more modalities that encodes context for generating additional content.

[0237] In some implementations, the task can be a text completion task. Machine- learned model(s) 1 can be configured to process input(s) 2 that represent textual data and to generate output(s) 3 that represent additional textual data that completes a textual sequence that includes input(s) 2. For instance, machine-learned model(s) 1 can be configured to generate output(s) 3 to complete a sentence, paragraph, or portion of text that follows from a portion of text represented by input(s) 2.

[0238] In some implementations, the task can be an instruction following task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent instructions to perform a function and to generate output(s) 3 that advance a goal of satisfying the instruction function (e.g., at least a step of a multi-step procedure to perform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., image-based instructions for a task to be performed, optionallyaccompanied by textual instructions) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward accomplishing the requested functionality. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of performing a function. Multiple steps can be performed, with a final output being obtained that is responsive to the initial instructions.

[0239] In some implementations, the task can be a question answering task. Machine- learned model(s) 1 can be configured to process input(s) 2 that represent a question to answer and to generate output(s) 3 that advance a goal of returning an answer to the question (e.g., at least a step of a multi-step procedure to perform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine- learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward answering the question. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of obtaining an answer to the question (e.g., querying a database, performing a computation, executing a script, etc.). Multiple steps can be performed, with a final output being obtained that is responsive to the question.

[0240] In some implementations, the task can be an image generation task. Machine- learned model(s) I can be configured to process input(s) 2 that represent context regarding a desired portion of image content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent image data that depicts imagery related to the context. For instance, machine-learned model(s) 1 can be configured to generate pixel data of an image. Values for channel(s)associated with the pixels in the pixel data can be selected based on the context (e.g., based on a probability determined based on the context).

[0241] In some implementations, the task can be an audio generation task. Machine- learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of audio content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent audio data related to the context. For instance, machine-learned model(s) 1 can be configured to generate waveform data in the form of an image (e.g., a spectrogram). Values for channel(s) associated with pixels of the image can be selected based on the context. Machine- learned model(s) 1 can be configured to generate waveform data in the form of a sequence of discrete samples of a continuous waveform. Values of the sequence can be selected based on the context (e.g., based on a probability determined based on the context).

[0242] In some implementations, the task can be a data generation task. Machine- learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of data (e.g., data from various data domains, such as sensor data, image data, multimodal data, statistical data, etc.). The desired data can be, for instance, synthetic data for training other machine-learned models. The context can include arbitrary data type(s). Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent data that aligns with the desired data. For instance, machine-learned model (s) 1 can be configured to generate data values for populating a dataset. Values for the data object(s) can be selected based on the context (e.g., based on a probability determined based on the context).Example Computing Systems and Devices

[0243] Figure 15 is a block diagram of an example networked computing system that can perform aspects of example implementations of the present disclosure. The system can include a number of computing devices and systems that are communicatively coupled over a network 49. An example computing device 50 is described to provide an example of a computing device that can perform any aspect of the present disclosure (e.g., implementing model host 31, client(s) 32, or both). An example server computing system 60 is described as an example of a server computing system that can perform any aspect of the present disclosure (e.g., implementing model host 31, client(s) 32, or both). Computing device 50 and server computing system(s) 60 can cooperatively interact (e.g., over network 49) to perform any aspect of the present disclosure (e.g., implementing model host 31. client(s) 32, or both). Model development platform system 70 is an example system that can host or serve modeldevelopment platform(s) 12 for development of machine-learned models. Third-party system(s) 80 are example system(s) with which any of computing device 50, server computing system(s) 60, or model development platform system(s) 70 can interact in the performance of various aspects of the present disclosure (e.g., engaging third-party tools, accessing third-party databases or other resources, etc.).

[0244] Network 49 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over network 49 can be carried via any type of wired or wireless connection, using a wide variety of communication protocols (e.g., TCP / IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), or protection schemes (e.g., VPN. secure HTTP, SSL). Network 49 can also be implemented via a system bus. For instance, one or more devices or systems of Figure 15 can be co-located with, contained by, or otherwise integrated into one or more other devices or systems.

[0245] Computing device 50 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, a server computing device, a virtual machine operating on a host device, or any other type of computing device. Computing device 50 can be a client computing device. Computing device 50 can be an end-user computing device. Computing device 50 can be a computing device of a service provided that provides a service to an end user (who may use another computing device to interact with computing device 50).

[0246] Computing device 50 can include one or more processors 51 and a memory 52. Processor(s) 51 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 52 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory’ devices, magnetic disks, etc., and combinations thereof. Memory 52 can store data 53 and instructions 54 which can be executed by processor(s) 51 to cause computing device 50 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein.

[0247] Computing device 50 can also include one or more input components that receive user input. For example, a user input component can be a touch-sensitive component(e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual keyboard. Other example user input components include a microphone, camera, LIDAR, a physical keyboard or other buttons, or other means by which a user can provide user input.

[0248] Computing device 50 can store or include one or more machine-learned models 55. Machine-learned models 55 can include one or more machine-learned model(s) 1, such as a sequence processing model 4. Machine-learned models 55 can include one or multiple model instance(s) 31-1. Machine-learned model(s) 55 can be received from server computing system(s) 60, model development platform system 70. third party system(s) 80 (e.g., an application distribution platform), or developed locally on computing device 50. Machine-learned model(s) 55 can be loaded into memory 52 and used or otherwise implemented by processor(s) 51. Computing device 50 can implement multiple parallel instances of machine-learned model (s) 55.

[0249] Server computing system(s) 60 can include one or more processors 61 and a memory 62. Processor(s) 61 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality7of processors that are operatively connected. Memory762 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 62 can store data 63 and instructions 64 which can be executed by processor(s) 61 to cause server computing system(s) 60 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein.

[0250] In some implementations, server computing system 60 includes or is otherwise implemented by one or multiple server computing devices. In instances in which server computing system 60 includes multiple server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.

[0251] Server computing system 60 can store or otherwise include one or more machine-learned models 65. Machine-learned model(s) 65 can be the same as or different from machine-learned model(s) 55. Machine-learned models 65 can include one or more machine-learned model(s) 1. such as a sequence processing model 4. Machine-learned models 65 can include one or multiple model instance(s) 31-1. Machine-learned model(s) 65can be received from computing device 50, model development platform system 70, third party system(s) 80. or developed locally on server computing system(s) 60. Machine-learned model(s) 65 can be loaded into memory 62 and used or otherwise implemented by processor(s) 61. Server computing system(s) 60 can implement multiple parallel instances of machine-learned model(s) 65.

[0252] In an example configuration, machine-learned models 65 can be included in or otherwise stored and implemented by server computing system 60 to establish a client-server relationship with computing device 50 for serving model inferences. For instance, server computing system(s) 60 can implement model host 31 on behalf of client(s) 32 on computing device 50. For instance, machine-learned models 65 can be implemented by server computing system 60 as a portion of a web service (e.g., remote machine-learned model hosting service, such as an online interface for performing machine-learned model operations over a network on server computing system(s) 60). For instance, server computing system(s) 60 can communicate with computing device 50 over a local intranet or internet connection. For instance, computing device 50 can be a workstation or endpoint in communication with server computing system(s) 60, with implementation of machine-learned models 65 being managed by server computing system(s) 60 to remotely perform inference (e g., for runtime or training operations), with output(s) returned (e.g., cast, streamed, etc.) to computing device 50. Machine-learned models 65 can work cooperatively or interoperatively with machine- learned models 55 on computing device 50 to perform various tasks.

[0253] Model development platform system(s) 70 can include one or more processors 71 and a memory 72. Processor(s) 71 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory 72 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory 72 can store data 73 and instructions 74 which can be executed by processor(s) 71 to cause model development platform system(s) 70 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein. Example operations include the functionality described herein with respect to model development platform 12. This and other functionality can be implemented by developer tool(s) 75.

[0254] Third-party system(s) 80 can include one or more processors 81 and a memory 82. Processor(s) 81 can be any suitable processing device (e.g., a processor core, amicroprocessor, an ASIC, an FPGA. a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. Memory’ 82 can include one or more non-transitory computer-readable storage media, such as HBM, RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. Memory’ 82 can store data 83 and instructions 84 which can be executed by processor(s) 81 to cause third-party system(s) 80 to perform operations. The operations can implement any one or multiple features described herein. The operations can implement example methods and techniques described herein. Example operations include the functionality described herein with respect to tools and other external resources called when training or performing inference with machine-learned model(s) 1, 4, 16, 20, 55, 65, etc. (e.g., third-party resource(s) 85).

[0255] Figure 15 illustrates one example arrangement of computing systems that can be used to implement the present disclosure. Other computing system configurations can be used as well. For example, in some implementations, one or both of computing system 50 or server computing system(s) 60 can implement all or a portion of the operations of model development platform system 70. For example, computing system 50 or server computing system(s) 60 can implement developer tool(s) 75 (or extensions thereof) to develop, update / train, or refine machine-learned models 1, 4, 16, 20, 55, 65, etc. using one or more techniques described herein with respect to model alignment toolkit 17. In this manner, for instance, computing system 50 or server computing system(s) 60 can develop, update / train, or refine machine-learned models based on local datasets (e.g., for model personalization / customization, as permitted by user data preference selections).

[0256] Figure 16 is a block diagram of an example computing device 98 that performs according to example embodiments of the present disclosure. Computing device 98 can be a user computing device or a server computing device (e.g., computing device 50, server computing system(s) 60, etc.). Computing device 98 can implement model host 31. For instance, computing device 98 can include a number of applications (e.g., applications 1 through N). Each application can contain its own machine learning library and machine- learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. As illustrated in Figure 16, each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, or additional components. In some implementations, each application cancommunicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.

[0257] Figure 17 is a block diagram of an example computing device 99 that performs according to example embodiments of the present disclosure. Computing device 99 can be the same as or different from computing device 98. Computing device 99 can be a user computing device or a server computing device (e.g., computing device 50, server computing system(s) 60, etc.). Computing device 98 can implement model host 31. For instance, computing device 99 can include a number of applications (e.g., applications 1 through N). Each application can be in communication with a central intelligence layer. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).

[0258] The central intelligence layer can include a number of machine-learned models. For example, as illustrated in Figure 17, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of computing device 99.

[0259] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for computing device 99. As illustrated in Figure 17, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).Additional Disclosure

[0260] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussedherein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

[0261] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

[0262] Aspects of the disclosure have been described in terms of illustrative embodiments thereof. Any and all features in the following claims can be combined or rearranged in any way possible, including combinations of claims not explicitly enumerated in combination together, as the example claim dependencies listed herein should not be read as limiting the scope of possible combinations of features disclosed herein. Accordingly, the scope of the present disclosure is by way of example rather than by way of limitation, and the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Moreover, terms are described herein using lists of example elements joined by conjunctions such as 'and." “or.” “but,” etc. It should be understood that such conjunctions are provided for explanatory purposes only. Clauses and other sequences of items joined by a particular conjunction such as “or,” for example, can refer to “and / or,” “at least one of’, “any combination of’ example elements listed therein, etc. Terms such as “based on” should be understood as “based at least in part on.”

[0263] The term “can” should be understood as referring to a possibility of a feature in various implementations and not as prescribing an ability7that is necessarily present in every7implementation. For example, the phrase “X can perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perfonn Y, and not as indicating that in every instance X must always be able to perform Y. It shouldbe understood that, in various implementations, X might be unable to perform Y and remain within the scope of the present disclosure.

[0264] The term '‘may” should be understood as referring to a possibility of a feature in various implementations and not as prescribing an ability' that is necessarily present in every implementation. For example, the phrase “X may perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perform Y, and not as indicating that in every instance X must always be able to perform Y. It should be understood that, in various implementations, X might be unable to perform Y and remain within the scope of the present disclosure.

Claims

WHAT IS CLAIMED IS:1 . A method comprising: obtaining, by a computing system comprising one or more computing devices, a multimodal multi-index data structure comprising: a first plurality of data entries having at least a first data type; a second plurality of data entnes having at least a second data type different from the first data type; and a plurality of index data structures, wherein each index data structure of the plurality of index data structures comprises a plurality of index data entries each correlating a respective data entry of the first plurality of data entries or second plurality of data entries with a corresponding indexing value; receiving, by the computing system, a multimodal input comprising a query' directed to a machine-learned agent; retrieving, by the computing system based at least in part on the query’ and based at least in part on the plurality of index data structures, a first data entry of the first plurality of data entries or second plurality' of data entries; and providing, by the computing system to the machine-learned agent, the first data entry' for generation of an inference output based at least in part on the first data entry.

2. The method of claim 1 , wherein obtaining the multimodal multi-index data structure comprises: receiving, by the computing system, one or more inputs directed to the machine- learned agent; extracting, by the computing system and based at least in part on the one or more inputs, one or more data entries to be added to the first plurality' of data entries or second plurality of data entries; generating, by the computing system and based at least in part on the one or more data entries, one or more index data entries to be added to the plurality of index data entries; and storing, by the computing system, the one or more data entries and the one or more index data entries in the multimodal multi-index data structure.

3. The method of claim 2, wherein the one or more data entries comprise one or more image frames, and wherein generating the one or more index data entries comprises:providing, by the computing system to a machine-learned image embedding model, the one or more image frames; receiving, by the computing system from the machine-learned image embedding model, an image embedding; and generating, by the computing system, an index data entry7of the plurality7of index data entries based at least in part on the image embedding.

4. The method of claim 2, wherein: the one or more data entries comprise one or more captions associated with one or more video segments; extracting the one or more data entries comprises: segmenting the multimodal input to generate one or more video segments; and providing the one or more video segments to a machine-learned video captioning model; and generating the one or more index data entries comprises: providing, by the computing system to a machine-learned language embedding model, the one or more captions; receiving, by the computing system from the machine-learned language embedding model, a language embedding vector; and generating, by the computing system, an index data entry of the plurality of index data entries based at least in part on the language embedding vector.

5. The method of claim 2, wherein the one or more data entries comprise data indicative of a natural language input component, and wherein generating the one or more index data entries comprises: providing, by the computing system to a machine-learned natural language embedding model, the data indicative of a natural language input component; receiving, by the computing system from the machine-learned natural language embedding model, a natural language embedding based on the data indicative of the natural language input component; and generating, by the computing system, an index data entry of the plurality7of index data entries based at least in part on the natural language embedding.

6. The method of claim 2, wherein:the one or more data entries comprise one or more facts extracted from one or more dialogue turns between a user and the machine-learned agent; extracting the one or more data entries comprises providing, by the computing system to a machine-learned fact extraction system, at least a portion of the one or more dialogue turns; and generating the one or more index data entries comprises providing, by the computing system to a machine-learned language embedding model, at least a portion of the one or more facts.

7. The method of claim 2, wherein: the one or more data entries comprise one or more summaries of content of one or more dialogue turns between a user and the machine-learned agent; extracting the one or more data entries comprises providing, by the computing system to a machine-learned language model, at least a portion of the one or more dialogue turns; and generating the one or more index data entries comprises providing, by the computing system to a machine-learned natural language embedding model, at least a portion of the one or more summaries.

8. The method of claim 2, wherein the one or more data entries comprise multimodal data, and generating the one or more index data entries comprises: providing, by the computing system to a multimodal machine-learned model, the multimodal data; receiving, by the computing system and from the multimodal machine-learned model, a multimodal embedding; and generating, by the computing system, an index data entry of the plurality of index data entries based at least in part on the multimodal embedding.

9. The method of claim 1, wherein retrieving the first data entry comprises: generating, by the computing system based at least in part on the query, a plurality of indexing values associated with the plurality of index data structures; identifying, by the computing system based at least in part on the plurality of indexing values and based at least in part on the plurality of index data structures, a plurality of candidate data entries; andselecting, by the computing system from the plurality of candidate data entries, the first data entry.

10. The method of claim 9, wherein retrieving the first data entry further comprises: generating, by the computing system, a plurality of scores respectively associated with the plurality of candidate data entries; wherein selecting the first data entry comprises selecting based at least in part on the plurality of scores.

11. The method of claim 10, wherein generating the plurality of scores comprises scoring the candidate data entries according to a common scoring function that is shared among the plurality of index data structures.

12. The method of claim 10, wherein generating the plurality of scores comprises generating based at least in part on a metric of diversity relative to one or more other candidate data entries of the plurality of candidate data entries.

13. The method of claim 10, wherein generating the plurality of scores comprises generating based at least in part on data indicative of a time at which a candidate data entry was obtained.

14. The method of claim 9, wherein generating the plurality of indexing values comprises: segmenting, by the computing system, the multimodal input to extract a segment comprising the query; generating, by the computing system based on the segment, one or more revised queries having one or more formats that are compatible with one or more indexers associated with the plurality of index data structures; and providing, by the computing system, to the one or more indexers, the one or more revised queries to generate the plurality of indexing values.

15. The method of claim 9, wherein retrieving the plurality of candidate data entries comprises retrieving, by the computing system based on a metric of similarity between a firstindexing value associated with the query and a plurality of indexing values associated with the plurality of candidate data entries, a plurality of top-k candidate data entries.

16. The method of claim 1, wherein the computing system comprises: a client device comprising one or more input devices; and a server device comprising one or more of the machine-learned agent and the plurality of index data structures.

17. The method of claim 16, wherein the client device comprises a wearable device configured to capture at least one of audio input and video input.

18. The method of claim 1 , wherein the inference output comprises data indicative of one or more application programming interfaces, and further comprising: calling, by the computing system, the one or more application programming interfaces based on the inference output.

19. A computing system comprising one or more processors and one or more non- transitory computer-readable media storing instructions that are executable by one or more processors to cause the computing system to perform operations, the operations comprising: obtaining a multimodal multi-index data structure comprising: a first plurality of data entries having at least a first data t pe: a second plurality' of data entries having at least a second data ty pe different from the first data type; and a plurality’ of index data structures, wherein each index data structure of the plurality of index data structures comprises a plurality of index data entries each correlating a respective data entry of the first plurality7of data entries or second plurality7of data entries with a corresponding indexing value; receiving a multimodal input comprising a query7directed to a machine-learned agent; retrieving, based at least in part on the query and based at least in part on the plurality of index data structures, a first data entry of the first plurality of data entries or second plurality7of data entries; and providing, to the machine-learned agent, the first data entry for generation of an inference output based at least in part on the first data entry.

20. One or more non-transitory computer-readable media storing instructions that are executable by one or more processors to cause a computing system to perform operations, the operations comprising: obtaining a multimodal multi-index data structure comprising: a first plurality7of data entries having at least a first data type; a second plurality of data entries having at least a second data type different from the first data type; and a plurality7of index data structures, wherein each index data structure of the plurality7of index data structures comprises a plurality7of index data entries each correlating a respective data entry of the first plurality of data entries or second plurality of data entries with a corresponding indexing value; receiving a multimodal input comprising a query directed to a machine-learned agent; retrieving, based at least in part on the query7and based at least in part on the plurality of index data structures, a first data entry7of the first plurality of data entries or second plurality of data entries; and providing, to the machine-learned agent, the first data entry for generation of an inference output based at least in part on the first data entry.