Display of emerging outputs from machine-learned models

A user interface dynamically visualizes machine-learned model outputs using certainty and tonality scores to improve user understanding and resource efficiency, addressing the challenge of conveying evolving model outputs.

WO2026143036A1PCT designated stage Publication Date: 2026-07-02GDM HOLDING LLC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GDM HOLDING LLC
Filing Date
2025-12-22
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing systems struggle to effectively display and convey the evolving nature of outputs from machine-learned models, particularly in terms of confidence and certainty, leading to suboptimal user understanding and resource inefficiencies.

Method used

A user interface is developed to dynamically visualize emerging outputs from machine-learned models by incorporating metric values such as certainty scores and tonality scores, using gradient formatting characteristics to enhance the display of output values, allowing for improved spatiotemporal correspondence and reduced resource usage.

Benefits of technology

The solution provides enhanced user understanding of output progression and quality through intuitive visualization, reducing latency and resource consumption by efficiently conveying multiple channels of information without requiring frequent gaze shifts.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025060966_02072026_PF_FP_ABST
    Figure US2025060966_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Systems and methods of the present disclosure can provide for, for each of a sequence of output cycles associated with a machine-learned model, obtaining, by a computing system comprising one or more computing devices, a model output comprising a plurality of output values; obtaining, by the computing system, a plurality of metric values associated with the plurality of output values; determining, by the computing system, a plurality of formatting characteristic values respective to the plurality of output values based on the plurality of metric values; and providing, by the computing system, the formatting characteristic values for use in a user interface that displays the plurality of output values formatted according to the plurality of formatting characteristic values.
Need to check novelty before this filing date? Find Prior Art

Description

DISPLAY OF EMERGING OUTPUTS FROM MACHINE-LEARNED MODELSPRIORITY CLAIM

[0001] The present application is based on and claims priority- to United States Provisional Application Number 63 / 739,310 having a filing date of December 27, 2024. The present application claims priority- to and the benefit of each of such applications and incorporates all such applications herein by reference in their entirety.FIELD

[0002] The present disclosure relates generally to machine-learning and artificial intelligence systems. More particularly, the present disclosure relates to improved display of emerging outputs from machine-learned models.BACKGROUND

[0003] Machine learning (ML) is a field of study in artificial intelligence (Al) that allows machines to leam and improve from data without being explicitly programmed. ML uses statistical algorithms to analyze large amounts of data, identify patterns, and make decisions.SUMMARY

[0004] Aspects and advantages of embodiments of the present disclosure will be set forth in part in the following description, or can be learned from the description, or can be learned through practice of the embodiments.

[0005] Example aspects of the present disclosure provide a computer-implemented method. The method includes obtaining, by a computing system including one or more computing devices, a first output including a plurality- of first output values, the first output associated with a first output cycle of a machine-learned model. The method includes obtaining, by the computing system, a plurality of first metric values associated with the plurality of first output values. The method includes determining, by the computing system, a plurality of first formatting characteristic values respective to the plurality of first output values based on the plurality- of first metric values. The method includes providing, by the computing system, the first formatting characteristic values for use in a first user interface having the plurality of first output values formatted according to the plurality of firstformatting characteristic values. The method includes obtaining, by the computing system, a second output including a plurality’ of second output values, the second output associated with a second output cycle of the machine-learned model. The method includes obtaining, by the computing system, a plurality of second metric values respective to the plurality of second output values. The method includes determining, by the computing system, a plurality7of second formatting characteristic values respective to the plurality of second output values based on the plurality of second metnc values. The method includes providing, by the computing system, the second formatting characteristic values for use in updating the first user interface to a second user interface having the plurality’ of second output values formatted according to the plurality of second formatting characteristic values.

[0006] In some implementations, each metric value of the plurality of first metric values or the plurality of second metric values is between a minimum metric value and a maximum metric value.

[0007] In some implementations, each formatting characteristic value of the plurality of first formatting characteristic values or the plurality7of second formatting characteristic values is between a minimum formatting characteristic value and a maximum formatting characteristic value.

[0008] In some implementations, a scale of each formatting characteristic value between the minimum formatting characteristic value and the maximum formatting characteristic value corresponds to a scale of its respective metric value between the minimum metric value and the maximum metric value.

[0009] In some implementations, at least one of the plurality7of first formatting characteristic values or the plurality of second formatting characteristic values includes a shading value, the shading value indicative of a shade with which a respective output value of the plurality of first output values or the plurality of second output values is rendered in the first user interface or the second user interface.

[0010] In some implementations, the shading value is between a minimum shading value and a maximum shading value.

[0011] In some implementations, at least one of the plurality of first metric values or the plurality of second metric values includes a certainty score associated with a respective output value of the plurality7of first output values or the plurality7of second output values.

[0012] In some implementations, at least one of the plurality7of first metric values or the plurality of second metric values includes a tonality score associated with a tonecharacteristic respective to an output value of the plurality of first output values or the plurality of second output values.

[0013] In some implementations, the machine-learned model is or includes a block generation model, wherein the block generation model is configured to generate a plurality' of outputs including the first output and the second output over a respective plurality of output cycles including the first output cycle and the second output cycle.

[0014] In some implementations, the block generation model is or includes a diffusion model.

[0015] In some implementations, the diffusion model is or includes a text diffusion model.

[0016] In some implementations, the diffusion model is or includes a discrete text diffusion model.

[0017] In some implementations, the diffusion model is or includes a continuous text diffusion model.

[0018] In some implementations, the diffusion model is or includes a discrete diffusion model.

[0019] In some implementations, the machine-learned model includes a sequence processing model, the sequence processing model configured to generate a sequence of output values, and wherein the first output includes a first sequence of the plurality of first output values and the second output includes a second sequence of the plurality’ of second output values.

[0020] In some implementations, the sequence processing model is or includes one of a large language model (LLM) or a large multimodal model (LMM) .

[0021] In some implementations, at least one of the plurality of first output values or the plurality of second output values includes text data.

[0022] In some implementations, the method further includes detecting, by the computing system, a modify interaction with the first user interface by a user during the first output cycle, the modify interaction descriptive of a selection of one or more of the plurality of first output values and a contextual aspect to be associated with the one or more of the plurality of first output values.

[0023] In some implementations, the method further includes providing, by the computing system, contextual input to the machine-learned model at the second output cycle, the contextual input based on the contextual aspect associated with the one or more of the plurality of first output values,

[0024] In some implementations, the method further includes obtaining, by the computing system, the second output from the machine-learned model in response to providing the contextual input.

[0025] In some implementations, the method further includes providing the first output as input to the machine-learned model at the second output cycle.

[0026] In some implementations, providing the contextual input to the machine-learned model includes determining, by the computing system, one or more modified metric values associated with the one or more of the plurality of first output values based on the contextual aspect.

[0027] In some implementations, the one or more modified metric values are provided in place of the first metric values respective to the one or more of the plurality of first output values as input to the machine-learned model at the second output cycle.

[0028] In some implementations, the contextual input is or includes one of a positive indication or a negative indication with respect to the one or more of the plurality of first output values.

[0029] In some implementations, determining the one or more modified metric values includes one of increasing or decreasing the first metric values respective to the one or more of the plurality7of first output values in response to the positive indication or the negative indication.

[0030] In some implementations, providing the contextual input to the machine-learned model includes determining, by the computing system, a guidance message based on the contextual input, the guidance message including instructions for the machine-learned model responsive to the contextual aspect.

[0031] In some implementations, providing the contextual input to the machine-learned model includes providing the guidance message as the contextual input to the machine-learned model at the second output cycle.

[0032] In some implementations, the method further includes, prior to detecting the modify interaction with the first user interface, detecting, by the computing system, a suspend interaction with the first user interface during the first output cycle.

[0033] In some implementations, the method further includes, in response to detecting the suspend interaction, suspending, by the computing system, a computing operation of the machine-learned model at the first output cycle.

[0034] In some implementations, the method further includes, prior to obtaining the second output from the machine-learned model, detecting, by the computing system, aresume interaction with the first user interface during suspension of the computing operation of the machine-learned model.

[0035] In some implementations, the method further includes, in response to detecting the resume interaction, resuming the computing operation of the machine-learned model at the second output cycle to cause the machine-learned model to generate the second output.

[0036] Example aspects of the present disclosure additionally provide a computing system. The computing system includes one or more processors and one or more non-transitory. computer-readable media storing instructions that, when implemented, cause the one or more processors to perform operations. The operations include obtaining a first output including a plurality of first output values, the first output associated with a first output cycle of a machine-learned model. The operations include obtaining a plurality of first metric values associated with the plurality of first output values. The operations include determining a plurality of first formatting characteristic values respective to the plurality of first output values based on the plurality of first metric values. The operations include providing the first formatting characteristic values for use in a first user interface having the plurality of first output values formatted according to the plurality of first formatting characteristic values. The operations include obtaining a second output including a plurality of second output values, the second output associated with a second output cycle of the machine-learned model. The operations include obtaining a plurality of second metric values respective to the plurality of second output values. The operations include determining a plurality of second formatting characteristic values respective to the plurality of second output values based on the plurality7of second metric values. The operations include providing the second formatting characteristic values for use in updating the first user interface to a second user interface having the plurality of second output values formatted according to the plurality of second formatting characteristic values.

[0037] Example aspects of the present disclosure additionally provide a computing system configured to dynamically visualize an evolving model output by a user interface configured to display one or more output values with a gradient formatting characteristic, wherein the computing system is configured to vary the gradient formatting charactenstic of each of the one or more output values along a gradient corresponding to a progression of generation of the evolving model output over one or more output cycles.

[0038] Example aspects of the present disclosure additionally provide a computer-implemented method. The method includes, for each of a sequence of output cycles associated with a machine-learned model: obtaining, by a computing system including one ormore computing devices, a model output including a plurality of output values: obtaining, by the computing system, a plurality of metric values associated with the plurality of output values; determining, by the computing system, a plurality of formatting characteristic values respective to the plurality of output values based on the plurality of metric values; and providing, by the computing system, the formatting characteristic values for use in a user interface that displays the plurality of output values formatted according to the plurality of formatting characteristic values.

[0039] Example aspects of the present disclosure additionally provide a computer-implemented method. The method includes, for each of a sequence of output cycles associated with a machine-learned text diffusion model: obtaining, by a computing system including one or more computing devices, a model output from the machine-learned text diffusion model, the model output including a plurality of text tokens; obtaining, by the computing system, a plurality of confidence scores respectively associated with the plurality of text tokens; determining, by the computing system, a plurality of shading values respectively associated with the plurality of text tokens based on the plurality of confidence scores; and providing, by the computing system, the plurality of text tokens for use in a user interface that displays the plurality7of text tokens rendered according to the plurality of shading values representative of the plurality of confidence scores respectively associated with the plurality of text tokens.’

[0040] Other example aspects of the present disclosure are directed to other systems, methods, apparatuses, tangible non-transitory computer-readable media, and devices for performing functions described herein. These and other features, aspects, and advantages of various implementations will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate implementations of the present disclosure and, together with the description, help explain the related principles.BRIEF DESCRIPTION OF THE DRAWINGS

[0041] FIGS. 1 A - 1C are block diagrams of example networked computing systems according to example implementations of aspects of the present disclosure;

[0042] FIG. 2 is a block diagram of an example agent system according to example implementations of aspects of the present disclosure;

[0043] FIGS. 3A - 3C are example user interfaces according to example implementations of aspects of the present disclosure.

[0044] FIGS. 4A - 4C are example user interfaces according to example implementations of aspects of the present disclosure.

[0045] FIG. 5 is a flow chart diagram illustrating example methods for implementing an agent system according to example implementations of aspects of the present disclosure;

[0046] FIG. 6 is a flow chart diagram illustrating an example method for training a machine-learned model according to example implementations of aspects of the present disclosure;

[0047] FIG. 7 is a block diagram of an example processing flow for using machine-learned model(s) to process input(s) to generate output(s) according to example implementations of aspects of the present disclosure;

[0048] FIG. 8 is a block diagram of an example sequence processing model according to example implementations of aspects of the present disclosure;

[0049] FIG. 9 is a block diagram of an example technique for populating an example input sequence for processing by a sequence processing model according to example implementations of aspects of the present disclosure;

[0050] FIG. 10 is a block diagram of an example model development platform according to example implementations of aspects of the present disclosure;

[0051] FIG. 11 is a block diagram of an example training workflow for training a machine-learned model according to example implementations of aspects of the present disclosure; and

[0052] FIG. 12 is a block diagram of an inference system for operating one or more machine-learned model(s) to perform inference according to example implementations of aspects of the present disclosure.DETAILED DESCRIPTION

[0053] Systems and methods described herein can provide for improved display of emerging outputs from machine-learned models. An emerging output refers to a progressively-generated output of a machine-learned model that is provided over a plurality of iterations, or output cycles. For example, an emerging output can include a first output at a first output cycle, a second output at a second output cycle, and so on. The emerging output may also include a final output at a final output cycle. The final output can occur when a model has converged on a desired output, such as an output that the model expects to adequately respond to input that prompted generation of the emerging output. In some implementations, each output of the emerging output may depend in some manner on prioroutput(s) of the emerging output. For instance, some example machine-learned models can provide an emerging output that emerges or evolves from an initial (e.g., noisy) input and / or initial output into a final output over a plurality of output cycles or iterations of the machine-learned model.

[0054] The systems and methods described herein can provide for improved display of these emerging outputs by providing a user interface capable of displaying and modifying the emerging output as it changes over time and displaying aspects about the generation of the emerging output in a readily understandable fashion. For instance, a metric value associated with each output value in the output can be used to alter a gradient formatting characteristic of how the output value is displayed to convey generation progress of the output value relative to the entire output. As an example, the metric value can be a confidence or certainty score associated with the output value.

[0055] For example, in some implementations, an initial output is generated by a machine-learned model. The initial output may be refined over subsequent computer cycles or iterations to improve responsiveness to the user’s input. This can provide for an iterative refinement process of the emerging output. For instance, at subsequent output cycles, the outputs may provide improved qualify and / or alignment in relation to user input. In one example aspect, for instance, the initial output may be refined over subsequent output cycles to improve confidence scores or certainty scores associated with the output values of subsequent outputs and / or the subsequent outputs as a whole relative to confidence scores or certainty scores associated with the output values of the initial output and / or the initial output as a whole. The model may consider some or all of its output at a first output cycle when generating output at a second output cycle. For instance, the model may utilize some or all of its prior outputs as context for the output at a given output cycle. For example, the model may revise lower-certainty output values or blocks of output values based on contextual aspects available in the output as a whole. Because the validity of an output value may depend not only on preceding output values but also subsequent output values, these models can provide improved accuracy and output quality when considering the output as a whole.

[0056] As used herein, an ‘"output cycle” can include one or more iterations of a machine-learned model. For instance, in some implementations, each iteration of a machine-learned model can trigger an output cycle such that display of the emerging output is updated at each iteration. As another example, in some implementations, an output cycle can include a plurality of iterations of a machine-learned model, such that display of the emerging output is only updated after some number of iterations of the machine-learned model. This canprovide, for example, that display of the emerging output is updated in a manner that is delayed enough for a viewer to comprehend the changes in the emerging output and / or to conserve computing resources by only updating after some significant change in probabilities and / or model output.

[0057] As another example, in some implementations, the initial output can be generated by a second machine-learned model (e.g.. a draft model) and refined by the described machine-learned model over a plurality of output cycles. As yet another example, in some implementations, the initial state of the machine-learned model may be a noisy state. For instance, if the machine-learned model is autoregressive, inputs to the machine-learned model at an initial output cycle may be or may include noise values.

[0058] In this manner, these emerging outputs provide a powerful approach for generating output values in response to user input. The output from the model at each output cycle provides rich information including not only the response to the user’s input, but also about the status of the output, how noisy the response is, how close the response is to completion, and / or which portions of the response are most likely to change over future iterations. For instance, as the model progresses from an initial (e.g., completely noisy) output to an ultimate output, the output of the model can indicate respective amounts of certainty or noise associated with each output value and / or the overall output. It can be desirable for a user interface (e.g., a graphical user interface) to provide the capability of displaying this rich information in a manner that improves the latency of human understanding of the information available from the output.

[0059] Example aspects of the present disclosure provide systems and methods for improved display of emerging outputs from machine-learned models. For instance, the systems and methods described herein can improve the information conveyance capabilities of displayed user interfaces, such as by providing for users to readily understand multiple channels of information relating to an output value without requiring the user to change the user’s point of focus or move the user's gaze. For example, aspects of the present disclosure can provide improved spatiotemporal correspondence between multiple channels of information relating to an output value compared to some existing approaches, such as, for example, previous outputs, changes compared to previous outputs, confidence or certainty of the present output, areas that will likely be changed in future outputs, and so on. Furthermore, this information can be displayed requiring less space on a display device and / or fewer pixels per output value than some existing approaches.

[0060] In an aspect, the systems and methods described herein can provide for obtaining (e.g.. by a computing system including one or more computing devices) an output including one or more output values. In some implementations, the output can include a plurality of output values. An output value can be a portion of a larger response or answer produced by the machine-learned model. For instance, in some example implementations, the model can generate a response, such as a block of text or passage of text, including the plurality of output values. The output value can be any suitable output value, such as, for example, a text output value including text data, such as a word, phrase, sentence, letter, or other suitable delineation of text, a numerical value (e.g., an integer value, a floating point value, etc.), such as a pixel value of image data and / or an audio channel value of audio data, and / or any other suitable value. In some implementations, for example, the output value can be a token (e.g., a text data token).

[0061] In some implementations, the emerging output can be obtained from a machine-learned model. The machine-learned model can be any suitable model. Furthermore, in some implementations, the emerging output can be obtained from an agent system. An agent system (e.g., an “artificial intelligence agent” or “Al agent”) can employ one or more machine-learned models to generate outputs responsive to queries from users. As one example, an agent system can be or can include a computing system including one or more machine-learned models, where the computing system is configured to receive an input from a user device or calling device and provide an output including a plurality of output values that are responsive to the input to the user device or calling device. As one example, the plurality of output values can be or can include text data such as words, and / or the plurality of output values can collectively form a passage or block of text that is displayed to the user. The plurality of output values can. for instance, collectively describe an answer or response to the user’s input. Furthermore, in some implementations, the plurality of output values can be or can include (e.g., or can be converted to) spoken language data such as audio data that can be spoken to the user. In some implementations, the machine-learned model(s) can output values, such as confidence scores or certainty scores, tonality scores, and / or other suitable scores and / or values, which are respectively associated with the plurality of output values. For instance, a user can interact with an agent system in a similar manner to interacting with a human assistant, providing for improved information conveyance compared to interacting with some computing system directly. The agent system may, for example, receive user input, optionally process the user input into a format suitable for a machine-learned model,provide the user input to the machine-learned model, and / or receive the emerging output from the machine-learned model over a plurality of output cycles.

[0062] In some implementations, the agent system can be or can implement a multimodal agent (e.g., a multi-modal artificial intelligence agent). A multi-modal agent can process inputs from one or more data modalities. In some implementations, the agent system can be implemented as a “situated agent”. The term situated agent refers to a setting in which the agent system shares one or more perceptual inputs with a human user. For example, the situated agent can receive and process various data inputs, including video, audio, and / or textual data which are also observable by the human user. The agent system can process these inputs to generate responses that are contextually relevant for the user’s physical or digital environment, for example enabling the agent system to generate dialogue or other responses or outputs which assist the user in understanding and / or navigating the environment.

[0063] In an aspect, the systems and methods described herein can provide obtaining (e.g., by the computing system), a metric value. The metric value can be associated with an output value of the output. For instance, in some implementations, the systems and methods described herein can obtain a plurality of metric values respective to a plurality of output values in an output. For instance, in some implementations, the plurality of metric values are respective to a plurality of output values in a one-to-one correspondence. As another example, in some implementations, a plurality of output values can be associated with a single rendering value and / or a plurality of rendering values can be associated with a single output value. The metric value can be any suitable value, such as a scalar value, a numerical value, a floating point value, a classification value (e.g., a discrete classification or an ordered classification along a range of classification values), or other suitable values. In some implementations, the metric value can be within a range of possible metric values. For example, in some implementations, the metric value can be between a minimum metric value and a maximum metric value.

[0064] In some implementations, the metric values can be generated by the machine-learned model. For instance, the metric values can be included in the output of the machine-learned model. Additionally and / or alternatively, in some implementations, the metric values can be generated by one or more machine-learned models other than the model, such as, for example, adversarial models, discriminator models, classifier models, or other suitable models.

[0065] As one example, in some implementations, a metric value can be a certainty score or confidence score associated with a respective output value. The certainty score canbe representative of a certainty of the model associated with the output value, such as, for example, a certainty that the output value is correct, a certainty that the output value will be included in the final output, a certainty that the output value is a best choice among a plurality of possible choices, or other certainty that is generally reflective of a confidence in the output value as output. For example, in some implementations, the model is configured to select each output value from a plurality of candidate output values based on a probability distribution associated with the plurality of candidate output values. For instance, the model can select an output value having a highest probability. The certainty score can be determined based on the probability distribution associated with the plurality of candidate output values. For example, in some implementations, the certainty score can be the probability of the probability distribution that is associated with the selected output value from the plurality of candidate output values. As one example, a per-token softmax layer can be applied to each denoising step to obtain the certainty score. Output values with higher probabilities can therefore have a higher associated metric value and / or output values with lower probabilities can therefore have a lower associated metric value, for example. Still further, in some implementations, an additional machine-learned model other than the model that generates the emerging output (e.g., a critique model) may generate the certainty score.

[0066] As another example, in some implementations, a metric value can be a tonality score associated with the output value. The tonality score can be representative of a tone of voice or tone characteristic associated with the output value, such as, for example, a degree to which the output value contributes to a tone of a passage of text including the output value. The tone can be a classification value or other attribute that is defined or learned based on human interaction such as, for example, a serious tone, a playful tone, a lighthearted tone, a professional-etiquette tone, a sorrowful tone, and / or other suitable tones. The tonality score may be output by the model. For instance, the model may be trained to associate output values or combinations of output values with particular tonalities during and / or after prediction of the output values themselves. As another example, in some implementations, a model other than the model (e.g., a classifier model or a discriminator model) may be configured to generate the tonality score.

[0067] In some implementations, the machine-learned model can be a block generation model. As used herein, a block generation model refers to a model configured to produce the output in "blocks" over a plurality of output cycles. Each “block” can be associated with a particular output cycle and can include a plurality of output values. For instance, the output of the block generation model at each output cycle of the plurality ofoutput cycles can include a plurality of output values (e.g., and / or a plurality of metric values respective to the plurality of output values). For instance, in some implementations, the plurality of output values can be produced “concurrently,” e.g., wherein each output value is output at a same iteration or output cycle of the block generation model. For instance, an output cycle can be an initial generation cycle (e.g., where initial values of the plurality of output values are generated) and / or an update cycle (e.g., where values of the plurality of output values are refined based on information available at a previous cycle). One example block generation model is a diffusion model, such as a text diffusion model. In some implementations, the block generation model may be autoregressive (e.g., a block autoregressive generation model) or semi-autoregressive. For instance, an output of the block autoregressive generation model at a first output cycle may depend on the output of the block autoregressive generation model at a second output cycle that precedes the first output cycle.

[0068] For example, one example machine-learned model that can be employed herein is referred to as a Step-unrolled Denoising Autoencoder for Text Generation (SUNDAE) model. See, e.g., Savinov et al., Step-unrolled Denoising Autoencoders for Text Generation, ARXIV:2112.06749v3 (Apr. 19, 2022). Another example model that may be employed herein is a diffusion model, which provides continuous-time modeling of categorical data. See, e.g., Dieleman et al., Continuous Diffusion for Categorical Data, ARXIV:2211.15089V3 (Dec. 15, 2022). Continuous diffusion models can be utilized to generate image data as well as flexible and scalable implementations for both conditional and unconditional text generation. See, e.g., Strudel et al., Self-conditioned Embedding Diffusion for Text Generation, ARXlV: 2211.04236 (Nov. 8, 2022). Additionally and / or alternatively, masked diffusion can be used for generative modeling of discrete data. See, e.g., Shi et al., Simplified and Generalized Masked Diffusion for Discrete Data, ARXlV:2406.04329v2 (Dec.4, 2024). Furthermore, text diffusion models can be utilized for the training and deployment of large language models (LLMs) and other models through transfer learning. See, e.g., Han et al., Transfer Learning for Text Diffusion Models, ARXlV:2401.17181 (Jan. 30, 2024).

[0069] In some implementations, the model may be designed or configured to operate over “blocks” of output values (e.g., text output values) having a length, such as a fixed length. For instance, in some example implementations, a text diffusion model can diffuse or denoise over a fixed number of output values. However, responses produced by the model may not necessarily always require the fixed number of output values to appropriately answer the user’s input. According to example aspects of the present disclosure, a model may produce a fixed-length block of output values, where the length defines a number of outputvalue positions that correspond to output values in the output of the model. The model may further be operable to predict an end of sequence (EOS) output value at each output value position. Output values occurring after the position of the EOS output value can be, for example, a null output value, an empty output value, or another suitable “no-in formation” output value (e.g., whitespace). As one example, the model can predict a maximum probability associated with the null output value at each output value position after the EOS output value.

[0070] In some implementations, the length of the block of output values may be fixed for each evaluation of the model. For instance, in some implementations, the length may be set at a value that is high enough to cover most or all use cases of the model, such as one thousand output values. As another example, in some implementations, the model and / or another model may predict the length of the block of output values based on the user’s input, then generate a block of output values having the predicted length. For example, the model may be trained to predict an approximate number of output values that the response to the user’s query will include and may determine the length based on the predicted approximate number of output values.

[0071] As another example, in some implementations, the machine-learned model can be or can include a sequence processing model. The sequence processing model can be configured to individually generate a sequence of output values (e.g., and / or respective metric values). For instance, the output at each iteration or output cycle can be or can include the plurality (e.g., sequence) of output values. Furthermore, in some implementations, each sequence of output values can be associated with a respective set (e.g., sequence) of metric values. Examples of a sequence processing model include, but are not limited to, a large language model (LLM) or a large multimodal model (EMM). For instance, the sequence processing model may generate output values in a sequential manner. The sequence processing model may not provide a fixed-length block of output values but rather generate each output value as evaluation of the model progresses. The EOS output value may signal the model to cease generation of output values. The output of the model may have fewer output values than a maximum length, for instance. Iterations of the sequence processing model (e.g., output cycles) can refer to multiple “passes” over the sequence of output values, where output values in the sequence from one output cycle and / or their respective metric values are provided as input in a subsequent output cycle to improve the model’s understanding about the entire sequence of output values at the subsequent output cycle.

[0072] In an aspect, the systems and methods described herein can provide for determining (e.g., by the computing system) a formatting characteristic value associated with each output value based on the metric value respective to the output value. The formatting characteristic value can be a value that is interpretable by a computing system (e.g., a graphical user interface) to convey information to the computing system regarding a formatting characteristic with which the output value is to be rendered. For example, formatting characteristic values can provide indications regarding text font, text boldness or italics, color, shading, position, size, and other aspects of rendered interface elements corresponding to the output values.

[0073] In some implementations, the formatting characteristic can be a gradient formatting characteristic that defines a gradient characteristic of the formatted output values. For instance, in some implementations, the formatting characteristic value can be within a range of possible formatting characteristic values. The gradient characteristic of the formatting output values can vary over the range of formatting characteristic values. For example, the formatting characteristic may be displayed more prominently for formatting characteristic values toward one end of the range than those toward another end of the range. For instance, in some implementations, the formatting characteristic value can be between a minimum formatting characteristic value and a maximum formatting characteristic value. Furthermore, in some implementations, a scale of the formatting characteristic value between the minimum formatting characteristic value and the maximum formatting characteristic value can correspond to a scale of the metric value betw een the minimum metric value and the maximum metric value. For instance, in some implementations, the formatting characteristic value can be a scalar value or other numerical value, such as a value correlated on a range of possible formatting characteristic values to the metric value along its own range of possible metric values. For example, a ratio of metric value to minimum metric value or maximum metric value can be equivalent to a ratio of formatting characteristic value to minimum formatting characteristic value or maximum formatting characteristic value. As one example, in some implementations, determining the formatting characteristic value based on the metric value can include determining a ratio based on the metric value and at least one of the minimum metric value and the maximum metric value and determining the formatting characteristic value based on a corresponding ratio between the formatting characteristic value and at least one of the minimum formatting characteristic value and the maximum formatting characteristic value. For instance, the metric value can be mapped to a formatting characteristic value betw een the minimum formatting characteristic value and the maximumformatting characteristic value in a linear, logarithmic, or other conversion manner based on the corresponding range between the minimum metric value and the maximum metric value. For example, in some implementations, a metric value that is 50% confidence (e.g., between a minimum metric value of 0% confidence and a maximum metric value of 100% confidence) may be mapped to a midpoint between a minimum formatting characteristic value and a maximum formatting characteristic value (e.g., 50% shading or brightness). Additionally and / or alternatively, a metric value of 75% confidence may be mapped to a formatting characteristic value that is closer to the maximum formatting characteristic value than the minimum formatting characteristic value.

[0074] For example, in some implementations, the formatting characteristic value can be a shading value. The shading value can instruct the computing system on a shade or degree of shading to render the output value within a user interface. The shading can represent, for example, how intensities of one or more color channels associated with rendering the output value are adjusted. For example, at a maximum shading value, the color channels may not be adjusted such that the output value is rendered with a maximum shade or a maximum brightness or (e.g. exactly) as the color channels define. As another example, at a minimum shading value, the color channels may be adjusted such that the output value is not rendered at all, or rendered w ith some minimum shade or minimum brightness such that the output value is significantly less perceptible than output values rendered according to the maximum shading value. For example, the rendered output value element can range from being near or identical to a background color of the user interface at minimum shading values to fully saturated at a target color (e.g., a specified font color) at maximum shading values. It should be understood that ‘‘minimum’' and “maximum’' are used herein to denote endpoints of a range of possible values for the purposes of illustration only, and similar endpoints or effective endpoints can fall within the use of “minimum” and “maximum” herein.

[0075] As another example, in some implementations, the formatting characteristic value can be a color value. The color value can instruct the computing system on a color to render the output value within a user interface. For example, the color value may be a named color and / or pixel intensity values descriptive of the desired color. In some implementations, for example, colors may be associated with respective tonalities of text output values. For example, a color such as red may be associated with text conveying an authoritative tone, whereas a color such as orange may be associated with text conveying a happy tone, or any of a number of different color-tone associations. Furthermore, in some implementations, the color value may be combined with the aforementioned shading value to indicate a degree towhich a given output value conveys the tonality. For example, a darker red output value may suggest more authoritative text than a lighter red output value. In this manner, the user can rapidly understand a variety of information about a given output value without being required to shift the user’s gaze to other user interface elements to ascertain the information.

[0076] In an aspect, the systems and methods described herein can provide for generating (e.g., by the computing system) a user interface having the output value(s) of the output arranged according to a plurality of output value positions and formatted according to the respective formatting characteristic values(s). For instance, the user interface can define output value positions that are configured to receive, store, and / or otherwise accept values or data from the output value(s) of the output. As one example, the output value positions may¬ receive characters or words defined by the output value(s). As another example, the output value positions may be placeholder elements in the user interface. For instance, in some implementations, generating the user interface can include replacing the placeholder elements with values defined by the output value(s) of the output. In some implementations, generating the user interface can include generating data that is implementable by a computing system to cause display of the user interface. Such data can include, for example, extended Markup Language (XML) data, Hypertext Markup Language (HTML) data, text document data, slide show data, spreadsheet data, and / or any other suitable data. For example, in some implementations, the output value position can define a text element in a larger webpage or document, where the text element displays an output value and is formatted according to a formatting characteristic based on the formatting characteristic value respective to the output value.

[0077] In some implementations, the systems and methods described herein can provide for rendering the output value as well as one or more alternate output values that represent an alternative to the output value in the output of the machine-learned model. For instance, the alternate output values may be displayed in response to a user gesture (e.g., by hovering over the output value), as alternatives (e.g., in a stacked arrangement or similar adjacent arrangement), or in other suitable manner. For example, in some implementations, the formatting characteristic can be a size with which the output values are to be rendered, and the metric value respective to each output value can represent a certainty score associated with each output value. In this manner, a more certain output value can be displayed at a larger size and adjacent to its alternatives, which are displayed with a smaller size. In some implementations, the systems and methods described herein may determine to render the alternate output values based on some ratio or relationship between the respective metricvalues of the output value and the additional output value. For example, a computing system may determine not to render alternate output values for output values having a great disparity¬ in metric value (e.g., certainty), such as in cases where these output values are highly separated from the next best alternative output value. As another example, the computing system may determine to render alternate output values for output values that are close to at least one other alternate output values. For example, in some implementations, determining to render an alternate output value can include determining a metric value ratio based on the metric value associated with the output value and the metric value associated with the alternate output value and determining to render the alternate output value if the metric value ratio satisfies a metric value ratio range or threshold.

[0078] In an aspect, the systems and methods described herein can provide for causing display of the user interface by the computing system. As one example, in some implementations, causing display of the user interface by the computing system can include communicating the user interface (e.g., data descriptive of the user interface) to a display device. The display device may be provided at the computing system and / or an external computing system. For example, in some implementations, communicating the user interface can include communicating the user interface over one or more communication networks.

[0079] Furthermore, in some implementations causing display of the user interface can provide for rendering one or more user interface elements via the display device based on the user interface. The display device can be configured to render and / or display the user interface (e.g., to a user or other observer). For instance, the display device can control one or more display elements (e.g., pixels) based on the user interface including the rendered output value element such that the user interface is presented to the user. Any suitable method of displaying the user interface can be employed in accordance with the present disclosure. As one example, a graphics processing unit (GPU) or similar computing device can compute a matrix of pixel intensities that represent the user interface and provide the matrix of pixel intensities to a display (e.g., a screen or monitor) configured to interpret the matrix of pixel intensities and light up corresponding pixels in the display in accordance with the matrix of pixel intensities.

[0080] The one or more user interface elements can include one or more rendered output value elements based on the output value and the formatting characteristic values. For instance, the output value can be rendered on the display device according to the formatting characteristic value. As one example, in some implementations, the output value can be rendered on a user interface with a shading based on the formatting characteristic value. Forinstance, pixel intensity values of the portion of the user interface including the output value can be determined such that the shading of the output value (e.g., relative to other rendered output values or elements) reflects the metric value output by the machine-learned model.

[0081] Another example aspect of the present disclosure can provide for a user to modify or alter an ongoing emerging output during its generation. For instance, a user can modify inputs or other state data of the machine-learned model during generation of the emerging output. In this manner, the user may be provided with the capability to alter the trajectory of the emerging output for any number of reasons. As one example, the output of the machine-learned model can be associated with a first output cycle of a plurality' of output cycles. Furthermore, for the purposes of convention, the aforementioned output value(s) can be referred to as first output value(s), the aforementioned metric value(s) can be referred to as first metric value(s), and the aforementioned formatting characteristic value(s) can be referred to as first formatting characteristic value(s).

[0082] The systems and methods according to example aspects of the present disclosure can further provide for detecting (e.g.,, by the computing system) a modify interaction with the user interface by the user during the first output cycle. The modify interaction can be descriptive of a selection of some or all of the first output value(s) and a contextual aspect to be associated with the selected first output value(s). The modify' interaction can be any of a variety of suitable interactions or combinations thereof that can be interpreted as indicative of a desire of the user to modify the generation of the emerging output and / or selecting one or more output values. As one example, the modify interaction can include a spoken interaction, such as the user speaking a phrase (e.g., captured by a microphone or similar voice input device) indicating that the user w ishes to modify' the generation of the emerging output. As another example, the user can tap, touch, highlight, or otherwise provide tangible feedback to select the first output value (e.g., and / or one or more additional selected output values).

[0083] In addition to the selection of the output value(s), the modify interaction can include or otherwise be descriptive of a contextual aspect. The contextual aspect can indicate some context regarding the user’s selection of the output value(s). For example, in some implementations, the contextual aspect can be or can include an indication by the user of whether the selected output value is correct or incorrect, good or bad, or some other contrasting position indicating the user’s approval or desire to include the output value in future output of the machine-learned model. The contextual aspect can be obtained by capturing user interactions, such as clicking, tapping, or otherwise interacting with interfaceelements that indicate the user’s position with respect to the output value. As another example, the contextual aspect can be a phrase obtained from the user, such as a spoken phrase from the user and / or a text phrase input by the user (e.g., into a text field element of the user interface). As yet another example, the contextual aspect can be or can include the user indicating desired values for given output values. For example, the user can “fill in” certain output values or groups of output values exactly as the user desires, and the model can generate other output values to complement the user-provided output values.

[0084] The systems and methods according to example aspects of the present disclosure can further provide for providing (e.g., by the computing system) contextual input to the machine-learned model at a second output cycle of the plurality of output cycles. The contextual input can be based on the contextual aspect associated with the first output value(s). For example, in some implementations, the contextual input can be equivalent to the contextual aspect. Additionally and / or alternatively, in some implementations, the contextual input can be represented in a format that may be understood by the machine-learned model. For instance, in some implementations, providing the contextual input can include determining, (e.g., by the computing system) a guidance message based on the contextual input. The guidance message can include instructions for the machine-learned model responsive to the contextual aspect associated with the first output value(s). Providing the contextual input can further include providing the guidance message as the contextual input to the machine-learned model at the second output cycle. For example, if the contextual aspect provided by the user is an indication that the user dislikes a particular output value (e.g., an output value at a particular position within a block of text) such as that obtained by the user highlighting the output value and interacting with a “thumbs down” user interface element, the contextual input may be a phrase that would be more readily interpretable by the machine-learned model, such as a phrase indicating the user’s disapproval and identifying which output value(s) the user has disapproved of.

[0085] In some implementations, the systems and methods herein can provide the first output (e g., the first output value(s) and the first metric value(s)) as input to the machine-learned model at the second output cycle. For instance, the model can consume outputs from a prior output cycle when computing the next output cycle to iteratively refine the output over subsequent processing cycles.

[0086] Furthermore, in some implementations, providing the contextual input to the machine-learned model can include determining (e.g.. by the computing system) modified metric value(s) associated with the first output value(s) associated with the contextual aspect.The modified metric value(s) can be provided in place of the first metric value(s) respective to the selected first output value(s) as input to the machine-learned model at the second output cycle. As one example, in some implementations, the contextual input can be one of a positive indication or a negative indication with respect to the selected first output value(s). Determining the modified metric value(s) can include one of increasing or decreasing the first metric value(s) respective to the selected first output value(s) in response to the positive indication or the negative indication. For example, in some implementations, the system can “force” the metric value associated with a given output value to be different (e.g., higher or lower) when used as input for a subsequent computing cycle than the original metric value (e.g., a value that was originally output by the model). For example, if the metric value is a certainty score, the system may “force” a high certainty or a maximum certainty for output values that the user indicates approval of or “force” a low certainty or a minimum certainty for output values that the user indicates disapproval of.

[0087] Furthermore, in an aspect, the systems and methods described herein can provide for obtaining (e.g., by the computing system) a second output from the machine-learned model. The second output can be or can include one or more second output values. Furthermore, in an aspect, the systems and methods described herein can provide for obtaining second metric value(s) associated with the second output value(s). For instance, in some implementations, the second output can be or can include the second metric value(s). Additionally and / or alternatively, in some implementations, one or more machine-learned models other than the model can generate the second metric values.

[0088] The second output can be associated with the second output cycle. For instance, the second output can include output values output during a second “pass” through a text passage by a sequence processing model. As another example, the second output may be a second block of output values (e.g., text) output by a block generation model. A second output value can occur in place of a respective first output value from the first processing cycle. For example, a second output value may be generated to occupy a same output value position as a respective first output value. For example, if the output is or includes a block of text, a second output value may occur at a same position within the block of text as a respective first output value. Generally, a second output value may have an approximately similar meaning to a respective first output value, although the second output value may not necessarily have a similar meaning to the first output value in every instance. For instance, in some implementations, the model may refine its choice of output value in the output value position of the first and second output values over subsequent output cycles, such that thesecond output value may have an increased certainty score and / or be an improvement to some aspect of the set of output values, such as an improvement to accuracy, coherency, consistency, grammatical correctness, or other aspect of the set of output values.

[0089] Example aspects of the present disclosure can further provide for determining (e.g., by the computing system) a second formatting characteristic value associated with the second output value(s) based on the second metric value(s). As described above with respect to the first output value(s), for instance, the second formatting characteristic value(s) may correspond along a range with respective second metric value(s). The second formatting characteristic value(s) may be or may include, for example, shading value(s).

[0090] Example aspects of the present disclosure can further provide for generating (e.g., by the computing system) a second user interface having the plurality of second output values arranged in place of the plurality of first output values according to the plurality of output value positions and formatted according to the plurality of second formatting characteristic values. For instance, if as described above a second metric value (e.g., a certainty score) has increased relative to a respective first metric value due to an improvement in the second output relative to the first output, the second output value may be provided differently from the first output value in the second user interface (e.g., based on the second formatting characteristic value(s)) in such a manner that indicates that the second output value is an improved output value from the first output value. For example, if a metric value is a certainty score and a formatting characteristic value is a shading value, the second output value have a darker shading than the first output value to indicate that the model is more certain about the second output value. It should be understood that some or all of the second output value(s) may not necessarily differ from respective first output value(s) across all output cycles. For instance, in some implementations, especially as the model converges to a final output, the output values may not be substantially modified across subsequent output cycles. The second output value can then be rendered in place of the first output value in the user interface and / or according to the second formatting characteristic value.

[0091] The systems and methods described herein can further provide for causing display of the second user interface (e.g., by the computing system). For instance, the computing system can communicate the second user interface to the display device. For example, when the second user interface is updated to include the second output value(s) in place of the first output value(s), the display device can modify the interface as displayed to the user such that the displayed interface reflects the second output value(s) and second formatting characteristic value(s). As one example, a GPU may compute a second matrix ofpixel intensities that represents the user interface including the rendered second output value and provide the second matrix of pixel intensities to the display device.

[0092] Another example aspect of the present disclosure provides for a user to suspend evaluation of the emerging output to facilitate the user providing the contextual input and / or modifying the generation trajectory of the emerging output. For instance, the systems and methods described herein can provide for, prior to detecting the modify interaction with the user interface, detecting, (e.g., by the computing system) a suspend interaction with the user interface during the first output cycle. For example, in some implementations, the user can be provided wi th a suspend interface element. If the user interacts with the suspend interface element (e.g.. by clicking, toggling, tapping, etc.), the output cycles of the machine-learned models can be suspended or interrupted. The suspend element may be, for example, a button on the user interface, a hotkey, or other similar input mechanism.

[0093] Additionally and / or alternatively, the systems and methods herein can provide for suspending (e.g., by the computing system) a computing operation of the machine-learned model at the first output cycle in response to detecting the suspend interaction. For example, an application or computing device displaying the user interface can communicate a suspend message or suspend signal to the machine-learned model and / or an application layer or program executing the machine-learned model. In response to the suspend message, the output cycles of the machine-learned model can be suspended such that the emerging output of the machine-learned model and / or the state of the machine-learned model is maintained at its current value or state. Additionally or alternatively, in some implementations, significantly fewer (e.g., no) computing resources may be consumed by the machine-learned model while its output cycles are suspended. Still further, in some implementations, the output cycles of the machine-learned model may continue after the suspend interaction, but display of updated outputs may be suspended. For instance, in response to the suspend message, the system can suspend generating updated user interfaces for display.

[0094] In some implementations, pausing or suspending the output cycles of a machine-learned model can differ from aborting a generation instance of a machine-learned model. For example, whereas in the case of aborting a generation instance the current state and / or output of a machine-learned model can be discarded, one aspect of the present disclosure provides for machine-learned models to be suspended while maintaining a current state and / or output. For example, state data of the machine-learned models (e.g., activations, parameter values, etc.) can be maintained at present values at the moment the model is suspended throughout the duration that the model is suspended, unless intentionally modified.Some example aspects of the present disclosure can provide for the inputs and / or state data of the model to be modified while the model is suspended through user interaction, as described further herein. For example, the modification of state data while the output cycles are suspended can provide for a user to pause and “guide” or “correct” subsequent output cycles from the model, such as if the user wishes to alter a traj ectory of the generation. The model can be unsuspended at a subsequent point in time, at which the model can continue its computation or evaluation using the state data (e.g., as if the model was never suspended). Furthermore, in some implementations, while the evolving output is suspended, a user may be provided with control elements to “rewind” generation to a prior output cycle at which the user may then provide contextual input.

[0095] The systems and methods can further provide for, prior to obtaining the second output from the machine-learned model, detecting (e.g., by the computing system) a resume interaction with the user interface during suspension of the computing operation of the machine-learned model. In response to detecting the resume interaction, resuming the computing operation of the machine-learned model at the second output cycle. For example, the user can interact with a resume interface element (e.g., and / or the suspend interface element) to cause the computing system to communicate a resume message to the model. In response to receiving the resume message, the model can continue to evaluate its output over output cycles and / or consume computing resources to provide further outputs. For example, in some implementations, resuming the computing operations can cause the machine-learned output value generation to provide the second output. If the user has provided contextual input, the second output (and / or further output(s)) after resuming the output cycles can be responsive to the contextual input.

[0096] Example aspects of the present disclosure can provide for a number of technical effects and benefits, including improvements to computing technology. As one example, providing an output value in a user interface according to a metric value associated with the output value can improve the spatiotemporal association between the output value and its associated metric value when presented to a user. For instance, the user can readily interpret the output value itself and the information conveyed by the metric value while in view of only the rendered output value element. In this manner, the user is not required to shift the user’s gaze between, for example, the output value and a separate UI element that conveys the metric value. This, in turn, can improve the latency with which the user is able to understand the information available on the user interface.

[0097] As another example, providing an output value in a user interface according to a metric value associated with the output value to improve the spatiotemporal association between the output value and its associated metric value can improve the information density available in the user interface. This, in turn, can provide for a comparable amount of information to be conveyed using fewer pixels and / or fewer user interface elements, thereby- providing a reduction in computing resources needed to store, render, and / or display the user interface. In applications with limited display space, such as, for example, on mobile computing devices such as smartphones, these space savings can significantly improve usability and user experience of the systems and methods according to the present disclosure. Furthermore, because the information in the metric value is conveyed as a formatting characteristic, the association between the output value and the information conveyed by the metric value is made clearer on the user interface when viewed by the user, aiding in the user’s review of the information presented on the user interface.

[0098] As yet another example, aspects of the present disclosure enables new modes of operation of a computing device. For instance, aspects of the present disclosure provide for outputs at progressive output cycles to be included in a user interface according to formatting characteristic values that are indicative of metric values (e g., certainty scores) associated with an output value and representative of a certainty or noise associated with each output value over a plurality of output cycles. This information can readily convey how complete a model’s generation is. For instance, while output values rendered according to some formatting characteristic values (e g., lightly shaded output values) may change substantially over subsequent output cycles, other output values (e.g., darker shaded output values) may be relatively unchanged over subsequent output cycles. This information may otherwise have been either unavailable to the user or could be conveyed only in such a manner that the latency of the user’s understanding was less than the output cycles, such that the user could not respond to the emerging output. According to example aspects of the present disclosure, however, by suspending computing operations of a model at a particular output cycle in response to a suspend interaction from the user and subsequently providing new contextual input to the machine-learned output value generation, the computing system can be enabled to correct a trajectory of an ongoing emerging output during the output’s generation, rather than requiring that the output be completed and then regenerated. The aforementioned example aspects of the present disclosure therefore provide that a computing device can be enabled to alter generation trajectory in response to a user's contextual indication regarding the current generation trajectory. Furthermore, the computing device can alter generationtrajectory without requiring that the generation be completed or that the existing generation be discarded. This can conserve significant otherwise wasted computing resources associated with completing evaluation of the model output and / or restarting the model generation from its initial state.

[0099] The described systems and methods for displaying emerging outputs can be extended to image generation models. Similar to text-based models, image generation models often produce outputs iteratively, with the image details and quality improving over multiple steps. Applying the principles of this disclosure to image generation could involve visualizing metrics related to the model's confidence or the quality of different image regions.

[0100] For example, in image inpainting or super-resolution tasks, the model's confidence in its reconstruction of specific pixels or regions could be represented visually. Similarly, in image synthesis from text prompts, the model could provide confidence scores for different generated features. These confidence levels could be mapped to display attributes like brightness, contrast, sharpness, or color saturation, allowing the user to quickly assess the model's certainty about different parts of the image. As the model refines its output, the visual representation would dynamically update, reflecting the changing confidence levels.

[0101] This approach could also facilitate user interaction w ith the image generation process. The user could, for example, identify regions of low confidence and provide additional guidance to the model, influencing the subsequent iterations and improving the final output. This interactive feedback loop, combined with the dynamic visualization of confidence metrics, could significantly enhance the user experience and the effectiveness of image generation models. Several possible metrics analogous to token certainty were discussed, including classifier-free guidance scale, conditional log-likelihood, variance / uncertainty estimates, sharpness / blur, contrast, perceptual similarity, aesthetic quality metrics, and detector confidence for specific features within the image.

[0102] As one example, the systems and methods herein can be applied to repairing damaged or old photos, removing unwanted objects from images, and / or filling in missing portions of images. For example, a user can provide an image to be modified. Confidence metrics, such as variance estimates or perceptual similarity to surrounding areas, are calculated for each pixel or region. These metrics are visualized, for example, by overlaying a semi-transparent mask on low-confidence areas. As the model refines the image, the mask fades in areas where confidence increases, providing visual feedback on the restorationprogress. The user can interact with low-confidence areas, providing further guidance to the model.

[0103] As another example, the systems and methods herein can be applied to image synthesis tasks, such as creating images from textual descriptions, generating creative content, exploring visual concepts. For example, the user can input a text prompt. The model can generate an image iteratively, starting from noise. Metrics such as classifier-free guidance scale or aesthetic quality scores are calculated for different features or regions. These metrics can be visualized as described herein through display attributes such as sharpness, contrast, or color saturation. For example, features with low confidence might initially appear blurry or desaturated and become sharper and more vibrant as the model refines them. The user can observe the evolution of the image and provide feedback at different stages, influencing the final result.

[0104] As another example, the systems and methods herein can be applied to AI-assisted image editing tasks, such as changing image backgrounds, modifying object appearance, or creating composite images. The user can provide an image and / or specify desired edits (e.g., changing the background). The model described herein iteratively implements these edits. Metrics such as blending quality or realism are calculated for the edited regions. These are visualized as described herein, for example, by highlighting areas with poor blending using colored outlines. As the model refines the edits, the outlines fade, indicating improved quality. The user can interact with highlighted areas, providing further input to the model.

[0105] As another example, the systems and methods herein can be applied to medical image analysis tasks, such as assisting doctors in identifying diseases or anomalies in medical scans (e.g., X-rays, MRIs, CT scans, etc.). The model described herein can analyze a medical scan and identify potentially problematic areas. Metrics like the probability of a specific condition being present are calculated for different regions. These probabilities are visualized as described herein through color intensity, with higher probabilities represented by warmer colors (e.g., red). This allows the doctor to quickly focus on the most suspicious areas. The model can also provide uncertainty estimates, visualized through blur or transparency, helping the doctor assess the reliability of the model's predictions.

[0106] As another example, the systems and methods herein can be applied to superresolution tasks, such as increasing the resolution of low -resolution images and / or enhancing image details. The user can provide a low-resolution image. The model described herein can iteratively generate a higher-resolution version. Metrics like sharpness or perceptualsimilarity to high-resolution training data are calculated. These metrics are visualized as described herein, for example, by highlighting areas with lower sharpness or similarity. As the model refines the image, these highlights fade. The user can observe the enhancement process and potentially guide the model towards desired details.

[0107] These examples demonstrate how visualizing confidence and quality7metrics can enhance the usability7and effectiveness of image generation models in various applications. The specific metrics and visualization techniques can depend on the particular task and the needs of the user. In addition to the aforementioned text-based tasks and imagebased tasks, the systems and methods herein can be applied to other tasks that are neither text-based nor image-based.

[0108] As one example, the systems and methods herein can be applied to audio generation or synthesis tasks, such as generating music, creating sound effects, synthesizing speech, and so on. For instance, a user can compose an audio segment by specifying characteristics of the desired audio segment to the model described herein. As the model generates notes or musical phrases, confidence metrics could be calculated based on harmony, melody, rhythm, or similarity to a target style. These could be visualized as described herein on a musical score or piano roll interface. Notes with low confidence might be dimmed or displayed with a different color, allowing the user to easily identify7and modify7them. As the model refines the composition, the visualization updates to reflect increasing confidence.

[0109] As another example, the systems and methods described herein can be applied to model generation tasks, such as designing objects for 3D printing, creating virtual environments, or generating characters or other assets for games and renders. For instance, as a model generates parts of a 3D object, confidence metrics could be calculated based on structural integrity, feasibility for printing, or similarity to a reference design. These could be visualized directly on the 3D model. Areas with low confidence might be highlighted with a different color or texture, or displayed with a semi-transparent overlay. As the model refines the object, the visualization updates, providing feedback on the evolving design.

[0110] As another example, the systems and methods described herein can be applied to code generation tasks, such as automated software development and / or assisting programmers in writing code. For instance, as a model generates code, confidence metrics could be calculated based on syntax correctness, code style, or potential bugs. These could be visualized within the code editor. Lines of code with low confidence might be underlined orhighlighted with a specific color. As the model refines the code, the visualization updates, helping the programmer identify areas that require attention.

[0111] As another example, the systems and methods described herein can be used for scientific data analysis and / or simulation tasks, such as analyzing complex datasets, simulating phy sical phenomena, and predicting experimental outcomes. For example, if a scientist is simulating fluid dynamics, the model described herein may generate data representing the flow of a liquid. Confidence metrics could be calculated for different regions of the simulation based on factors like numerical stability or agreement with experimental data. These could be visualized using color gradients, contour lines, or other graphical representations. Areas with low confidence might be highlighted, guiding the scientist's interpretation of the results.

[0112] As another example, the systems and methods described herein can be applied to robotics or control systems tasks, such as designing control algorithms for robots, optimizing robot movements, and / or predicting robot behavior. For instance, a model can predict a robot’s trajectory. Confidence metrics are calculated for different points along the trajectory, reflecting the model's certainty about the robot's position and actions. These metrics could be visualized in a simulation environment, with low-confidence segments of the trajectory' highlighted. This allows engineers to identify potential issues and refine the control algorithms.

[0113] These examples demonstrate the broad applicability of visualizing evolving outputs of machine-learned models across many diverse domains. Notably, in each instance, relevant metrics for each application can be identified and mapped to intuitive visual representations of those metrics, to generate a visual representation of the output that provides valuable feedback to the user.

[0114] Various example implementations are described herein with respect to the accompanying FIGS.Example Model Systems and Architectures

[0115] Figure 1A depicts a block diagram of an example computing system 100 that performs tasks according to example embodiments of the present disclosure. The system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180.

[0116] The user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computingdevice (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.

[0117] The user computing device 102 includes one or more processors 112 and a memory 114. The one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 114 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations.

[0118] In some implementations, the user computing device 102 can store or include one or more machine-learned models 120. For example, the machine-learned models 120 can be or can otherwise include various machine-learned models such as block generation models, as described herein. Additionally and / or alternatively, as described herein, the models 120 can be or can include neural networks (e.g.. deep neural networks) or other types of machine-learned models, including non-linear models and / or linear models. Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory' recurrent neural networks), convolutional neural networks or other forms of neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models).

[0119] In some implementations, the one or more machine-learned models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112. In some implementations, the user computing device 102 can implement multiple parallel instances of a single machine-learned model 120 (e.g., to perform parallel tasks across multiple instances of machine-learned models).

[0120] Additionally or alternatively, one or more machine-learned models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship. For example, the machine-learned models 140 can be implemented by the server computing system 140 as a portion of a web service. Thus, one or more models 120 can be stored and implemented at the user computing device 102 and / or one or more models 140 can be stored and implemented at the server computing system 130.

[0121] The user computing device 102 can also include one or more user input components 122 that receives user input. For example, the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus). The touch-sensitive component can serve to implement a virtual key board. Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.

[0122] The server computing system 130 includes one or more processors 132 and a memory' 134. The one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 134 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory' 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.

[0123] In some implementations, the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.

[0124] As described above, the server computing system 130 can store or otherwise include one or more machine-learned models 140. For example, the models 140 can be or can otherwise include various machine-learned models. Example machine-learned models include neural networks or other multi-layer non-linear models. Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multi-headed self-attention models (e.g., transformer models). Some example machine-learned models can include diffusion models.

[0125] The user computing device 102 and / or the server computing system 130 can train the models 120 and / or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180. The training computing system 150 can beseparate from the server computing system 130 or can be a portion of the server computing system 130.

[0126] The training computing system 150 includes one or more processors 152 and a memory 154. The one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected. The memory 154 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof. The memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations. In some implementations, the training computing system 150 includes or is otherwise implemented by one or more server computing devices.

[0127] The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and / or 140 stored at the user computing device 102 and / or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and / or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.

[0128] In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. The model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.

[0129] In particular, the model trainer 160 can train the machine-learned models 120 and / or 140 based on a set of training data 162. The training data 162 can include, for example, examples of data that generally correspond to data that would be consumed by a machine-learned model configured to perform a particular task. In some implementations, the examples may be labeled with an expected or desired output from the models.

[0130] In some implementations, if the user has provided consent, the training examples can be provided by the user computing device 102. Thus, in such implementations, the model 120 provided to the user computing device 102 can be trained by the trainingcomputing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.

[0131] The model trainer 160 includes computer logic utilized to provide desired functionality. The model trainer 160 can be implemented in hardware, firmware, and / or software controlling a general purpose processor. For example, in some implementations, the model trainer 160 includes program files stored on a storage device, loaded into a memory and executed by one or more processors. In other implementations, the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media.

[0132] The network 180 can be any type of communications network, such as a local area network (e.g.. intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links. In general, communication over the network 180 can be carried via any type of wired and / or wireless connection, using a wide variety of communication protocols (e.g., TCP / IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and / or protection schemes (e.g., VPN, secure HTTP, SSL).

[0133] The machine-learned models described in this specification may be used in a variety of tasks, applications, and / or use cases.

[0134] In some implementations, the input to the machine-learned model(s) of the present disclosure can be image data. The machine-learned model(s) can process the image data to generate an output. As an example, the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc ). As another example, the machine-learned model(s) can process the image data to generate an image segmentation output. As another example, the machine-learned model(s) can process the image data to generate an image classification output. As another example, the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and / or compressed representation of the image data, etc.). As another example, the machine-learned model(s) can process the image data to generate an upscaled image data output. As another example, the machine-learned model(s) can process the image data to generate a prediction output.

[0135] In some implementations, the input to the machine-learned model(s) of the present disclosure can be text or natural language data. The machine-learned model(s) canprocess the text or natural language data to generate an output. As an example, the machine-learned model(s) can process the natural language data to generate a language encoding output. As another example, the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output. As another example, the machine-learned model(s) can process the text or natural language data to generate a translation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a classification output. As another example, the machine-learned model(s) can process the text or natural language data to generate a textual segmentation output. As another example, the machine-learned model(s) can process the text or natural language data to generate a semantic intent output. As another example, the machine-learned model(s) can process the text or natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, the machine-learned model(s) can process the text or natural language data to generate a prediction output.

[0136] In some implementations, the input to the machine-learned model(s) of the present disclosure can be speech data. The machine-learned model(s) can process the speech data to generate an output. As an example, the machine-learned model(s) can process the speech data to generate a speech recognition output. As another example, the machine-learned model(s) can process the speech data to generate a speech translation output. As another example, the machine-learned model(s) can process the speech data to generate a latent embedding output. As another example, the machine-learned model(s) can process the speech data to generate an encoded speech output (e.g., an encoded and / or compressed representation of the speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.). As another example, the machine-learned model(s) can process the speech data to generate a prediction output.

[0137] In some implementations, the input to the machine-learned model(s) of the present disclosure can be latent encoding data (e.g., a latent space representation of an input, etc.). The machine-learned model(s) can process the latent encoding data to generate an output. As an example, the machine-learned model(s) can process the latent encoding data to generate a recognition output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reconstruction output. As another example, themachine-learned model(s) can process the latent encoding data to generate a search output. As another example, the machine-learned model(s) can process the latent encoding data to generate a reclustering output. As another example, the machine-learned model(s) can process the latent encoding data to generate a prediction output.

[0138] In some implementations, the input to the machine-learned model(s) of the present disclosure can be statistical data. Statistical data can be. represent, or otherwise include data computed and / or calculated from some other data source. The machine-learned model(s) can process the statistical data to generate an output. As an example, the machine-learned model(s) can process the statistical data to generate a recognition output. As another example, the machine-learned model(s) can process the statistical data to generate a prediction output. As another example, the machine-learned model(s) can process the statistical data to generate a classification output. As another example, the machine-learned model(s) can process the statistical data to generate a segmentation output. As another example, the machine-learned model(s) can process the statistical data to generate a visualization output. As another example, the machine-learned model(s) can process the statistical data to generate a diagnostic output.

[0139] In some implementations, the input to the machine-learned model(s) of the present disclosure can be sensor data. The machine-learned model(s) can process the sensor data to generate an output. As an example, the machine-learned model(s) can process the sensor data to generate a recognition output. As another example, the machine-learned model(s) can process the sensor data to generate a prediction output. As another example, the machine-learned model(s) can process the sensor data to generate a classification output. As another example, the machine-learned model(s) can process the sensor data to generate a segmentation output. As another example, the machine-learned model(s) can process the sensor data to generate a visualization output. As another example, the machine-learned model(s) can process the sensor data to generate a diagnostic output. As another example, the machine-learned model(s) can process the sensor data to generate a detection output.

[0140] In some cases, the machine-learned model(s) can be configured to perform a task that includes encoding input data for reliable and / or efficient transmission or storage (and / or corresponding decoding). For example, the task may be an audio compression task. The input may include audio data and the output may include compressed audio data. In another example, the input includes visual data (e.g. one or more images or videos), the output includes compressed visual data, and the task is a visual data compression task. Inanother example, the task may include generating an embedding for input data (e.g. input audio or visual data).

[0141] In some cases, the input includes visual data and the task is a computer vision task. In some cases, the input includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.

[0142] In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output may include a text output which is mapped to the spoken utterance. In some cases, the task includes encrypting or decry pting input data. In some cases, the task includes a microprocessor performance task, such as branch prediction or memory’ address translation.

[0143] Figure 1 A illustrates one example computing system that can be used to implement the present disclosure. Other computing systems can be used as well. For example, in some implementations, the user computing device 102 can include the model trainer 160 and the training dataset 162. In such implementations, the models 120 can be both trained and used locally at the user computing device 102. In some of such implementations, the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data.

[0144] Figure IB depicts a block diagram of an example computing device 10 that performs according to example embodiments of the present disclosure. The computing device 10 can be a user computing device or a server computing device.

[0145] The computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model. Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.

[0146] As illustrated in Figure IB. each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and / or additional components. In some implementations, each application can communicate with each device component using an API (e.g., a public API). In some implementations, the API used by each application is specific to that application.

[0147] Figure 1C depicts a block diagram of an example computing device 50 that performs according to example embodiments of the present disclosure. The computing device 50 can be a user computing device or a server computing device.

[0148] The computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communicarion with a central intelligence layer.Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc. In some implementations, each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).

[0149] The central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 1C, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50.

[0150] The central intelligence layer can communicate with a central device data layer. The central device data layer can be a centralized repository of data for the computing device 50. As illustrated in Figure 1C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and / or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).

[0151] Referring now to FIG. 2, a block diagram illustrates an example computing system 200 configured to implement an agent system 202, according to example implementations of aspects of the present disclosure. The depicted computing system 200 is designed to receive multiple types of input data, process this data, and generate outputs that are responsive to the inputs in a contextually appropriate manner.

[0152] The agent system 202 within the computing system 200 is configured to receive visual data 204, audio data 206, and additional context data 208. Each type of data is processed by the agent system 202 using one or more machine-learned model(s) 205 to facilitate interaction within its operational environment. For example, visual data 204 can include live video streams from a camera or recorded video streams from a web resource, while audio data 206 can include spoken commands or ambient sounds captured by microphones.

[0153] Additional context data 208 can include sensor data, textual information, or other forms of digital data that provide further insights into the environment or the context of the interaction. As one example, the additional context data 208 can include sensor data that captures user inputs beyond speech inputs, such as touch-screen inputs, gestures, facial expressions, and / or other inputs. These user inputs can, in some implementations, be merged with other inputs such as visual data 204 to create combined inputs. In one example, a user can be provided with an interface that displays a real-time field of view of the agent system (e.g.. which may correspond to visual data 204). The interface can enable the user to "‘draw” on or otherwise interact with the interface to mark up the real-time field of view. For example, the user could draw an arrow or make a circle to identify a particular object included within the scene displayed on the interface. The user’s graphical input can be added onto or merged with the visual data 204 to form a combined input. For example, the visual data 204 can be amended to include the arrow or circle, which can then be processed by the agent system 202. In such manner, interactive interfaces can provide the ability’ for the user to more granularly interact with or identify portions of the environment when querying the agent system 202.

[0154] Furthermore, it should be appreciated that in some cases the user will be able to control the type, nature, content, or other characteristics of the visual data 204, audio data 206, and / or additional context data 208. As one example, the user can manipulate a field of view of a camera to alter the content of the visual data 204 that is provided to the agent system 202. Similarly, by speaking into a microphone, the user can provide additional audio data 206 as an input for the agent system 202. The agent system's ability to process andcombine visual, auditory, and textual information allows it to generate more comprehensive and nuanced responses, carefully tailored to the user's multi-modal context.

[0155] The agent system 202 processes these diverse inputs to generate an agent action 210, which can include an output designed to respond to the processed inputs effectively. As examples, this action can range from textual responses, vocal responses, displaying information, controlling connected devices, or any other form of interaction output that is deemed appropriate based on the input data. Specifically, the agent system 202 can provide concise answers, generate detailed explanations, offer step-by-step instructions, display information through visual highlights or augmented reality overlays, control connected devices, and / or other forms of actions 210.

[0156] In some implementations, the agent system 202 can include and use specialized sequence processing models to integrate and analyze the input data. These models are configured to process complex patterns across different data modalities, enabling the agent system 202 to generate more accurate and contextually relevant responses. The sequence processing models may be specifically fine-tuned to handle various interaction dynamics, such as tum-based dialogues or more open-ended conversational formats, enhancing the flexibility and adaptability of the agent system. Additionally and / or alternatively, the agent system 202 can include block generation models.

[0157] Furthermore, the computing system 200 can be connected to a real-time communication framework that facilitates the immediate and efficient exchange of data, including the inputs and outputs to and from the agent system 202. This configuration reduces latency in data processing and response generation.

[0158] The agent system 202 can include or can have access to a user-specific memory layer 212. The user-specific memory layer 212 can provide for the agent system to access user-specific data, such as image data, video data, documents, and other data provided by the user to the agent system. As one example, the user-specific memory layer 212 can access data at a designated local repository' on a calling device belonging to the user and / or other memory within the computing system 200 or accessible by the computing system 200. For example, the user-specific memory layer can be a directory, folder, file repository, or other non-tangible, computer-readable media in which the user consents to store video data, pictures or image data, documents, files, music or audio data, or other computer-readable data that the user wishes for the agent system 202 to have access to. Additionally or alternatively, the user can ask the agent system 202 to store data in the user-specific memory layer 212, such as by asking the agent system 202 to record and store video data from a camera of theuser device. For example, the user may instruct the agent system 202 to “remember where I parked.” in response to which the agent system 202 may capture image data and / or geopositional data of a vehicle of the user. As another example, the user may instruct the agent system 202 to “remember that for later,” in which case the agent system 202 may capture image data or video data of the environment at which the user is looking (e.g., through a camera on a wearable device, such as smart glasses).

[0159] As another example, the user can provide the agent system 202 with access instructions for user-specific data streams, such as video watch history, historical geodata, and other data that the user wishes for the agent system 202 to have access to, which can either be or can provide data to the user-specific memory layer 212. The user-specific memory layer 212 can be, in some implementations, a long-term memory’ layer that can, with the consent of the user, provide context relating to long-term memories of the user, such as birthdays, anniversaries, and so on.

[0160] In some implementations, in addition to the user-specific memory layer 212, the agent system 202 can include or have access to a model memory layer 214 or other memory system. The agent system 202 can store and retrieve various types of information to and from the model memory’ layer 214. For example, the agent system 202 can store past interactions, observations, preferences, and / or information from the environment in the model memory layer 214. The agent system 202 can then recall this information for use in generating new predictions, outputs, or agent actions.

[0161] A number of different types of data can be stored in the model memory layer 214. One example of data stored within the model memory’ layer 214 can include object detections. This can include indexed records of objects that the agent system encounters during its operations, complete with metadata such as timestamps, location coordinates, and / or contextual tags. By archiving these detections, the agent system 202 can recognize and recall objects from a “history’” of observed scenes. The agent system 202 can leverage this information to refine interactions and bolster situational awareness, potentially spanning different sessions of user interaction.

[0162] As another example data type, the model memory layer 214 can store embeddings of observed visual content, textual content, or other inputs. These embeddings can be low-dimensional numerical representations that encode the essential features of input data into a latent embedding space. The storage of embeddings associated with observed inputs allows the agent system 202 to conduct rapid comparisons and recognition tasks efficiently. In particular, these embeddings, which can be derived from various layer(s) of theagent system’s machine-learned models, can be used to perform similarity searches to facilitate quick data retrieval.

[0163] As another example, intermediate model activations can be stored in the model memory layer 214. Capturing and preserving the state of model activations at various stages can enable the agent system 202 to efficiently resume or adjust its processing activities as needed. This feature can be used in scenarios involving long running or complex processing tasks that may be interrupted or require dynamic adjustments such as resetting the agent system to a prior state associated with a prior time.

[0164] As another example, the model memory' layer 214 can store raw tokens generated by the agent system’s natural language processing, image processing, or other tokenization mechanisms. For example, a cache of tokens can be stored, with each being associated with a specific timestamp. This data allows for the reconstruction of the sequence of inputs and internal states over time, which can be used to retrieve and replay perceptual inputs associated with a particular timestamp or setting, or to otherwise provide the raw tokens as a contextual input for a later prediction.

[0165] By maintaining a repository of these data types, the agent system 202 can be equipped with a knowledge base that supports advanced functionalities such as context-aware computing, personalized interactions, and information retrieval from past observations. For example, upon retrieving stored information from the model memory layer 214, the agent system 202 can integrate the retrieved data into the current processing workflow. This integration can include aligning historical and current data to enhance the accuracy and relevance of the output.

[0166] In some implementations, the agent system 202 can include or have access to both short-term and long-term memory components. The short-term memory may be volatile, designed for the temporary storage of recent interactions and sensory inputs. In contrast, the long-term memory may be non-volatile, storing valuable learned information, user preferences, historical interaction data, and significant environmental events for longer-term recall and usage. In addition, the design of the model memory’ layer 214 can accommodate both structured and unstructured data. As an example, for immediate processing needs, volatile memory such as Random Access Memory' (RAM) can be used. As another example, for the purpose of long-term data retention, non-volatile storage solutions such as Hard Disk Drives (HDDs) or Solid-State Drives (SSDs) can be used. Furthermore, the model memory' layer 214 can include hybrid memory solutions that combine the rapid access capabilities ofRAM with the extensive storage capacity of disk storage, thereby optimizing the performance of the agent system 202 across various tasks.

[0167] FIGS 3A - 3C depict example user interfaces 300, 320, and 340, respectively, that progressively display evolving output from a machine-learned model, according to example implementations of the present disclosure. The evolving output can be generated responsive to a user input 302. For example, the user input 302 may be a text prompt or other form of user input that the user provides to a computing system, such as the computing system displaying the user interfaces 300, 320, 340. Each of the user interfaces 300, 320, and 340 may display output 305, 325, 345, respectively, from a machine-learned model to a user in view of the user interfaces 300, 320, 340. instance, FIG. 3A depicts the user interface 300 having first output 305 at a first output cycle, FIG. 3B depicts the user interface 320 having second output 325 at a second output cycle subsequent to the first output cycle, and FIG. 3C. depicts the user interface 340 having third output 345 at a third output cycle subsequent to the second output cycle. For instance, if the user interfaces 300, 320, and 340 are progressively displayed to the user (e.g., via a same display device), the user can gain comprehension of how the outputs 305, 325, and 345 are evolving over time. The user interface 300 may, for example, depict an earlier output 305 of the model having lower confidence than the outputs 325, 345 of the user interfaces 320 and 340. As illustrated in FIG. 3A, the formatting characteristic representing the shading of the tokens can be determined based on a metric value indicative of confidence of the model in each token of the output 305. Because the output 305 is relatively early in generation, many of the tokens in the output 305 are rendered with low shading values such that the tokens are nearly invisible against a background color of the user interface 300. By comparison, as illustrated in FIGS. 3B and 3C, as the evolving output progresses and the model gets more certain, the tokens in the outputs 325 and 345 are rendered with increasingly darkening shading relative to the increasing confidence in the tokens. In some implementations, at a final output cycle (e.g., when the model’s output has converged), each token in the output may be displayed with a common formatting characteristic (e.g.. a darkest shading).

[0168] The user interface 300 can further provide the user with various controls to modify the evolving output during its generation. As one example, the user interface 300 can include a suspend interface element 306. The suspend interface element 306 can be a button on the user interface 300 with which the user can perform a suspend interaction, whether by clicking, toggling, tapping, etc. In response to the suspend interaction, the user can provide additional contextual information regarding the generation. Additionally and / or alternatively.the user interface 300 can provide contextual feedback interface elements 308 for indicating approval or disapproval of the generated response. For example, the contextual feedback interface elements may include a first element for indicating approval (e.g., depicting a “thumbs up,” a check mark, a smiley face, or other graphic conventionally understood to indicate approval) and a second element for indicating disapproval (e.g., depicting a “thumbs down,” an X, a frowning face, or other graphic conventionally understood to indicate disapproval). When the user has not selected any particular tokens, this contextual feedback from the user may be associated with the evolving output as a whole. Example aspects of the present disclosure provide that a user may suspend generation of an evolving output, provide some contextual input, and modify future generation of that evolving output without discarding the past output cycles.

[0169] Additionally, the user interface 300 includes an input field 310. The input field 310 can provide for a user to enter, speak, or otherwise provide contextual input before, during, and / or after generation of the emerging outputs 305, 325, and / or 345. For instance, the input field 310 can accept text input, audio input, image input, or other suitable types of input from the user. In some implementations, the user can select one or more selected tokens in the outputs 305, 325, 345 for which the contextual input in the input field 310 is directed, as described in greater detail with respect to FIGS. 4A - 4C below.

[0170] Furthermore, in some implementations, the user interface 300 can include generation scroll element 312. The generation scroll element 312 can provide for a user to scroll through prior output cycles as they are generated by the machine-learned model. For example, the generation scroll element 312 may range from a first end (e.g., left) to a second end (e.g., right). The first end can correspond to the initial output cycle of the emerging output and / or the second end can correspond to a current or most recent output cycle (e.g., a final output cycle or an intermediate output cycle if the output has not converged). As one example, a user can click on an indicator element (e.g., a point) at a point in the generation scroll element 312 to select an output cycle and / or can drag left or right to move throughout the generation history. According to example aspects of the present disclosure, the user can “rewind” generation of the model using the generation scroll element 312, provide contextual input at an earlier output cycle through the input field 310, and cause a branching generation that diverges from the output cycles after the contextual input is provided. The prior branch of generation may be overwritten or may be archived for the user to return to if desired.

[0171] FIGS. 4A - 4C depict example user interfaces 400, 420. and 440, respectively, illustrating user modification of an evolving output, according to example implementations ofthe present disclosure. For instance, FIG. 4A depicts a first user interface 400. In the example of FIG. 4A, the user may have interacted with the suspend interface element 306 to suspend output cycles of the machine-learned model. In response to the user interacting with the suspend interface element 306, the model has suspended generation at a particular output cycle with the output as displayed in FIG. 4A. Furthermore, in the example of FIG. 4A, the user has provided a modify interaction that indicates selected (e.g., highlighted) tokens 402. The selected tokens 402 reference questions asked by an audience. Additionally, the modify interaction from the user includes the user providing a guidance message 404 in the input field 310 stating “please modify this section to remove reference to audience questions.” The selected tokens 402 and the contextual aspect (the guidance message 404) in the input field 310 can be provided as contextual input to the machine-learned model at a future output cycle, after the generation is unsuspended.

[0172] FIG. 4A depicts a user providing a guidance message 404 in the input field 310 to guide future output cycles from the machine-learned model. In addition to and / or alternatively to a guidance message 404, in some implementations, the user may be provided with the ability to directly modify tokens in the emerging output through the input field 310 and / or another input element. For example, in some implementations, a user may select the tokens 402 and type or otherwise input data to directly overwrite the selected tokens 402.

[0173] FIG. 4B depicts another user interface 420. In the example of FIG. 4B, the user may have suspended the output cycles at the same output cycle as FIG. 4A. For example, the interface 420 could be displayed before or after the user has submitted the contextual aspect in FIG. 4A. In the example of FIG. 4B, the user has provided a modify interaction indicating that the user has selected tokens 426, which read “sharing your insights.” The user can subsequently interact with the contextual feedback interface elements 308 to provide a contextual aspect indicative of whether the user approves or disapproves of the output values displayed in the selected tokens 426.

[0174] FIG. 4C depicts another user interface 440. The user interface 440 can be displayed after the user has unsuspended output cycles from the user interfaces 400 and 420. The user interface 440 includes new output values for the tokens 442 relative to the selected tokens 402 of FIG. 4A. For example, the user’s contextual aspect asking the model to remove references to audience questions could be interpreted by the model and cause the model to generate new output values for the tokens 442 that do not reference audience questions. Additionally, the user interface 440 displays the tokens 446 containing the “sharing your insights” phrase in a solid, darkest shading. This can occur, for example, if the user indicatedapproval of the tokens 446 in the interface 420 of FIG. 4B. The user has overwritten the formatting characteristic value of the tokens 446 and / or “forced” the metric values of the output values of the “sharing your insights” phrase for the tokens 446 to be a maximum value by providing a contextual aspect indicating approval, even if the metric values of those output values from the model are not a maximum.Example Methods

[0175] FIG. 5 depicts a flowchart of a method 500 for implementing agent systems according to aspects of the present disclosure. For instance, an example agent system can include one or more machine-learned models and / or other systems configured to perform tasks in response to a query from a user.

[0176] One or more portion(s) of example method 500 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 500 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 500 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. FIG. 5 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. FIG. 5 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 500 can be performed additionally, or alternatively, by other systems.

[0177] At 502, the method 500 can include obtaining (e.g., by a computing system including one or more computing devices) an output including one or more output values. In some implementations, the output can include a plurality of output values. An output value can be a portion of a larger response or answer produced by the machine-learned model. For instance, in some example implementations, the model can generate a response, such as a block of text or passage of text, including the plurality of output values. The output value can be any suitable output value, such as, for example, a text output value including text data, such as a word, phrase, sentence, letter, or other suitable delineation of text, a numericalvalue (e.g., an integer value, a floating point value, etc.), such as a pixel value of image data and / or an audio channel value of audio data, and / or any other suitable value. In some implementations, for example, the output value can be a token (e.g., a text data token).

[0178] In some implementations, the output can be obtained from a machine-learned model. The machine-learned model can be any suitable model. For instance, in some implementations, an agent system (e.g., an ‘“artificial intelligence agent” or “Al agent”) can employ one or more machine-learned models to generate outputs responsive to queries from users. As one example, an agent system can be or can include a computing system including one or more machine-learned models, where the computing system is configured to receive an input from a user device or calling device and provide an output including a plurality of output values that are responsive to the input to the user device or calling device. As one example, the plurality of output values can be or can include text data such as words, and / or the plurality of output values can collectively form a passage or block of text that is displayed to the user. The plurality of output values can, for instance, collectively describe an answer or response to the user’s input. Furthermore, in some implementations, the plurality of output values can be or can include (e.g., or can be converted to) spoken language data such as audio data that can be spoken to the user. In some implementations, the machine-learned model(s) can output values, such as confidence scores or certainty scores, tonality scores, and / or other suitable scores and / or values, which are respectively associated with the plurality of output values.

[0179] In some implementations, the agent system can be or can implement a multimodal agent (e.g., a multi-modal artificial intelligence agent). For instance, a multi-modal agent can process inputs from one or more data modalities. In some implementations, the agent system can be implemented as a “situated agent”. The term situated agent refers to a setting in which the agent system shares one or more perceptual inputs with a human user. For example, the situated agent can receive and process various data inputs, including video, audio, and / or textual data which are also observable by the human user. The agent system can process these inputs to generate responses that are contextually relevant for the user's physical or digital environment, for example enabling the agent system to generate dialogue or other responses or outputs which assist the user in understanding and / or navigating the environment.

[0180] At 504, the method 500 can include obtaining (e.g., by the computing system), a metric value. The metric value can be associated with an output value of the output. For instance, in some implementations, the systems and methods described herein can obtain aplurality of metric values respective to a plurality of output values in an output. For instance, in some implementations, the plurality of metric values are respective to a plurality of output values in a one-to-one correspondence. As another example, in some implementations, a plurality of output values can be associated with a single rendering value and / or a plurality of rendering values can be associated with a single output value. The metric value can be any suitable value, such as a scalar value, a numerical value, a floating-point value, a classification value (e.g.. a discrete classification or an ordered classification along a range of classification values), or other suitable values. In some implementations, the metric value can be within a range of possible metric values. For example, in some implementations, the metric value can be between a minimum metric value and a maximum metric value.

[0181] In some implementations, the metric values can be generated by the machine-learned model. For instance, the metric values can be included in the output of the machine-learned model. Additionally and / or alternatively, in some implementations, the metric values can be generated by one or more machine-learned models other than the model, such as, for example, adversarial models, discriminator models, classifier models, or other suitable models.

[0182] As one example, in some implementations, a metric value can be a certainty score or confidence score associated with a respective output value. The certainty score can be representative of a certainty of the model associated with the output value, such as, for example, a certainty that the output value is correct, a certainty that the output value will be included in the final output, a certainty that the output value is a best choice among a plurality of possible choices, or other certainty that is generally reflective of a confidence in the output value as output. For example, in some implementations, the model is configured to select each output value from a plurality of candidate output values based on a probability distribution associated with the plurality of candidate output values. For instance, the model can select an output value having a highest probability. The certainty score can be determined based on the probability7distribution associated wi th the plurality of candidate output values. For example, in some implementations, the certainty score can be the probability of the probability distribution that is associated with the selected output value from the plurality of candidate output values. Output values with higher probabilities can therefore have a higher associated metric value and / or output values with lower probabilities can therefore have a lower associated metric value, for example. Still further, in some implementations, a model other than the model (e.g., an adversarial model) may generate the certainty7score.

[0183] As another example, in some implementations, a metric value can be a tonality score associated with the output value. The tonality score can be representative of a tone of voice or tone characteristic associated with the output value, such as, for example, a degree to which the output value contributes to a tone of a passage of text including the output value. The tone can be a classification value or other attribute that is defined or learned based on human interaction such as, for example, a serious tone, a playful tone, a lighthearted tone, a professional-etiquette tone, a sorrowful tone, and / or other suitable tones. The tonality score may be output by the model. For instance, the model may be trained to associate output values or combinations of output values with particular tonalities during and / or after prediction of the output values themselves. As another example, in some implementations, a model other than the model (e.g., a classifier model or a discriminator model) may be configured to generate the tonality score.

[0184] In some implementations, the machine-learned model can be a block generation model. As used herein, a block generation model refers to a model configured to produce the output in ‘“blocks,” each block including a plurality of output values, over a plurality of output cycles. For instance, the output of the block generation model at each output cycle of the plurality of output cycles can include a plurality of output values (e.g., and / or a plurality of metric values respective to the plurality' of output values). For instance, in some implementations, the plurality of output values can be produced “simultaneously,” e.g., wherein each output value is output at a same iteration or output cycle of the block generation model. For instance, an output cycle can be an initial generation cycle (e g., where initial values of the plurality of output values are generated) and / or an update cycle (e.g., where values of the plurality of output values are refined based on information available at a previous cycle). One example block generation model is a diffusion model, such as a text diffusion model. In some implementations, the block generation model may be autoregressive (e.g., a block autoregressive generation model). For instance, an output of the block autoregressive generation model at a first output cycle may depend on the output of the block autoregressive generation model at a second output cycle that precedes the first output cycle.

[0185] In some implementations, the model may be designed or configured to operate over “blocks” of output values (e.g., text output values) having a length, such as a fixed length. For instance, in some example implementations, a text diffusion model can diffuse or denoise over a fixed number of output values. However, responses produced by the model may not necessarily always require the fixed number of output values to appropriately answer the user’s input. According to example aspects of the present disclosure, a model mayproduce a fixed-length block of output values, where the length defines a number of output value positions that correspond to output values in the output of the model. The model may further be operable to predict an end of sequence (EOS) output value at each output value position. Output values occurring after the position of the EOS output value can be, for example, a null output value, an empty output value, or another suitable “no-information” output value (e.g., whitespace). As one example, the model can predict a maximum probability associated with the null output value at each output value position after the EOS output value.

[0186] In some implementations, the length of the block of output values may be fixed for each evaluation of the model. For instance, in some implementations, the length may be set at a value that is high enough to cover most or all use cases of the model, such as one thousand output values. As another example, in some implementations, the model and / or another model may predict the length of the block of output values based on the user’s input, then generate a block of output values having the predicted length. For example, the model may be trained to predict an approximate number of output values that the response to the user’s query will include and may determine the length based on the predicted approximate number of output values.

[0187] As another example, in some implementations, the machine-learned model can be or can include a sequence processing model. The sequence processing model can be configured to individually generate a sequence of output values (e.g.. and / or respective metric values). For instance, the output at each iteration or output cycle can be or can include the plurality (e.g., sequence) of output values. Furthermore, in some implementations, each sequence of output values can be associated with a respective set (e.g., sequence) of metric values. Examples of a sequence processing model include, but are not limited to. a large language model (LLM) or a large multimodal model (EMM). For instance, the sequence processing model may generate output values in a sequential manner. The sequence processing model may not provide a fixed-length block of output values but rather generate each output value as evaluation of the model progresses. The EOS output value may signal the model to cease generation of output values. The model’s output may have fewer output values than a maximum length, for instance. Iterations of the sequence processing model (e.g., output cycles) can refer to multiple “passes” over the sequence of output values, where output values in the sequence from one output cycle and / or their respective metric values are provided as input in a subsequent output cycle to improve the model’s understanding about the entire sequence of output values at the subsequent output cycle.

[0188] At 506, the method 500 can include determining (e.g., by the computing system) a formatting characteristic value associated with each output value based on the metric value respective to the output value. The formatting characteristic value can be a value that is interpretable by a computing system (e.g., a graphical user interface) to convey information to the computing system regarding a formatting characteristic with which the output value is to be rendered. For example, formatting characteristic values can provide indications regarding text font, text boldness or italics, color, shading, position, size, and other aspects of rendered interface elements corresponding to the output values.

[0189] In some implementations, the formatting characteristic can be a gradient formatting characteristic that defines a gradient characteristic of the formatted output values. For instance, in some implementations, the formatting characteristic value can be within a range of possible formatting characteristic values. The gradient characteristic of the formatting output values can vary over the range of formatting characteristic values. For example, the formatting characteristic may be displayed more prominently for formatting characteristic values toward one end of the range than those toward another end of the range. For instance, in some implementations, the formatting characteristic value can be between a minimum formatting characteristic value and a maximum formatting characteristic value. Furthermore, in some implementations, a scale of the formatting characteristic value between the minimum formatting characteristic value and the maximum formatting characteristic value can correspond to a scale of the metric value between the minimum metric value and the maximum metric value. For instance, in some implementations, the formatting characteristic value can be a scalar value or other numerical value, such as a value correlated on a range of possible formatting characteristic values to the metric value along its own range of possible metric values. For example, a ratio of metric value to minimum metric value or maximum metric value can be equivalent to a ratio of formatting characteristic value to minimum formatting characteristic value or maximum formatting characteristic value. As one example, in some implementations, determining the formatting characteristic value based on the metric value can include determining a ratio based on the metric value and at least one of the minimum metric value and the maximum metric value and determining the formatting characteristic value based on a corresponding ratio betw een the formatting characteristic value and at least one of the minimum formatting characteristic value and the maximum formatting characteristic value.

[0190] For example, in some implementations, the formatting characteristic value can be a shading value. The shading value can instruct the computing system on a shade ordegree of shading to render the output value within a user interface. The shading can represent, for example, how intensities of one or more color channels associated with rendering the output value are adjusted. For example, at a maximum shading value, the color channels may not be adjusted such that the output value is rendered with a maximum shade or a maximum brightness or (e.g. exactly) as the color channels define. As another example, at a minimum shading value, the color channels may be adjusted such that the output value is not rendered at all, or rendered with some minimum shade or minimum brightness such that the output value is significantly less perceptible than output values rendered according to the maximum shading value. For example, the rendered output value element can range from being near or identical to a background color of the user interface at minimum shading values to fully saturated at a target color (e.g., a specified font color) at maximum shading values. It should be understood that ‘'minimum” and “maximum” are used herein to denote endpoints of a range of possible values for the purposes of illustration only, and similar endpoints or effective endpoints can fall within the use of “minimum” and “maximum” herein.

[0191] As another example, in some implementations, the formatting characteristic value can be a color value. The color value can instruct the computing system on a color to render the output value within a user interface. For example, the color value may be a named color and / or pixel intensity7values descriptive of the desired color. In some implementations, for example, colors may be associated with respective tonalities of text output values. For example, a color such as red may be associated with text conveying an authoritative tone, whereas a color such as orange may be associated with text conveying a happy tone, or any of a number of different color-tone associations. Furthermore, in some implementations, the color value may be combined with the aforementioned shading value to indicate a degree to which a given output value conveys the tonality. For example, a darker red output value may suggest more authoritative text than a lighter red output value. In this manner, the user can rapidly understand a variety of information about a given output value without being required to shift the user’s gaze to other user interface elements to ascertain the information.

[0192] At 508, the method 500 can include providing (e.g., by the computing system) the first formatting characteristic values for use in a first user interface. For instance, the computing system can generate a user interface having the output value(s) of the output arranged according to a plurality7of output value positions and formatted according to the respective formatting characteristic values(s). For instance, the user interface can define output value positions that are configured to receive, store, and / or otherwise accept values or data from the output value(s) of the output. As one example, the output value positions mayreceive characters or words defined by the output value(s). As another example, the output value positions may be placeholder elements in the user interface. For instance, in some implementations, generating the user interface can include replacing the placeholder elements with values defined by the output value(s) of the output. In some implementations, generating the user interface can include generating data that is implementable by a computing system to cause display of the user interface. Such data can include, for example, extended Markup Language (XML) data, Hypertext Markup Language (HTML) data, text document data, slide show data, spreadsheet data, and / or any other suitable data. For example, in some implementations, the output value position can define a text element in a larger webpage or document, where the text element displays an output value and is formatted according to a formatting characteristic based on the formatting characteristic value respective to the output value.

[0193] In some implementations, the systems and methods described herein can provide for rendering the output value as well as one or more alternate output values that represent an alternative to the output value in the output of the machine-learned model. For instance, the alternate output values may be displayed in response to a user gesture (e.g., by hovering over the output value), as alternatives (e.g., in a stacked arrangement or similar adjacent arrangement), or in other suitable manner. For example, in some implementations, the formatting characteristic can be a size of the output values, and the metric value respective to each output value can represent a certainty score associated with each output value. In this manner, a more certain output value can be displayed at a larger size and adjacent to its alternatives, which are displayed with a smaller size. In some implementations, the systems and methods described herein may determine to render the alternate output values based on some ratio or relationship between the respective metric values of the output value and the additional output value. For example, a computing system may determine not to render alternate output values for output values having a great disparity in metric value (e.g., certainty), such as in cases where these output values are highly separated from the next best alternative output value. As another example, the computing system may determine to render alternate output values for output values that are close to at least one other alternate output values. For example, in some implementations, determining to render an alternate output value can include determining a metric value ratio based on the metric value associated with the output value and the metric value associated with the alternate output value and determining to render the alternate output value if the metric value ratio satisfies a metric value ratio range or threshold.

[0194] In some implementations, providing the first formatting characteristic values for use in a user interface can include causing display of the user interface by the computing system. As one example, in some implementations, causing display of the user interface by the computing system can include communicating the user interface (e.g., data descriptive of the user interface) to a display device. The display device may be provided at the computing system and / or an external computing system. For example, in some implementations, communicating the user interface can include communicating the user interface over one or more communication networks.

[0195] Furthermore, in some implementations causing display of the user interface can provide for rendering one or more user interface elements via the display device based on the user interface. The display device can be configured to render and / or display the user interface (e.g., to a user or other observer). For instance, the display device can control one or more display elements (e.g., pixels) based on the user interface including the rendered output value element such that the user interface is presented to the user. Any suitable method of displaying the user interface can be employed in accordance with the present disclosure. As one example, a graphics processing unit (GPU) or similar computing device can compute a matrix of pixel intensities that represent the user interface and provide the matrix of pixel intensities to a display (e.g., a screen or monitor) configured to interpret the matrix of pixel intensities and light up corresponding pixels in the display in accordance with the matrix of pixel intensities.

[0196] The one or more user interface elements can include one or more rendered output value elements based on the output value and the formatting characteristic values. For instance, the output value can be rendered on the display device according to the formatting characteristic value. As one example, in some implementations, the output value can be rendered on a user interface with a shading based on the formatting characteristic value. For instance, pixel intensity values of the portion of the user interface including the output value can be determined such that the shading of the output value (e.g., relative to other rendered output values or elements) reflects the metric value output by the machine-learned model.

[0197] Another example aspect of the present disclosure can provide for a user to modify or alter an ongoing emerging output during its generation. For instance, a user can modify inputs or other state data of the machine-learned model during generation of the emerging output. In this manner, the user may be provided with the capability to alter the trajectory of the emerging output for any number of reasons. As one example, the output of the machine-learned model can be associated with a first output cycle of a plurality of outputcycles. Furthermore, for the purposes of convention, the aforementioned output value(s) can be referred to as first output value(s), the aforementioned metric value(s) can be referred to as first metric value(s), and the aforementioned formatting characteristic value(s) can be referred to as first formatting characteristic value(s).

[0198] The systems and methods according to example aspects of the present disclosure can further provide for detecting (e.g.,, by the computing system) a modify interaction with the user interface by the user during the first output cycle. The modify interaction can be descriptive of a selection of some or all of the first output value(s) and a contextual aspect to be associated with the selected first output value(s). The modify7interaction can be any of a variety of suitable interactions or combinations thereof that can be interpreted as indicative of a desire of the user to modify the generation of the emerging output and / or selecting one or more output values. As one example, the modify interaction can include a spoken interaction, such as the user speaking a phrase (e.g., captured by a microphone or similar voice input device) indicating that the user wishes to modify' the generation of the emerging output. As another example, the user can tap, touch, highlight, or otherwise provide tangible feedback to select the first output value (e.g., and / or one or more additional selected output values).

[0199] In addition to the selection of the output value(s), the modify interaction can include or otherwise be descriptive of a contextual aspect. The contextual aspect can indicate some context regarding the user’s selection of the output value(s). For example, in some implementations, the contextual aspect can be or can include an indication by the user of whether the selected output value is correct or incorrect, good or bad, or some other contrasting position indicating the user’s approval or desire to include the output value in future output of the machine-learned model. The contextual aspect can be obtained by capturing user interactions, such as clicking, tapping, or otherwise interacting with interface elements that indicate the user’s position with respect to the output value. As another example, the contextual aspect can be a phrase obtained from the user, such as a spoken phrase from the user and / or a text phrase input by the user (e.g., into a text field element of the user interface). As yet another example, the contextual aspect can be or can include the user indicating desired values for given output values. For example, the user can '‘fill in” certain output values or groups of output values exactly as the user desires, and the model can generate other output values to complement the user-provided output values.

[0200] For instance, the computing system can provide contextual input to the machine-learned model at a second output cycle of the plurality of output cycles to generatethe second output. The contextual input can be based on the contextual aspect associated with the first output value(s). For example, in some implementations, the contextual input can be equivalent to the contextual aspect. Additionally and / or alternatively, in some implementations, the contextual input can be represented in a format that may be understood by the machine-learned model. For instance, in some implementations, providing the contextual input can include determining, (e.g., by the computing system) a guidance message based on the contextual input. The guidance message can include instructions for the machine-learned model responsive to the contextual aspect associated with the first output value(s). Providing the contextual input can further include providing the guidance message as the contextual input to the machine-learned model at the second output cycle. For example, if the contextual aspect provided by the user is an indication that the user dislikes a particular output value (e.g., an output value at a particular position within a block of text) such as that obtained by the user highlighting the output value and interacting with a “thumbs dow n" user interface element, the contextual input may be a phrase that w ould be more readily interpretable by the machine-learned model, such as a phrase indicating the user’s disapproval and identifying which output value(s) the user has disapproved of.

[0201] In some implementations, the systems and methods herein can provide the first output (e.g., the first output value(s) and the first metric value(s)) as input to the machine-learned model at the second output cycle. For instance, the model can consume outputs from a prior output cycle (e.g., in addition to and / or alternatively to additional contextual input from a user) when computing the next output cycle to iteratively refine the output over subsequent processing cycles.

[0202] Furthermore, in some implementations, providing the contextual input to the machine-learned model can include determining (e.g.. by the computing system) modified metric value(s) associated with the first output value(s) associated with the contextual aspect. The modified metric value(s) can be provided in place of the first metric value(s) respective to the selected first output value(s) as input to the machine-learned model at the second output cycle. As one example, in some implementations, the contextual input can be one of a positive indication or a negative indication with respect to the selected first output value(s). Determining the modified metric value(s) can include one of increasing or decreasing the first metric value(s) respective to the selected first output value(s) in response to the positive indication or the negative indication. For example, in some implementations, the system can “force” the metric value associated with a given output value to be different (e.g., higher or lower) when used as input for a subsequent computing cycle than the original metric value(e.g., a value that was originally output by the model). For example, if the metric value is a certainty score, the system may "‘force” a high certainty or a maximum certainty for output values that the user indicates approval of or '‘force” a low certainty or a minimum certainty for output values that the user indicates disapproval of.

[0203] At 510, the method 500 can include obtaining a second output from the machine-learned model. The second output can be or can include one or more second output values. Furthermore, At 512, the method 500 can include obtaining second metric value(s) associated with the second output value(s). For instance, in some implementations, the second output can be or can include the second metric value(s). Additionally and / or alternatively, in some implementations, one or more machine-learned models other than the model can generate the second metric values.

[0204] The second output can be associated with the second output cycle. For instance, the second output can include output values output during a second “pass” through a text passage by a sequence processing model. As another example, the second output may be a second block of output values (e g., text) output by a block generation model. A second output value can occur in place of a respective first output value from the first processing cycle. For example, a second output value may be generated to occupy a same output value position as a respective first output value. For example, if the output is or includes a block of text, a second output value may occur at a same position within the block of text as a respective first output value. Generally, a second output value may have an approximately similar meaning to a respective first output value, although the second output value may not necessarily have a similar meaning to the first output value in every’ instance. For instance, in some implementations, the model may refine its choice of output value in the output value position of the first and second output values over subsequent output cycles, such that the second output value may have an increased certainty score and / or be an improvement to some aspect of the set of output values, such as an improvement to accuracy, coherency, consistency, grammatical correctness, or other aspect of the set of output values.

[0205] At 514, the method 500 can include determining (e.g., by the computing system) a second formatting characteristic value associated with the second output value(s) based on the second metric value(s). As described above with respect to the first output value(s), for instance, the second formatting characteristic value(s) may correspond along a range with respective second metric value(s). The second formatting characteristic value(s) may be or may include, for example, shading value(s).

[0206] At 516, the method 500 can include providing the second formatting characteristic values for use in updating the first user interface to a second user interface having the plurality of second output values formatted according to the plurality of second formatting characteristic values. For instance, the computing system can generate a second user interface having the plurality7of second output values arranged in place of the plurality7of first output values according to the plurality7of output value positions and formatted according to the plurality7of second formatting charactenstic values. For instance, if as described above a second metric value (e.g., a certainty score) has increased relative to a respective first metric value due to an improvement in the second output relative to the first output, the second output value may be provided differently from the first output value in the second user interface (e.g.. based on the second formatting characteristic value(s)) in such a manner that indicates that the second output value is an improved output value from the first output value. For example, if a metric value is a certainty score and a formatting characteristic value is a shading value, the second output value have a darker shading than the first output value to indicate that the model is more certain about the second output value. It should be understood that some or all of the second output value(s) may not necessarily differ from respective first output value(s) across all output cycles. For instance, in some implementations, especially as the model converges to a final output, the output values may not be substantially modified across subsequent output cycles. The second output value can then be rendered in place of the first output value in the user interface and / or according to the second formatting characteristic value.

[0207] The systems and methods described herein can further provide for causing display of the second user interface (e.g., by the computing system.) For instance, the computing system can communicate the second user interface to the display device. For example, when the second user interface is updated to include the second output value(s) in place of the first output value(s), the display7device can modify the interface as displayed to the user such that the display ed interface reflects the second output value(s) and second formatting characteristic value(s). As one example, a GPU may compute a second matrix of pixel intensities that represents the user interface including the rendered second output value and provide the second matrix of pixel intensities to the displays device.

[0208] FIG. 6 depicts a flowchart of a method 1000 for training one or more machine-learned models according to aspects of the present disclosure. For instance, an example machine-learned model can include a sequence processing model.

[0209] One or more portion(s) of example method 1000 can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of example method 1000 can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of example method 1000 can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. FIG. 6 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded, omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. FIG. 6 is described with reference to elements / terms described with respect to other systems and figures for exemplary illustrated purposes and is not meant to be limiting. One or more portions of example method 1000 can be performed additionally, or alternatively, by other systems.

[0210] At 1002, example method 1000 can include obtaining a training instance. A set of training data can include a plurality of training instances divided between multiple datasets (e.g., a training dataset, a validation dataset, or testing dataset). A training instance can be labeled or unlabeled. Although referred to in example method 1000 as a “training” instance, it is to be understood that runtime inferences can form training instances when a model is trained using an evaluation of the model’s performance on that runtime instance (e.g., online training / leaming). Example data types for the training instance and various tasks associated therewith are described throughout the present disclosure.

[0211] At 1004, example method 1000 can include processing, using one or more machine-learned models, the training instance to generate an output. The output can be directly obtained from the one or more machine-learned models or can be a downstream result of a chain of processing operations that includes an output of the one or more machine-learned models.

[0212] At 1006, example method 1000 can include receiving an evaluation signal associated with the output. The evaluation signal can be obtained using a loss function. Various determinations of loss can be used, such as mean squared error, likelihood loss, cross entropy loss, hinge loss, contrastive loss, or various other loss functions. The evaluation signal can be computed using known ground-truth labels (e.g., supervised learning), predicted or estimated labels (e.g., semi- or self-supervised learning), or without labels (e.g., unsupervised learning). The evaluation signal can be a reward (e.g., for reinforcementlearning). The reward can be computed using a machine-learned reward model configured to generate rewards based on output(s) received. The reward can be computed using feedback data describing human feedback on the output(s).

[0213] At 1008, example method 1000 can include updating the machine-learned model using the evaluation signal. For example, values for parameters of the machine-learned model(s) can be learned, in some embodiments, using various training or learning techniques, such as, for example, backwards propagation. For example, the evaluation signal can be backpropagated from the output (or another source of the evaluation signal) through the machine-learned model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the evaluation signal with respect to the parameter value(s)). For example, system(s) containing one or more machine-learned models can be trained in an end-to-end manner. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations. In some implementations, performing backwards propagation of errors can include performing truncated backpropagation through time. Example method 1000 can include implementing a number of generalization techniques (e g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.

[0214] In some implementations, example method 1000 can be implemented for training a machine-learned model from an initialized state to a fully trained state (e.g., when the model exhibits a desired performance profile, such as based on accuracy, precision, recall, etc.).

[0215] In some implementations, example method 1000 can be implemented for particular stages of a training procedure. For instance, in some implementations, example method 1000 can be implemented for pre-training a machine-learned model. Pre-training can include, for instance, large-scale training over potentially noisy data to achieve a broad base of performance levels across a variety of tasks / data types. In some implementations, example method 1000 can be implemented for fine-tuning a machine-learned model. Fine-tuning can include, for instance, smaller-scale training on higher-quality (e.g., labeled, curated, etc.) data. Fine-tuning can affect all or a portion of the parameters of a machine-learned model. For example, various portions of the machine-learned model can be “frozen” for certain training stages. For example, parameters associated with an embedding space can be “frozen” during fine-tuning (e.g., to retain information learned from a broader domain(s) than present in the fine-tuning dataset(s)). An example fine-tuning approach includes reinforcement learning. Reinforcement learning can be based on user feedback on model performance during use.Example Machine-learned Models

[0216] FIG. 7 is a block diagram of an example processing flow for using machine-learned model(s) 1 to process input(s) 2 to generate output(s) 3.

[0217] Machine-learned model(s) 1 can be or include one or multiple machine-learned models or model components. Example machine-learned models can include neural networks (e.g., deep neural networks). Example machine-learned models can include nonlinear models or linear models. Example machine-learned models can use other architectures in lieu of or in addition to neural networks. Example machine-learned models can include decision tree based models, support vector machines, hidden Markov models, Bayesian networks, linear regression models, k-means clustering models, etc.

[0218] Example neural networks can include feed-forward neural networks, recurrent neural networks (RNNs), including long short-term memory (LSTM) based recurrent neural networks, convolutional neural networks (CNNs), diffusion models, generative-adversarial networks, or other forms of neural networks. Example neural networks can be deep neural networks. Some example machine-learned models can leverage an attention mechanism such as self-attention. For example, some example machine-learned models can include multiheaded self-attention models. For example, the machine-learned models can be or include transformer models.

[0219] Machine-learned model(s) 1 can include a single or multiple instances of the same model configured to operate on data from input(s) 2. Machine-learned model(s) 1 can include an ensemble of different models that can cooperatively interact to process data from input(s) 2. For example, machine-learned model(s) 1 can employ a mixture-of-experts structure. See, e.g., Zhou et al., Mixture-of-Experts with Expert Choice Routing, ARXlV:2202.09368v2 (Oct. 14, 2022).

[0220] Input(s) 2 can generally include or otherwise represent various types of data. Input(s) 2 can include one type or many different types of data. Output(s) 3 can be data of the same type(s) or of different ty pes of data as compared to input(s) 2. Output(s) 3 can include one type or many different types of data.

[0221] Example data types for input(s) 2 or output(s) 3 include natural language text data, software code data (e.g., source code, object code, machine code, or any other form of computer-readable instructions or programming languages), machine code data (e.g., binary code, assembly code, or other forms of machine-readable instructions that can be executed directly by a computer's central processing unit), assembly code data (e.g.. low-level programming languages that use symbolic representations of machine code instructions toprogram a processing unit), genetic data or other chemical or biochemical data, image data, audio data, audiovisual data, haptic data, biometric data, medical data, financial data, statistical data, geographical data, astronomical data, historical data, sensor data generally (e.g., digital or analog values, such as voltage or other absolute or relative level measurement values from a real or artificial input, such as from an audio sensor, light sensor, displacement sensor, etc.), and the like. Data can be raw or processed and can be in any format or schema.

[0222] In multimodal inputs 2 or outputs 3, example combinations of data types include image data and audio data, image data and natural language data, natural language data and software code data, image data and biometric data, sensor data and medical data, etc. It is to be understood that any combination of data types in an input 2 or an output 3 can be present.

[0223] An example input 2 can include one or multiple data types, such as the example data types noted above. An example output 3 can include one or multiple data types, such as the example data ty pes noted above. The data type(s) of input 2 can be the same as or different from the data type(s) of output 3. It is to be understood that the example data types noted above are provided for illustrative purposes only. Data types contemplated within the scope of the present disclosure are not limited to those examples noted above.Example Machine-Learned Sequence Processing Models

[0224] FIG. 8 is a block diagram of an example implementation of an example machine-learned model configured to process sequences of information. For instance, an example implementation of machine-learned model(s) 1 can include machine-learned sequence processing model(s) 4. An example system can pass input(s) 2 to sequence processing model(s) 4. Sequence processing model(s) 4 can include one or more machine-learned components. Sequence processing model(s) 4 can process the data from input(s) 2 to obtain an input sequence 5. Input sequence 5 can include one or more input elements 5-1, 5-2, . . . , 5-M, etc. obtained from input(s) 2. Sequence processing model 4 can process input sequence 5 using prediction layer(s) 6 to generate an output sequence 7. Output sequence 7 can include one or more output elements 7-1, 7-2, . . . , 7-A, etc. generated based on input sequence 5. The system can generate output(s) 3 based on output sequence 7.

[0225] Sequence processing model(s) 4 can include one or multiple machine-learned model components configured to ingest, generate, or otherwise reason over sequences of information. For example, some example sequence processing models in the text domain are referred to as “Large Language Models,’’ or LLMs. See, e.g., PaLM 2 Technical Report,GOOG E, https: / / ai.google / static / documents / pahn2techreport.pdf (n d ). Other example sequence processing models can operate in other domains, such as image domains, see, e.g, Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ARXIV:2010.11929V2 (Jun. 3, 2021), audio domains, see, e.g., Agostinelli et al., MusicLM: Generating Music From Text, ARXIV:2301.11325V1 (Jan. 26, 2023), biochemical domains, see. e.g., Jumper et al., Highly accurate protein structure prediction with AlphaFold, 596 Nature 583 (Aug. 26, 2021), by way of example. Sequence processing model(s) 4 can process one or multiple types of data simultaneously. Sequence processing model(s) 4 can include relatively large models (e.g., more parameters, computationally expensive, etc.), relatively small models (e.g., fewer parameters, computationally lightweight, etc ), or both.

[0226] In general, sequence processing model(s) 4 can obtain input sequence 5 using data from input(s) 2. For instance, input sequence 5 can include a representation of data from input(s) 2 in a format understood by sequence processing model(s) 4. One or more machine-learned components of sequence processing model(s) 4 can ingest the data from input(s) 2, parse the data into pieces compatible with the processing architectures of sequence processing model(s) 4 (e.g., via “tokenization”), and project the pieces into an input space associated with prediction layer(s) 6 (e.g., via “embedding”).

[0227] Sequence processing model(s) 4 can ingest the data from input(s) 2 and parse the data into a sequence of elements to obtain input sequence 5. For example, a portion of input data from input(s) 2 can be broken down into pieces that collectively represent the content of the portion of the input data. The pieces can provide the elements of the sequence.

[0228] Elements 5-1, 5-2, . . . , 5-M can represent, in some cases, building blocks for capturing or expressing meaningful information in a particular data domain. For instance, the elements can describe “atomic units” across one or more domains. For example, for textual input source(s), the elements can correspond to groups of one or more words or sub-word components, such as sets of one or more characters.

[0229] For example, elements 5-1, 5-2, . . . , 5-M can represent tokens obtained using a tokenizer. For instance, a tokenizer can process a given portion of an input source and output a series of tokens (e.g., corresponding to input elements 5-1, 5-2, . . . , 5-M) that represent the portion of the input source. Various approaches to tokenization can be used. For instance, textual input source(s) can be tokenized using a byte-pair encoding (BPE) technique. See, e.g., Kudo et al., SentencePiece: A simple and language independent subw ord tokenizer and detokenizer for Neural Text Processing, PROCEEDINGS OF THE 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (SystemDemonstrations), pages 66-71 (October 31-November 4, 2018), https: / / aclanthology.org / D18-2012.pdf. Image-based input source(s) can be tokenized by extracting and serializing patches from an image. Other tokenization approaches can be performed as well, including linear projections, non-linear transformations, and / or other data transformations.

[0230] In general, arbitrary’ data types can be serialized and processed into input sequence 5. It is to be understood that element(s) 5-1, 5-2, . . . , 5-M depicted in FIG. 8 can be the tokens or can be the embedded representations thereof.

[0231] Prediction layer(s) 6 can predict one or more output elements 7-1, 7-2, . . . , 7-N based on the input elements. Prediction layer(s) 6 can include one or more machine-learned model architectures, such as one or more lavers of learned parameters that manipulate and transform the input(s) to extract higher-order meaning from, and relationships between, input element(s) 5-1, 5-2, . . . , 5-M. In this manner, for instance, example prediction layer(s) 6 can predict new output element(s) in view of the context provided by input sequence 5.

[0232] Prediction layer(s) 6 can evaluate associations between portions of input sequence 5 and a particular output element. These associations can inform a prediction of the likelihood that a particular output follows the input context. For example, consider the textual snippet, “The carpenter’s toolbox was small and heavy. It was full of .” Example prediction layer(s) 6 can identify that “It” refers back to “toolbox” by determining a relationship between the respective embeddings. Example prediction layer(s) 6 can also link “It” to the attributes of the toolbox, such as “small” and “heavy.” Based on these associations, prediction layer(s) 6 can, for instance, assign a higher probability’ to the word “nails” than to the word “sawdust.”

[0233] A transformer is an example architecture that can be used in prediction layer(s) 4. See, e.g., Vaswani et al., Attention Is All You Need, ARXlV:1706.03762v7 (Aug. 2, 2023). A transformer is an example of a machine-learned model architecture that uses an attention mechanism to compute associations betw een items within a context w indow’. The context window can include a sequence that contains input sequence 5 and potentially one or more output element(s) 7-1, 7-2, . . . , 7-N. A transformer block can include one or more attention layer(s) and one or more post-attention layer(s) (e.g., feedforward layer(s), such as a multi-layer perceptron).

[0234] Prediction layer(s) 6 can include other machine-learned model architectures in addition to or in lieu of transformer-based architectures. For example, recurrent neural networks (RNNs) and long short-term memory (LSTM) models can also be used, as well asconvolutional neural networks (CNNs). In general, prediction layer(s) 6 can leverage various kinds of artificial neural networks that can understand or generate sequences of information.

[0235] Output sequence 7 can include or otherwise represent the same or different data types as input sequence 5. For instance, input sequence 5 can represent textual data, and output sequence 7 can represent textual data. Input sequence 5 can represent image, audio, or audiovisual data, and output sequence 7 can represent textual data (e.g., describing the image, audio, or audiovisual data). It is to be understood that prediction layer(s) 6, and any other interstitial model components of sequence processing model(s) 4, can be configured to receive a variety of data types in input sequence(s) 5 and output a variety of data ty pes in output sequence(s) 7.

[0236] Output sequence 7 can have various relationships to input sequence 5. Output sequence 7 can be a continuation of input sequence 5. Output sequence 7 can be complementary to input sequence 5. Output sequence 7 can translate, transform, augment, or otherwise modify input sequence 5. Output sequence 7 can answer, evaluate, confirm, or otherwise respond to input sequence 5. Output sequence 7 can implement (or describe instructions for implementing) an instruction provided via an input sequence 5.

[0237] Output sequence 7 can be generated autoregressively. For instance, for some applications, an output of one or more prediction layer(s) 6 can be passed through one or more output layers (e.g., SoftMax layer) to obtain a probability distribution over an output vocabulary (e.g.. a textual or symbolic vocabulary) conditioned on a set of input elements in a context window. In this manner, for instance, output sequence 7 can be autoregressively generated by sampling a likely next output element, adding that element to the context window and re-generating the probability distribution based on the updated context window, and sampling a likely next output element, and so forth.

[0238] Output sequence 7 can also be generated non-autoregressively. For instance, multiple output elements of output sequence 7 can be predicted together without explicit sequential conditioning on each other. See, e.g., Saharia et al., Non-Autoregressive Machine Translation with Latent Alignments. ARXIV:2004.07437V3 (NOV. 16, 2020).

[0239] Output sequence 7 can include one or multiple portions or elements. In an example content generation configuration, output sequence 7 can include multiple elements corresponding to multiple portions of a generated output sequence (e.g., a textual sentence, values of a discretized waveform, computer code, etc.). In an example classification configuration, output sequence 7 can include a single element associated with a classification output. For instance, an output “vocabulary’’ can include a set of classes into which an inputsequence is to be classified. For instance, a vision transformer block can pass latent state information to a multilayer perceptron that outputs a likely class value associated with an input image.

[0240] FIG. 9 is a block diagram of an example technique for populating an example input sequence 8. Input sequence 8 can include various functional elements that form part of the model infrastructure, such as an element 8-0 obtained from a task indicator 9 that signals to any model(s) that process input sequence 8 that a particular task is being performed (e.g., to help adapt a performance of the model(s) to that particular task). Input sequence 8 can include various data elements from different data modalities. For instance, an input modality 10-1 can include one modality of data. A data-to-sequence model 11-1 can process data from input modality 10-1 to project the data into a format compatible with input sequence 8 (e.g., one or more vectors dimensioned according to the dimensions of input sequence 8) to obtain elements 8-1, 8-2, 8-3. Another input modality 10-2 can include a different modality of data. A data-to-sequence model 11-2 can project data from input modality 10-2 into a format compatible with input sequence 8 to obtain elements 8-4, 8-5, 8-6. Another input modality 10-3 can include yet another different modality of data. A data-to-sequence model 11-3 can project data from input modality 10-3 into a format compatible with input sequence 8 to obtain elements 8-7, 8-8, 8-9.

[0241] Input sequence 8 can be the same as or different from input sequence 5. Input sequence 8 can be a multimodal input sequence that contains elements that represent data from different modalities using a common dimensional representation. For instance, an embedding space can have dimensions. Input sequence 8 can be configured to contain a plurality of elements that have P dimensions. In this manner, for instance, example implementations can facilitate information extraction and reasoning across diverse data modalities by projecting data into elements in the same embedding space for comparison, combination, or other computations therebetween.

[0242] For example, elements 8-0, . . . , 8-9 can indicate particular locations within a multidimensional embedding space. Some elements can map to a set of discrete locations in the embedding space. For instance, elements that correspond to discrete members of a predetermined vocabulary of tokens can map to discrete locations in the embedding space that are associated with those tokens. Other elements can be continuously distributed across the embedding space. For instance, some datatypes can be broken down into continuously defined portions (e.g., image patches) that can be described using continuously distributed locations within the embedding space.

[0243] In some implementations, the expressive power of the embedding space may not be limited to meanings associated with any particular set of tokens or other building blocks. For example, a continuous embedding space can encode a spectrum of high-order information. An individual piece of information (e.g., a token) can map to a particular point in that space: for instance, a token for the word “dog” can be projected to an embedded value that points to a particular location in the embedding space associated with canine-related information. Similarly, an image patch of an image of a dog on grass can also be projected into the embedding space. In some implementations, the projection of the image of the dog can be similar to the projection of the word “dog” while also having similarity to a projection of the word “grass,” while potentially being different from both. In some implementations, the projection of the image patch may not exactly align with any single projection of a single word. In some implementations, the projection of the image patch can align with a combination of the projections of the words “dog” and “grass.” In this manner, for instance, a high order embedding space can encode information that can be independent of data modalities in which the information is expressed.

[0244] Task indicator 9 can include a model or model component configured to identify a task being performed and inject, into input sequence 8, an input value represented by element 8-0 that signals which task is being performed. For instance, the input value can be provided as a data type associated with an input modality and projected along with that input modality (e.g., the input value can be a textual task label that is embedded along with other textual data in the input; the input value can be a pixel-based representation of a task that is embedded along with other image data in the input; etc.). The input value can be provided as a data type that differs from or is at least independent from other input(s). For instance, the input value represented by element 8-0 can be learned within a continuous embedding space.

[0245] Input modalities 10-1, 10-2, and 10-3 can be associated with various different data types (e.g., as described above with respect to input(s) 2 and output(s) 3).

[0246] Data-to-sequence models 11-1. 11-2. and 11-3 can be the same or different from each other. Data-to-sequence models 11 -I, 11-2, and 11-3 can be adapted to each respective input modality 10-1, 10-2, and 10-3. For example, a textual data-to-sequence model can subdivide a portion of input text and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-1. 8-2, 8-3, etc.). An image data-to-sequence model can subdivide an input image and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-4, 8-5, 8-6, etc.). An arbitrary datatype data-to-sequence model cansubdivide an input of that arbitrary datatype and project the subdivisions into element(s) in input sequence 8 (e.g., elements 8-7. 8-8, 8-9, etc.).

[0247] Data-to-sequence models 11-1, 11-2, and 11-3 can form part of machine-learned sequence processing model (s) 4. Data-to-sequence models 11-1, 11-2, and 11-3 can be jointly trained with or trained independently from machine-learned sequence processing model(s) 4. Data-to-sequence models 11-1, 11-2, and 11-3 can be trained end-to-end with machine-learned sequence processing model(s) 4.Example Machine-learned Model Development Platform

[0248] FIG. 10 is a block diagram of an example model development platform 12 that can facilitate creation, adaptation, and refinement of example machine-learned models (e.g., machine-learned model(s) 1, sequence processing model(s) 4, etc.). Model development platform 12 can provide a number of different toolkits that developer systems can employ in the development of new or adapted machine-learned models.

[0249] Model development platform 12 can provide one or more model libraries 13 containing building blocks for new models. Model libraries 13 can include one or more pretrained foundational models 13-1, which can provide a backbone of processing power across various tasks. Model libraries 13 can include one or more pre-trained expert models 13-2, which can be focused on performance in particular domains of expertise. Model libraries 13 can include various model primitives 13-3, which can provide low-level architectures or components (optionally pre-trained), which can be assembled in various arrangements as desired.

[0250] Model development platform 12 can receive selections of various model components 14. Model development platform 12 can pass selected model components 14 to a workbench 15 that combines selected model components 14 into a development model 16.

[0251] Workbench 15 can facilitate further refinement and adaptation of development model 16 by leveraging a number of different toolkits integrated with model development platform 12. For example, workbench 15 can facilitate alignment of the development model 16 with a desired performance profile on various tasks using a model alignment toolkit 17.

[0252] Model alignment toolkit 17 can provide a number of tools for causing development model 16 to generate outputs aligned with desired behavioral characteristics. Alignment can include increasing an accuracy, precision, recall, etc. of model outputs.Alignment can include enforcing output styles, schema, or other preferential characteristics of model outputs. Alignment can be general or domain specific. For instance, a pre-trainedfoundational model 13-1 can begin with an initial level of performance across multiple domains. Alignment of the pre-trained foundational model 13-1 can include improving a performance in a particular domain of information or tasks (e.g., even at the expense of performance in another domain of information or tasks).

[0253] Model alignment toolkit 17 can integrate one or more dataset(s) 17-1 for aligning development model 16. Curated dataset(s) 17-1 can include labeled or unlabeled training data. Dataset(s) 17-1 can be obtained from public domain datasets. Dataset(s) 17-1 can be obtained from private datasets associated with one or more developer system(s) for the alignment of bespoke machine-learned model(s) customized for private use-cases.

[0254] Pre-training pipelines 17-2 can include a machine-learned model training workflow configured to update development model 16 over large-scale, potentially noisy datasets. For example, pre-training can leverage unsupervised learning techniques (e.g., denoising, etc.) to process large numbers of training instances to update model parameters from an initialized state and achieve a desired baseline performance. Pre- training pipelines 17-2 can leverage unlabeled datasets in dataset(s) 17-1 to perform pre-training. Workbench 15 can implement a pre-training pipeline 17-2 to pre-train development model 16.

[0255] Fine-tuning pipelines 17-3 can include a machine-learned model training workflow configured to refine the model parameters of development model 16 with higher-quality data. Fine-tuning pipelines 17-3 can update development model 16 by conducting supervised training with labeled dataset(s) in dataset(s) 17-1. Fine-tuning pipelines 17-3 can update development model 1 by conducting reinforcement learning using reward signals from user feedback signals. Workbench 15 can implement a fine-tuning pipeline 17-3 to finetune development model 16.

[0256] Prompt libraries 17-4 can include sets of inputs configured to induce behavior aligned with desired performance criteria. Prompt libraries 17-4 can include few-shot prompts (e.g., inputs providing examples of desired model outputs for prepending to a desired runtime query), chain-of-thought prompts (e.g., inputs providing step-by-step reasoning within the exemplars to facilitate thorough reasoning by the model), and the like.

[0257] Example prompts can be retrieved from an available repository of prompt libraries 17-4. Example prompts can be contributed by one or more developer systems using workbench 15.

[0258] In some implementations, pre-trained or fine-tuned models can achieve satisfactory performance without exemplars in the inputs. For instance, zero-shot prompts caninclude inputs that lack exemplars. Zero-shot prompts can be within a domain within a training dataset or outside of the training domain(s).

[0259] Prompt libraries 17-4 can include one or more prompt engineering tools. Prompt engineering tools can provide workflows for retrieving or learning optimized prompt values. Prompt engineering tools can facilitate directly learning prompt values (e.g., input element values) based on one or more training iterations. Workbench 15 can implement prompt engineering tools in development model 16.

[0260] Prompt libraries 17-4 can include pipelines for prompt generation. For example, inputs can be generated using development model 16 itself or other machine-learned models. In this manner, for instance, a first model can process information about a task and output an input for a second model to process in order to perform a step of the task. The second model can be the same as or different from the first model. Workbench 15 can implement prompt generation pipelines in development model 16.

[0261] Prompt libraries 17-4 can include pipelines for context injection. For instance, a performance of development model 16 on a particular task can improve if provided with additional context for performing the task. Prompt libraries 17-4 can include software components configured to identify desired context, retrieve the context from an external source (e.g., a database, a sensor, etc.), and add the context to the input prompt. Workbench 15 can implement context injection pipelines in development model 16.

[0262] Although various training examples described herein with respect to model development platform 12 refer to ‘'pre-training” and “fine-tuning,” it is to be understood that model alignment toolkit 17 can generally support a wide variety of training techniques adapted for training a wide variety of machine-learned models. Example training techniques can correspond to the example training method 1000 described above.

[0263] Model development platform 12 can include a model plugin toolkit 18. Model plugin toolkit 18 can include a variety of tools configured for augmenting the functionality' of a machine-learned model by integrating the machine-learned model with other systems, devices, and software components. For instance, a machine-learned model can use tools to increase performance quality where appropriate. For instance, deterministic tasks can be offloaded to dedicated tools in lieu of probabilistically performing the task with an increased risk of error. For instance, instead of autoregressively predicting the solution to a system of equations, a machine-learned model can recognize a tool to call for obtaining the solution and pass the system of equations to the appropriate tool. The tool can be a traditional system of equations solver that can operate deterministically to resolve the system of equations. Asanother example, a tool can have a machine-learned model with reduced model overhead compared to a larger machine-learned model. For instance, the model of the tool can be a less sophisticated model than the calling model that is specialized to a particular task or subset of tasks and can require fewer computing resources to produce a usable output. The output of the tool can be returned in response to the original query'. In this manner, tool use can allow some example models to focus on the strengths of machine-learned models — e g., understanding an intent in an unstructured request for a task — while augmenting the performance of the model by offloading certain tasks to a more focused tool for rote application of deterministic algorithms to a well-defined problem or for evaluation of simpler tasks that can be adequately performed by a less sophisticated model.

[0264] Model plugin toolkit 18 can include validation tools 18-1. Validation tools 18-1 can include tools that can parse and confirm output(s) of a machine-learned model.Validation tools 18-1 can include engineered heuristics that establish certain thresholds applied to model outputs. For example, validation tools 18-1 can ground the outputs of machine-learned models to structured data sources (e.g., to mitigate "hallucinations"). One example tool that can be included in validation tools 18-1 is a routing tool for routing a query from a user to a user-specific memory layer or a public data interface.

[0265] Model plugin toolkit 18 can include tooling packages 18-2 for implementing one or more tools that can include scripts or other executable code that can be executed alongside development model 16. Tooling packages 18-2 can include one or more inputs configured to cause machine-learned model(s) to implement the tools (e g., few-shot prompts that induce a model to output tool calls in the proper syntax, etc.). Tooling packages 18-2 can include, for instance, fine-tuning training data for training a model to use a tool.

[0266] Model plugin toolkit 18 can include interfaces for calling external application programming interfaces (APIs) 18-3. For instance, in addition to or in lieu of implementing tool calls or tool code directly with development model 16, development model 16 can be aligned to output instructions that initiate API calls to send or obtain data via external systems. As an example, the development model 16 can initiate API calls to one or more public data interface(s) to send or obtain data from one or more public data sources, such as webpages, databases, and so on.

[0267] Model plugin toolkit 18 can integrate with prompt libraries 17-4 to build a catalog of available tools for use with development model 16. For instance, a model can receive, in an input, a catalog of available tools, and the model can generate an output that selects a tool from the available tools and initiates a tool call for using the tool.

[0268] Model development platform 12 can include a computational optimization toolkit 19 for optimizing a computational performance of development model 16. For instance, tools for model compression 19-1 can allow development model 16 to be reduced in size while maintaining a desired level of performance. For instance, model compression 19-1 can include quantization workflows, weight pruning and sparsification techniques, etc. Tools for hardware acceleration 19-2 can facilitate the configuration of the model storage and execution formats to operate optimally on different hardware resources. For instance, hardware acceleration 19-2 can include tools for optimally sharding models for distributed processing over multiple processing units for increased bandwidth, lower unified memory requirements, etc. Tools for distillation 19-3 can provide for the training of lighter-weight models based on the knowledge encoded in development model 16. For instance, development model 16 can be a highly performant, large machine-learned model optimized using model development platform 12. To obtain a lightweight model for running in resource-constrained environments, a smaller model can be a “student model” that learns to imitate development model 16 as a “teacher model.” In this manner, for instance, the investment in learning the parameters and configurations of development model 16 can be efficiently transferred to a smaller model for more efficient inference.

[0269] Workbench 15 can implement one, multiple, or none of the toolkits implemented in model development platform 12. Workbench 15 can output an output model 20 based on development model 16. Output model 20 can be a deployment version of development model 16. Output model 20 can be a development or training checkpoint of development model 16. Output model 20 can be a distilled, compressed, or otherwise optimized version of development model 16.

[0270] FIG. 11 is a block diagram of an example training flow for training a machine-learned development model 16. One or more portion(s) of the example training flow can be implemented by a computing system that includes one or more computing devices such as, for example, computing systems described with reference to the other figures. Each respective portion of the example training flow can be performed by any (or any combination) of one or more computing devices. Moreover, one or more portion(s) of the example training flow can be implemented on the hardware components of the device(s) described herein, for example, to train one or more systems or models. FIG. 11 depicts elements performed in a particular order for purposes of illustration and discussion. Those of ordinary skill in the art, using the disclosures provided herein, will understand that the elements of any of the methods discussed herein can be adapted, rearranged, expanded.omitted, combined, or modified in various ways without deviating from the scope of the present disclosure. FIG. 11 is described with reference to elements / terms described with respect to other systems and figures for exemplar}' illustrated purposes and is not meant to be limiting. One or more portions of the example training flow can be performed additionally, or alternatively, by other systems.

[0271] Initially, development model 16 can persist in an initial state as an initialized model 21. Development model 16 can be initialized with weight values. Initial weight values can be random or based on an initialization schema. Initial weight values can be based on prior pre-training for the same or for a different model.

[0272] Initialized model 21 can undergo pre-training in a pre-training stage 22. Pretraining stage 22 can be implemented using one or more pre-training pipelines 17-2 over data from dataset(s) 17-1. Pre-training can be omitted, for example, if initialized model 21 is already pre-trained (e.g., development model 16 contains, is, or is based on a pre-trained foundational model or an expert model).

[0273] Pre-trained model 23 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Pre-trained model 23 can be the initial state if development model 16 was already pre-trained. Pre-trained model 23 can undergo fine-tuning in a fine-tuning stage 24. Fine-tuning stage 24 can be implemented using one or more fine-tuning pipelines 17-3 over data from dataset(s) 17-1. Fine-tuning can be omitted, for example, if a pre-trained model has satisfactory performance, if the model was already fine-tuned, or if other tuning approaches are preferred.

[0274] Fine-tuned model 29 can then be a new version of development model 16, which can persist as development model 16 or as a new development model. Fine-tuned model 29 can be the initial state if development model 16 was already fine-tuned. Fine-tuned model 29 can undergo refinement with user feedback 26. For instance, refinement with user feedback 26 can include reinforcement learning, optionally based on human feedback from human users of fine-tuned model 25. As reinforcement learning can be a form of fine-tuning, it is to be understood that fine-tuning stage 24 can subsume the stage for refining with user feedback 26. Refinement with user feedback 26 can produce a refined model 27. Refined model 27 can be output to downstream system(s) 28 for deployment or further development.

[0275] In some implementations, computational optimization operations can be applied before, during, or after each stage. For instance, initialized model 21 can undergo computational optimization 29-1 (e.g.. using computational optimization toolkit 19) before pre-training stage 22. Pre-trained model 23 can undergo computational optimization 29-2(e.g., using computational optimization toolkit 19) before fine-tuning stage 24. Fine-tuned model 25 can undergo computational optimization 29-3 (e.g., using computational optimization toolkit 19) before refinement with user feedback 26. Refined model 27 can undergo computational optimization 29-4 (e.g., using computational optimization toolkit 19) before output to downstream system(s) 28. Computational optimization(s) 29-1, . . . . 29-4 can all be the same, all be different, or include at least some different optimization techniques.Example Machine-learned Model Inference System

[0276] FIG. 12 is a block diagram of an inference system for operating one or more machine-learned model(s) 1 to perform inference (e.g., for training, for deployment, etc.). A model host 31 can receive machine-learned model(s) 1. Model host 31 can host one or more model instance(s) 31-1, which can be one or multiple instances of one or multiple models. Model host 31 can host model instance(s) 31-1 using available compute resources 31-2 associated with model host 31.

[0277] Model host 31 can perform inference on behalf of one or more client(s) 32. Client(s) 32 can transmit an input request 33 to model host 31. Using input request 33, model host 31 can obtain input(s) 2 for input to machine-learned model(s) 1. Machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3. Using output(s) 3, model host 31 can return an output payload 34 for responding to input request 33 from client(s) 32. Output payload 34 can include or be based on output(s) 3.

[0278] Model host 31 can leverage various other resources and tools to augment the inference task. For instance, model host 31 can communicate with tool interfaces 35 to facilitate tool use by model instance(s) 31-1. Tool interfaces 35 can include local or remote APIs. Tool interfaces 35 can include integrated scripts or other software functionality. Model host 31 can engage online learning interface(s) 36 to facilitate ongoing improvements to machine-learned model(s) 1. For instance, online learning interface(s) 36 can be used within reinforcement learning loops to retrieve user feedback on inferences served by model host 31. Model host 31 can access runtime data source(s) 37 for augmenting input(s) 2 with additional contextual information. For instance, runtime data source(s) 37 can include a knowledge graph 37-1 that facilitates structured information retrieval for information associated with input request(s) 33 (e.g., a search engine service). Runtime data source(s) 37 can include public or private, external or local database(s) 37-2 that can store information associated with input request(s) 33 for augmenting input(s) 2. Runtime data source(s) 37 can include accountdata 37-3 which can be retrieved in association with a user account corresponding to a client 32 for customizing the behavior of model host 31 accordingly.

[0279] Model host 31 can be implemented by one or multiple computing devices or systems. Client(s) 2 can be implemented by one or multiple computing devices or systems, which can include computing devices or systems shared with model host 31.

[0280] For example, model host 31 can operate on a server system that provides a machine-learning service to client device(s) that operate client(s) 32 (e.g., over a local or wide-area network). Client device(s) can be end-user devices used by individuals. Client device(s) can be server systems that operate client(s) 32 to provide various functionality as a sen-ice to downstream end-user devices.

[0281] In some implementations, model host 31 can operate on a same device or system as client(s) 32. Model host 31 can be a machine-learning service that runs on-device to provide machine-learning functionality to one or multiple applications operating on a client device, which can include an application implementing client(s) 32. Model host 31 can be a part of a same application as client(s) 32. For instance, model host 31 can be a subroutine or method implemented by one part of an application, and client(s) 32 can be another subroutine or method that engages model host 31 to perform inference functions within the application. It is to be understood that model host 31 and client(s) 32 can have various different configurations.

[0282] Model instance(s) 31-1 can include one or more machine-learned models that are available for performing inference. Model instance(s) 31-1 can include weights or other model components that are stored on or in persistent storage, temporarily cached, or loaded into high-speed memory. Model instance(s) 31-1 can include multiple instance(s) of the same model (e.g., for parallel execution of more requests on the same model). Model instance(s) 31-1 can include instance(s) of different model(s). Model instance(s) 31-1 can include cached intermediate states of active or inactive model(s) used to accelerate inference of those models. For instance, an inference session with a particular model can generate significant amounts of computational results that can be re-used for future inference runs (e.g., using a KV cache for transformer-based models). These computational results can be saved in association with that inference session so that session can be executed more efficiently w hen resumed.

[0283] Compute resource(s) 31-2 can include one or more processors (central processing units, graphical processing units, tensor processing units, machine-learning accelerators, etc.) connected to one or more memory devices. Compute resource(s) 31-2 caninclude a dynamic pool of available resources shared with other processes. Compute resource(s) 31-2 can include memory devices large enough to fit an entire model instance in a single memory instance. Compute resource(s) 31-2 can also shard model instance(s) across multiple memoiy devices (e.g., using data parallelization or tensor parallelization, etc.). This can be done to increase parallelization or to execute a large model using multiple memoi7devices which individually might not be able to fit the entire model into memory.

[0284] Input request 33 can include data for input(s) 2. Model host 31 can process input request 33 to obtain input(s) 2. Input(s) 2 can be obtained directly from input request 33 or can be retrieved using input request 33. Input request 33 can be submitted to model host 31 via an API.

[0285] Model host 31 can perform inference over batches of input requests 33 in parallel. For instance, a model instance 31-1 can be configured with an input structure that has a batch dimension. Separate input(s) 2 can be distributed across the batch dimension (e.g., rows of an array). The separate input(s) 2 can include completely different contexts. The separate input(s) 2 can be multiple inference steps of the same task. The separate input(s) 2 can be staggered in an input structure, such that any given inference cycle can be operating on different portions of the respective input(s) 2. In this manner, for instance, model host 31 can perform inference on the batch in parallel, such that output(s) 3 can also contain the batch dimension and return the inference results for the batched input(s) 2 in parallel. In this manner, for instance, batches of input request(s) 33 can be processed in parallel for higher throughput of output payload(s) 34.

[0286] Output payload 34 can include or be based on output(s) 3 from machine-learned model(s) 1. Model host 31 can process output(s) 3 to obtain output payload 34. This can include chaining multiple rounds of inference (e.g., iteratively, recursively, across the same model(s) or different model(s)) to arrive at a final output for a task to be returned in output payload 34. Output payload 34 can be transmitted to client(s) 32 via an API.

[0287] Online learning interface(s) 36 can facilitate reinforcement learning of machine-learned model(s) 1. Online learning interface(s) 36 can facilitate reinforcement learning with human feedback (RLHF). Online learning interface(s) 36 can facilitate federated learning of machine-learned model(s) 1.

[0288] Model host 31 can execute machine-learned model(s) 1 to perform inference for various tasks using various types of data. For example, various different input(s) 2 and output(s) 3 can be used for various different tasks. In some implementations, input(s) 2 can be or otherwise represent image data. Machine-learned model(s) 1 can process the image datato generate an output. As an example, machine-learned model(s) 1 can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an image segmentation output. As another example, machine-learned model(s) 1 can process the image data to generate an image classification output. As another example, machine-learned model(s) 1 can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an encoded image data output (e.g., an encoded and / or compressed representation of the image data, etc.). As another example, machine-learned model(s) 1 can process the image data to generate an upscaled image data output. As another example, machine-learned model(s) 1 can process the image data to generate a prediction output.

[0289] In some implementations, the task is a computer vision task. In some cases, input(s) 2 includes pixel data for one or more images and the task is an image processing task. For example, the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class. The image processing task can be object detection, where the image processing output identifies one or more regions in the one or more images and. for each region, a likelihood that region depicts an object of interest. As another example, the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category' in a predetermined set of categories. For example, the set of categories can be foreground and background. As another example, the set of categories can be object classes. As another example, the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value. As another example, the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.

[0290] In some implementations, input(s) 2 can be or otherwise represent natural language data. Machine-learned model(s) 1 can process the natural language data to generate an output. As an example, machine-learned model(s) 1 can process the natural language data to generate a language encoding output. As another example, machine-learned model(s) 1 canprocess the natural language data to generate a latent text embedding output. As another example, machine-learned model(s) 1 can process the natural language data to generate a translation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a classification output. As another example, machine-learned model(s) 1 can process the natural language data to generate a textual segmentation output. As another example, machine-learned model(s) 1 can process the natural language data to generate a semantic intent output. As another example, machine-learned model(s) 1 can process the natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.). As another example, machine-learned model(s) 1 can process the natural language data to generate a prediction output (e.g., one or more predicted next portions of natural language content).

[0291] In some implementations, input(s) 2 can be or otherwise represent speech data (e.g., data describing spoken natural language, such as audio data, textual data, etc.).Machine-learned model(s) 1 can process the speech data to generate an output. As an example, machine-learned model(s) 1 can process the speech data to generate a speech recognition output. As another example, machine-learned model(s) 1 can process the speech data to generate a speech translation output. As another example, machine-learned model(s) 1 can process the speech data to generate a latent embedding output. As another example, machine-learned model(s) 1 can process the speech data to generate an encoded speech output (e g., an encoded and / or compressed representation of the speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate an upscaled speech output (e.g.. speech data that is higher quality' than the input speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.). As another example, machine-learned model(s) 1 can process the speech data to generate a prediction output.

[0292] In some implementations, input(s) 2 can be or otherwise represent latent encoding data (e.g., a latent space representation of an input, etc.). Machine-learned model(s) 1 can process the latent encoding data to generate an output. As an example, machine-learned model(s) 1 can process the latent encoding data to generate a recognition output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a reconstruction output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a search output. As another example, machine-learned model(s) 1can process the latent encoding data to generate a reclustering output. As another example, machine-learned model(s) 1 can process the latent encoding data to generate a prediction output.

[0293] In some implementations, input(s) 2 can be or otherwise represent statistical data. Statistical data can be, represent, or otherwise include data computed and / or calculated from some other data source. Machine-learned model(s) 1 can process the statistical data to generate an output. As an example, machine-learned model(s) 1 can process the statistical data to generate a recognition output. As another example, machine-learned model(s) 1 can process the statistical data to generate a prediction output. As another example, machine-learned model(s) 1 can process the statistical data to generate a classification output. As another example, machine-learned model(s) 1 can process the statistical data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the statistical data to generate a visualization output. As another example, machine-learned model(s) 1 can process the statistical data to generate a diagnostic output.

[0294] In some implementations, input(s) 2 can be or otherwise represent sensor data. Machine-learned model(s) 1 can process the sensor data to generate an output. As an example, machine-learned model(s) 1 can process the sensor data to generate a recognition output. As another example, machine-learned model(s) 1 can process the sensor data to generate a prediction output. As another example, machine-learned model(s) 1 can process the sensor data to generate a classification output. As another example, machine-learned model(s) 1 can process the sensor data to generate a segmentation output. As another example, machine-learned model(s) 1 can process the sensor data to generate a visualization output. As another example, machine-learned model(s) 1 can process the sensor data to generate a diagnostic output. As another example, machine-learned model(s) 1 can process the sensor data to generate a detection output.

[0295] In some implementations, machine-learned model(s) 1 can be configured to perform a task that includes encoding input data for reliable and / or efficient transmission or storage (and / or corresponding decoding). For example, the task can be an audio compression task. The input can include audio data and the output can be or can include compressed audio data. In another example, the input includes visual data (e.g. one or more images or videos), the output includes compressed visual data, and the task is a visual data compression task. In another example, the task can include generating an embedding for input data (e.g. input audio or visual data). In some cases, the input includes audio data representing a spoken utterance and the task is a speech recognition task. The output can be or can include a textoutput which is mapped to the spoken utterance. In some cases, the task includes encrypting or decrypting input data. In some cases, the task includes a microprocessor performance task, such as branch prediction or memory address translation.

[0296] In some implementations, the task is a generative task, and machine-learned model(s) 1 can be configured to output content generated in view of input(s) 2. For instance, input(s) 2 can be or otherwise represent data of one or more modalities that encodes context for generating additional content.

[0297] In some implementations, the task can be a text completion task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent textual data and to generate output(s) 3 that represent additional textual data that completes a textual sequence that includes input(s) 2. For instance, machine-learned model(s) 1 can be configured to generate output(s) 3 to complete a sentence, paragraph, or portion of text that follows from a portion of text represented by input(s) 2.

[0298] In some implementations, the task can be an instruction following task.Machine-learned model(s) 1 can be configured to process input(s) 2 that represent instructions to perform a function and to generate output(s) 3 that advance a goal of satisfying the instruction function (e.g., at least a step of a multi-step procedure to perform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the instructions (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward accomplishing the requested functionality. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of performing a function. Multiple steps can be performed, with a final output being obtained that is responsive to the initial instructions.

[0299] In some implementations, the task can be a question answering task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent a question to answer and to generate output(s) 3 that advance a goal of returning an answer to the question (e.g., atleast a step of a multi-step procedure to perform the function). Output(s) 3 can represent data of the same or of a different modality as input(s) 2. For instance, input(s) 2 can represent textual data (e.g., natural language instructions for a task to be performed) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). Input(s) 2 can represent image data (e.g., image-based instructions for a task to be performed, optionally accompanied by textual instructions) and machine-learned model(s) 1 can process input(s) 2 to generate output(s) 3 that represent textual data responsive to the question (e.g., natural language responses, programming language responses, machine language responses, etc.). One or more output(s) 3 can be iteratively or recursively generated to sequentially process and accomplish steps toward answering the question. For instance, an initial output can be executed by an external system or be processed by machine-learned model(s) 1 to complete an initial step of obtaining an answer to the question (e.g., querying a database, performing a computation, executing a script, etc.). Multiple steps can be performed, with a final output being obtained that is responsive to the question.

[0300] In some implementations, the task can be an image generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of image content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent image data that depicts imagery related to the context. For instance, machine-learned model(s) 1 can be configured to generate pixel data of an image. Values for channel (s) associated with the pixels in the pixel data can be selected based on the context (e.g., based on a probability determined based on the context).

[0301] In some implementations, the task can be an audio generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of audio content. The context can include text data, image data, audio data, etc. Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent audio data related to the context. For instance, machine-learned model(s) I can be configured to generate waveform data in the form of an image (e.g., a spectrogram). Values for channel(s) associated with pixels of the image can be selected based on the context. Machine-learned model(s) 1 can be configured to generate waveform data in the form of a sequence of discrete samples of a continuous waveform. Values of the sequence can be selected based on the context (e.g., based on a probability determined based on the context).

[0302] In some implementations, the task can be a data generation task. Machine-learned model(s) 1 can be configured to process input(s) 2 that represent context regarding a desired portion of data (e.g., data from various data domains, such as sensor data, image data, multimodal data, statistical data, etc.). The desired data can be, for instance, synthetic data for training other machine-learned models. The context can include arbitrary data type(s).Machine-learned model(s) 1 can be configured to generate output(s) 3 that represent data that aligns with the desired data. For instance, machine-learned model(s) 1 can be configured to generate data values for populating a dataset. Values for the data object(s) can be selected based on the context (e.g., based on a probability determined based on the context).Additional Disclosure

[0303] The technology discussed herein makes reference to servers, databases, software applications, and other computer-based systems, as well as actions taken and information sent to and from such systems. The inherent flexibility of computer-based systems allows for a great variety of possible configurations, combinations, and divisions of tasks and functionality between and among components. For instance, processes discussed herein can be implemented using a single device or component or multiple devices or components working in combination. Databases and applications can be implemented on a single system or distributed across multiple systems. Distributed components can operate sequentially or in parallel.

[0304] While the present subject matter has been described in detail with respect to various specific example embodiments thereof, each example is provided by way of explanation, not limitation of the disclosure. Those skilled in the art, upon attaining an understanding of the foregoing, can readily produce alterations to, variations of, and equivalents to such embodiments. Accordingly, the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. For instance, features illustrated or described as part of one embodiment can be used with another embodiment to yield a still further embodiment. Thus, it is intended that the present disclosure cover such alterations, variations, and equivalents.

[0305] Aspects of the disclosure have been described in terms of illustrative embodiments thereof. Any and all features in the following claims can be combined or rearranged in any way possible, including combinations of claims not explicitly enumerated in combination together, as the example claim dependencies listed herein should not be readas limiting the scope of possible combinations of features disclosed herein. Accordingly, the scope of the present disclosure is by way of example rather than by way of limitation, and the subject disclosure does not preclude inclusion of such modifications, variations or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art. Moreover, terms are described herein using lists of example elements joined by conjunctions such as “and.” “or,” “but,” etc. It should be understood that such conjunctions are provided for explanatory purposes only. Clauses and other sequences of items joined by a particular conjunction such as “or,” for example, can refer to “and / or,” “at least one of’, “any combination of’ example elements listed therein, etc. Terms such as “based on” should be understood as “based at least in part on.”

[0306] The term “can” should be understood as referring to a possibility of a feature in various implementations and not as prescribing an ability7that is necessarily present in every7implementation. For example, the phrase “X can perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perform Y, and not as indicating that in every’ instance X must always be able to perform Y. It should be understood that, in vanous implementations, X might be unable to perform Y and remain within the scope of the present disclosure.

[0307] The term “may” should be understood as referring to a possibility' of a feature in various implementations and not as prescribing an ability that is necessarily present in every implementation. For example, the phrase “X may perform Y” should be understood as indicating that, in various implementations, X has the potential to be configured to perform Y, and not as indicating that in every' instance X must always be able to perform Y. It should be understood that, in various implementations, X might be unable to perform Y and remain within the scope of the present disclosure.

Claims

WHAT IS CLAIMED IS:

1. A computer-implemented method comprising:obtaining, by a computing system comprising one or more computing devices, a first output comprising a plurality’ of first output values, the first output associated with a first output cycle of a machine-learned model;obtaining, by the computing system, a plurality of first metric values associated with the plurality of first output values;determining, by the computing system, a plurality of first formatting characteristic values respective to the plurality of first output values based on the plurality' of first metric values;providing, by the computing system, the first formatting characteristic values for use in a first user interface having the plurality' of first output values formatted according to the plurality’ of first formatting characteristic values;obtaining, by the computing system, a second output comprising a plurality' of second output values, the second output associated with a second output cycle of the machine-learned model;obtaining, by the computing system, a plurality' of second metric values respective to the plurality of second output values;determining, by the computing system, a plurality' of second formatting characteristic values respective to the plurality’ of second output values based on the plurality7of second metric values; andproviding, by’ the computing system, the second formatting characteristic values for use in updating the first user interface to a second user interface having the plurality of second output values formatted according to the plurality of second formatting characteristic values.

2. The computer-implemented method of claim 1, wherein each metric value of the plurality of first metric values or the plurality of second metric values is between a minimum metric value and a maximum metric value;wherein each formatting characteristic value of the plurality of first formatting characteristic values or the plurality of second formatting characteristic values is between a minimum formatting characteristic value and a maximum formatting characteristic value; and wherein a scale of each formatting characteristic value between the minimum formatting characteristic value and the maximum formatting characteristic value corresponds to a scale of its respective metric value between the minimum metric value and the maximum metric value.

3. The computer-implemented method of claim 1, wherein at least one of the plurality of first formatting characteristic values or the plurality of second formatting characteristic values comprises a shading value, the shading value indicative of a shade with which a respective output value of the plurality of first output values or the plurality of second output values is rendered in the first user interface or the second user interface; andwherein the shading value is between a minimum shading value and a maximum shading value.

4. The computer-implemented method of claim 1, wherein at least one of the plurality of first metric values or the plurality of second metric values comprises a certainty score associated with a respective output value of the plurality of first output values or the plurality of second output values.

5. The computer-implemented method of claim 1, wherein at least one of the plurality of first metric values or the plurality7of second metric values comprises a tonality score associated with a tone characteristic respective to an output value of the plurality of first output values or the plurality of second output values.

6. The computer-implemented method of claim 1, wherein the machine-learned model comprises a block generation model, wherein the block generation model is configured to generate a plurality7of outputs comprising the first output and the second output over a respective plurality7of output cycles comprising the first output cycle and the second output cycle.

7. The computer-implemented method of claim 6, wherein the block generation model comprises a diffusion model.

8. The computer-implemented method of claim 7, wherein the diffusion model comprises a text diffusion model.

9. The computer-implemented method of claim 8, wherein the diffusion model comprises a discrete text diffusion model.

10. The computer-implemented method of claim 8, wherein the diffusion model comprises a continuous text diffusion model.

11. The computer-implemented method of claim 7, wherein the diffusion model comprises a discrete diffusion model.

12. The computer-implemented method of claim 1, wherein the machine-learned model comprises a sequence processing model, the sequence processing model configured to generate a sequence of output values, and wherein the first output comprises a first sequence of the plurality7of first output values and the second output comprises a second sequence of the plurality of second output values.

13. The computer-implemented method of claim 12, wherein the sequence processing model comprises one of a large language model (LLM) or a large multimodal model (LMM).

14. The computer-implemented method of claim 1, wherein at least one of the plurality7of first output values or the plurality of second output values comprises text data.

15. The computer-implemented method of claim 1, wherein the method further comprises:detecting, by the computing system, a modify interaction with the first user interface by a user during the first output cycle, the modify interaction descriptive of a selection of one or more of the plurality of first output values and a contextual aspect to be associated with the one or more of the plurality of first output values;providing, by the computing system, contextual input to the machine-learned model at the second output cycle, the contextual input based on the contextual aspect associated with the one or more of the plurality of first output values; andobtaining, by the computing system, the second output from the machine-learned model in response to providing the contextual input.

16. The computer-implemented method of claim 15, wherein the method further comprises providing the first output as input to the machine-learned model at the second output cycle;wherein providing the contextual input to the machine-learned model comprises determining, by the computing system, one or more modified metric values associated with the one or more of the plurality' of first output values based on the contextual aspect; and wherein the one or more modified metric values are provided in place of the first metric values respective to the one or more of the plurality' of first output values as input to the machine-learned model at the second output cycle.

17. The computer-implemented method of claim 16, wherein the contextual input comprises one of a positive indication or a negative indication with respect to the one or more of the plurality of first output values; andwherein determining the one or more modified metric values comprises one of increasing or decreasing the first metric values respective to the one or more of the plurality of first output values in response to the positive indication or the negative indication.

18. The computer-implemented method of claim 15, wherein providing the contextual input to the machine-learned model comprises:determining, by the computing system, a guidance message based on the contextual input, the guidance message comprising instructions for the machine-learned model responsive to the contextual aspect; andproviding the guidance message as the contextual input to the machine-learned model at the second output cycle.

19. The computer-implemented method of claim 15, wherein the method further comprises:prior to detecting the modify interaction with the first user interface, detecting, by the computing system, a suspend interaction with the first user interface during the first output cycle;in response to detecting the suspend interaction, suspending, by the computing system, a computing operation of the machine-learned model at the first output cycle;prior to obtaining the second output from the machine-learned model, detecting, by the computing system, a resume interaction with the first user interface during suspension of the computing operation of the machine-learned model; andin response to detecting the resume interaction, resuming the computing operation of the machine-learned model at the second output cycle to cause the machine-learned model to generate the second output.

20. A computing system, comprising:one or more processors; andone or more non-transitory, computer-readable media storing instructions that, when implemented, cause the one or more processors to perform operations comprising:obtaining a first output comprising a plurality of first output values, the first output associated with a first output cycle of a machine-learned model;obtaining a plurality of first metric values associated with the plurality of first output values;determining a plurality of first formatting characteristic values respective to the plurality of first output values based on the plurality of first metric values;providing the first formatting characteristic values for use in a first user interface having the plurality of first output values formatted according to the plurality of first formatting characteristic values;obtaining a second output comprising a plurality of second output values, the second output associated with a second output cycle of the machine-learned model;obtaining a plurality7of second metric values respective to the plurality7of second output values;determining a plurality of second formatting characteristic values respective to the plurality7of second output values based on the plurality7of second metric values; and providing the second formatting characteristic values for use in updating the first user interface to a second user interface having the plurality of second output values formatted according to the plurality of second formatting characteristic values.

21. A computing system configured to dynamically visualize an evolving model output by a user interface configured to display one or more output values with a gradient formatting characteristic, wherein the computing system is configured to vary the gradient formatting characteristic of each of the one or more output values along a gradient corresponding to a progression of generation of the evolving model output over one or more output cycles.

22. A computer-implemented method comprising:for each of a sequence of output cycles associated with a machine-learned model: obtaining, by a computing system comprising one or more computing devices, a model output comprising a plurality of output values;obtaining, by the computing system, a plurality of metric values associated with the plurality of output values:determining, by the computing system, a plurality of formatting characteristic values respective to the plurality of output values based on the plurality of metric values; and providing, by the computing system, the formatting characteristic values for use in a user interface that displays the plurality of output values formatted according to the plurality of formatting characteristic values.

23. A computer-implemented method comprising:for each of a sequence of output cycles associated with a machine-learned text diffusion model:obtaining, by a computing system comprising one or more computing devices, a model output from the machine-learned text diffusion model, the model output comprising a plurality of text tokens;obtaining, by the computing system, a plurality of confidence scores respectively associated with the plurality of text tokens;determining, by the computing system, a plurality of shading values respectively associated with the plurality of text tokens based on the plurality of confidence scores; andproviding, by the computing system, the plurality of text tokens for use in a user interface that displays the plurality of text tokens rendered according to the plurality of shading values representative of the plurality of confidence scores respectively associated with the plurality of text tokens.