Generating outputs using a trained model and a task-specific model

The adaptive ensemble of pre-trained and task-specific models addresses the issue of spurious features and overfitting, enhancing generalization and performance on unseen tasks with reduced computational costs.

WO2025166364A9PCT designated stage Publication Date: 2026-06-25DEEPMIND TECH LTD +1

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
DEEPMIND TECH LTD
Filing Date
2025-02-03
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing machine learning models struggle with generalization to unseen tasks and are prone to overfitting on spurious features, leading to poor performance in new distributions due to reliance on training and fine-tuning data.

Method used

A system that adaptively ensembles a pre-trained model with a task-specific model, combining intermediate outputs from both models through multiple adapting layers to reduce the impact of spurious features while preserving necessary representations, thereby improving generalization.

Benefits of technology

The system enhances generalization capabilities with reduced computational resources by leveraging different views from pre-trained and task-specific models, ensuring improved performance on unseen tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025014346_25062026_PF_FP_ABST
    Figure US2025014346_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating outputs using machine learning models. One of the methods includes receiving input data for a machine learning task; processing the input data to generate a respective output at each of one or more iterations, comprising, at each of the one or more iterations: processing an input for the iteration derived from the input data using a trained model that has been trained to perform one or more machine learning tasks; processing the input for the iteration using a task-specific model to generate a task-specific representation of the input for the machine learning task; for each adapting layer in a set of multiple adapting layers, processing an adapting layer input to generate a candidate output for the iteration; and generating the output for the iteration from the candidate outputs for the iteration generated by the adapting layers.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Atorney Docket No. 45288-0431WO1

[0002] GENERATING OUTPUTS USING A TRAINED MODEE AND A TASK-SPECIFIC

[0003] MODEL

[0004] CROSS-REFERENCE TO RELATED APPLICATION

[0005] This application claims priority’ to U.S. Application No. 63 / 548,840, filed February 1, 2024, the disclosure of which is incorporated herein by reference.

[0006] BACKGROUND

[0007] This specification relates to processing data using machine learning models.

[0008] Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model. Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output.

[0009] SUMMARY

[0010] This specification describes a system implemented as computer programs on one or more computers in one or more locations that generates outputs using a trained model and a task-specific model. For example, the outputs can be for a machine learning task that the trained model was not trained to perform.

[0011] According to one aspect there is provided a computer-implemented method comprising: receiving input data for a machine learning task; processing the input data to generate a respective output at each of one or more iterations, comprising, at each of the one or more iterations: processing an input for the iteration derived from the input data using a trained model that has been trained to perform one or more machine learning tasks, wherein the trained model comprises a plurality' of trained layers, and wherein the trained model is configured to: process the input for the iteration using the plurality of trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks; processing the input for the iteration using a task-specific model to generate a task-specific representation of the input for the machine learning task; for each adapting layer Atorney Docket No. 45288-0431WO1 in a set of multiple adapting layers, processing an adapting layer input to generate a candidate output for the iteration, wherein each adapting layer corresponds to one or more of the trained layers, and the adapting layer input for the adapting layer comprises the task-specific representation and the respective intermediate output generated by each corresponding trained layer; and generating the output for the iteration from the candidate outputs for the iteration generated by the adapting layers.

[0012] In some implementations, the trained model has been trained to perform one or more machine learning tasks other than the machine learning task.

[0013] In some implementations, generating the output for the iteration comprises combining two or more of the candidate outputs.

[0014] In some implementations, combining two or more of the candidate outputs comprises averaging the two or more of the candidate outputs.

[0015] In some implementations, generating the output for the iteration comprises identifying one of the candidate outputs as the output.

[0016] In some implementations, each of the multiple adapting layers has a same set of weights.

[0017] In some implementations, each of the multiple adapting layers corresponds to a different trained layer.

[0018] In some implementations, the adapting layer input is a concatenation of the taskspecific representation and the respective intermediate output generated by each corresponding trained layer.

[0019] In some implementations, the input data comprises any one or more of: text data, image data, audio data, or video data.

[0020] In some implementations, the trained model is a Transformer-based neural network.

[0021] In some implementations, the task-specific model comprises one or more of: a multilayer perceptron, an embedding layer, or a convolutional neural network.

[0022] In some implementations, the task-specific model has a smaller size than the trained model.

[0023] In some implementations, the task-specific model has a smaller number of trainable parameters or a smaller number of layers than the trained model.

[0024] In some implementations, each of the one or more adapting layers comprises a multilayer perceptron.

[0025] In some implementations, the machine learning task comprises one of: a language processing task, or a computer vision task. Atorney Docket No. 45288-0431WO1

[0026] In some implementations, the input data comprises an input image and the computer vision task comprises one or more of: (i) image classification, wherein the output is scores for each of a set of object categories, each score representing an estimated likelihood that the input image contains an image of an object belonging to the category; (ii) object detection, wherein the output data identifies locations in the input image at which particular types of objects are depicted; and (ii) image segmentation, wherein the output assigns each pixel of the input image to a category from a set of categories.

[0027] According to a second aspect there is provided a computer-implemented method for generating an ensemble model for one or more second machine learning tasks, comprising: obtaining data specifying a trained model, wherein the trained model comprises a plurality of trained layers, wherein the trained model has been trained to perform one or more machine learning tasks, and wherein the trained model is configured to: receive input data; at each of one or more iterations, process an input for the iteration derived from the input data using the plurality7of trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks; obtaining data specifying a task-specific model that has a plurality of model parameters, wherein the task-specific model is configured to: at each of the one or more iterations, process the input for the iteration in accordance with the model parameters to generate a task-specific representation of the input; obtaining a plurality7of training examples for the one or more second machine learning tasks; and generating the ensemble model for the one or more second machine learning tasks, wherein the ensemble model comprises the trained model, the task-specific model, and a set of multiple adapting layers, and wherein the ensemble model is configured to: at each of the one or more iterations, for each of the adapting layers, process a respective adapting layer input to generate a candidate output for the iteration for the one or more second machine learning tasks, wherein each adapting layer corresponds to one or more of the trained layers, and the adapting layer input for the adapting layer comprises the task-specific representation and the respective intermediate output generated by each corresponding trained layer, and wherein generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model.

[0028] In some implementations, generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model while holding the trained model fixed. Atorney Docket No. 45288-0431WO1

[0029] In some implementations, generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers, the task-specific model, and the trained model.

[0030] In some implementations, the model parameters of the task-specific model specified by the obtained data are randomly initialized.

[0031] In some implementations, training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model while holding the trained model fixed comprises: training the set of multiple adapting layers and the task-specific model to minimize an aggregated loss function over losses for the candidate outputs.

[0032] In some implementations, the aggregated loss function comprises a weighted sum of the losses.

[0033] In some implementations, each adapting layer has a same set of weights.

[0034] In some implementations, each adapting layer has a different set of weights.

[0035] In some implementations, the input data comprises any one or more of: text data, image data, audio data, or video data.

[0036] In some implementations, the trained model is a Transformer-based neural network.

[0037] In some implementations, the trained model has been further trained on a fine-tuning dataset for the one or more second machine learning tasks.

[0038] In some implementations, the task-specific model comprises any one or more of: a multilayer perceptron, an embedding layer, or a convolutional neural network.

[0039] In some implementations, the task-specific model has a smaller size than the trained model.

[0040] In some implementations, each of the one or more adapting layers comprises a multilayer perceptron.

[0041] In some implementations, the one or more second machine learning tasks comprises any one or more of: a language processing task, or a computer vision task.

[0042] In some implementations, the input data comprises an input image and the computer vision task comprises one or more of: (i) image classification, wherein the output is scores for each of a set of object categories, each score representing an estimated likelihood that the input image contains an image of an object belonging to the category7; (ii) object detection, wherein the output data identifies locations in the input image at which particular ty pes of objects are depicted; and (ii) image segmentation, wherein the output assigns each pixel of the input image to a category7from a set of categories. Atorney Docket No. 45288-0431WO1

[0043] According to another aspect there are provided one or more computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the respective operations of the methods described herein.

[0044] According to another aspect there is provided a system comprising: one or more computers; and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the respective operations of the methods described herein.

[0045] Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages.

[0046] The system can provide for improved generalization over possible inputs for a machine learning task. For example, the system can generate outputs of high quality for a variety of machine learning tasks and possible inputs for the machine learning tasks.

[0047] Fine-tuning a machine learning model, that has been trained on one or more machine learning tasks, for an unseen machine learning task can allow the machine learning model to leam characteristics of the unseen machine learning task. For example, the trained machine learning model can have been trained, or pre-trained, for one or more machine learning tasks that do not include the unseen machine learning task. In some cases, fine-tuning can result in improved performance of the machine learning model when deployed on the unseen machine learning task. However, in some cases, the machine learning model may not perform well during inference on inputs that were not represented in the fine-tuning data, e.g., in subpopulation, quality, time, or domain.

[0048] Furthermore, in some cases, the training data that the machine learning model was trained on, or the fine-tuning data that the machine learning model was fine-tuned on, can have spurious features. Spurious features can include features that are useful to increase accuracy during training or fine-tuning, but are not transferable at deployment with new distributions. Thus, the fine-tuned model may have learned representations for spurious features of the training data or the fine-tuning data, e.g., by over-fitting to the training or fine- tuning data. In some cases, the fine-tuned model can be affected by more spurious features than the machine learning model or a model that was trained from scratch on the fine-tuning data. The fine-tuned machine learning model may thus not generalize well to unseen distributions at inference.

[0049] Some conventional methods for improving generalization include identifying the limitations, such as spurious features, of the fine-tuning data. However, these methods may Atorney Docket No. 45288-0431WO1 require information about the fine-tuning data or the training data the machine learning model was trained on. which may not be available.

[0050] Other conventional methods include preserving the representations learned by the machine learning model prior to fine-tuning to avoid overfitting to the fine-tuning data. However, these methods may assume that the representations learned by the machine learning model prior to fine-tuning are sufficient for the unseen machine learning task. In many cases, the machine learning model may not learn important representations for the unseen machine learning task. For example, relying on pre-trained representations may not be enough to leam essential representations for the unseen machine learning task. In addition, the representations learned by the machine learning model prior to fine-tuning may be based on spurious features.

[0051] The system described in this specification can reduce the impact of spurious features in training data or fine-tuning data while maintaining necessary representations to improve generalization. For example, the system described in this specification can leverage different views from different models to mitigate problems in the training data and fine-tuning data. The system can adaptively ensemble a pre-trained machine learning model, or a machine learning model that has been pre-trained on one or more machine learning tasks, with a taskspecific model, also referred to as a trained-from-scratch model. The task-specific model can be trained on fine-tuning data. By combining the two models layer-wise, the system can reduce the impact of potentially problematic features while preserving necessary features. For example, the system can generate the output over one or more iterations. At each of the one or more iterations, the system can process an input for the iteration using a trained model with multiple trained layers. The system can generate an intermediate output for each of the trained layers. The system can also process the input for the iteration using a task-specific model to generate a task-specific representation. The system can provide one or more of the intermediate outputs from the trained layers of the trained model and the task-specific representation from the task-specific model to each adapting layer in a set of multiple adapting layers to generate a candidate output for the adapting layer. The system can generate an output for the iteration from the candidate outputs. Because the task-specific model is trained on fine-tuning data, the system can thus leverage different views (representations) from the pre-trained machine learning model and the task-specific model to reduce the impact of problematic features while preserving necessary features.

[0052] The system can improve generalization using ensembling while using fewer computational resources during training and inference compared to conventional ensembling Atorney Docket No. 45288-0431WO1 methods. For example, conventional ensembling may require training multiple machine learning models, or deploying multiple trained machine learning models, which can consume large amounts of computational resources. The system described in this specification can use a task-specific model that is smaller than the trained machine learning model, decreasing the computational costs of ensembling while allowing for the learning of task-specialized features for new tasks.

[0053] The system can also improve generalization using ensembling beyond conventional prediction ensembling by integrating intermediate layers of the trained model. For example, each adapting layer can correspond to one or more trained layers of the trained model. The system can provide one or more of the intermediate outputs from the corresponding trained layers and the task-specific representation to each adapting layer to generate a candidate output for the adapting layer. Furthermore, integrating different layers of the trained model can provide the system with different views. For example, earlier layers can be more general, while later layers can be more specific.

[0054] The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

[0055] BRIEF DESCRIPTION OF THE DRAWINGS

[0056] FIG. 1 A shows an example system for generating outputs.

[0057] FIG. IB shows an example trained model, an example task-specific model, and example adapting layers.

[0058] FIG. 2 is a flow diagram of an example process for generating outputs.

[0059] FIG. 3 shows an example system for generating an ensemble model.

[0060] FIG. 4 is a flow diagram of an example process for generating an ensemble model.

[0061] FIG. 5 shows the performance of a system for generating outputs.

[0062] Like reference numbers and designations in the various drawings indicate like elements.

[0063] DETAILED DESCRIPTION

[0064] FIG. 1A shows an example system 100 for generating outputs. The sy stem 100 is an example of a system implemented as computer programs on one or more computers in one or Atorney Docket No. 45288-0431WO1 more locations, in which the systems, components, and techniques described below can be implemented.

[0065] The system 100 generates an output 140 for a machine learning task given input data 102 for the machine learning task using a trained model 110 and a task-specific model 120.

[0066] The system 100 can perform any kind of machine learning task, i.e., can be configured to receive any kind of digital data input and to generate any kind of score, classification, or regression output based on the input.

[0067] In some cases, the system 100 can perform an image processing or computer vision task, i.e., receive an input image and to process the input image, i.e., to process the intensity values of the pixels of the input image, to generate an output for the input image. For example, the task may be image classification and the output generated by the system for a given image may be respective scores for each of a set of object categories, with each score representing an estimated likelihood that the image contains an image of an object belonging to the category. As another example, the task can be image embedding generation and the output generated by the system can be a numeric embedding of the input image. As yet another example, the task can be object detection and the output generated by the system can identify locations in the input image at which particular ty pes of objects are depicted. As yet another example, the task can be image segmentation and the output generated by the system can assign each pixel of the input image to a category from a set of categories.

[0068] As another example, if the input data includes Internet resources (e.g., web pages), documents, or portions of documents or features extracted from Internet resources, documents, or portions of documents, the task can be to classify the resource or document, i.e.. the output generated by the system for a given Internet resource, document, or portion of a document may be a respective score for each of a set of topics, with each score representing an estimated likelihood that the Internet resource, document, or document portion is about the topic.

[0069] As another example, if the input data includes features of an impression context for a particular advertisement, the output generated by the system 100 may be a score that represents an estimated likelihood that the particular advertisement will be clicked on.

[0070] As another example, if the input data includes features of a personalized recommendation for a user, e.g., features characterizing the context for the recommendation, e.g., features characterizing previous actions taken by the user, the output generated by the system may be a respective score for each of a set of content items, with each score Atorney Docket No. 45288-0431WO1 representing an estimated likelihood that the user will respond favorably to being recommended the content item.

[0071] As another example, if the input data includes a sequence of text in one language, the output generated by the system may be a respective score for each of a set of pieces of text in another language, with each score representing an estimated likelihood that the piece of text in the other language is a proper translation of the input text into the other language.

[0072] As another example, the task may be an audio processing task. For example, if the input data includes a sequence representing a spoken utterance, the output generated by the system 100 may be a respective score for each of a set of pieces of text, each score representing an estimated likelihood that the piece of text is the correct transcript for the utterance. As another example, the task may be a keyword spotting task where, if the input data is a sequence representing a spoken utterance, the output generated by the system can indicate whether a particular word or phrase (“hotword”) was spoken in the utterance. As another example, if the input to the system 100 is a sequence representing a spoken utterance, the output generated by the system can identify the natural language in which the utterance was spoken.

[0073] As another example, the task can be a language processing task. For example, the task can be a natural language processing or understanding task, e.g., an entailment task, a paraphrase task, a textual similarity task, a sentiment task, a sentence completion task, a grammaticality task, and so on, that operates on a sequence of text in some natural language.

[0074] For example, the system 100 can be configured to perform a natural language processing (NLP) task, i.e., receive input data that includes text and process the input data to generate a sequence of text responsive to the input data, such as question answering. For example, the trained model 110 can be a language model neural network.

[0075] As another example, the system 100 can be configured to generate code for computer programs (e.g., in a programming language such as Python, Java, C++, etc.) to perform tasks specified by the input data.

[0076] As another example, the task can be a text to speech task, where the input is text in a natural language or features of text in a natural language and the network output is a spectrogram or other data defining audio of the text being spoken in the natural language.

[0077] As another example, the task can be a health prediction task, where the input is electronic health record data for a patient and the output is a prediction that is relevant to the future health of the patient, e.g., a predicted treatment that should be prescribed to the patient, Atorney Docket No. 45288-0431WO1 the likelihood that an adverse health event will occur to the patient, or a predicted diagnosis for the patient.

[0078] As another example, the task can be an agent control task, where the input is an observation characterizing the state of an environment and the output defines an action to be performed by the agent in response to the observation. The agent can be, e.g., a real-world or simulated robot, a control system for an industrial facility, or a control system that controls a different kind of agent.

[0079] In some cases, the machine learning task is a multi-modal processing task that requires processing multi-modal data. In general, multi-modal data is a combination of two or more different types of data, e.g., two or more of audio data, image data, text data, or graph data. As one example the multi-modal data may comprise audio-visual data, comprising a combination of pixels of an image or of video and audio data representing values of a digitized audio waveform. As another example, the multi-modal data may comprise a combination of i) text data representing text in a natural language and ii) pixels of an image or of video or audio data representing values of an audio waveform. Optionally, but not necessarily, the different types of data may represent the same or overlapping objects using the different modalities (types), and when processing multi-modal data, the data may be mapped into a common embedding space.

[0080] As a particular example, the task is a multi-modal processing task that requires processing both text and image inputs, so that the neural network 110 includes both a computer vision neural network and a text processing neural network. That is, the target output to be generated by the computer vision neural network for a given image depends on one or more outputs generated by the text processing neural network for one or more corresponding text inputs (and vice versa). Examples of such tasks include image captioning, image generation, and so on. More generally, the multi-modal processing task may correspond to any of the tasks previously described for any of the types of data making up the multi-modal combination.

[0081] In general, a multimodal machine learning model can be trained to perform any sort of machine learning task or tasks. After the multimodal machine learning model has been trained it can be deployed for use in performing the machine learning task(s). For instance, the machine learning model can be deployed in an environment that enables users to provide requests for the machine learning model to process specified multimodal inputs to generate corresponding model outputs. Users can provide the requests, e.g., by way of a user interface or through an application programming interface (API). The requests can be transmitted from Atorney Docket No. 45288-0431WO1 a user device (e.g., over a data communication network, e.g., the internet) to one or more computers implementing the machine learning model, e.g., in a data center. The machine learning model can process multimodal inputs specified by user requests to generate corresponding model outputs, and then transmit the model outputs to user devices (e.g., over a data communication network).

[0082] In some implementations, after training, a particular task that is to be performed by the multimodal machine learning model can be described by part or all of the sequence of text in the multimodal input to the model. For example in a multimodal input that includes an image such a prompt might specify “Generate a caption”, “Generate a description”, “Answer the following question: [about the image or video]”, or “Detect a person”. Where the model is used for an agent control task a prompt may define “Take the knife out of the drawer”, or “Q: What action should the robot take to take the knife out of the drawer?”. Also or instead such a prompt may give one or more examples of a task to be performed. A multimodal machine learning model can be trained on multiple natural and / or computer languages and the prompt may then specify a language to use.

[0083] A few examples of some machine learning tasks that can be performed by a model trained as described herein follow. The tasks described below may be tasks that require spatial awareness or other context from the image or video. For example, a prompt may ask “What is the object in the top left comer?”

[0084] As one example, the task may comprise an object or action detection task. A taskspecific training data item may comprise an image or video containing one or more objects or actions, and a sequence of text. The sequence of text may describe or otherwise label the object(s) or action(s) and may include text giving bounding box coordinates for the object(s) or action(s). After training, when the model is used in inference, the model output may comprise or represent text that describes or otherwise labels detected object(s) or action(s) in the image input, and may include bounding-box coordinates for the detected object(s) or action(s), e.g. "10 20 90 100 cat 20 30 100 100 dog”.

[0085] As another example, the task may comprise a classification task, e.g. an object or action classification task. A task-specific training data item may comprise an image or video item containing one or more objects or actions and a sequence of text. The sequence of text may describe or otherwise classify' the object(s) or action(s). After training, when the model is used in inference, the model output may comprise data, e g. text, that classifies the object(s) or action(s) in the image input into one of a plurality of classes. Atorney Docket No. 45288-0431WO1

[0086] As another example the task may comprise an image or video describing task, e.g. a captioning task (which, as used here, includes an audio description task to explain what is happening in a video). A task-specific training data item may comprise an image or video and a sequence of text describing the image or video. After training, when the model is used in inference, the model output may comprise data, e.g. text, describing the image or video. For example the model output may provide a caption or description or it may count objects in the image or video, or it may provide some other form of description.

[0087] As another example the task may comprise an image or video question-answering task. A task-specific training data item may comprise an image or video and a sequence of text that describes the image or video. After training, when the model is used in inference, the model output may comprise data, e.g. text, that answers a question about the second modality input specified in a prompt sequence of text, e.g. as described above. This may be used, e.g., to answer questions about visual plots and charts or about sounds.

[0088] As another example the task may comprise a character or word recognition task, e.g. an OCR (optical character recognition) task. A task-specific training data item may comprise an image or video and a sequence of text that includes text that is depicted in the image or video, or that is represented as speech in the audio item. After training, when the model is used in inference, the model output may comprise text that represents characters or words in the second modality input, e g. in a natural language.

[0089] As another example the task may comprise a still or moving image generation task; Google DeepMind Imagen 3 is an example of a system that can generate an image output. For example an image such as a plot or chart may be decoded from one or more (language) tokens generated by the system. A training data item for such a system may comprise an image or video and a sequence of text that describes the image or video. After training, when the model is used in inference, the model output may comprise data for an image or video, e.g., image data defining values for pixels of a still or moving image, and the sequence of text in the multimodal input to the model may describe or characterize the image or video to be generated.

[0090] As another example the task may comprise a computer language text generation task. A task-specific training data item may comprise an image or video and a sequence of text in a computer language for generating the image or video. After training, when the model is used in inference, the model output may comprise text in the or another computer language for generating or rendering an image or video, e.g. a web page, plot, or chart. Atorney Docket No. 45288-0431WO1

[0091] In another example of a computer language text generation task a task-specific training data item may comprise an image or video and a sequence of text in a computer language for performing a task in relation to the image or video, e.g. a data processing task that involves analyzing the content of the image or video to provide a result of the analysis or, e.g., a search to search for information relating to the content of the image or video. The computer language in the model output may comprise computer language for invoking a function or calling one or more external APIs. Merely as one example, such an output may be formatted as a JSON object. As previously, the sequence of text in the multimodal input may define the task to be performed and the second modality input may comprise, e.g. an image or video in relation to which the task is to be performed, e.g. a task that involves manipulation of particular types of data that may benefit from access to an API such as mathematical data, date / time related data, scientific data, recent data that may post-date training of the model (that may be accessed by a search function or API), and so forth. After training, when the model is used in inference, the model output may comprise text in the or another computer language for performing a task, e.g. as described above, in relation to an image or video in the second modality input. The method may then include using the text in the computer language to perform the task.

[0092] In general where the model output comprises text this may be provided as speech representing the text.

[0093] As another example, the task can be an image generation task, i.e., receive input data and to process the input data to generate an image for the input data. For example, the input data can include text and / or an image. For example, the trained model 110 can be a visual language model (VLM) neural network.

[0094] As another example, the task can be a video generation task, i.e., receive input data and to process the input data to generate a video for the input data. For example, the input data can include text, an image, and / or a video.

[0095] As another example, the task can be an audio generation task, i.e., receive input data and to process the input data to generate audio for the input data. For example, the input data can include text, audio, an image, and / or a video.

[0096] As another example, the task can be a style transfer task to match a style of, e.g., images, video, audio, etc. specified by the input data. For example, the system 100 can be configured to output, e.g., sample images, sample videos, sample audio, etc., that match styles of the input data. Atorney Docket No. 45288-0431WO1

[0097] As another example, the task can be a transcription task in which the system 100 is configured to produce text transcriptions for, e.g., video, audio, etc. specified by the input data.

[0098] As another example, the task can be a summarization task in which the system 100 is configured to produce text summaries of, e.g., text, images, video, audio, etc. specified by the input data.

[0099] To generate an output for a machine learning task, the system 100 receives input data 102 for the machine learning task. The input data 102 can include, for example, text data, image data, video data, or audio data.

[0100] The system processes the input data 102 to generate a respective output 140 at each of one or more iterations using the trained model 110 and the task-specific model 120. In some examples, such as examples where the system 100 processes the input data in a single iteration, the output for the machine learning task is the output 140 for the iteration. In some examples, such as examples where the system 100 processes the input data 102 for more than one iteration, the output for the machine learning task can include the output 140 for more than one or all of the iterations.

[0101] At each of the one or more iterations, the system processes an input 104 for the iteration derived from the input data 102 using the trained model 110. In some examples, such as examples where the system 100 processes the input data in a single iteration, the input 104 for the iteration includes the input data 102. In some examples, such as examples where the system 100 processes the input data 102 for more than one iteration, such as for an autoregressive generation task, the input 104 for the iteration can include an output 140 from a previous iteration and, optionally, some or all of the input data 102.

[0102] The trained model 110 can have been trained to perform one or more machine learning tasks. The trained model includes multiple trained layers 116a-k. The trained model 110 is configured to process the input 104 for the iteration using the trained layers 116a-k to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks.

[0103] The trained model 110 can have any appropriate architecture for performing the one or more machine learning tasks. For example, the trained model 110 can be a Transformerbased neural network, a generative model, a foundational model, a multimodal model, or a language model neural network. As particular examples, the trained model can be a Transformer-based language model neural network, a visual language model (VLM) neural network, or a vision transformer (ViT). In some examples, the trained model can include one Atorney Docket No. 45288-0431WO1 or more atention layers. The trained model can include, in some examples, more than 100 million trained parameters, more than 1 billion trained parameters, more than 10 billion trained parameters, or more than 100 billion trained parameters.

[0104] As an example, a visual language model neural network may include an image encoder neural network and a text encoder.

[0105] As another example, the trained model 110 can have any appropriate neural network architecture that allows the model to map an input sequence of text tokens from a vocabulary to an output sequence of text tokens from the vocabulary or another vocabulary.

[0106] For example, the trained model 110 can have an encoder-decoder Transformer-based architecture.

[0107] As another example, the trained model 110 can have a decoder-only Transformerbased architecture.

[0108] In general, a Transformer-based architecture can be one which is characterized by having a succession of self-atention neural network layers. A self-atention neural network layer has an atention layer input for each element of the input and is configured to apply an attention mechanism over the atention layer input to generate an atention layer output for each element of the input. Generally, an atention mechanism maps respective queries and a set of key-value pairs to the atention layer outputs, where the queries, keys, and values are all vectors. Each attention layer output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility or similarity function e.g. a dot product or scaled dot product, of the query with the corresponding key. There are many different attention mechanisms that may be used. One example is the query-key-value attention mechanism described in Vaswani et al. arXiv: 1706.03762.

[0109] In particular, the trained model 110 can be an auto-regressive neural network that auto-regressively generates an output sequence of tokens by generating each particular token in the output sequence conditioned on a current input sequence that includes (i) the input sequence followed by (ii) any tokens that precede the particular token in the output sequence.

[0110] More specifically, to generate a particular token, the neural network can process the current input sequence to generate a score distribution, e.g.. a probability distribution, that assigns a respective score, e.g., a respective probability, to each token in the vocabulary of tokens. The neural network can then select, as the particular token, a token from the vocabulary using the score distribution. For example, the neural network can greedily select the highest-scoring token or can sample, e.g., using top-k sampling, nucleus sampling or another sampling technique, a token from the distribution. Atorney Docket No. 45288-0431WO1

[0111] As a particular example, the neural network can be an auto-regressive Transformerbased neural network that includes a plurality of layers that each apply a self-attention operation. The neural network can have any of a variety of Transformer-based neural network architectures. Examples of such architectures include those described in J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models, arXiv preprint arXiv: 2203.15556, 2022; J.W. Rae, S. Borgeaud, T. Cai. K. Millican, J. Hoffmann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato. J. Mellor, I. Higgins, A. Creswell, N. McAleese, A.Wu. E. Eisen. S. M. Jayakumar, E. Buchatskaya. D. Budden, E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume. Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas. A. Guy, C. Jones, J. Bradbury. M. Johnson, B. A. Hechtman, L. Weidinger. I. Gabriel, W. S. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs / 2112. 11446, 2021; Cohn Raffel, Noam Shazeer. Adam Roberts, Katherine Lee. Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019; Daniel Adiwardana, Minh- Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. Towards a human-like opendomain chatbot. CoRR, abs / 2001.09977, 2020; and Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry. Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005. 14165, 2020.

[0112] The tokens in the vocabulary can be any appropriate text tokens, e.g.. words, word pieces, punctuation marks, characters, bytes, and so on that represent elements of text in one or more natural languages and, optionally, numbers and other text symbols that are found in a corpus of text. For example, the system can tokenize a given sequence of words by applying a tokenizer, e.g., the SentencePiece tokenizer (Kudo et al., arXiv: 1808.06226) or another tokenizer, to divide the sequence into tokens from the vocabulary. Atorney Docket No. 45288-0431WO1

[0113] Additionally, or alternatively, the vocabulary' of tokens can include tokens that can represent data other than text. For example, the vocabulary of tokens can include image tokens that represent a discrete set of image patch embeddings of an image that can be generated by an image encoder neural network based on processing the image patches of the image. As another example, the vocabulary of tokens can include audio tokens that represent code vectors in a codebook of a quantizer, e.g., a residual vector quantizer.

[0114] The trained model 110 can be trained on a next token prediction task, e.g., a task that requires predicting, given a current sequence of tokens, the next token that follows the current sequence in the training data. For example, the trained model 110 can be trained on a maximum-likelihood objective.

[0115] As a particular example, the next token prediction task can be a language modeling task. The language modeling task can require, for each given unlabeled text sequence in a training data set, predicting a text sequence that followed the given unlabeled text sequence in a corresponding document. As a particular example, the trained model 110 can be pretrained on a maximum-likelihood objective on a large dataset of text, e.g., text that is publicly available from the Internet or another text corpus.

[0116] The trained model 110 can optionally have been fine-tuned, e g., through supervised fine tuning (SFT) or instruction tuning. For example, the system can use a trained model 110 that has been fine-tuned for one or more second machine learning tasks to further improve the performance of the system 100.

[0117] The system 100 processes the input 104 for the iteration using the task-specific model 120 to generate a task-specific representation of the input 104 for the machine learning task.

[0118] The task-specific model 120 can have any appropriate architecture for performing the machine learning task. For example, the task-specific model can be a Transformer-based neural network, a generative model, a multimodal model, or a language model neural network. As particular examples, the task-specific model 120 can include a multilayer perceptron (MLP) model, embedding layers (e.g., embedding layers that transform tokens in the input 104, such as text or image tokens, into respective vector representations), and / or a convolutional neural network (CNN). The convolutional neural network can include, for example, multiple convolution layers, and batch normalization, ReLU, and max pooling operations. In some examples, the last layer can be flattened for generating the task-specific representation. For example, for a language processing task, the task-specific model 120 can include an MLP model with input embedding layers. For a computer vision task, the taskspecific model 120 can include a CNN. Atorney Docket No. 45288-0431WO1

[0119] In some examples, the task-specific model 120 can have a smaller size than the trained model 110. For example, the task-specific model 120 can have a smaller number of trainable parameters or a smaller number of layers than the trained model.

[0120] In some examples, the task-specific model 120 and the trained model 110 can have different architectures. For example, the trained model 110 can include a ViT and the taskspecific model 120 can include a CNN. As another example, the trained model 110 can include a language model neural network such as Gemini or PALM (see Chowdhery et al. arXiv:2204.02311) and the task-specific model 120 can include an MLP.

[0121] In some examples, the task-specific model 120 and the trained model 110 can have similar architectures of different sizes. For example, the task-specific model 120 can include a language model neural network, and the trained model 110 can be a language model neural network of a larger size (e.g., more trainable parameters and / or more layers) than the taskspecific model 120.

[0122] The task-specific model 120 can have been trained on fine-tuning data for the machine learning task. Training the task-specific model 120 is described in more detail below with reference to FIGS. 3-4.

[0123] For each adapting layer in a set of multiple adapting layers 130a-n, the system processes an adapting layer input to generate a candidate output 132 for the iteration. For example, as described below with reference to FIG. IB. the adapting layer input for each adapting layer includes the task-specific representation generated by the task-specific model 120 and the respective intermediate output generated by the one or more corresponding trained layers for the adapting layer. Thus the system processes the respective intermediate output of each trained layer using the corresponding adapting layer for the trained layer.

[0124] As an example, when the task-specific model 120 includes a language model neural network or visual language model neural network, each of the adapting layers can include an MLP. The adapting layers and generating the candidate outputs 132a-n are described in more detail below wi th reference to FIG. IB.

[0125] The system 100 generates the output 140 for the iteration from the candidate outputs 132a-n for the iteration generated by the adapting layers 130a-n.

[0126] For example, the system 100 can combine two or more of the candidate outputs 132a- n. For example, the candidate outputs 132-n can be probability distributions over a set of possible outputs for the iteration. The system 100 can combine the probability distributions, and identity the possible output with the highest probability out of the combined probability distribution as the output 140 for the iteration. Atorney Docket No. 45288-0431WO1

[0127] As another example, the candidate outputs 132a-n can include regressed numerical values. The system 100 can combine the candidate outputs 132a-n. for example, by averaging the candidate outputs, and using the average as the output 140 for the iteration.

[0128] As another example, the system 100 can identify one of the candidate outputs 132a-n as the output 140. For example, the system 100 can identify' the candidate output that has a highest confidence as the output 140.

[0129] In some implementations, the system 100 can further process the candidate outputs 132a-n. For example, the system 100 can determine a measure of uncertainty for the candidate outputs 132a-n. For example, the system can compute the variance of the candidate outputs 132a-n to determine the measure of uncertainty. The measure of uncertainty can represent a confidence of the output 140. In some examples, the system 100 can provide the measure of uncertainty for presentation to a user.

[0130] The system can thus leverage different views from the trained model 110 and the taskspecific model 120. For example, by ensembling the trained model 110 and the task-specific model 120. the system 100 can obtain a general view (e.g., representation) from pre-training data for training the trained model 110 and a task-specialized view (e.g., representation) from fine-tuning data for training the task-specific model 120. In addition, by ensembling within the model, e.g., by generating candidate outputs for each adapting layer 130a-n that each correspond to different layers of the trained model 110, the sy stem 100 can utilize different information from each intermediate layer of the trained model 110.

[0131] FIG. IB shows the example trained model 1 10, the example task-specific model 120, and the example adapting layers 130a-n of FIG. 1A. In particular, FIG. IB shows the processing of the input Xi 104 for one iteration. The trained model 110. the task-specific model 120. and the adapting layers 130a-n are also referred to as an ensemble model.

[0132] The system processes the input 104 for the iteration using the trained model 110. The trained model 110 is configured to process the input 104 using the trained layers 116a-k to generate, for each trained layer, a respective intermediate output 118a-k for the one or more machine learning tasks. For example, the trained model processes the input 104 using the trained layer 116a to generate the intermediate output 118a.

[0133] The system processes the input 104 using the task-specific model 120 to generate a task-specific representation 122 for the machine learning task. The task-specific representation 122 can include a representation of the input 104 that is specialized for the machine learning task. Atorney Docket No. 45288-0431WO1

[0134] Each adapting layer 130a-n corresponds to one or more of the trained layers 116a-k. For example, a particular adapting layer can correspond to a respective one of the trained layers 116a-k or multiple trained layers 116a-k. In the example of FIG. IB, each of the adapting layers 130a-n can correspond to a different one of the trained layers 116a-k. That is, n and k can be the same integer.

[0135] The particular adapting layer can correspond to one or more early trained layers (i.e. , one or more trained layers occurring relatively near to the start of the sequence of trained layers), or one or more later trained layers (i.e., one or more trained layers occurring after the one or more early trained layers). For example, early trained layers can contribute positively to performance on inputs that were not represented during training and later trained layers can contribute to performance on inputs that were represented during training.

[0136] In some examples, each of the adapting layers 130a-n can correspond to a different trained layer of the trained layers 116a-k, or different trained layers of the trained layers 116a-k.

[0137] For each of the adapting layers 130a-n, the system processes an adapting layer input 128a-n to generate a candidate output 132a-n. The adapting layer input 128a-n for the adapting layer 130a-n can include the task-specific representation 122 and the respective intermediate output generated by each corresponding trained layer. For example, the adapting layer input 128a-n can include a concatenation of the task-specific representation 122 and the respective intermediate output generated by each corresponding trained layer. In the example of FIG. IB, the adapting layer input 128a includes the task-specific representation 122 and the intermediate output 118a.

[0138] In some examples, each of the adapting layers 130a-n can have a same set of weights. In some examples, each of the adapting layers 130a-n can have different sets of weights.

[0139] Each of the adapting layers 130a-n can have any appropriate network architecture for generating a candidate output 132a-n for the iteration from the adapting layer input. As an example, each adapting layer can include an MLP with one or more hidden layers, or a single linear layer optionally followed by a non-linear output layer.

[0140] In some implementations, the system can apply efficient ensembling techniques to reduce computational and memory costs used by the adapting layers 130a-n. For example, the system can represent the weights of each of the adapting layers 130a-n using a single-rank matrix, as described in further detail in Wen, Y., et al., BatchEnsemble: An Alternative Atorney Docket No. 45288-0431WO1

[0141] Approach to Efficient Ensemble and Lifelong Learning, arXiv preprint arXiv:2002.06715, 2020.

[0142] FIG. 2 shows an example process 200 for generating outputs. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a system for generating outputs, e.g., the system 100 of FIG. 1, appropriately programmed, can perform the process 200.

[0143] The system receives input data for a machine learning task (step 202). In some implementations, the system can receive the input data from a user device, e.g., through a user interface of the user device.

[0144] The system processes the input data to generate a respective output at each of one or more iterations. The system performs steps 204-210 at each of the one or more iterations.

[0145] The system processes an input for the iteration using a trained model (step 204). The input for the iteration is derived from the input data. In some examples, such as examples where the system processes the input data in a single iteration, the input includes the input data. In some examples, such as examples where the system processes the input data for more than one iteration, the input for the iteration can include an output from a previous iteration and, optionally, some or all of the input data.

[0146] The trained model can have been trained to perform one or more machine learning tasks. For example, the trained model can have been trained to perform one or more machine learning tasks other than the machine learning task.

[0147] The trained model includes multiple trained layers. The trained model is configured to process the input for the iteration using the multiple trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks.

[0148] In some implementations, the system pre-processes the input for the trained model. For example, the machine learning task can be a language-based recommendation task, and the input can include text specifying attribute values. In some examples, the trained model includes a language model neural netw ork. The system can pre-process the input to generate a prompt that includes the attribute values in an input sentence.

[0149] The system processes the input for the iteration using a task-specific model to generate a task-specific representation of the input for the machine learning task (step 206).

[0150] In some implementations, the system pre-processes the input for the task-specific model. For example, the machine learning task can be a language-based recommendation task, and the input can include text specifying attribute values. In some examples, the taskspecific model includes one or more embedding layers and an MLP. The system can pre- Atorney Docket No. 45288-0431WO1 process the input to generate numerical features for the attribute values. The system can provide the numerical features as input to the one or more embedding layers.

[0151] Further details for pre-processing are described in Geng, S., et al., Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5), arXiv preprint arXiv:2203.13366, 2022.

[0152] For each adapting layer in a set of multiple adapting layers, the system processes an adapting layer input to generate a candidate output for the iteration (step 208). Each adapting layer corresponds to one or more of the trained layers.

[0153] The adapting layer input for an adapting layer includes the task-specific representation and the respective intermediate output generated by each corresponding trained layer.

[0154] Each adapting layer generates a candidate output using the task-specific representation generated by the task-specific model and the intermediate output generated by each corresponding trained layer of the trained model. Thus the system can perform ensembling over the two models layer-wise. Because the task-specific model is trained on fine-tuning data for the machine learning task, and the trained model is trained on more general training data, the system can leverage different views from the task-specific model and the trained model using the adapting layers, allowing for improved generalization at inference.

[0155] The system generates the output for the iteration from the candidate outputs for the iteration generated by the adapting layers (step 210). For example, as described above with reference to FIG. 1 A, the system can combine tw o or more of the candidate outputs to generate the output, or identify one of the candidate outputs as the output for the iteration.

[0156] The system can thus perform ensembling using the candidate outputs generated by the adapting layers. For example, the system can use different information from each trained layer. For example, early layers can be more general and later layers can be more specific.

[0157] In examples where the system processes the input data in a single iteration, the system can generate the output for the machine learning task as the output for the iteration. In examples where the system processes the input data for more than one iteration, the system can generate the output for the machine learning task from the outputs for all of the iterations. For example, the output can include each of the tokens generated for all iterations. In some implementations, the system can provide data representing the output for the machine learning task for presentation on a user device. Atorney Docket No. 45288-0431WO1

[0158] FIG. 3 shows an example system 300 for generating an ensemble model 330. The system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

[0159] The ensemble model 330 can be trained to perform machine learning tasks such as the machine learning tasks described above with reference to FIG. 1A.

[0160] The ensemble model 330 includes the trained model 110, the task-specific model 120, and the set of multiple adapting layers 130a-n described above with reference to FIGS. 1 A- 1B. The outputs of each of the adapting layers 130a-n serve as an ensemble of different models. The system 300 can train one or more of the task-specific model 120, the adapting layers 130a-n, or the trained model 110 for performing specific machine learning tasks, also referred to as second machine learning tasks. For example, a second machine learning task can be a machine learning task other than the machine learning tasks the trained model 110 was trained to perform. As particular examples, the second machine learning task can include a language processing task or a computer vision task.

[0161] In some implementations, the system 300 is part of the system 100 described with reference to FIGS. 1A-1B. For example, the system 100 can train the ensemble model 330 to perform a second machine learning task. The system 100 can then receive input data for the second machine learning task, and process the input data using the ensemble model to generate an output for the second machine learning task.

[0162] To generate the ensemble model 330, the system 300 obtains data specifying the trained model 110 and the task-specific model 120. In some examples, the model parameters of the task-specific model 120 can be randomly initialized parameters. The system 300 also obtains multiple training examples 302 for the one or more second machine learning tasks.

[0163] The system 300 includes the trained model 110 and the task-specific model 120 in the ensemble model 330. The system also includes the set of multiple adapting layers 130a-n in the ensemble model 330.

[0164] The system can train the ensemble model 330 using a training engine 320. The training engine 320 is configured to train any one or more of the trained model 110, the taskspecific model 120, or the adapting layers 130a-n, e.g., by updating the current parameter values of the trained model 110, the task-specific model 120, or the adapting layers 130a-n at each of multiple training iterations.

[0165] For example, instead of updating all of the parameters of the ensemble model 330, the training engine 320 can only update the parameters of the task-specific model 120 and the Atorney Docket No. 45288-0431WO1 adapting layers 130a-n and hold the parameters of the trained model 110 fixed during the training.

[0166] For example, the training engine 320 can apply a gradient descent with backpropagation training technique that uses, e.g., a stochastic gradient descent, RMSprop, or Adam optimizer, or another known or learned optimizer, to optimize an objective function that is appropriate for the second machine learning task of the training examples 302. Example loss functions are described below with reference to FIG. 4.

[0167] At each training step of the training process, the training engine 320 processes a batch of training examples from the training examples 302 in accordance with the current values of the parameters to generate a training output for each adapting layer 130a-n. The training engine 320 determines, with respect to the parameters of the ensemble model 330, a gradient of an objective function that measures the overall quality of the training outputs for the batch of training examples. At the end of each training step, the training engine 320 applies, e.g., through backpropagation, respective updates to at least some of the current values of the parameters of the ensemble model 330 using the gradient determined at the training step.

[0168] The system 300 can thus train the task-specific model 120 to leam necessary and specialized representations for the second machine learning tasks of the training examples 302. At inference, as described with reference to FIGS. 1A-2, the system can ensemble the information from the trained model 110 and the task-specific model 120.

[0169] FIG. 4 is a flow diagram of an example process 400 for generating an ensemble model for one or more second machine learning tasks. For convenience, the process 400 will be described as being performed by a system of one or more computers located in one or more locations. For example, a system for generating an ensemble model, e.g., the system 300 of FIG. 3, appropriately programmed, can perform the process 400.

[0170] The system obtains data specifying a trained model (step 402). In some examples, the trained model can have been trained to perform machine learning tasks that do not include the second machine learning tasks.

[0171] In some examples, the trained model can have been trained to perform machine learning tasks that include the second machine learning tasks. For example, the trained model can have been further trained, e.g., through supervised fine tuning (SFT) or instruction tuning, on a fine-tuning dataset for the one or more second machine learning tasks. In some examples, the trained model can have been further trained to update some or all of the parameters for the one or more second machine learning tasks. Atorney Docket No. 45288-0431WO1

[0172] In some examples, the trained model can have been trained using training methods such as LoRA, which is described in further detail in Hu, E. J., et al., LoRA: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685, 2022, and Chen, M., et al., Generative pretraining from pixels, in International Conference on Machine Learning (ICML), volume 119, pp. 1691-1703, 2020. The system can thus achieve improved performance while improving the overall training efficiency.

[0173] The trained model, as descnbed above, includes multiple trained layers. The trained model is configured to receive input data. At each of one or more iterations, the trained model processes an input for the iteration derived from the input data using the multiple trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks.

[0174] The system obtains data specifying the task-specific model (step 404). The taskspecific model includes multiple model parameters. In some examples, the task-specific model can have randomly initialized parameters.

[0175] The task-specific model, as described above, is configured to. at each of the one or more iterations, process the input for the iteration in accordance with the model parameters to generate a task-specific representation of the input.

[0176] The system obtains multiple training examples for the second machine learning tasks (step 406). Each training example can include a training input and a target output for a second machine learning task.

[0177] The system generates the ensemble model for the one or more second machine learning tasks (step 408). The ensemble model includes the trained model, the task-specific model, and a set of multiple adapting layers. As described above, each adapting layer can correspond to one or more of the trained layers.

[0178] The ensemble model, as described above, is configured to, at each of the one or more iterations, for each of the adapting layers, process a respective adapting layer input to generate a candidate output for the iteration for the one or more second machine learning tasks. The adapting layer input for the adapting layer includes the task-specific representation and the respective intermediate output generated by each corresponding trained layer.

[0179] The system can generate the ensemble model by training the ensemble model on the training examples. For example, the system can train the ensemble model by training the set of multiple adapting layers and the task-specific model while holding the trained model fixed. As a particular example, the system can train the set of multiple adapting layers and the task- Atorney Docket No. 45288-0431WO1 specific model to minimize an aggregated loss function over losses for the candidate outputs.

[0180] An example aggregated loss function is shown below: where m is the number of training examples, n is the number of adapting layers, y ' is the candidate output from the adapting layer j for the input data / , and yt is the target output for the input data i.

[0181] In some examples, the aggregated loss function can include a weighted sum of the losses. For example, the aggregated loss function can include a weighted sum of y An example aggregated loss function is shown below: is the w eight for the loss of each adapting layer j. In some examples, can be a hyperparameter to tune the importance betw een different intermediate layers of the trained model. For example, to focus on more specific features, the weight can be higher for later layers. To focus on more general features, the weight can be higher for earlier layers.

[0182] Each loss may vary across different tasks, but typically, the loss function measures a quality of the training output, e.g., that measures a difference betw een the candidate output and the known, target output (or another target output that is derived from the known, target output) of the training example. A cross-entropy loss function, e.g., in the case of classification tasks, and a mean squared error (MSE) loss function, e.g., in the case of regression tasks, are examples of suitable loss functions that can be used during the training.

[0183] In some examples, the system can also train the trained model on the training examples. For example, the system can train the set of multiple adapting layers, the taskspecific model, and at least some parameters of the trained model. For example, the system can update all of the parameters, half of the parameters, or parameters of particular layers such as the last linear layer. Atorney Docket No. 45288-0431WO1

[0184] FIG. 5 shows the performance of a system for generating outputs. In particular, FIG. 5 shows the performance of a variety of techniques for generating outputs for a language processing task in terms of root-mean-square error, number of parameters, and FLOPs, which can be used to estimate required memory and computational costs.

[0185] As can be seen from FIG. 5, the system described in this specification (labeled as “LEVF’ in FIG. 5). performs training and inference faster and using fewer or a comparable number of training parameters and inference parameters, while achieving better out of distribution generalization, or generalization for inputs not seen during training.

[0186] In this specification, the term "configured" is used in relation to computing systems and environments, as well as computer program components. A computing system or environment is considered "configured" to perform specific operations or actions when it possesses the necessary software, firmware, hardware, or a combination thereof, enabling it to carry out those operations or actions during operation. For instance, configuring a system might involve installing a software library with specific algorithms, updating firmware with new instructions for handling data, or adding a hardware component for enhanced processing capabilities. Similarly, one or more computer programs are "configured" to perform particular operations or actions when they contain instructions that, upon execution by a computing device or hardware, cause the device to perform those intended operations or actions.

[0187] The embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuitry, software, firmware, computer hardware (encompassing the disclosed structures and their structural equivalents), or any combination thereof. The subject matter can be realized as one or more computer programs, essentially modules of computer program instructions encoded on a tangible non- transitory storage medium for execution by or to control the operation of a computing device or hardware. The storage medium can be a storage device such as a hard drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination of these. Additionally or alternatively, the program instructions can be encoded on a transmitted signal, such as a machine-generated electrical, optical, or electromagnetic signal, designed to cany7information for transmission to a receiving device or system for execution by a computing device or hardware. Furthermore, implementations may leverage emerging technologies like quantum computing or neuromorphic computing for specific applications, and may be deployed in distributed or cloud-based environments where components reside on different machines or within a cloud infrastructure. Atorney Docket No. 45288-0431WO1

[0188] The term "computing device or hardware" refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs), and specialized processing hardware such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). In addition to hardware, a computing device or hardware may also include code that creates an execution environment for computer programs. This code can take the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of these elements. Embodiments may particularly benefit from utilizing the parallel processing capabilities of GPUs, in a General-Purpose computing on Graphics Processing Units (GPGPU) context, where code specifically designed for GPU execution, often called kernels or shaders, is employed. Similarly, TPUs excel at running optimized tensor operations crucial for many machine learning algorithms. By leveraging these accelerators and their specialized programming models, the system can achieve significant speedups and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in areas such as computer vision, natural language processing, and robotics.

[0189] A computer program, also referred to as software, an application, a module, a script, code, or simply a program, can be written in any programming language, including compiled or interpreted languages, and declarative or procedural languages. It can be deployed in various forms, such as a standalone program, a module, a component, a subroutine, or any other unit suitable for use within a computing environment. A program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e.g.. scripts within a markup language document), residing in a dedicated file, or distributed across multiple coordinated files (e.g., files storing modules, subprograms, or code segments). A computer program can be executed on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected through a data communication network. The specific implementation of the computer programs may involve a combination of traditional programming languages and specialized languages or libraries designed for GPGPU programming or TPU utilization, depending on the chosen hardware platform and desired performance characteristics.

[0190] In this specification, the term "engine" broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions. An engine is Atorney Docket No. 45288-0431WO1 ty pically implemented as one or more software modules or components installed on one or more computers, which can be located at a single site or distributed across multiple locations. In some instances, one or more dedicated computers may be used for a particular engine, while in other cases, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of Al and machine learning could include data pre-processing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of engines will depend on the overall architecture and the distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors.

[0191] The processes and logic flows described in this specification can be executed by one or more programmable computers running one or more computer programs to perform functions by operating on input data and generating output. Additionally, graphics processing units (GPUs) and tensor processing units (TPUs) can be utilized to enable concurrent execution of aspects of these processes and logic flows, significantly accelerating performance. This approach offers significant advantages for computationally intensive tasks often found in Al and machine learning applications, such as matrix multiplications, convolutions, and other operations that exhibit a high degree of parallelism. By leveraging the parallel processing capabilities of GPUs and TPUs, significant speedups and efficiency gains compared to relying solely on CPUs can be achieved. Alternatively or in combination with programmable computers and specialized processors, these processes and logic flows can also be implemented using specialized processing hardware, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), for even greater performance or energy efficiency in specific use cases.

[0192] Computers capable of executing a computer program can be based on general-purpose microprocessors, special-purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. These accelerators often work in conjunction with CPUs, handling specialized computations while the CPU manages overall system operations and other tasks. Typically, a CPU receives instructions and data from read-only memory (ROM), random access memory (RAM), or both. The elements of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data. The specific Atorney Docket No. 45288-0431WO1 configuration of processing units and memory will depend on factors like the complexity of the Al model, the volume of data being processed, and the desired performance and latency requirements. Embodiments can be implemented on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high-performance computing capabilities. The system may include storage devices like hard drives, SSDs, or flash memory for persistent data storage.

[0193] Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and memory devices. Examples include semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices: hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs. The specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or permanence.

[0194] To facilitate user interaction, embodiments of the subject matter described in this specification can be implemented on a computing device equipped with a display device, such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display, for presenting information to the user. Input can be provided by the user through various means, including a keyboard), touchscreens, voice commands, gesture recognition, or other input modalities depending on the specific device and application. Additional input methods can include acoustic, speech, or tactile input, while feedback to the user can take the form of visual, auditory, or tactile feedback. Furthermore, computers can interact with users by exchanging documents with a user's device or application. This can involve sending web content or data in response to requests or sending and receiving text messages or other forms of messages through mobile devices or messaging platforms. The selection of input and output modalities will depend on the specific application and the desired form of user interaction.

[0195] Machine learning models can be implemented and deployed using machine learning frameworks, such as TensorFlow or JAX. These frameworks offer comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models.

[0196] Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specific application and requirements. These may include a back-end component, such as a back-end server or cloud-based infrastructure; an optional middleware component, such as a Atorney Docket No. 45288-0431WO1 middleware server or application programming interface (API), to facilitate communication and data exchange: and a front-end component, such as a client device with a user interface, a web browser, or an app, through which a user can interact with the implemented subject matter. For instance, the described functionality could be implemented solely on a client device (e.g., for on-device machine learning) or deployed as a combination of front-end and back-end components for more complex applications. These components, when present, can be interconnected using any form or medium of digital data communication, such as a communication network like a local area network (LAN) or a wide area network (WAN) including the Internet. The specific system architecture and choice of components will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience.

[0197] The computing system can include clients and servers that may be geographically separated and interact through a communication network. The specific type of network, such as a local area network (LAN), a wide area network (WAN), or the Internet, will depend on the reach and scale of the application. The client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. These protocols may include HTTP, TCP / IP, or other specialized protocols depending on the nature of the data being exchanged and the security requirements of the system. In certain embodiments, a server transmits data or instructions to a user's device, such as a computer, smartphone, or tablet, acting as a client. The client device can then process the received information, display results to the user, and potentially send data or feedback back to the server for further processing or storage. This allows for dynamic interactions between the user and the system, enabling a wide range of applications and functionalities.

[0198] In addition to the embodiments described above, the following embodiments are also innovative:

[0199] Embodiment 1 is a computer-implemented method comprising: receiving input data for a machine learning task; processing the input data to generate a respective output at each of one or more iterations, comprising, at each of the one or more iterations: processing an input for the iteration derived from the input data using a trained model that has been trained to perform one or more machine learning tasks, wherein the trained model comprises a plurality of trained layers, and wherein the trained model is configured to: Atorney Docket No. 45288-0431WO1 process the input for the iteration using the plurality of trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks; processing the input for the iteration using a task-specific model to generate a task-specific representation of the input for the machine learning task; for each adapting layer in a set of multiple adapting layers, processing an adapting layer input to generate a candidate output for the iteration, wherein each adapting layer corresponds to one or more of the trained layers, and the adapting layer input for the adapting layer comprises the task-specific representation and the respective intermediate output generated by each corresponding trained layer; and generating the output for the iteration from the candidate outputs for the iteration generated by the adapting layers.

[0200] Embodiment 2 is the method of embodiment 1 , wherein the trained model has been trained to perform one or more machine learning tasks other than the machine learning task.

[0201] Embodiment 3 is method of any of embodiments 1-2, wherein generating the output for the iteration comprises combining two or more of the candidate outputs.

[0202] Embodiment 4 is the method of embodiment 3, wherein combining two or more of the candidate outputs comprises averaging the tw o or more of the candidate outputs.

[0203] Embodiment 5 is the method of any of embodiments 1-2, wherein generating the output for the iteration comprises identifying one of the candidate outputs as the output.

[0204] Embodiment 6 is the method of any of embodiments 1-5, wherein each of the multiple adapting layers has a same set of weights.

[0205] Embodiment 7 is the method of any of embodiments 1-6, wherein each of the multiple adapting layers corresponds to a different trained layer.

[0206] Embodiment 8 is the method of any of embodiments 1-7, wherein the adapting layer input is a concatenation of the task-specific representation and the respective intermediate output generated by each corresponding trained layer.

[0207] Embodiment 9 is the method of any of embodiments 1-8, wherein the input data comprises any one or more of: text data, image data, audio data, or video data.

[0208] Embodiment 10 is the method of any of embodiments 1 -9, wherein the trained model is a Transformer-based neural network.

[0209] Embodiment 11 is the method of any of embodiments 1-10, wherein the task-specific model comprises one or more of: a multilayer perceptron, an embedding layer, or a convolutional neural network. Attorney Docket No. 45288-0431WO1

[0210] Embodiment 12 is the method of any of embodiments 1-11, wherein the task-specific model has a smaller size than the trained model.

[0211] Embodiment 13 is the method of embodiment 12, wherein the task-specific model has a smaller number of trainable parameters or a smaller number of layers than the trained model.

[0212] Embodiment 14 is the method of any of embodiments 1-13, wherein each of the one or more adapting layers comprises a multilayer perceptron.

[0213] Embodiment 15 is the method of any of embodiments 1-14, wherein the machine learning task comprises one of: a language processing task, or a computer vision task. Embodiment 16 is the method of embodiment 15, wherein the input data comprises an input image and the computer vision task comprises one or more of: (i) image classification, wherein the output is scores for each of a set of object categories, each score representing an estimated likelihood that the input image contains an image of an object belonging to the category; (ii) object detection, wherein the output data identifies locations in the input image at which particular types of objects are depicted; and (ii) image segmentation, wherein the output assigns each pixel of the input image to a category from a set of categories.

[0214] Atorney Docket No. 45288-0431WO1

[0215] Embodiment 17 is a computer-implemented method for generating an ensemble model for one or more second machine learning tasks, comprising: obtaining data specifying a trained model, wherein the trained model comprises a plurality of trained layers, wherein the trained model has been trained to perform one or more machine learning tasks, and wherein the trained model is configured to: receive input data; at each of one or more iterations, process an input for the iteration derived from the input data using the plurality of trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks; obtaining data specify ing a task-specific model that has a plurality of model parameters, wherein the task-specific model is configured to: at each of the one or more iterations, process the input for the iteration in accordance with the model parameters to generate a task-specific representation of the input; obtaining a plurality7of training examples for the one or more second machine learning tasks; and generating the ensemble model for the one or more second machine learning tasks, wherein the ensemble model comprises the trained model, the task-specific model, and a set of multiple adapting layers, and wherein the ensemble model is configured to: at each of the one or more iterations, for each of the adapting layers, process a respective adapting layer input to generate a candidate output for the iteration for the one or more second machine learning tasks, wherein each adapting layer corresponds to one or more of the trained layers, and the adapting layer input for the adapting layer comprises the taskspecific representation and the respective intermediate output generated by each corresponding trained layer, and wherein generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model.

[0216] Embodiment 18 is the method of embodiment 17, wherein generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model while holding the trained model fixed.

[0217] Embodiment 19 is the method of embodiment 17, wherein generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers, the task-specific model, and the trained model. Atorney Docket No. 45288-0431WO1

[0218] Embodiment 20 is the method of any of embodiments 17-19, wherein the model parameters of the task-specific model specified by the obtained data are randomly initialized.

[0219] Embodiment 21 is the method of any of embodiments 17-20, wherein training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model while holding the trained model fixed comprises: training the set of multiple adapting layers and the task-specific model to minimize an aggregated loss function over losses for the candidate outputs.

[0220] Embodiment 22 is the method of embodiment 21, wherein the aggregated loss function comprises a weighted sum of the losses.

[0221] Embodiment 23 is the method of any of embodiments 17-22, wherein each adapting layer has a same set of weights.

[0222] Embodiment 24 is the method of any of embodiments 17-22, wherein each adapting layer has a different set of weights.

[0223] Embodiment 25 is the method of any of embodiments 17-24, wherein the input data comprises any one or more of: text data, image data, audio data, or video data.

[0224] Embodiment 26 is the method of any of embodiments 17-25, wherein the trained model is a Transformer-based neural network.

[0225] Embodiment 27 is the method of any of embodiments 17-26, wherein the trained model has been further trained on a fine-tuning dataset for the one or more second machine learning tasks.

[0226] Embodiment 28 is the method of any of embodiments 17-27, wherein the taskspecific model comprises any one or more of: a multilayer perceptron, an embedding layer, or a convolutional neural network.

[0227] Embodiment 29 is the method of any of embodiments 17-28, wherein the taskspecific model has a smaller size than the trained model.

[0228] Embodiment 30 is the method of any of embodiments 17-29, wherein each of the one or more adapting layers comprises a multilayer perceptron.

[0229] Embodiment 31 is the method of any of embodiments 17-30, wherein the one or more second machine learning tasks comprises any one or more of: a language processing task, or a computer vision task.

[0230] Embodiment 32 is the method of embodiment 31, wherein the input data comprises an input image and the computer vision task comprises one or more of: (i) image classification, wherein the output is scores for each of a set of object categories, each score representing an estimated likelihood that the input image contains an image of an object belonging to the Atorney Docket No. 45288-0431WO1 category; (ii) object detection, wherein the output data identifies locations in the input image at which particular types of objects are depicted; and (ii) image segmentation, wherein the output assigns each pixel of the input image to a category from a set of categories.

[0231] Embodiment 33 is a system comprising: one or more computers; and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the respective operations of the method of any one of embodiments 1-32.

[0232] Embodiment 34 is one or more computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the respective operations of the method of any one of embodiments 1-32.

[0233] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0234] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0235] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one Atorney Docket No. 45288-0431WO1 example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

[0236] What is claimed is:

Claims

Attorney Docket No. 45288-0431WO1CLAIMS1. A computer-implemented method comprising: receiving input data for a machine learning task: processing the input data to generate a respective output at each of one or more iterations, comprising, at each of the one or more iterations: processing an input for the iteration derived from the input data using a trained model that has been trained to perform one or more machine learning tasks, wherein the trained model comprises a plurality of trained layers, and wherein the trained model is configured to: process the input for the iteration using the plurality of trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks; processing the input for the iteration using a task-specific model to generate a task-specific representation of the input for the machine learning task; for each adapting layer in a set of multiple adapting layers, processing an adapting layer input to generate a candidate output for the iteration, wherein each adapting layer corresponds to one or more of the trained layers, and the adapting layer input for the adapting layer comprises the task-specific representation and the respective intermediate output generated by each corresponding trained layer; and generating the output for the iteration from the candidate outputs for the iteration generated by the adapting layers.

2. The method of claim 1 , wherein the trained model has been trained to perform one or more machine learning tasks other than the machine learning task.

3. The method of any preceding claim, wherein generating the output for the iteration comprises combining two or more of the candidate outputs.

4. The method of claim 3, wherein combining two or more of the candidate outputs comprises averaging the two or more of the candidate outputs.

5. The method of any of claims 1-2, wherein generating the output for the iteration comprises identifying one of the candidate outputs as the output.Atorney Docket No. 45288-0431WO16. The method of any preceding claim, wherein each of the multiple adapting layers has a same set of weights.

7. The method of any preceding claim, wherein each of the multiple adapting layers corresponds to a different trained layer.

8. The method of any preceding claim, wherein the adapting layer input is a concatenation of the task-specific representation and the respective intermediate output generated by each corresponding trained layer.

9. The method of any preceding claim, wherein the input data comprises any one or more of: text data, image data, audio data, or video data.

10. The method of any preceding claim, wherein the trained model is a Transformerbased neural network.

11. The method of any preceding claim, wherein the task-specific model comprises one or more of: a multilayer perceptron, an embedding layer, or a convolutional neural network.

12. The method of any preceding claim, wherein the task-specific model has a smaller size than the trained model.

13. The method of claim 12, wherein the task-specific model has a smaller number of trainable parameters or a smaller number of layers than the trained model.

14. The method of any preceding claim, wherein each of the one or more adapting layers comprises a multilayer perceptron.

15. The method of any preceding claim, wherein the machine learning task comprises one of: a language processing task, or a computer vision task.Attorney Docket No. 45288-0431WO116. The method of claim 15, wherein the input data comprises an input image and the computer vision task comprises one or more of: (i) image classification, wherein the output is scores for each of a set of object categories, each score representing an estimated likelihood that the input image contains an image of an object belonging to the category; (ii) object detection, wherein the output data identifies locations in the input image at which particular types of objects are depicted; and (ii) image segmentation, wherein the output assigns each pixel of the input image to a category from a set of categories.

17. A computer-implemented method for generating an ensemble model for one or more second machine learning tasks, comprising: obtaining data specifying a trained model, wherein the trained model comprises a plurality of trained layers, wherein the trained model has been trained to perform one or more machine learning tasks, and wherein the trained model is configured to: receive input data; at each of one or more iterations, process an input for the iteration derived from the input data using the plurality of trained layers to generate, for each trained layer, a respective intermediate output for the one or more machine learning tasks; obtaining data specifying a task-specific model that has a plurality of model parameters, wherein the task-specific model is configured to: at each of the one or more iterations, process the input for the iteration in accordance with the model parameters to generate a task-specific representation of the input; obtaining a plurality of training examples for the one or more second machine learning tasks; and generating the ensemble model for the one or more second machine learning tasks, wherein the ensemble model comprises the trained model, the task-specific model, and a set of multiple adapting layers, and wherein the ensemble model is configured to: at each of the one or more iterations, for each of the adapting layers, process a respective adapting layer input to generate a candidate output for the iteration for the one or more second machine learning tasks, wherein each adapting layer corresponds to one or more of the trained layers, and the adapting layer input for the adapting layer comprises the taskspecific representation and the respective intermediate output generated by each corresponding trained layer, and wherein generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model.Attorney Docket No. 45288-0431WO118. The method of claim 17, wherein generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers and the task-specific model while holding the trained model fixed.

19. The method of claim 17, wherein generating the ensemble model comprises training the ensemble model on the plurality of training examples by training the set of multiple adapting layers, the task-specific model, and the trained model.

20. The method of any of claims 17-19, wherein the model parameters of the task-specific model specified by the obtained data are randomly initialized.

21. The method of any of claims 17-20, wherein training the ensemble model on the plurality7of training examples by training the set of multiple adapting layers and the taskspecific model while holding the trained model fixed comprises: training the set of multiple adapting layers and the task-specific model to minimize an aggregated loss function over losses for the candidate outputs.

22. The method of claim 21, wherein the aggregated loss function comprises a weighted sum of the losses.

23. The method of any of claims 17-22, wherein each adapting layer has a same set of weights.

24. The method of any of claims 17-22, wherein each adapting layer has a different set of weights.

25. The method of any of claims 17-24, wherein the input data comprises any one or more of: text data, image data, audio data, or video data.

26. The method of any of claims 17-25, wherein the trained model is a Transformer-based neural network.Attorney Docket No. 45288-0431WO127. The method of any of claims 17-26, wherein the trained model has been further trained on a fine-tuning dataset for the one or more second machine learning tasks.

28. The method of any of claims 17-27, wherein the task-specific model comprises any one or more of: a multilayer perceptron, an embedding layer, or a convolutional neural network.

29. The method of any of claims 17-28, wherein the task-specific model has a smaller size than the trained model.

30. The method of any of claims 17-29, wherein each of the one or more adapting layers comprises a multilayer perceptron.

31. The method of any of claims 17-30, wherein the one or more second machine learning tasks comprises any one or more of: a language processing task, or a computer vision task.

32. The method of claim 31 , wherein the input data comprises an input image and the computer vision task comprises one or more of: (i) image classification, wherein the output is scores for each of a set of object categories, each score representing an estimated likelihood that the input image contains an image of an object belonging to the category; (ii) object detection, wherein the output data identifies locations in the input image at which particular ty pes of objects are depicted; and (ii) image segmentation, wherein the output assigns each pixel of the input image to a category from a set of categories.

33. A sy stem compri sing : one or more computers; and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the respective operations of the method of any one of claims 1-32.

34. One or more computer-readable storage media storing instructions that when executed by one or more computers cause the one or more computers to perform the respective operations of the method of any one of claims 1-32.