Multi-reward direct preference optimization for model finetuning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
Multi-reward DPO training addresses the limitations of single-metric DPO by efficiently training genAI models to generate high-quality digital content, leveraging diverse human preferences and criteria for improved performance.

WO2026127970A1PCT designated stage Publication Date: 2026-06-18GOOGLE LLC

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: GOOGLE LLC
Filing Date: 2024-12-13
Publication Date: 2026-06-18

Application Information

Patent Timeline

13 Dec 2024

Application

18 Jun 2026

Publication

WO2026127970A1

IPC: G06N3/08

AI Tagging

Application Domain

Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Sar ship detection method and system with hierarchical attention fusion and edge enhancement
CN121962936Breduce overfittingImprove stability Character and pattern recognition Neural learning methods
Property prediction system
US12658286B2Geometric CAD Chemical property prediction
Multi-scale neural network for anomaly detection
CN122197978AKernel methods Neural learning methods
A multimodal fusion video conference content real-time abstract generation method and system
CN122205030ATelevision conference systemsTwo-way working systems
A vehicle position estimation method of a fusion filtering network and a computer readable medium
CN116086476BInstruments for road network navigation Internal combustion piston engines

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing generative artificial intelligence (genAI) model training methods, such as Direct Preference Optimization (DPO), are limited to single reward metrics, leading to inefficiencies and suboptimal performance in generating high-quality digital content.

⚗Method used

Implementing multi-reward Direct Preference Optimization (DPO) to train genAI models with multiple reward metrics, allowing for more nuanced and efficient training that aligns with diverse human preferences and criteria.

🎯Benefits of technology

The multi-reward DPO approach enhances genAI model performance by reducing computational resources, requiring fewer training iterations, and producing higher-quality digital content while improving generalization and robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure US2024059966_18062026_PF_FP_ABST

Patent Text Reader

Abstract

Aspects of the disclosure are directed to generating training assets for training an artificial intelligence model using direct preference optimization. An artificial intelligence model may generate a set of digital content in response to a prompt. Attributes of pieces of digital content of the set of digital content may be determined. The attributes of each piece of digital content may be mapped to respective reward metrics. A set of training assets may be generated, with each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content.

Need to check novelty before this filing date? Find Prior Art

Description

GOOGLE 3.4-4275MULTI-REWARD DIRECT PREFERENCE OPTIMIZATION FOR MODEL FINETUNINGBACKGROUND

[0001] Direct Preference Optimization (DPO) is a machine-learning technique that leverages human feedback to refine generative artificial intelligence (genAI) model behavior. DPO presents the genAI model with pairs of outputs and prompts a human evaluator to select which output they prefer. The model then adjusts its parameters to increase the likelihood of generating content with similar outputs to the preferred output in future iterations.

[0002] Other model training techniques, such as reinforcement learning from human feedback (RLFH) are capable of handling diverse forms of human feedback. In this regard, RLFH does not rely only on a selection of preferred options from a human evaluator but can include other feedback, such as rankings of options. As such, RLFH can provide finer- grained control over the behavior of a genAI model. This flexibility makes RLFH useful for complex tasks that require nuanced human input.

[0003] DPO offers several advantages over other model training techniques, such as reinforcement learning from human feedback (RLFH). For instance, DPO simplifies the training process of Al models by eliminating the need for intricate reward modeling, instead learning directly from human preferences. As DPO does not require reward modeling, DPO is often much more efficient than RLFH, requiring less computational resources and fewer training iterations. Additionally, DPO often aligns genAI model behavior more closely with human preferences.BRIEF SUMMARY

[0004] Aspects of the technology are directed to finetuning generative artificial intelligence (Al) models using Direct Preference Optimization (DPO) with multiple reward metrics to generate digital content. Typically, DPO is used to train Al models to generate digital content that humans tend to prefer by creating digital content similar to digital content previously identified by a human evaluator as being preferred. However, such training considers only a single metric, human preference, and does not consider multiple metrics. Multiple metrics, if considered, such as in the case of RLHF, tend to result in higher-quality digital content being generated by Al models. By using DPO with multiple reward metrics, genAI models may be trained to generate digital content that accounts for more than one reward metric. As such,GOOGLE 3.4-4275 genAI models trained using DPO with multiple reward metrics may be more finely tuned, similar to models trained using RLFH, while still benefiting from the advantages of DPO, including better efficiency, such as fewer computational resources and training iterations. Moreover, the model finetuned using DPO does not include a reward model, thereby resulting in faster and more efficient operation relative to RLFH-trained models.

[0005] Multi-reward DPO models, such as those models trained using multi-reward metrics as described herein, may also enhance generalization by leveraging diverse preferences from multiple sources, enabling better adaptation to new tasks. Such models can also improve diversity by capturing a wide range of user preferences represented by multiple reward metrics. Robustness is further achieved by balancing and joint optimization of conflicting objectives, reducing the risk of overfitting to any single preference. Overall, the multi-reward DPO models provide a unified framework for diverse and robust behavior while maintaining strong generalization capabilities.

[0006] An aspects of the disclosure is directed to a method, comprising: generating, by one or more processors using an artificial intelligence model, a set of digital content in response to a prompt; determining, by the one or more processors, attributes of pieces of digital content of the set of digital content; mapping, by the one or more processors, the attributes of each piece of digital content to respective reward metrics; generating, by the one or more processors, a set of training assets, each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content; and training, by the one or more processors, the artificial intelligence model using the set of training assets.

[0007] In some instances, the prompt and the second prompt are the same prompt.

[0008] In some instances, the pieces of digital content of the set of digital content includes all pieces of digital content in the set of digital content.

[0009] In some instances, the training further comprises using direct preference optimization (DPO) to train the artificial intelligence model using the set of training assets, wherein the set of training assets are sequentially input into the artificial intelligence model and parameters of the artificial intelligence model are adjusted to increase a probability of the artificial intelligence model generating new digital content with attributes that satisfy multiple criteria.GOOGLE 3.4-4275

[0010] In some instances, generating the set of training assets further comprises: determining a subset of the digital content based on the attributes of the digital content relative to a collection of criteria; and generating the set of training assets using the subset of the digital content.

[0011] In some examples, determining the subset of the digital content based on the attributes of the digital content relative to the collection of criteria comprises, for each piece of digital content: determining whether the attributes of the respective piece of digital content satisfy all criteria of the collection of criteria, none of the criteria of the collection of criteria, or some criteria of the collection of criteria.

[0012] In some examples, the subset of the digital content comprises the pieces of digital content satisfying all criteria of the collection of criteria and none of the criteria of the collection of criteria.

[0013] Another aspect of the disclosure is directed to a system comprising: one or more processors; and one or more storage devices couped to the one or more processors and storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations for training a machine learning model with differentially private labels, the operations comprising: generating, using an artificial intelligence model, a set of digital content in response to a prompt; determining attributes of pieces of digital content of the set of digital content; mapping the attributes of each piece of digital content to respective reward metrics; generating a set of training assets, each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content; and training, by the one or more processors, the artificial intelligence model using the set of training assets.

[0014] Another aspect of the disclosure is directed to a non-transitory computer-readable medium for storing instructions that, when executed by one or more processors, cause the one or more processors to: generate, using an artificial intelligence model, a set of digital content in response to a prompt; determine attributes of pieces of digital content of the set of digital content; map the attributes of each piece of digital content to respective reward metrics; generate a set of training assets, each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content; and train the artificial intelligence model using the set of training assets.GOOGLE 3.4-4275BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIG. 1 is a block diagram of an example process for finetuning a model using direct preference optimization according to aspects of the disclosure.

[0016] FIG. 2 is a block diagram of a training asset generation system, according to aspects of the disclosure.

[0017] FIG. 3 depicts a block diagram of an example environment for implementing a reward score generation system according to aspects of the disclosure.

[0018] FIG. 4 depicts a block diagram of one or more machine learning model architectures according to aspects of the disclosure.

[0019] FIG. 5 depicts a flow diagram of an example process for training a machine learning model using reinforcement learning according to aspects of the disclosure.DETAILED DESCRIPTION

[0020] The technology generally relates to finetuning generative artificial intelligence (genAI) models through training using Direct Preference Optimization (DPO) with combined and multiple reward metrics, collectively referred to herein as multi-reward metrics. The genAI model may be finetuned using DPO with multiple reward metrics to generate digital content that accounts for more than one reward metric. The genAI model, therefore, benefits from the advantages of DPO, including better efficiency, such as fewer computational resources and training iterations, while producing higher quality digital content than previously possible with DPO trained on a single metric indicative of human preference.

[0021] The multiple reward metrics used to train genAI models using DPO with multiple reward metrics may be selected based on a use case of the trained genAI model. In this regard, the reward metrics may represent characteristics of high-quality digital content that satisfies the use case of the trained genAI model. Such reward metrics may be based on attributes of the digital content such as the performance of digital content, the quality of digital content, policies that digital content satisfies, human preferences for digital content, etc. Digital content that satisfies and fails to satisfy the use case of the trained genAI model may be mapped into training assets along with a prompt and reward metric indicators for training the genAI model using DPO.

[0022] The genAI model may be finetuned by training the genAI model using the generatedGOOGLE 3.4-4275 training assets. Fig. 1 illustrates an example process 100 for finetuning a base model 106 using DPO 120 with training assets 110. The base model 106 may be a genAI model. As shown in FIG. 1, the training assets, which each may include a prompt, a multi-reward metric, and digital content may be fed into the base model 106. For each prompt, the base model 106 may generate multiple response options, a reward model, human, or other such third-party evaluation system may score the response options, and the parameters of the base model 106 may be updated to finetune the base model to increase the probability of generating high-quality digital content, such as content that satisfies a particular attribute(s), and decreasing the probability of the base model generating low-quality digital content. This process 100 may be iterative, such that the base model 106 is trained on more than one training asset. With each iteration, the base model 106 may have new parameters. Alternatively, the process 100 may be performed one time, such that the base model 106 is trained on a single training asset. After training is complete, the base model 106 may is output as the finetuned model 114.

[0023] Although not illustrated, a copy of the base model 106 at its initial state, before any finetuning is performed during process 100, may also be used. The copy of the base model may remain unchanged throughout the process 100. The copy of the base model may also generate multiple response options for the prompts of the training assets simultaneously with the base model 106. The responses generated by the copy of the base model and the base model 106 may be scored and used to update the parameters of the base model 106 during finetuning. In this example, the base model 106 may have new parameters through each training iteration until completion, after which the base model 106 may be output as the finetuned model 114.

[0024] The response options generated by the base model 106, and in some examples, the copy of the base model may be digital content generated to satisfy the prompts of the training assets 110.

[0025] The generated digital content may have scores assigned which are indicative of the probability the generated digital content satisfies the particular attribute(s) being finetuned. The base model may then be finetuned by adjusting a loss function that encourages the base model to increase the probability of generating digital content that satisfies the particular attribute(s) and decreases the probability of generating digital content that does not satisfy the particular attribute(s) being finetuned.

[0026] Digital content, as used herein, may be in the form of text, images, video, audio, and any combination of the preceding. Digital content may include, for example, informativeGOOGLE 3.4-4275 information, entertainment, advertisements, etc. References to a piece of digital content may mean a single item of digital content or a portion of a single item of digital content (e.g., a frame of a video, a part of an image, a snippet of audio, etc.) The genAI model may be trained to general digital content of different modalities, either as separate models or as one multimodal model. In examples where the genAI is trained to generate digital content from one or more different modalities, the genAI model may receive input(s) or some indication as to whether to generate digital content as a combination of text, image, video, etc. The reward metrics used to train genAI models may be the same or different across different modalities.Training Assets

[0027] FIG. 2 depicts a block diagram of a training asset generation system 200. The training asset generation system 200 can be implemented on one or more computing devices in one or more locations, such as one or more servers and / or hardware accelerators. The training asset generation system 200 may generate a collection of training assets for finetuning a genAI model using DPO. To generate the collection of training assets, attributes of digital content need to be collected. The digital content from which the attributes are collected may be generated by a genAI model.

[0028] The training asset generation system 200 includes sub-systems, including a digital content generation system 204, an attribute determination system 206, a reward metric mapping system 208, and an asset generation system 210. Although each sub-system is shown as being part of the training asset generation system 200, some sub-systems may be separate from the training asset generation system 200. In some examples, each sub-system may be distinct from other sub-systems such that no training asset generation system 200 is used but is rather implemented via the sub-systems. Additionally, or alternatively, each sub-system may be combined with one or more other sub-system. For instance, the digital content generation 204 and attribute determination 206 sub-systems may be combined into a single sub-system.

[0029] The training asset generation system 200 can be configured to receive prompts 202 for use in generating training assets. The prompts may provide instructions for generating digital content by a genAI model. The prompts provided to the digital content generation system 204 may all be the same prompts or different prompts. For example, the training asset generation system 200 can receive prompts 201 as part of a call to an application programming interface (API) exposing the training asset generation system 200 to one or more computing devices.GOOGLE 3.4-4275The prompts 201 can also be provided to the training asset generation system 200 through a storage medium, such as remote storage connected to one or more computing devices over a network. In yet another example, the prompts 201 can also be provided as input through a user interface on a client computing device coupled to the training asset generation system 200. The user interface can include a natural language interface, such as one or more text boxes, and / or a graphical interface, such as one or more sliders, checkboxes, and / or templates. The user interface can be configured to receive input as natural language in a variety of different modalities, for example as text input to a text box and / or as an image, a video, and / or audio.

[0030] The digital content generation system 204 may execute one or more genAI models that may generate digital content in response to the prompts. The genAI model executed by the digital content generation system 204 may be the same genAI model that is to be finedtuned using DPO. In other examples, the genAI model that generates the digital content may differ from the genAI model being finetuned. In some examples, more than one genAI model may generate the digital content. In some instances, digital content may be generated by a genAI model other than the genAI model(s) of the digital content generation system 204.

[0031] The digital content may be passed to the attribute determination system 206. The attribute determination system 206 may determine the attributes of the digital content. The digital content provided to the attribute determination system 206 may include some or all of the digital content generated by the genAI model(s) of the digital content generation system 204.

[0032] For instance, attribute determination system 206 may identify attributes such as click- through rates (CTR), conversion rates (CVR), and effective cost per mille (eCPM) for the collection of digital content. In some instances, the identified attributes may be predicted values, such as predicted click-through rates (pCTR), predicted conversion rates (pCVR), etc. Other attributes may include human preference, policy compliance, determined suitability of the content, etc.

[0033] The reward metric mapping system 208 may map attributes of respective pieces of digital content into reward metrics. As further described herein, the reward metrics may be a single value or multiple values representing how well the respective pieces of digital content perform across all attributes.

[0034] When selecting digital content to include in the training assets, digital content that performs well across all considered attributes may be selected along with digital content thatGOOGLE 3.4-4275 fails to satisfy all considered attributes. Digital content that performs well across some attributes but poorly in others may be discarded and not included in the training assets. Thus, the resulting training assets may provide both good and bad examples of digital content for the genAI model to train on. The determination of which digital content to generate reward metrics for may be done by the attribute determination 206, the reward metric mapping 208, and / or by some other system.

[0035] The determination of how digital content performs across considered attributes may be based on the type of considered attribute. For example, determining the performance of digital content based on numerical attributes, such as CTR and CVR, may include comparing the CTR and CVR of the digital content to threshold values. Digital content with CTR that satisfies the CTR threshold value and CVR that satisfies the CVR threshold value may be determined to perform well. For subjective attributes, such as quality, a value or a score (e.g., A-F, 1-10, 1- 100, 0.0-1, etc.) may be assigned to the digital content. The assigned value or score may be assigned by a human evaluator or by an Al model. Digital content with a value or score that satisfies a threshold value may be considered satisfying the attribute.

[0036] For objective attributes, such as policy, a score or other indicator may be used to identify whether digital content meets the objective attribute. For instance, a policy may require that an image be included in the digital content. A human evaluator or an Al model may review the digital content and determine whether it includes an image and assign a score or value (e.g., 0 or 1, yes or no, etc.) Digital content determined to satisfy the policy may be considered as satisfying the attribute.Combined Reward Metric

[0037] The reward metric may be a single value representing how well a respective piece of digital content performs across all attributes. For instance, the reward metric of a piece of digital content in the collection of digital content that has high CTR, CVR, and eCPM values may have a higher reward metric than another piece of digital content in the collection that has similar CTR and CVR values but with a lower eCPM value. The reward metric may be generated by an Al model. Such an Al model may take attributes of digital content as input and output a reward value. In another example, the attributes may be determined by normalizing the values of the attributes of the piece of digital content across different serving traffic.GOOGLE 3.4-4275

[0038] Although the previous example describes generating a reward metric by the reward metric mapping system 208 using all attributes of the respective piece of digital content, a reward metric may be generated using only a subset of the attributes of the respective piece of digital content. For instance, and continuing the previous example, the reward metric may factor in only CTR and eCPM, thereby disregarding CVR. The attributes that are factored into the reward metric may be predetermined.

[0039] A training asset may be generated for all or a subset of pieces of digital content in the collection of digital content by the asset generation system 210. In this regard, the asset generation system 210 may generate training assets represented by a data structure including a prompt, digital content, and reward metric, such as <prompt, digitalcontent, rewardmetric>. For X number of training assets, the set of training assets may be represented as follows: <prompt, digitalcontent 1, rewardmetric 1>, <prompt, digitalcontent2, rewardmetric2>, .... <prompt, digitalcontentX, rewardmetricX>. The training assets may be output by the training asset generation system 200, illustrated by arrow 212, for use in DPO finetuning of a gen Al model.Multidimensional Reward Metric

[0040] In other embodiments of the technology, the attributes of a respective piece of digital content may be mapped, by the reward metric mapping 208, into discrete reward metrics. The reward metric mapping 208 may generate a multidimensional reward metric using the discrete reward metrics. Unlike the case of a combined reward metric, where the combined reward metric represents how well the digital content performs across multiple attributes, a discrete reward metric may be generated for each attribute of a piece of digital content. Such discrete reward metrics may represent how well a piece of digital content satisfies or otherwise performs related to a respective attribute.

[0041] In an example, the digital content in the collection of digital content may have separate reward metrics for each attribute, including reward metrics for CTR, CVR, and eCPM. A training asset may be generated, by the asset generation 210 subsystem, for some or all pieces of digital content in the collection of digital content using the discrete reward metrics. Each training asset may be represented by a data structure having a reward metric space consistent with the number of reward metrics: <prompt, digitalcontent, rewardmetric 1, rewardmetric2...rewardmetricZ>, where Z is the number of reward metrics. For X number ofGOOGLE 3.4-4275 training assets having a two-dimensional reward metric space, the set of training assets may be represented as follows: <prompt, digitalcontent 1, rewardmetric 1-1, rewardmetric2-l>, <prompt, digitalcontent2, rewardmetric 1-2, rewardmetric2-2>, .... <prompt, digitalcontentX, rewardmetric 1-X, rewardmetric2-X>. The training assets may be output by the training asset generation system 200, illustrated by arrow 212, for use in DPO finetuning of a genAI model.

[0042] By the training a genAI model using direct preference optimization (DPO) with the set of training assets, the parameters of the artificial intelligence model are adjusted to increase a probability of the artificial intelligence model generating new digital content with attributes that satisfy multiple criteria. Training genAI models to satisfy multiple criteria, by finetuning the parameters of the model, was not previously possible with DPO, as DPO was previously limited to binary inputs (e.g., good / bad, yes / no, etc.) Moreover, the DPO trained genAI mode may benefit from the other advantages of DPO training, including avoiding the need for reward modeling to account for the multiple criteria, which is required for RLFH. As such, the DPO trained genAI model is more efficient than RLFH, requiring less computational resources and fewer training iterations. Additionally, DPO training aligns genAI model behavior more closely with human preferences.

[0043] FIG. 3 depicts a block diagram of an example environment 300 for implementing a training asset generation system 318. The training asset generation system 318 can be implemented on one or more devices having one or more processors in one or more locations, such as in server computing device 302. Client computing device 304 and the server computing device 302 can be communicatively coupled to one or more storage devices 306 over a network 308. The storage devices 306 can be a combination of volatile and non-volatile memory and can be at the same or different physical locations than the computing devices 302, 304. For example, the storage devices 306 can include any type of non-transitory computer readable medium capable of storing information, such as a hard-drive, solid state drive, tape drive, optical storage, memory card, ROM, RAM, DVD, CD-ROM, write-capable, and read-only memories.

[0044] The server computing device 302 can include one or more processors 310 and memory 312. The memory 312 can store information accessible by the processors 310, including instructions 314 that can be executed by the processors 310. The memory 312 can also include data 316 that can be retrieved, manipulated, or stored by the processors 310. The memory 312 can be a type of transitory or non-transitory computer readable medium capable of storingGOOGLE 3.4-4275 information accessible by the processors 310, such as volatile and non-volatile memory. The processors 310 can include one or more central processing units (CPUs), graphic processing units (GPUs), field-programmable gate arrays (FPGAs), and / or application-specific integrated circuits (ASICs), such as tensor processing units (TPUs).

[0045] The instructions 314 can include one or more instructions that, when executed by the processors 310, cause the one or more processors 310 to perform actions defined by the instructions 314. The instructions 314 can be stored in object code format for direct processing by the processors 310, or in other formats including interpretable scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. The instructions 314 can include instructions for implementing a training asset generation system 318, which can correspond to the training asset generation system 200 as depicted in FIG. 2. The training asset generation system 318 can be executed using the processors 310, and / or using other processors remotely located from the server computing device 302.

[0046] The data 316 can be retrieved, stored, or modified by the processors 310 in accordance with the instructions 314. The data 316 can be stored in computer registers, in a relational or non-relational database as a table having a plurality of different fields and records, or as JSON, YAML, proto, or XML documents. The data 316 can also be formatted in a computer-readable format such as, but not limited to, binary values, ASCII, or Unicode. Moreover, the data 316 can include information sufficient to identify relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories, including other network locations, or information that is used by a function to calculate relevant data.

[0047] The client computing device 304 can also be configured similarly to the server computing device 302, with one or more processors 320, memory 322, instructions 324, and data 326. The client computing device 304 can also include a user input 328 and a user output 330. The user input 328 can include any appropriate mechanism or technique for receiving input from a user, such as keyboard, mouse, mechanical actuators, soft actuators, touchscreens, microphones, and sensors.

[0048] The server computing device 302 can be configured to transmit data to the client computing device 304, and the client computing device 304 can be configured to display at least a portion of the received data on a display implemented as part of the user output 330. The user output 330 can also be used for displaying an interface between the client computingGOOGLE 3.4-4275 device 304 and the server computing device 302. The user output 330 can alternatively or additionally include one or more speakers, transducers or other audio outputs, a haptic interface or other tactile feedback that provides non-visual and non-audible information to the platform user of the client computing device 304.

[0049] Although FIG. 3 illustrates the processors 310, 320 and the memories 312, 322 as being within the respective computing devices 302, 304, components described herein can include multiple processors and memories that can operate in different physical locations and not within the same computing device. For example, some of the instructions 314, 324 and the data 316, 326 can be stored on a removable SD card and others within a read-only computer chip. Some or all of the instructions 314, 324 and data 316, 326 can be stored in a location physically remote from, yet still accessible by, the processors 310, 320. Similarly, the processors 310, 320 can include a collection of processors that can perform concurrent and / or sequential operation. The computing devices 302, 304 can each include one or more internal clocks providing timing information, which can be used for time measurement for operations and programs run by the computing devices 302, 304.

[0050] The server computing device 302 can be connected over the network 308 to a data center 332 housing any number of hardware accelerators 334. The data center 332 can be one of multiple data centers or other facilities in which various types of computing devices, such as hardware accelerators, are located. Computing resources housed in the data center 332 can be specified for deploying models, such as for reward score generation, as described herein.

[0051] The server computing device 302 can be configured to receive requests to process data from the client computing device 304 on computing resources in the data center 332. For example, the environment 300 can be part of a computing platform configured to provide a variety of services to users, through various user interfaces and / or application programming interfaces (APIs) exposing the platform services. As an example, the variety of services can include generating reward scores for training machine learning models with reinforcement learning.

[0052] The client computing device 304 can transmit prompts for generating digital content for generating training assets. The training asset generation system 318 can receive the prompts, and in response, generate output data including training assets as described herein.

[0053] The server computing device 302 can maintain a variety of models in accordance with different constraints available at the data center 332. For example, the server computing deviceGOOGLE 3.4-4275302 can maintain different families for deploying models on various types of TPUs and / or GPUs housed in the data center 332 or otherwise available for processing.

[0054] FIG. 4 depicts a block diagram 400 illustrating one or more machine learning model 402 architectures, more specifically 402A-N for each architecture, for deployment in a datacenter 404 housing a hardware accelerator 406 on which the deployed machine learning models 402 will execute, such as for the variety of services as described herein. The hardware accelerator 406 can be any type of processor, such as a CPU, GPU, FPGA, or ASIC such as a TPU.

[0055] An architecture of a machine learning model 402 can refer to characteristics defining the model, such as characteristics of layers for the model, how the layers process input, or how the layers interact with one another. The architecture of the machine learning model 402 can also define types of operations performed within each layer. One or more machine learning model 402 architectures can be generated that can output results, such as for generating reward scores for training machine learning models with reinforcement learning. Example model architectures can correspond to generative models, such as language models, foundation models, and / or graphical models.

[0056] Referring back to FIG. 3, the devices 302, 304 and the data center 332 can be capable of direct and indirect communication over the network 308. For example, using a network socket, the client computing device 304 can connect to a service operating in the data center 332 through an Internet protocol. The devices 302, 304 can set up listening sockets that may accept an initiating connection for sending and receiving information. The network 308 can include various configurations and protocols including the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, and private networks using communication protocols proprietary to one or more companies. The network 308 can support a variety of short- and long-range connections. The short- and long-range connections may be made over different bandwidths, such as 2.402 GHz to 2.480 GHz, commonly associated with the Bluetooth® standard, 2.4 GHz and 5 GHz, commonly associated with the Wi-Fi® communication protocol; or with a variety of communication standards, such as the LTE® standard for wireless broadband communication. The network 308, in addition or alternatively, can also support wired connections between the devices 302, 304 and the data center 332, including over various types of Ethernet connection.

[0057] Although a single server computing device 302, client computing device 304, and dataGOOGLE 3.4-4275 center 332 are shown in FIG. 3, it is understood that the aspects of the disclosure can be implemented according to a variety of different configurations and quantities of computing devices, including in paradigms for sequential or parallel processing, or over a distributed network of multiple devices. In some implementations, aspects of the disclosure can be performed on a single device connected to hardware accelerators configured for processing machine learning models, or any combination thereof.

[0058] FIG. 5 depicts a flow diagram of an example process 500 for generating training assets. The example process 500 can be performed by a training asset generation system, such as training asset generation system 318, on a system of one or more processors in one or more locations, such as server computing device 302 and / or datacenter 332, as depicted in FIG. 3.

[0059] As shown in block 510, a genAI model generates a set of digital content in response to prompts. The prompts may be the same prompts, different prompts, or a combination of both.

[0060] As shown in block 520, the training asset generation system 318 may determine attributes of pieces of digital content of the set of digital content. The attributes of the pieces of digital content in the set of digital digital content may be mapped to respective reward metrics, as further shown in block 530.

[0061] As shown in block 540, a set of training assets may be generated, with each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content. In instances, training assets may be generated for only a subset of the digital content based on the attributes of the digital content relative to a collection of criteria. In this regard, training assets may be generated when it is determined that the attributes of the respective piece of digital content satisfy all criteria of the collection of criteria, none of the criteria of the collection of criteria, or some criteria of the collection of criteria.

[0062] The generated training assets may then be used to train an artificial intelligence model, as shown by block 550.Aspects of this disclosure can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, and / or in computer hardware, such as the structure disclosed herein, their structural equivalents, or combinations thereof. Aspects of this disclosure can further be implemented as one or more computer programs, such as one or more modules of computer program instructions encoded on a tangible non-transitory computer storage medium for execution by, or to control the operation of, one or more data processing apparatus. The computer storage medium can be a machine-readable storage device,GOOGLE 3.4-4275 a machine-readable storage substrate, a random or serial access memory device, or combinations thereof. The computer program instructions can be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.

[0063] The term “configured” is used herein in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed thereon software, firmware, hardware, or a combination thereof that cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by one or more data processing apparatus, cause the apparatus to perform the operations or actions.

[0064] The term “data processing apparatus” or “data processing system” refers to data processing hardware and encompasses various apparatus, devices, and machines for processing data, including programmable processors, computers, or combinations thereof. The data processing apparatus can include special purpose logic circuitry, such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). The data processing apparatus can include code that creates an execution environment for computer programs, such as code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or combinations thereof.

[0065] The term “computer program” refers to a program, software, a software application, an app, a module, a software module, a script, or code. The computer program can be written in any form of programming language, including compiled, interpreted, declarative, or procedural languages, or combinations thereof. The computer program can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. The computer program can correspond to a file in a file system and can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code. The computer program can be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.GOOGLE 3.4-4275

[0066] The term “database” refers to any collection of data. The data can be unstructured or structured in any manner. The data can be stored on one or more storage devices in one or more locations. For example, an index database can include multiple collections of data, each of which may be organized and accessed differently.

[0067] The term “engine” refers to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. The engine can be implemented as one or more software modules or components or can be installed on one or more computers in one or more locations. A particular engine can have one or more computers dedicated thereto, or multiple engines can be installed and running on the same computer or computers.

[0068] The processes and logic flows described herein can be performed by one or more computers executing one or more computer programs to perform functions by operating on input data and generating output data. The processes and logic flows can also be performed by special purpose logic circuitry, or by a combination of special purpose logic circuitry and one or more computers.

[0069] A computer or special purpose logic circuitry executing the one or more computer programs can include a central processing unit, including general or special purpose microprocessors, for performing or executing instructions and one or more memory devices for storing the instructions and data. The central processing unit can receive instructions and data from the one or more memory devices, such as read only memory, random access memory, or combinations thereof, and can perform or execute the instructions. The computer or special purpose logic circuitry can also include, or be operatively coupled to, one or more storage devices for storing data, such as magnetic, magneto optical disks, or optical disks, for receiving data from or transferring data to. The computer or special purpose logic circuitry can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS), or a portable storage device, e.g., a universal serial bus (USB) flash drive, as examples.

[0070] Computer readable media suitable for storing the one or more computer programs can include any form of volatile or non-volatile memory, media, or memory devices. Examples include semiconductor memory devices, e.g., EPROM, EEPROM, or flash memory devices, magnetic disks, e.g., internal hard disks or removable disks, magneto optical disks, CD-ROM disks, DVD-ROM disks, or combinations thereof.

[0071] Aspects of the disclosure can be implemented in a computing system that includes aGOOGLE 3.4-4275 back end component, e.g., as a data server, a middleware component, e.g., an application server, or a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app, or any combination thereof. The components of the system can be interconnected by any form or medium of digital data communication, such as a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

[0072] The computing system can include clients and servers. A client and server can be remote from each other and interact through a communication network. The relationship of client and server arises by virtue of the computer programs running on the respective computers and having a client-server relationship to each other. For example, a server can transmit data, e.g., an HTML page, to a client device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device. Data generated at the client device, e.g., a result of the user interaction, can be received at the server from the client device.

[0073] Unless otherwise stated, the foregoing alternative examples are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of the embodiments should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as "such as," "including" and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only one of many possible embodiments. Further, the same reference numbers in different drawings can identify the same or similar elements.

Claims

GOOGLE 3.4-4275CLAIMS1. A method, comprising: generating, by one or more processors using an artificial intelligence model, a set of digital content in response to a prompt; determining, by the one or more processors, attributes of pieces of digital content of the set of digital content; mapping, by the one or more processors, the attributes of each piece of digital content to respective reward metrics; generating, by the one or more processors, a set of training assets, each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content; and training, by the one or more processors, the artificial intelligence model using the set of training assets.

2. The method of claim 1, wherein the prompt and the second prompt are the same prompt.

3. The method of any preceding claim, wherein the pieces of digital content of the set of digital content includes all pieces of digital content in the set of digital content.

4. The method of any preceding claim, wherein the training further comprises using direct preference optimization (DPO) to train the artificial intelligence model using the set of training assets, wherein the set of training assets are sequentially input into the artificial intelligence model and parameters of the artificial intelligence model are adjusted to increase a probability of the artificial intelligence model generating new digital content with attributes that satisfy multiple criteria.

5. The method of any preceding claim, wherein generating the set of training assets further comprises: determining a subset of the digital content based on the attributes of the digital content relative to a collection of criteria; and generating the set of training assets using the subset of the digital content.GOOGLE 3.4-42756. The method of claim 5, wherein determining the subset of the digital content based on the attributes of the digital content relative to the collection of criteria comprises, for each piece of digital content: determining whether the attributes of the respective piece of digital content satisfy all criteria of the collection of criteria, none of the criteria of the collection of criteria, or some criteria of the collection of criteria.

7. The method of claim 6, wherein the subset of the digital content comprises the pieces of digital content satisfying all criteria of the collection of criteria and none of the criteria of the collection of criteria.

8. A system comprising: one or more processors; and one or more storage devices couped to the one or more processors and storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations for training a machine learning model with differentially private labels, the operations comprising: generating, using an artificial intelligence model, a set of digital content in response to a prompt; determining attributes of pieces of digital content of the set of digital content; mapping the attributes of each piece of digital content to respective reward metrics; generating a set of training assets, each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content; and training, by the one or more processors, the artificial intelligence model using the set of training assets.

9. The system of claim 8, wherein the prompt and the second prompt are the same prompt.GOOGLE 3.4-427510. The system of claim 8 or 9, wherein the pieces of digital content of the set of digital content includes all pieces of digital content in the set of digital content.

11. The system of any of claims 8-10, wherein the training further comprises using direct preference optimization (DPO) to train the artificial intelligence model using the set of training assets, wherein the set of training assets are sequentially input into the artificial intelligence model and parameters of the artificial intelligence model are adjusted to increase a probability of the artificial intelligence model generating new digital content with attributes that satisfy multiple criteria.

12. The system of any of claims 8-11, wherein generating the set of training assets further comprises: determining a subset of the digital content based on the attributes of the digital content relative to a collection of criteria; and generating the set of training assets using the subset of the digital content.

13. The system of claim 12, wherein determining the subset of the digital content based on the attributes of the digital content relative to the collection of criteria comprises, for each piece of digital content: determining whether the attributes of the respective piece of digital content satisfy all criteria of the collection of criteria, none of the criteria of the collection of criteria, or some criteria of the collection of criteria.

14. The system of claim 13, wherein the subset of the digital content comprises the pieces of digital content satisfying all criteria of the collection of criteria and none of the criteria of the collection of criteria.

15. A non-transitory computer-readable medium for storing instructions that, when executed by one or more processors, cause the one or more processors to : generate, using an artificial intelligence model, a set of digital content in response to a prompt; determine attributes of pieces of digital content of the set of digital content;GOOGLE 3.4-4275 map the attributes of each piece of digital content to respective reward metrics; generate a set of training assets, each training asset of the set of training assets comprising a second prompt, a piece of digital content of the set of digital content, and respective reward metrics of the piece of digital content; and train the artificial intelligence model using the set of training assets.

16. The non-transitory computer-readable medium of claim 15, wherein the prompt and the second prompt are the same prompt.

17. The non-transitory computer-readable medium of claim 15 or 16, wherein the pieces of digital content of the set of digital content includes all pieces of digital content in the set of digital content.

18. The non-transitory computer-readable medium of any of claims 15-17, wherein the training further comprises using direct preference optimization (DPO) to train the artificial intelligence model using the set of training assets, wherein the set of training assets are sequentially input into the artificial intelligence model and parameters of the artificial intelligence model are adjusted to increase a probability of the artificial intelligence model generating new digital content with attributes that satisfy multiple criteria.

19. The non-transitory computer-readable medium of any of claims 15-18, wherein generating the set of training assets further comprises: determining a subset of the digital content based on the attributes of the digital content relative to a collection of criteria; and generating the set of training assets using the subset of the digital content.

20. The system of claim 19, wherein determining the subset of the digital content based on the attributes of the digital content relative to the collection of criteria comprises, for each piece of digital content: determining whether the attributes of the respective piece of digital content satisfy all criteria of the collection of criteria, none of the criteria of the collection of criteria, or some criteria of the collection of criteria.