Using reinforcement learning to select content items

By participating in the reinforcement learning training of machine learning models over a long period, the long-term impact of content presentation on user engagement is predicted, solving the problems of resource waste and decreased user engagement in existing systems, and realizing a more efficient content presentation strategy.

CN117150127BActive Publication Date: 2026-06-19GOOGLE LLC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GOOGLE LLC
Filing Date
2017-07-14
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing reinforcement learning systems struggle to effectively predict the long-term impact of content presentation on user engagement, leading to wasted resources and decreased user participation.

Method used

Long-term engagement machine learning models are used to predict the future impact of content item presentation on user engagement through reinforcement learning training, and the impact is evaluated before presentation, presenting content items only when the impact is minimal.

Benefits of technology

It increased long-term user engagement, reduced resource waste, ensured the necessity of content items, and improved the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117150127B_ABST
    Figure CN117150127B_ABST
Patent Text Reader

Abstract

This invention relates to using reinforcement learning to select content items. One method includes: receiving first data representing a first scenario in which a first content item can be presented to a first user in a presentation environment; and providing the first data as input to a long-term engagement machine learning model, the model having been trained via reinforcement learning to: receive a plurality of inputs and process each of the plurality of inputs to generate a corresponding engagement score for each input, the corresponding engagement score representing the total number of predicted, time-adjusted selections by the corresponding user of future content items to be presented to the corresponding user in the presentation environment if the corresponding content item is presented in the corresponding scenario.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Case Analysis

[0002] This application is a divisional application of Chinese Invention Patent Application No. 201780047232.4, filed on July 14, 2017. Technical Field

[0003] This manual relates to reinforcement learning. Background Technology

[0004] Reinforcement learning agents interact with their environment by receiving observations that represent the current state of the environment and, in response, performing actions from a predetermined set of actions. Some reinforcement learning agents use neural networks to select the action to perform in response to any given observation received.

[0005] A neural network is a machine learning model that uses one or more non-linear units to predict the output from a received input. Some neural networks are deep neural networks that include one or more hidden layers in addition to the output layer. The output of each hidden layer is used as the input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output from the received input based on the current values ​​of its corresponding set of parameters. Summary of the Invention

[0006] In general, this specification describes a system configured to use a machine learning model that has been trained through reinforcement learning to determine whether to present content items to a user.

[0007] In an exemplary embodiment described herein, a method executed by one or more computers is provided, the method comprising: receiving first data characterizing a first scenario in which a first content item can be presented to a first user in a presentation environment; providing the first data as input to a long-term engagement machine learning model to obtain a first engagement score, the first engagement score representing the total number of predicted, time-adjusted selections by the first user of future content items to be presented to the first user in the presentation environment if the first content item is presented in the first scenario. The long-term engagement machine learning model may have been trained via reinforcement learning to: receive a plurality of inputs, each input including data characterizing a corresponding scenario in which a corresponding content item can be presented to a corresponding user in a presentation environment; and process each of the plurality of inputs to generate a corresponding engagement score for each input, the corresponding engagement score representing the total number of predicted, time-adjusted selections by the corresponding user of future content items to be presented to the corresponding user in the presentation environment if the corresponding content item is presented in the corresponding scenario. The method may further comprise: determining from the first engagement score whether to present the first content item to the first user in the first scenario.

[0008] Determining whether to present the first content item to the first user may include: providing second data representing the first context and the empty content item as input to a long-term engagement machine learning model. The long-term engagement machine learning model has been trained via reinforcement learning to treat the second data as an indication that the content item should not be presented to the first user in the first context and to generate an empty engagement score, which represents the total number of predicted, time-adjusted selections made by the first user for future content items to be presented to the first user in the presentation environment if the content item is not presented in the first context.

[0009] Determining whether to present a first content item to a first user may include: determining, from the empty participation score and the first participation score, the predicted impact on the user's engagement due to presenting the first content item in the first context; and using the predicted impact to determine whether to present the first content item to the first user in the first context. Determining whether to present the first content item may include: determining to present the first content item to the first user in the first context only if the predicted impact is greater than a threshold impact value. The method may further include: in response to determining to present the first content item, providing the first content item to be presented to the first user in the presentation environment or providing an instruction to an external system to provide the first content item to be presented to the first user in the presentation environment. The long-term engagement machine learning model may have been trained via reinforcement learning to determine the trained values ​​of the parameters of the long-term engagement machine learning model. Data characterizing the first context includes data characterizing content items previously presented to the first user in the presentation environment. The presentation environment may be a response to a user-submitted search query, and wherein the data characterizing the context includes the search query. Data characterizing the first context may include data characterizing content items previously presented to the first user in response to one or more search queries previously submitted by the first user. Data characterizing the first context may include data characterizing the quality of the first content item. Data characterizing the first context may include the predicted probability that the first user would choose the first content item if it were presented to the first user in the first context.

[0010] A system of one or more computers can also be configured using software, firmware, hardware, or a combination thereof that causes the system to perform actions during operation. A system of one or more computer programs can also be configured using instructions that cause a data processing device to perform actions when executed by the device. One or more computer-readable storage media can be provided, comprising instructions that cause the one or more computers to perform actions when executed by the one or more computers.

[0011] Specific embodiments of the subject matter described herein can be implemented to achieve one or more of the following advantages. The systems described herein can effectively predict the extent to which presenting a content item to a specific user in a given context will have a negative long-term impact on that user's future engagement with subsequent content items. Therefore, if the negative long-term impact is too great, the system can determine not to present the current content item, thereby maintaining long-term user engagement. For example, in certain embodiments, user engagement with future content (such as informational materials related to user safety, such as product recalls, alerts, etc., courses / messages in a learning environment, etc.) may be important, and in these embodiments, the predictions made by the system can increase the likelihood that a user will engage with such future content. In addition to maintaining or increasing long-term engagement with individual specific users within a (statistically) user group, certain embodiments can maintain or increase overall long-term user engagement with the content provided. Similarly, within a user group, these embodiments can ensure that content is not unnecessarily sent to the user group, thereby reducing wasted resources, such as (e.g., the energy usage bandwidth of the content server) and the battery life of user devices.

[0012] Details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the following description. Other features, aspects, and advantages of this subject matter will become apparent from the description, the drawings, and the claims. Attached Figure Description

[0013] Figure 1 An example content item rendering system is shown.

[0014] Figure 2 This is a flowchart of an example process for determining whether to render content items in the presentation settings.

[0015] Figure 3 This is a flowchart illustrating an example process for determining the predicted long-term impact of presenting current content items on user engagement.

[0016] Figure 4 This is a flowchart illustrating an example process for training long-term participants in a machine learning model through reinforcement learning.

[0017] In the various figures, similar reference numerals and designations denote similar elements. Detailed Implementation

[0018] Figure 1 An example content item rendering system 100 is shown.

[0019] The content item presentation system 100 is an example of a system implemented as a computer program on one or more computers located at one or more locations, where the systems, components and technologies described below may be implemented.

[0020] The content item presentation system 100 is a system for selecting the content items to be presented to the user in the presentation environment 110.

[0021] Specifically, the content item presentation system 100 receives data representing the current state of the presentation environment 110 (i.e., the state in which content item 114 can be presented to user 102 on user device 104), and in response, determines whether to present content item 114 to user 102 on user device 104 when the presentation environment 110 is in the current state.

[0022] In some implementations, content items are candidates to be presented in response to a search query, for example, as part of a search results webpage. That is, the presentation environment 110 is a search query response, such as a search results webpage, and different states of presentation 110 correspond to different instances of the search query response, i.e., as a presentation of responses to different search queries submitted by each user. In these implementations, content items may be search results that are candidates to be included in the response, or other content items that are candidates to be included in the response together with the search results.

[0023] In some other implementations, the content item may be an announcement. In other implementations, the content item may be a course within a learning environment. In some other implementations, the content item is a content recommendation that the user currently presented with the content snippet might be interested in. That is, the presentation environment 110 presents the user with content snippets that include recommendations for other content snippets that the user might be interested in. For example, the presentation environment 110 may present a video that the user might be interested in, which includes one or more previews, each preview identifying other videos, for example, through a thumbnail of the video and other identification information for the video and including a link to the video. As another example, the presentation environment 110 may present an image that the user might be interested in, which includes one or more previews, each preview identifying other images, for example, through a thumbnail of the image and including a link to the image.

[0024] For example, user 102 of user device 104 can submit a request to presentation environment system 150 via data communication network 112, for example, by submitting a search query to an internet search engine or a video sharing website. This request trigger may include content item 114 that needs to be presented to user 102 in presentation environment 110. As part of generating a response, presentation environment system 150 may submit a request to content item presentation system 100, which includes data characterizing the current context of presentation environment 110.

[0025] In response, the content item presentation system 100 uses the prediction subsystem 130 and the presentation subsystem 140 to determine whether to present the content item 114 to the user 102 in the presentation environment 110.

[0026] Typically, the prediction subsystem 130 is a long-term participation machine learning model that has been trained through reinforcement learning to receive model inputs and generate prediction outputs for each received model input. For example, the prediction subsystem 130 can be a linear regression model, a feedforward neural network, a recurrent neural network, or a long short-term memory (LSTM) neural network.

[0027] Specifically, each model input represents a context in which a corresponding content item is presented to a corresponding user in presentation environment 110, and prediction subsystem 130 has been trained such that the prediction output generated for the model input is an engagement score, which measures the user's predicted future user engagement with future content items presented to the corresponding user in presentation environment 110 if the corresponding content item is presented in the corresponding context. Specifically, in some embodiments, the engagement score represents the predicted, time-adjusted (e.g., time-discounted) total number of selections made by the corresponding user for future content items presented to the corresponding user in presentation environment 110 if the corresponding content item is presented in the corresponding context. In some other embodiments, the engagement score represents the predicted, time-adjusted (e.g., time-discounted) change in the percentage of future content items selected by the corresponding user in presentation environment 110 if the corresponding content item is presented in the corresponding context.

[0028] The following reference Figure 4 The discussion focuses on training the prediction subsystem 130 using reinforcement learning to generate participation scores.

[0029] The presentation subsystem 140 interacts with the prediction subsystem 130 to determine whether to present content item 114 in the presentation environment 110 using the participation score generated by the prediction subsystem 130. See below. Figure 2 and Figure 3 The use of participation scores to determine whether to present content items in presentation environment 110 is discussed in more detail.

[0030] If the presentation subsystem 140 determines that the content item 114 should be presented in the presentation environment 110, the content item presentation system 100 may send the content item 114 to be presented to the user 102 or send an instruction to the presentation environment system 150 to provide the content item 114 to be presented to the user 102 in the presentation environment 110.

[0031] If the content item presentation system 100 determines that it does not present the content item 114, then the content item presentation system 100 avoids sending the content item 114 to be presented to the user 102 or sending an instruction to the presentation environment system 150 that the content item 114 should not be presented to the user 102 in the presentation environment 110.

[0032] Figure 2 This is a flowchart of an example process 200 for determining whether to render a content item in the rendering settings. For convenience, process 200 is described as being implemented by a system located on one or more computers in one or more locations. For example, a content item rendering system (e.g., Figure 1 The content presentation system 100, if properly programmed, can execute process 200.

[0033] The system receives data representing the current state of the presentation environment in which a specific content item may be presented to a specific user (step 202). That is, the system receives data representing the current context in which a specific content item may be presented to a specific user in the presentation environment.

[0034] Data characterizing the current context includes various features that characterize specific content items.

[0035] For example, the data may include scores that indicate the quality of content items, such as those determined by an external system.

[0036] As another example, the data may include a score, as determined by an external system, representing the predicted probability that a user would select a content item if it were presented.

[0037] As another example, if the presentation environment is a response to a user-submitted search query, the data may include scores, such as those determined by an external system, indicating the likelihood that a content item is a navigation item relative to the search query. A navigation content item is the content item being searched for by the search query. That is, the search query is a query looking for a single piece of content, and the content item is or identifies the single piece of content being searched.

[0038] As another example, if the content item includes links to resources, the data may also include a score, as determined by an external system, representing the quality of the resources linked through the content chain.

[0039] As another example, the data can also include information that identifies the current time and date.

[0040] As another example, the data may also include data that identifies the presentation location of a content item, such as where the content item might be presented in the presentation environment relative to other content (e.g., other content items or different content).

[0041] Optionally, the data characterizing the current context may also include various features of content items previously presented to a particular user in the presentation environment. For example, the features may include some or all of the features described above for a predetermined number of context items recently presented to a particular user in the presentation environment, or for each context item presented to a particular user in a recent time window. Further optionally, for each previously presented content item, the data may also include data identifying whether a particular user selected the content item when it was presented in the presentation environment.

[0042] Additionally, when content items can be presented as part of a response to a search query, the data may optionally include the text of the search query and, more optionally, the text of one or more other search queries recently submitted by a particular user.

[0043] The system determines the predictive long-term impact on user engagement due to the presentation of a specific content item while the presentation environment is in its current state (i.e., due to the presentation of a specific content item in the current context) (step 204). Specifically, the system uses a machine learning model that has been trained through reinforcement learning (e.g., Figure 1 The prediction subsystem 130) is used to determine the predicted long-term effects. See below for reference. Figure 3 Describe the predicted long-term impacts in more detail.

[0044] The system determines whether to present a specific content item based on the predicted long-term impact (step 206). Generally, the system determines to present the content item if the predicted long-term impact is not overly negative. Specifically, the system determines to present the content item when the predicted long-term impact exceeds a threshold.

[0045] In some implementations, the system receives identification threshold data from an external system or from a system administrator.

[0046] In some other implementations, the system receives data identifying short-term values ​​resulting from presenting content items to the user and average values ​​resulting from each future selection of content items by the user, and determines a threshold value based on the received data. For example, the threshold value T may satisfy:

[0047] T = (k – STV) / AV

[0048] Where k is a constant value, STV is a short-term value resulting from the content items presented to the user, and AV is the average value resulting from each future selection of content items by the user.

[0049] If the system determines that it should present a content item, it may send the content item to the user or provide an instruction to an external system to provide the content item to the user in the presentation environment. If the system determines that it should not present a content item, it will avoid sending the content item to the user or sending an instruction to an external system that the content item should not be presented to the user.

[0050] Figure 3 This is a flowchart of an example process 300 for determining the predicted long-term impact of presenting a current content item on user engagement. For convenience, process 300 is described as being implemented by a system of one or more computers located in one or more locations. For example, a content item presentation system (e.g., Figure 1 The content presentation system 100, if properly programmed, can execute process 300.

[0051] The system provides data representing the current context in which content items might be presented as input to the prediction subsystem (step 302). The prediction subsystem is a machine learning model trained to process this input to generate a current engagement score, which measures a user's predicted future engagement with future content items presented to the user in the presentation environment if the content item is presented in the current context. Specifically, in some implementations, the engagement score represents the predicted, time-adjusted total number of selections by the user for future content items presented to the corresponding user in the presentation environment if the content item is presented in the current context. In some other implementations, the engagement score represents the predicted, time-adjusted change in the percentage of future content items selected by the user for future content items presented to the user in the presentation environment if the content item is presented in the current context.

[0052] The system provides the prediction subsystem with data representing the current context and empty content items as input (step 304). That is, the system provides data representing the current context, but replaces the data representing the current content items with data representing empty content items. The data representing empty content items is pre-determined placeholder data, which indicates to the prediction subsystem that content items are not presented in the current context.

[0053] The prediction subsystem has been trained to treat data representing the current context and empty content items as indications that content items will not be presented to the user in the current context (i.e., not to the current presentation environment), and therefore generates an empty participation score that measures the user's predicted future participation in future content items to be presented to the user in the presentation environment if the content item is not presented in the current context. Specifically, in some embodiments, the participation score represents the predicted, time-adjusted total number of selections by the user for future content items to be presented to the corresponding user in the presentation environment if the content item is not presented in the current context. In some other embodiments, the participation score represents the predicted, time-adjusted change in the percentage of future content items selected by the user to be presented to the user in the presentation environment if the content item is not presented in the current context. Refer to below. Figure 4 The training prediction subsystem is described in more detail to treat data representing (i) the context and (ii) specified empty context items as indications that content items will not be presented to the user in that context.

[0054] The system determines the predicted long-term impact of presenting the current content item on user engagement using the current engagement score and the empty engagement score (step 306). Specifically, the system subtracts the current engagement score from the empty engagement score to determine the predicted decrease in user engagement due to presenting the current content item. The system may consider the predicted decrease as a predicted long-term impact or may apply a scaling factor to the predicted decrease to generate a predicted long-term impact. In some implementations, if the predicted long-term impact is positive, i.e., greater than 0, the system sets the impact to 0.

[0055] In some implementations, the system performs process 300 online, i.e., when determining whether to present a given content item. In other implementations, the system performs process 300 offline for multiple possible combinations of content items and contexts, and stores data mapping each combination of content items and contexts to the predicted long-term effects for that combination. In these implementations, when making the determination whether to present a given content item, the system accesses maintained data to determine the predicted long-term effects. If no particular content item and context combination is present in the maintained data, the system can estimate the predicted long-term effects based on the predicted long-term effects for adjacent combinations present in the maintained data, for example, by averaging the predicted long-term effects for adjacent combinations.

[0056] Figure 4 This is a flowchart of an example process 400 for training a long-term engagement machine learning model using reinforcement learning. For convenience, process 400 is described as being implemented by a system of one or more computers located in one or more locations. For example, a content item rendering system (e.g., Figure 1 The content presentation system 100, if properly programmed, can execute process 400.

[0057] The system receives a tuple that includes data representing a first context in the presentation environment in which a first content item is presented to the user, data identifying whether the user has selected the first content item, and data representing a second subsequent context in which a second content item is presented to the user (step 402). Typically, the second context is exactly after the first context; that is, the second content item is the next content item presented to the user in the environment after the first content item.

[0058] The system generates a reward based on whether the user selected the first content item to be presented when the environment is in the first state (step 404).

[0059] The way rewards are generated depends on what type of participation score the machine learning model has been trained to generate.

[0060] Specifically, if a machine learning model is being trained to generate an engagement score representing the total number of selections for future content items, adjusted over time, then if the user selects the first content item, the system can set the reward to a first predetermined numerical value, and if the user does not select the first content item, the reward can be set to a lower, second predetermined numerical value. For example, the first numerical value could be 1 and the second numerical value could be 0. As another example, the first numerical value could be .8 and the second numerical value could be .1.

[0061] If a machine learning model is being trained to generate a time-adjusted participation score representing the rate at which a user selects a future content item, then if the user selects the first content item, the system sets the reward to a first value, which depends on the predicted probability that the user will select the first content item, as determined by an external system, and sets the reward to 0 if the user does not select the first content item. For example, the first value could be 1 minus the predicted probability or 1 divided by the predicted probability.

[0062] The system uses a long-term engagement machine learning model to process the data representing the first scenario based on the current values ​​of the network parameters to generate a first engagement score for the first state (step 406).

[0063] The system uses a long-term engagement machine learning model to process the data representing the second scenario based on the current values ​​of the network parameters to generate a second engagement score for the second state (step 408).

[0064] The system determines the error of the first participation score using the reward and the second participation score (step 410). The system can determine this error in any of the various methods applicable to reinforcement learning training techniques.

[0065] For example, the error E can be the time difference learning error that satisfies the following equation:

[0066] E = V(s) t )-(R+γV(s t+1 ))

[0067] Wherein, V(s) t ) is the first participation score, R is the reward, and V(s) is the reward. t+1 ) is the second participating score, and γ is the time discount factor.

[0068] As another example, the error E can be an interpolation between the aforementioned temporal difference error and the Monte Carlo supervised learning error.

[0069] As yet another example, the error E can include Huber loss, which caps the magnitude of the error.

[0070] The system uses the error to adjust the current values ​​of parameters that have been involved in the machine learning model for a long time (step 412). For example, the system can use backpropagation training techniques to perform gradient descent iterations to update the model's parameters in order to reduce the error.

[0071] The system can repeatedly perform process 400 on multiple different tuples to train the model effectively, thereby generating long-term engagement scores. While each tuple describes a content item presented to a single user, multiple different tuples often include tuples that collectively describe content items presented to many different users. For example, the system can repeatedly perform process 400 on tuples selected from a tuple database until convergence criteria for training the machine learning model are met.

[0072] To ensure that the machine learning model was also trained to generate null participation scores, some tuples in the tuples of process 400 included scenarios where content items were not presented to the user. For these tuples, the data representing the scenarios included placeholder data representing null content items. By including tuples representing scenarios where at least one content item was not presented to the user, the machine learning model was trained to generate accurate null participation scores and participation scores for actual content items.

[0073] The embodiments and functional operations of the subject matter described herein can be implemented in digital electronic circuit systems, in tangibly embodied computer software or firmware, in computer hardware (including the structures disclosed herein and their structural equivalents), or in combinations thereof. Embodiments of the subject matter described herein can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by a data processing apparatus or for controlling the operation of such data processing apparatus. Alternatively or additionally, program instructions can be encoded on artificially generated propagated signals, such as machine-generated electrical, optical, or electromagnetic signals, generated to encode information for transmission to a suitable receiver device for execution by the data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination thereof. However, the computer storage medium is not a propagated signal.

[0074] The term "data processing apparatus" encompasses all devices, apparatuses, and machines used for processing data, including, for example, programmable processors, computers, or multiple processors or computers. The apparatus may include special-purpose logic circuitry, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits). In addition to hardware, the apparatus may also include code that creates an execution environment for the computer program under consideration, such as code constituting processor firmware, protocol stacks, database management systems, operating systems, or combinations thereof.

[0075] Computer programs (also referred to or described as programs, software, software applications, modules, software modules, scripts, or code) can be written in any programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as standalone programs or as modules, components, subroutines, or other units suitable for use in a computing environment. Computer programs may, but do not need to, correspond to files in a file system. Programs can be stored as a part of a file that holds other programs or data (e.g., in one or more scripts stored in a markup language document), or in a single file dedicated to the program in question, or in multiple co-located files (e.g., files storing one or more modules, subroutines, or portions of code). Computer programs can be deployed to execute on a single computer or on multiple computers located at a single site or distributed across multiple sites and interconnected via a communication network.

[0076] As used in this specification, "engine" or "software engine" refers to a software implementation of an input / output system that provides outputs distinct from its inputs. An engine can be a functional block of code, such as a library, platform, software development kit (SDK), or object. Each engine can be implemented on any suitable type of computing device, including one or more processors and computer-readable media, such as a server, mobile phone, tablet computer, laptop computer, music player, e-book reader, laptop or desktop computer, PDA, smartphone, or other fixed or portable device. Furthermore, two or more engines can be implemented on the same computing device or on different computing devices.

[0077] The processes and logic flows described herein can be executed by one or more programmable computers that execute one or more computer programs to perform functions by manipulating input data and generating outputs. The processes and logic flows can also be executed by special-purpose logic circuitry (e.g., FPGA (Field Programmable Gate Array) or ASIC (Application-Specific Integrated Circuit)), and the device can also be implemented as special-purpose logic circuitry (e.g., FPGA or ASIC).

[0078] Computers suitable for executing computer programs include, for example, those based on general-purpose microprocessors or special-purpose microprocessors or both, or any other type of central processing unit. Typically, the central processing unit receives instructions and data from read-only memory or random access memory or both. Essential components of a computer are the central processing unit for making or executing instructions, and one or more memory devices for storing instructions and data. Typically, a computer also includes one or more mass storage devices (e.g., disks, magneto-optical disks, or optical disks) for storing data, or the computer can be operatively coupled to receive data from or transfer data to or both from such mass storage devices. However, a computer does not necessarily need to have such devices. Furthermore, a computer can be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, GPS receiver, or portable storage device (e.g., a Universal Serial Bus (USB) flash drive), to name just a few.

[0079] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disks or removable disks), magneto-optical disks, CD-ROMs, and DVD-ROMs. The processor and memory may be supplemented by or incorporated into a dedicated logic circuit system.

[0080] To provide interaction with the user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device for displaying information to the user (e.g., a CRT (cathode ray tube) monitor, an LCD (liquid crystal display) monitor, or an OLED display) and an input device for providing input to the computer (e.g., a keyboard, a mouse, or a sensitive display or other surface). Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form (including voice input, speech input, or tactile input). Additionally, the computer can interact with the user by sending resources to and receiving resources from the device being used (e.g., sending a webpage to a web browser on the user's device in response to a request received from a web browser).

[0081] Embodiments of the subject matter described herein can be implemented in a computing system that includes back-end components (e.g., as a data processor), or middleware components (e.g., an application server), or front-end components (e.g., a client computer with a graphical user interface or a web browser through which a user can interact with embodiments of the subject matter described herein), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected via digital data communication (e.g., a communication network) of any form or medium. Examples of communication networks include local area networks (“LANs”) and wide area networks (“WANs”), such as the Internet.

[0082] A computing system may include clients and servers. Clients and servers are generally located far apart and typically interact through a communication network. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other.

[0083] While this specification contains numerous specific implementation details, these details should not be construed as limiting the scope of any invention or what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of a particular invention. Certain features described in this specification within the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, while features may be described above as functioning in certain combinations and initially or even equally claimed, in some cases one or more features from the claimed combination may be removed from the combination, and the claimed combination may point to a sub-combination or a variation of the sub-combination.

[0084] Similarly, although operations are shown in a specific order in the accompanying drawings, this should not be construed as requiring such operations to be performed in the described specific order or in a sequential order, or as requiring all illustrated operations to achieve the desired result. In some cases, multitasking and parallel processing can be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0085] Specific embodiments of this subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions set forth in the claims can be performed in a different order and the desired result can still be achieved. As an example, the processes depicted in the drawings do not necessarily require a specific order or sequence to be shown in order to obtain the desired result. In some embodiments, multitasking and parallel processing can be advantageous.

Claims

1. A system for using a machine learning model to determine whether to present content items to a user, the system comprising one or more computers and one or more storage devices storing instructions, the instructions, when executed by the one or more computers, causing the one or more computers to perform operations, the operations including: Receive first data representing a first content item and a first scenario, in which the first content item can be presented to a first user in a presentation environment; The first data is used as input to the long-term participation machine learning model to obtain a first participation score, which represents the predicted, time-adjusted total number of selections received in the time window following the current time window if the first content item is presented in the first scenario in the current time window. Based at least on the first scenario, an estimate of a second participation score is generated, the second participation score representing the predicted, time-adjusted total number of selections received in a time window following the current time window if the first content item is not presented in the first scenario within the current time window; and Whether to present the first content item to the first user in the first scenario is determined based on at least the first participation score and the second participation score.

2. The system according to claim 1, wherein: Providing the first data as input to the long-term engagement machine learning model to obtain a first engagement score further includes providing data specifying a first action for presenting the first content item to the first user in the first context as input to the long-term engagement machine learning model. as well as The estimation of the second engagement score further includes providing data, which specifies a second action to avoid presenting the first content item to the first user in the first scenario, as input to the long-term engagement machine learning model.

3. The system of claim 1, wherein, Determining whether to present the first content item includes: only when the first participation score is greater than the second participation score, determining to provide the first content item to the first user in the first scenario.

4. The system of claim 1, wherein, The operation further includes: In response to determining that the first content item should be presented, the first content item for presentation is provided to the first user in the presentation environment or an instruction is provided to an external system to cause the external system to provide the first content item for presentation to the first user in the presentation environment.

5. The system according to claim 1, wherein, The Long Term Engagement Machine Learning model has been trained through reinforcement learning to determine the trained values ​​of the parameters of the Long Term Engagement Machine Learning model.

6. The system of claim 1, wherein, The data characterizing the first scene includes data characterizing content items previously presented to the first user in the presentation environment.

7. The system according to claim 1, wherein, The presentation environment is a response to the search query submitted by the user, and the data characterizing the environment includes the search query.

8. The system of claim 1, wherein, The first content item is a recommendation of content that the first user might be interested in.

9. The system of claim 1, wherein, The data characterizing the first scenario includes data characterizing content items previously presented to the first user in response to one or more search queries previously submitted by the first user.

10. The system of any one of claims 1-9, wherein, The data characterizing the first scenario includes data characterizing the quality of the first content item.

11. The system of any one of claims 1-9, wherein, The data characterizing the first scenario includes the predicted probability that the first user would select the first content item if it were presented to the first user in the first scenario.

12. A non-transitory computer storage medium storing instructions, said instructions causing said one or more computers to perform operations when executed by said one or more computers, said operations including: Receive first data representing a first content item and a first scenario, in which the first content item can be presented to a first user in a presentation environment; The first data is used as input to the long-term participation machine learning model to obtain a first participation score, which represents the predicted, time-adjusted total number of selections received in the time window following the current time window if the first content item is presented in the first scenario in the current time window. Based at least on the first scenario, an estimate of a second participation score is generated, the second participation score representing the predicted, time-adjusted total number of selections received in a time window following the current time window if the first content item is not presented in the first scenario within the current time window; and Whether to present the first content item to the first user in the first scenario is determined based on at least the first participation score and the second participation score.

13. A method performed by one or more computers for determining whether to present a content item to a user using a machine learning model, wherein, The method includes: Receive first data representing a first content item and a first scenario, in which the first content item can be presented to a first user in a presentation environment; The first data is used as input to the long-term participation machine learning model to obtain a first participation score, which represents the predicted, time-adjusted total number of selections received in the time window following the current time window if the first content item is presented in the first scenario in the current time window. Based at least on the first scenario, an estimate of a second participation score is generated, the second participation score representing the predicted, time-adjusted total number of selections received in a time window following the current time window if the first content item is not presented in the first scenario within the current time window; and Whether to present the first content item to the first user in the first scenario is determined based on at least the first participation score and the second participation score.

14. The method of claim 13, wherein: Providing the first data as input to the long-term engagement machine learning model to obtain a first engagement score further includes providing data specifying a first action for presenting the first content item to the first user in the first context as input to the long-term engagement machine learning model. as well as The estimation of the second engagement score further includes providing data, which specifies a second action to avoid presenting the first content item to the first user in the first scenario, as input to the long-term engagement machine learning model.

15. The method of claim 13, wherein, Determining whether to present the first content item includes: only when the first participation score is greater than the second participation score, determining to provide the first content item to the first user in the first scenario.

16. The method according to claim 13, wherein, Determining whether to present the first content item includes: determining to provide the first content item to the first user in the first scenario only when the first participation score is greater than a threshold.

17. The method of claim 13, wherein, The method further includes: In response to determining that the first content item should be presented, the first content item for presentation is provided to the first user in the presentation environment or an instruction is provided to an external system to cause the external system to provide the first content item for presentation to the first user in the presentation environment.

18. The method of any of claims 13-17, wherein, The data characterizing the first scene includes data characterizing content items previously presented to the first user in the presentation environment.

19. The method of any of claims 13-17, wherein, The data characterizing the first scenario includes data characterizing the quality of the first content item.

20. The method of any of claims 13-17, wherein, The data characterizing the first scenario includes the predicted probability that the first user would select the first content item if it were presented to the first user in the first scenario.