Diffusion model multi-person image generation
By combining the first and second generative machine learning models to generate multi-person images and using a diffusion model for image combination, the problem of expensive equipment and complex interaction in the existing technology for generating multi-person images is solved, and efficient and low-cost multi-person image generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SNAP INC
- Filing Date
- 2024-11-14
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies require expensive equipment and extensive user interaction when generating images of multiple people, and it is difficult to efficiently generate high-quality images in simulated scenarios, resulting in wasted resources and a decline in user experience.
Personalized images are generated using first and second generative machine learning models, respectively. These images are then combined with background information to generate multi-person images. Finally, a diffusion model is used to combine the images, reducing user interaction and improving generation efficiency.
It can quickly and automatically generate high-quality multi-person images in simulated scenes, reducing resource consumption and time costs, and improving user experience.
Smart Images

Figure CN122228526A_ABST
Abstract
Description
[0001] Priority requirements
[0002] This application claims the benefit of priority to U.S. Provisional Application No. 63 / 600,451, filed November 17, 2023, which is incorporated herein by reference in its entirety. Technical Field
[0003] This disclosure generally relates to the generation of images using diffusion models. Background Technology
[0004] A diffusion model is a form of generative machine learning model that generates artificial images given one or more inputs, including cues. Generative machine learning models process cues and generate artificial images based on instructions defined by the cues. Sometimes, these generative machine learning models can generate artificial videos based on cues. Attached Figure Description
[0005] In accompanying drawings that are not necessarily drawn to scale, the same reference numerals may describe similar parts in different views. For ease of identification of any particular element or action being discussed, one or more highest-digit numbers in the reference numerals indicate the drawing number in which the element was first introduced. Some non-limiting examples are shown in the accompanying drawings:
[0006] Figure 1 It is a diagrammatic representation of a networked environment in which the content of this disclosure can be deployed, based on some examples.
[0007] Figure 2 It is a graphical representation of a messaging system with both client-side and server-side functionalities, based on some examples.
[0008] Figure 3 It is a graphical representation of the data structures maintained in the database based on some examples.
[0009] Figure 4 It is a graphical representation based on some example messages.
[0010] Figure 5 It is a graphical representation of a multi-person image generation system based on some examples.
[0011] Figure 6 It is a graphical representation of the example inputs and outputs of a system that generates images based on some examples.
[0012] Figure 7 This is a flowchart illustrating example operations and methods of a multi-person image generation system based on some examples.
[0013] Figure 8It is a graphical representation of a machine in the form of a computer system based on some examples, within which a set of instructions can be executed to cause the machine to perform any or more of the methods discussed herein.
[0014] Figure 9 It is a block diagram illustrating an example of a software architecture that can be implemented therein.
[0015] Figure 10 The system shown is a model of a head-mounted device. Detailed Implementation
[0016] The following description includes systems, methods, techniques, instruction sequences, and computer program products that embody illustrative examples of the present disclosure. In this description, numerous specific details are set forth for illustrative purposes in order to provide an understanding of various examples. However, it will be apparent to those skilled in the art that the examples can be practiced without these specific details. In general, well-known examples of instructions, protocols, structures, and techniques are not necessarily shown in detail.
[0017] Typically, various communication platforms allow users to share content and create images to send to other users. These images can be used to promote products or services and / or simply to represent different real-world objects in simulated or real-world environments. However, these systems require users to use expensive equipment and technology to create high-quality, engaging images. Furthermore, users may spend considerable effort carefully placing objects in different environments and manually adjusting lighting and other image properties to enhance their presentation. All these factors can add up to a significant expense in creating high-quality images, diminishing the overall usability and enjoyment of the system. Moreover, users may miss opportunities to share and present objects in ideal settings because they lack the resources needed to create high-quality images. Additionally, presenting lower-quality images of such objects may cause other users to overlook their value, wasting the resources used to create and display them.
[0018] In some cases, generative models can be used to generate artificial images depicting users in different simulated contexts and environments. These systems can reduce the time and cost required to generate content. However, these systems are often personalized for individual users and can generate artificial content characterized by a specific user. Applying these systems in a multi-person context is incredibly complex and requires retraining the system on people seeking representation in artificial content. Such training involves a significant amount of time and effort, which detracts from the overall usability and enjoyment of the system.
[0019] The disclosed technology seeks to improve the efficiency of using electronic devices by intelligently and automatically generating images depicting multiple people in simulated or artificial scenes in a simple and intuitive manner. The disclosed technology creates realistic images or videos depicting multiple people in simulated scenes very quickly and efficiently with minimal user interaction or involvement. This can reduce the overall time and cost associated with developing high-quality images featuring objects or products such as shoes, shirts, or other fashion items.
[0020] For example, the disclosed technology accesses a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model. The first generative machine learning model is trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model is trained to generate the second artificially personalized image including a depiction of a second person. The disclosed technology generates a foreground image that combines the depiction of the first person in the first image with the depiction of the second person in the second image. The disclosed technology accesses background information and generates a new artificial image that includes a foreground image on a background having visual attributes corresponding to the background information.
[0021] In this way, the disclosed technology improves the overall user experience when using electronic devices and reduces the total amount of resources required to complete the task of producing high-quality images.
[0022] Networked computing environment
[0023] Figure 1 This is a block diagram illustrating an example interactive system 100 for facilitating interactions on a network, such as exchanging text messages, making text-to-audio and video calls, or playing games. Interactive system 100 includes multiple user systems 102, each hosting multiple applications, including interactive client 104 and other applications 106. Each interactive client 104 is communicatively coupled to (e.g., hosted on corresponding other user systems 102) other instances of interactive client 104, interactive server system 110, and third-party server 112 via one or more communication networks, including network 108 (e.g., the Internet). Interactive client 104 can also communicate with locally hosted applications 106 using application programming interfaces (APIs).
[0024] Each user system 102 may include multiple user devices, such as mobile devices 114, head-mounted devices 116, and computer client devices 118, which are communicatively connected to exchange data and messages.
[0025] Interactive client 104 interacts with other interactive clients 104 and with interactive server system 110 via network 108. The data exchanged between interactive clients 104 (e.g., interactive 120) and between interactive client 104 and interactive server system 110 includes functions (e.g., commands to invoke functions) and payload data (e.g., text, audio, video, or other multimedia data).
[0026] Interactive server system 110 provides server-side functionality to interactive client 104 via network 108. While some functions of interactive system 100 are described herein as being performed by interactive client 104 or interactive server system 110, whether a function resides within interactive client 104 or interactive server system 110 can be a design choice. For example, it is technically preferred that specific technologies and functions are initially deployed within interactive server system 110, but that technology and functions are later migrated to interactive client 104 of user system 102, which has sufficient processing power.
[0027] The interactive server system 110 supports various services and operations provided to the interactive client 104. Such operations include sending data to and receiving data from the interactive client 104, and processing data generated by the interactive client 104. This data may include message content, client device information, geolocation information, media enhancements and overlays, message content persistence conditions, entity relationship information, and on-site event information. Data exchange within the interactive system 100 is initiated and controlled through functions available via the user interface of the interactive client 104.
[0028] Specifically, the focus now shifts to interactive server system 110. API server 122 is coupled to interactive server 124 and provides a programming interface to interactive server 124, enabling interactive client 104, other applications 106, and third-party server 112 to access the functionality of interactive server 124. Interactive server 124 is communicatively coupled to database server 126, facilitating access to database 128, which stores data associated with the interactions processed by interactive server 124. Similarly, web server 130 is coupled to interactive server 124 and provides a web-based interface to interactive server 124. To this end, web server 130 handles incoming network requests via Hypertext Transfer Protocol (HTTP) and several other related protocols.
[0029] API server 122 receives and sends interactive data (e.g., command and message payloads) between interactive server 124 and user system 102 (as well as, for example, interactive client 104 and other applications 106) and third-party server 112. Specifically, API server 122 provides a set of interfaces (e.g., routines and protocols) that interactive client 104 and other applications 106 can call or query to invoke the functionality of interactive server 124. API server 122 discloses various functions supported by interaction server 124, including: account registration; login function; sending interactive data from one interactive client 104 to another interactive client 104 via interaction server 124; transmission of media files (e.g., images or videos) from interactive client 104 to interaction server 124; setting up collections of media data (e.g., stories); retrieval of the user's friend list in user system 102; retrieval of messages and content; adding and deleting entities (e.g., friends) in entity relationship graphs (e.g., entity graph 310); locating friends within entity relationship graphs; and opening application events (e.g., related to interactive client 104).
[0030] Interactive server 124 hosts multiple systems and subsystems, as detailed below. Figure 2 Describe it.
[0031] Application of links
[0032] Returning to interactive client 104, the features and functionality of external resources (e.g., linked application 106 or applet) are available to the user through the interface of interactive client 104. In this context, "external" refers to the fact that application 106 or applet is outside of interactive client 104. External resources are typically provided by third parties, but may also be provided by the creator or provider of interactive client 104. Interactive client 104 receives the user's selection of options to launch or access the features of such external resources. External resources can be application 106 installed on user system 102 (e.g., "local application"), or a smaller version (e.g., "app") of an application hosted on user system 102 or remotely on user system 102 (e.g., on a third-party server 112). A smaller version of an application includes a subset of the features and functionality of the application (e.g., the full-scale, native version of the application) and is implemented using markup language documentation. In some examples, a smaller version of an application (e.g., "app") is a web-based markup language version of the application and is embedded in interactive client 104. In addition to using markup language documentation (e.g.,... In addition to ml files, mini-programs can also incorporate scripting languages (e.g., ...). .js files or .json files) and stylesheets (e.g.) (SS file).
[0033] In response to receiving a user's selection of an option to launch or access an external resource, the interactive client 104 determines whether the selected external resource is a web-based external resource or a locally installed application 106. In some cases, the locally installed application 106 on the user's system 102 can be launched independently of and separately from the interactive client 104, for example, by selecting the icon corresponding to application 106 on the user's system 102's home screen. A smaller version of such an application can be launched or accessed through the interactive client 104, and in some examples, no part of the smaller application can be accessed outside the interactive client 104, or only a limited portion of the smaller application can be accessed outside the interactive client 104. The smaller application can be launched by the interactive client 104, for example, by receiving and processing a markup language document associated with the smaller application from a third-party server 112.
[0034] In response to determining that the external resource is a locally installed application 106, the interactive client 104 instructs the user system 102 to launch the external resource by executing locally stored code corresponding to the external resource. In response to determining that the external resource is a web-based resource, the interactive client 104 communicates with a third-party server 112 (e.g.) to obtain a markup language document corresponding to the selected external resource. The interactive client 104 then processes the obtained markup language document to render the web-based external resource within the user interface of the interactive client 104.
[0035] Interactive client 104 can notify users of user system 102 or other users (e.g., "friends") associated with such users of one or more external resources. For example, interactive client 104 can provide participants in a conversation (e.g., a chat session) within interactive client 104 with notifications related to the current or recent use of external resources by one or more members of a group of users. One or more users can be invited to join an active external resource or to activate (in a group of friends) a recently used but currently inactive external resource. External resources can provide participants in the conversation, each using the corresponding interactive client 104, with the ability to share items, statuses, states, or locations within the external resource with one or more members of a group of users during a chat session. Shared items can be interactive chat cards, which chat members can use to interact, for example, activate the corresponding external resource, view specific information within the external resource, or take chat members to a specific location or state within the external resource. Within a given external resource, response messages can be sent to users on interactive client 104. External resources can selectively include different media items in the response based on the current context of the external resource.
[0036] Interactive client 104 can present a list of available external resources (e.g., application 106 or mini-program) to the user to launch or access a given external resource. This list can be presented in a context-sensitive menu. For example, icons representing different applications (or mini-programs) within application 106 (or mini-program) can change depending on how the user launches the menu (e.g., from a conversational interface or a non-conversational interface).
[0037] System Architecture
[0038] Figure 2 This is a block diagram illustrating further details of an interactive system 100 based on some examples. Specifically, the interactive system 100 is shown as including an interactive client 104 and an interactive server 124. The interactive system 100 includes multiple subsystems supported on the client side by the interactive client 104 and on the server side by the interactive server 124. Example subsystems are discussed below and may include a multi-user image generation system 500 that generates artificial images of multiple users, people, or objects in an artificial or simulated background or environment. The following is in conjunction with... Figure 5 An illustrative implementation of the multi-person image generation system 500 is shown and described.
[0039] In some examples, these subsystems are implemented as microservices. A microservice subsystem (e.g., a microservice application) can have components that enable it to operate independently and communicate with other services. Example components of a microservice subsystem may include:
[0040] • Functional logic: Functional logic implements the functions of the microservice subsystem and represents the specific capabilities or functions provided by the microservice.
[0041] • API Interface: Microservices can communicate with other components using lightweight protocols such as REST or messaging through well-defined APIs or interfaces. The API interface defines the inputs and outputs of a microservice subsystem and how it interacts with other microservice subsystems of the interactive system 100.
[0042] • Data storage: The microservice subsystem can be responsible for its own data storage, which can be in the form of a database, cache, or other storage mechanisms (e.g., using database server 126 and database 128). This allows the microservice subsystem to operate independently of other microservices in the interactive system 100.
[0043] • Service discovery: Microservice subsystems can find and communicate with other microservice subsystems in the interacting system 100. The service discovery mechanism enables microservice subsystems to locate and communicate with other microservice subsystems in a scalable and efficient manner.
[0044] • Monitoring and logging: Microservice subsystems may need to be monitored and logged to ensure availability and performance. Monitoring and logging mechanisms enable the tracking of the health and performance of microservice subsystems.
[0045] In some examples, the interactive system 100 may adopt a monolithic architecture, a service-oriented architecture (SOA), a function-as-a-service (FaaS) architecture, or a modular architecture.
[0046] The image processing system 202 provides various functions that enable users to capture and enhance (e.g., annotate or otherwise modify or edit) media content associated with a message.
[0047] The camera device system 204 includes (e.g., in a camera device application) control software that (e.g., directly or via operating system controls) interacts with and controls the camera device hardware of the user system 102 to modify and enhance real-time images captured and displayed through the interactive client 104.
[0048] Enhancement system 206 provides functionality related to the generation and publication of enhancements (e.g., media overlays) of images captured in real time by the camera device of user system 102 or obtained from the memory of user system 102. For example, enhancement system 206 is operable to select, present, and display media overlays (e.g., image filters or image lenses) to interactive client 104 to enhance real-time images received by camera device system 204 or from the memory 1002 of user system 102 (e.g., [images]). Figure 10 The stored image is obtained (as shown). These enhancements are selected and presented to the user of the interactive client 104 by the enhancement system 206 based on multiple inputs and data, such as:
[0049] • The geographical location of user system 102; and
[0050] • Entity relationship information of users in user system 102.
[0051] Enhancements may include audio and visual content and visual effects. Examples of audio and video content include images, text, logos, animations, and sound effects. Examples of visual effects include color overlays. Audio and video content or visual effects may be applied to media content items (e.g., photos or videos) at user system 102 to be transmitted in messages or applied to video content, such as a video content stream or feed transmitted from interactive client 104. Therefore, image processing system 202 can interact with and support various subsystems of communication system 208, such as messaging system 210 and video communication system 212.
[0052] Media overlays may include text or image data that can be superimposed on photographs taken by user system 102 or video streams generated by user system 102. In some examples, media overlays may be location overlays (e.g., Venice Beach), names of on-site events, or names of businesses (e.g., beach cafes). In other examples, image processing system 202 uses the geographic location of user system 102 to identify the media overlay, which includes the name of a business at the geographic location of user system 102. Media overlays may include other tags associated with businesses. Media overlays may be stored in database 128 and accessed through database server 126.
[0053] Image processing system 202 provides a user-based publishing platform that allows users to select geographic locations on a map and upload content associated with those locations. Users can also specify which media overlays should be provided to other users. Image processing system 202 generates a media overlay that includes the uploaded content and associates the uploaded content with the selected geographic location.
[0054] The Enhanced Creation System 214 supports AR developer platforms and includes applications that enhance (e.g., AR experiences) the creation and publishing of interactive clients 104 for content creators (such as artists and developers). The Enhanced Creation System 214 provides content creators with a library of built-in features and tools, including, for example, custom shaders, tracking technologies, and templates.
[0055] In some examples, the enhancement creation system 214 provides a merchant-based publishing platform that allows merchants to select specific enhancements associated with a geographic location through a bidding process. For example, the enhancement creation system 214 associates the media overlay of the highest bidder with the corresponding geographic location for a predefined amount of time.
[0056] Communication system 208 is responsible for enabling and processing various forms of communication and interaction within interactive system 100, and includes messaging system 210, audio communication system 216, and video communication system 212. Messaging system 210 is responsible for enabling temporary or time-limited access to content by interactive client 104. Messaging system 210 incorporates multiple timers (e.g., within user association system 218) that selectively enable access (e.g., for presentation and display) of messages and associated content via interactive client 104 based on duration and display parameters associated with a message or message set (e.g., a story). Audio communication system 216 enables and supports audio communication (e.g., real-time audio chat) between multiple interactive clients 104. Similarly, video communication system 212 enables and supports video communication (e.g., real-time video chat) between multiple interactive clients 104.
[0057] The user management system 218 is operationally responsible for managing user data and profiles, and maintaining entity information about users and relationships between users in the interactive system 100 (e.g., stored in...). Figure 3 (In entity table 308, entity diagram 310, and profile data 302).
[0058] The collection management system 220 is operationally responsible for managing collections or sets of media (e.g., collections of text, image, video, and audio data). Collections of content (e.g., messages, including images, videos, text, and audio) can be organized into “event libraries” or “event stories.” Such collections can be made available for a specified time period (e.g., the duration of the event to which the content relates). For example, content related to a concert can be made available as a “story” for the duration of the concert. The collection management system 220 can also be responsible for publishing icons to the user interface of the interactive client 104, which provide notifications for specific collections. The collection management system 220 includes curation functionality that allows collection managers to manage and curate specific content collections. For example, the curation interface enables event organizers to curate collections of content related to a specific event (e.g., removing inappropriate content or redundant messages). Furthermore, the collection management system 220 employs machine vision (or image recognition technology) and content rules to automatically curate content collections. In some examples, users may be compensated for including user-generated content in the collection. In such a situation, the collection management system 220 operates to automatically pay such users for access to its content.
[0059] Map system 222 provides various geographic location (e.g., geolocation) functions and supports the presentation of map-based media content and messages by interactive client 104. For example, map system 222 enables the display (e.g., stored on) maps. Figure 3 The user's profile data 302 (in which the user's icon or avatar is used) indicates the current or past location of the user's "friends" within the context of the map, as well as media content generated by such friends (e.g., a collection of messages including photos and videos). For example, a message posted by a user to the interactive system 100 from a specific geographic location can be displayed to the user's "friends" at that specific location within the context of the map at that specific location on the map interface of the interactive client 104. The user can also share his or her location and status information with other users of the interactive system 100 through the interactive client 104 (e.g., using an appropriate status avatar), where the location and status information is similarly displayed to the selected user within the context of the map interface of the interactive client 104.
[0060] Game system 224 provides various game functions within the context of interactive client 104. Interactive client 104 provides a game interface that offers a list of available games that a user can launch and play with other users of interactive system 100 within the context of interactive client 104. Interactive system 100 also enables specific users to invite other users to participate in specific games by sending invitations from interactive client 104. Interactive client 104 also supports sending and receiving audio, video, and text messages (e.g., chat) within the context of playing the game, provides leaderboards for the game, and supports the provision of in-game rewards (e.g., game currency and items).
[0061] External resource system 226 provides interactive client 104 with an interface to communicate with remote servers (e.g., third-party server 112) to launch or access external resources such as applications or applets. Each third-party server 112 hosts applications or smaller versions of applications (e.g., games, utilities, payment, or ride-sharing apps) based on markup languages (e.g., HTML5). Interactive client 104 can launch web-based resources (e.g., applications) by accessing HTML5 files from the third-party server 112 associated with the web-based resource. The application hosted by third-party server 112 is programmed in JavaScript using a software development kit (SDK) provided by interactive server 124. The SDK includes APIs with functions that can be called or invoked by the web-based application. Interactive server 124 hosts a JavaScript library that provides access to a given external resource for specific user data of interactive client 104. HTML5 is an example of a technology used for programming games, but applications and resources programmed using other technologies can be used.
[0062] To integrate the SDK's functionality into the web-based resource, the SDK is downloaded by third-party server 112 from interaction server 124, or otherwise received by third-party server 112. Once downloaded or received, the SDK is included as part of the application code of the web-based external resource. The code of the web-based resource can then call or invoke certain functions of the SDK to integrate the features of interaction client 104 into the web-based resource.
[0063] The SDK stored on the interactive server system 110 effectively provides a bridge between external resources (e.g., application 106 or applet) and the interactive client 104. This gives users a seamless experience of communicating with other users on the interactive client 104 while retaining the look and feel of the interactive client 104. To bridge communication between external resources and the interactive client 104, the SDK facilitates communication between a third-party server 112 and the interactive client 104. A bridge script running on the user system 102 establishes two unidirectional communication channels between the external resources and the interactive client 104. Messages are sent asynchronously between the external resources and the interactive client 104 through these communication channels. Each SDK function is invoked as a message and a callback. Each SDK function is implemented by constructing a unique callback identifier and sending a message with that callback identifier.
[0064] By using the SDK, not all information from the interactive client 104 is shared with the third-party server 112. The SDK limits what information is shared based on the needs of the external resource. Each third-party server 112 provides the interactive server 124 with an HTML5 file corresponding to the web-based external resource. The interactive server 124 can add a visual representation (such as box art or other graphics) of the web-based external resource to the interactive client 104. Once the user selects the visual representation or instructs the interactive client 104 to access the features of the web-based external resource through the graphical user interface (GUI), the interactive client 104 obtains the HTML5 file and instantiates the resource to access the features of the web-based external resource.
[0065] Interactive client 104 presents a GUI (e.g., a login page or title screen) for an external resource. During, before, or after presenting the login page or title screen, interactive client 104 determines whether the initiated external resource has previously been authorized to access interactive client 104's user data. In response to determining that the initiated external resource has previously been authorized to access interactive client 104's user data, interactive client 104 presents another GUI for the external resource, including its functionality and characteristics. In response to determining that the initiated external resource has not previously been authorized to access interactive client 104's user data, after displaying the external resource's login page or title screen for a threshold time period (e.g., 3 seconds), interactive client 104 slides up a menu for authorizing the external resource to access user data (e.g., animating the menu to appear from the bottom of the screen to the middle or other part of the screen). The menu identifies the type of user data that the external resource will be authorized to use. In response to receiving a user's selection of the accept option, interactive client 104 adds the external resource to the list of authorized external resources and allows the external resource to access user data from interactive client 104. Interactive client 104 authorizes the external resource to access user data under the OAuth 2 framework.
[0066] Interactive client 104 controls the type of user data shared with external resources based on the type of authorized external resource. For example, access to a first type of user data (e.g., 2D avatars of users with or without different avatar characteristics) is provided to external resources including full-scale applications (e.g., application 106). As another example, access to a second type of user data (e.g., payment information, 2D avatars of users, 3D avatars of users, and avatars with various avatar characteristics) is provided to external resources including smaller versions of applications (e.g., web-based versions of applications). Avatar characteristics include different ways of customizing the appearance and feel of an avatar, such as different poses, facial features, clothing, etc.
[0067] The advertising system 228 is operationally designed to enable third parties to purchase advertisements for presentation to end users via interactive client 104, and also handles the delivery and presentation of these advertisements.
[0068] Artificial intelligence and machine learning system 230 provides various services to different subsystems within interactive system 100. For example, artificial intelligence and machine learning system 230 operates in conjunction with image processing system 202 and camera device system 204 to analyze images and extract information such as objects, text, or faces. This information can then be used by image processing system 202 to enhance, filter, or manipulate the image. Enhancement system 206 can use artificial intelligence and machine learning system 230 to generate enhanced content, XR experiences, and AR experiences, such as adding virtual objects or animations to real-world images.
[0069] Communication system 208 and messaging system 210 can use artificial intelligence and machine learning system 230 to analyze communication patterns and provide insights into how users interact with each other, and provide intelligent message classification and tagging, such as classifying messages based on sentiment or topic. Artificial intelligence and machine learning system 230 can also provide chatbot functionality for message interactions 120 between user systems 102 and between user system 102 and interaction server system 110. Artificial intelligence and machine learning system 230 can also work with audio communication system 216 to provide speech recognition and natural language processing capabilities, allowing users to interact with interaction system 100 using voice commands.
[0070] In some cases, the artificial intelligence and machine learning system 230 can realize the multi-person image generation system 232, which will combine Figure 5 To discuss in more detail. The multi-person image generation system 232 accesses or includes a first generative machine learning model and a second generative machine learning model. The first generative machine learning model can be trained to generate a first artificially personalized image including a depiction of a first person. The second generative machine learning model can be trained to generate a second artificially personalized image including a depiction of a second person. The multi-person image generation system 232 generates a foreground image that combines the depiction of the first person in the first image with the depiction of the second person in the second image. The multi-person image generation system 232 accesses background information and generates a new artificial image including depictions of the first and second people on a background with visual attributes corresponding to the background information.
[0071] Data Architecture
[0072] Figure 3 This is a schematic diagram illustrating a data structure 300 that can be stored in a database 304 of an interactive server system 110, according to certain examples. Although the contents of the database 304 are shown to include multiple tables, it should be understood that the data can be stored in other types of data structures (e.g., as an object-oriented database).
[0073] Database 304 includes message data stored in message table 306. For any given message, this message data includes at least message sender data, message receiver (or recipient) data, and a payload. See below for reference. Figure 4 Additional details describe information that can be included in a message and is contained within message data stored in message table 306.
[0074] Entity table 308 stores entity data and (e.g., by reference) links to entity diagram 310 and profile data 302. Entities for which records are maintained within entity table 308 can include individuals, company entities, organizations, objects, locations, events, etc. Regardless of entity type, any entity whose data is stored by the interactive server system 110 can be an identifiable entity. Each entity is provided with a unique identifier and an entity type identifier (not shown).
[0075] Entity graph 310 stores information about relationships and associations between entities. As an example only, such relationships can be social, professional (e.g., working in a common company or organization), interest-based, or activity-based. Some relationships between entities can be one-way, such as an individual user subscribing to digital content from a business or publishing user (e.g., a newspaper or other digital media channel or brand). Other relationships can be two-way, such as the "friendship" relationship between individual users of interactive system 100.
[0076] Certain licenses and relationships can be attached to each relationship, and also to each direction of the relationship. For example, a two-way relationship (e.g., a friendship between individual users) can include authorization for the publication of digital content items between individual users, but certain restrictions or filters can be imposed on the publication of such digital content items (e.g., based on content characteristics, location data, or time of day data). Similarly, a subscription relationship between an individual user and a business user can impose varying degrees of restrictions on the publication of digital content from the business user to the individual user, and can significantly restrict or prevent the publication of digital content from the individual user to the business user. As an example of an entity, a specific user can record certain restrictions in a record for that entity within entity table 308 (e.g., via privacy settings). Such privacy settings can be applied to all types of relationships in the context of interaction system 100, or selectively applied to certain types of relationships.
[0077] Profile data 302 stores various types of profile data about a specific entity. Profile data 302 can be selectively used and presented to other users of the interactive system 100 based on privacy settings specified by the specific entity. In the case of an individual, profile data 302 includes, for example, a username, phone number, address, settings (such as notification and privacy settings), and an avatar representation (or a set of such avatar representations) selected by the user. A specific user can then selectively include one or more of these avatar representations within the content of messages transmitted via the interactive system 100 and on a map interface displayed to other users by the interactive client 104. The set of avatar representations may include “status avatars,” which present a graphical representation of a status or activity that the user can choose to transmit at a specific time.
[0078] In the case that the entity is a group, in addition to the group name, members and various settings of the associated group (such as notifications), the group profile data 302 may similarly include one or more avatar representations associated with the group.
[0079] Database 304 also stores enhancement data, such as overlays or filters, in enhancement table 312. The enhancement data is associated with and applied to videos (whose data is stored in video table 314) and images (whose data is stored in image table 316).
[0080] In some examples, filters are overlays displayed as superimposed on images or videos during presentation to the recipient user. Filters can be of various types, including filters selected by the user from a set of filters presented to the sending user by the interactive client 104 when the sending user composes a message. Other types of filters include geolocation filters (also known as geographic filters), which can be presented to the sending user based on geographic location. For example, the interactive client 104 may present geolocation filters specific to nearby or particular locations within the user interface based on geolocation information determined by the Global Positioning System (GPS) unit of the user system 102.
[0081] Another type of filter is a data filter, which can be selectively presented to the sending user by the interactive client 104 based on other inputs or information collected by the user system 102 during the message creation process. Examples of data filters include the current temperature at a specific location, the sending user's current travel speed, the battery life of the user system 102, or the current time.
[0082] Other augmented data that can be stored within image table 316 includes, for example, AR content items corresponding to the application “Lens” or AR experience. AR content items can be real-time effects and sounds that can be added to images or videos.
[0083] Collection table 318 stores data related to collections of messages and associated image, video, or audio data compiled into collections (e.g., stories or galleries). The creation of a specific collection can be initiated by a specific user (e.g., each user for whom records are maintained in entity table 308). A user can create a "personal story" in the form of a collection of content that has already been created and sent / broadcast by that user. For this purpose, the user interface of interactive client 104 may include user-selectable icons to allow the sending user to add specific content to his or her personal story.
[0084] Collections can also constitute "live stories," which are collections of content from multiple users created manually, automatically, or using a combination of manual and automatic technologies. For example, a "live story" can constitute a curated stream of user-submitted content from different locations and events. For instance, users with location services enabled on their client devices and who are at a common location event at a specific time can be presented with the option to contribute content to a specific live story via the user interface of interactive client 104. Users can be identified by interactive client 104 based on their location. The end result is a "live story" told from a community perspective.
[0085] Another type of content collection is called a "location story," which allows users of user system 102 located in a specific geographic location (e.g., within a college or university campus) to contribute to a specific collection. In some examples, contributions to a location story may employ a second level of verification to confirm that the end user belongs to a specific organization or other entity (e.g., is a student on a university campus).
[0086] As described above, video table 314 stores video data, which in some examples is associated with messages whose records are maintained within message table 306. Similarly, image table 316 stores image data associated with messages whose message data is stored in entity table 308. Entity table 308 can associate various enhancements from enhancement table 312 with various images and videos stored in image table 316 and video table 314.
[0087] Database 304 also includes trained machine learning techniques 307 or generative machine learning models, which are stored in the multi-user image generation system 500. Figure 5 The parameters of one or more machine learning models that have been trained during the training period. For example, trained machine learning technique 307 stores the trained parameters of one or more artificial neural network machine learning models or techniques or diffusion models.
[0088] Data communication architecture
[0089] Figure 4This is a schematic diagram illustrating the structure of message 400 according to some examples, generated by interactive client 104 for transmission to another interactive client 104 via interactive server 124. The content of a particular message 400 is used to populate message table 306 within database 304 accessible by interactive server 124. Similarly, the content of message 400 is stored in memory as "in-transit" or "in-flight" data for user system 102 or interactive server 124. Message 400 is shown to include the following example components:
[0090] • Message Identifier 402: A unique identifier that identifies message 400.
[0091] • Message text payload 404: The text to be generated by the user via the user interface of user system 102 and included in message 400.
[0092] • Message image payload 406: Image data captured by the camera device component of user system 102 or obtained from the storage component of user system 102 and included in message 400. The image data for the sent or received message 400 can be stored in image table 316.
[0093] • Message video payload 408: Video data captured by the camera device component or obtained from the storage component of the user system 102 and included in the message 400. The video data for the sent or received message 400 can be stored in the image table 316.
[0094] • Message audio payload 410: Audio data captured by the microphone or obtained from the storage component of the user system 102 and included in message 400.
[0095] • Message enhancement data 412: Enhancement data (e.g., filters, labels, or other annotations or enhancements) representing enhancements to be applied to the message image payload 406, message video payload 408, or message audio payload 410 of message 400. Enhancement data for the sent or received message 400 can be stored in enhancement table 312.
[0096] • Message duration parameter 414: A parameter value in seconds indicating the amount of time that the content of the message (e.g., message image payload 406, message video payload 408, message audio payload 410) will be presented to the user via the interactive client 104 or is accessible to the user.
[0097] • Message geolocation parameter 416: Geographic location data (e.g., latitude and longitude coordinates) associated with the message's content payload. Multiple message geolocation parameter 416 values may be included in the payload, each of which is associated with a content item included in the content (e.g., a specific image within the message image payload 406 or a specific video within the message video payload 408).
[0098] • Message Story Identifier 418: An identifier value that identifies one or more sets of content (e.g., “story” identified in set table 318) associated with a specific content item in the message image payload 406 of message 400. For example, multiple images within the message image payload 406 may each be associated with multiple sets of content using their respective identifier values.
[0099] • Message Tag 420: Each message 400 can be labeled with multiple tags, each of which indicates the subject of the content included in the message payload. For example, in the case where a specific image depicts an animal (e.g., a lion) is included in the message image payload 406, a tag value can be included within the message tag 420 indicating the relevant animal. Tag values can be manually generated based on user input, or can be automatically generated using, for example, image recognition.
[0100] • Message sender identifier 422: An identifier (e.g., message sending system identifier, email address, or device identifier) indicating the user of the user system 102 on which message 400 is generated and from which message 400 is sent.
[0101] • Message receiver identifier 424: An identifier (e.g., message sending and receiving system identifier, email address, or device identifier) indicating the user of the user system 102 to which message 400 is addressed.
[0102] The content (e.g., values) of each component of message 400 can be pointers to locations in tables where content data values are stored. For example, image values in message image payload 406 can be pointers to locations (or their addresses) within image table 316. Similarly, values in message video payload 408 can point to data stored in image table 316, values in message enhancement data 412 can point to data stored in enhancement table 312, values in message story identifier 418 can point to data stored in set table 318, and values in message sender identifier 422 and message receiver identifier 424 can point to user records stored in entity table 308.
[0103] Multi-person image generation system
[0104] Figure 5This is a block diagram illustrating an example multi-person image generation system 500 according to some examples. The multi-person image generation system 500 may include a first personalized generative model 510, a background image component 530, a second personalized generative model 520, and an image combination component 540. These components together enable the multi-person image generation system 500 to access the first and second generative machine learning models, the first being trained to generate a first artificially personalized image including a depiction of a first person, and the second being trained to generate a second artificially personalized image including a depiction of a second person. These components together enable the multi-person image generation system 500 to generate a foreground image that combines the depiction of the first person in the first image with the depiction of the second person in the second image. These components together enable the multi-person image generation system 500 to access background information and generate new artificial images including depictions of the first and second people on a background having visual attributes corresponding to the background information.
[0105] Specifically, the first personalized generative model 510 can receive input images, such as from a user, person, message, communication, and / or from a storage device or online database. In some cases, the first personalized generative model 510 can activate a rear or front camera of the user system 102. The activated camera can capture images or videos depicting a first real-world object (e.g., a first person wearing real-world fashion items) within a first real-world context. The first personalized generative model 510 can be trained to generate artificial images depicting the first real-world object in different poses and different backgrounds. The first personalized generative model 510 can be trained based on the first set of training operations discussed below.
[0106] In some examples, the first personalized generative model 510 receives input that includes or defines cues. In some cases, the first personalized generative model 510 accesses a pre-configured library of cues and randomly selects a given cue. In some cases, a third party may define and store one or more cues in the first personalized generative model 510. Each of these cues (defined by the user or selected from a set of predefined cues) includes a textual description of a target pose or an image depicting an object in a specific pose. The cues may define one or more visual or target aspects of a person depicted in the received image. In some cases, the cues may include a textual or image description of the background. The visual aspects of a person defined by the cues may include a textual description of clothing or fashion items and may indicate the type of clothing (e.g., coat, sweater, t-shirt, shirt, etc.) and may also optionally specify one or more attributes of the clothing, such as color, style, appearance, size, season, etc. In some examples, the textual description also includes a description of the background, such as specifying location, weather, scenery, environment, etc.
[0107] The first personalized generative model 510 receives a cue and, in some cases, an image depicting a first real-world object (e.g., a first person). The first personalized generative model 510 processes the cue to generate a first artificially personalized image depicting the first person in a pose defined by the cue, wearing clothing defined by the cue, and an artificial background defined by the cue.
[0108] The first personalized generative model 510 can implement one or more machine learning models, such as diffusion models. The first personalized generative model 510 can be trained to process text from prompts and generate artificial images depicting artificial elements that match the text. For example, the first personalized generative model 510 can be trained on a large dataset of images and their corresponding text descriptions to generate images from text. The first personalized generative model 510 learns the statistical relationship between the text descriptions and the corresponding images. During training, the first personalized generative model 510 learns to generate the sequence of image samples by iteratively refining the sequence of image samples using multiple rounds of random diffusion steps. The first personalized generative model 510 starts with a random noise vector and applies a series of diffusion steps to iteratively refine the image. At each diffusion step, the first personalized generative model 510 applies random noise to the image and then computes the gradient of the image with respect to a loss function. The gradient is then used to update the image, and the image is further refined in the next diffusion step.
[0109] The first personalized generative model 510 generates images by sampling from a sequence of image samples produced during the diffusion process. The first personalized generative model 510 uses a learned autoregressive model conditioned on the textual description of the image to generate each pixel of the image. Overall, the process of generating images from text using the first personalized generative model 510 involves training the first personalized generative model 510 on a large dataset of images and their corresponding textual descriptions, and then using the trained first personalized generative model 510 to generate images by sampling from a learned image distribution conditioned on the textual description.
[0110] Specifically, the first personalized generative model 510 preprocesses the text descriptions and images of the training set. This may involve lexicalizing the text descriptions, resizing and normalizing the images, and splitting the data into training and validation sets. The first personalized generative model 510 is trained on the training data using an encoder that encodes the text descriptions into a low-dimensional vector space, a generator that generates images from noise vectors, and a discriminator that distinguishes between real and generated images. During training, the first personalized generative model 510 is trained to minimize a loss function that encourages generated images to match real images while also being consistent with the text descriptions. The loss function may consist of a combination of adversarial loss, reconstruction loss, and text consistency loss. Once the first personalized generative model 510 is trained, images can be generated from the text descriptions by sampling from a learned image distribution conditioned on the text descriptions. This involves encoding the text descriptions into low-dimensional vectors using a trained encoder, and then generating images by iteratively refining the noise vectors using a trained generator and a series of diffusion steps. At each diffusion step, the first personalized generative model 510 applies random noise to the image and computes the gradient of the image with respect to the loss function. The gradient is then used to update the image, which is further refined in the next diffusion step. Finally, the generated image can be post-processed to improve its visual quality. This could involve denoising the image, applying color correction, or performing other image processing operations. This results in the generation of high-quality images closely aligned with the text description.
[0111] In some cases, the first personalized generative model 510 can use an input image depicting a first real-world object to adjust its cues to include a depiction of an artificial object corresponding to the first real-world object. That is, the first personalized generative model 510 can receive text cues describing both fashion items and / or backgrounds, and can also receive an input image depicting a first real-world object (e.g., a first real-world person). In response, the first personalized generative model 510 can generate an artificial image depicting an artificial object that matches the first real-world object depicted in the input image and wears artificial fashion items matching the description in the cues, and optionally on an artificial background matching the description in the cues and optionally in a pose defined by the cues. In this case, the first personalized generative model 510 can be trained based on training data including text samples depicting the same first real-world object and training images, as well as corresponding real-world images matching the descriptions of fashion items, backgrounds, and poses in the text. During the training process discussed above, the first personalized generative model 510 can be trained to generate artificial images to match the first real-world objects depicted in the training images, while also being consistent with the textual descriptions of fashion items and poses.
[0112] In some cases, the first personalized generative model 510 performs a first set of training operations to train itself to generate personalized artificial images for a first real-world object (a first person). The first set of training operations includes: accessing a first set of training images, each depicting a first real-world object with different backgrounds and in different poses, and receiving images including depictions of the first real-world objects. The first personalized generative model 510 receives visual attributes of an individual background depicted in an individual training image within the first set of training images and a cue defining a target pose depicted in the individual training image. The first personalized generative model 510 processes the images and cue to generate an estimated artificial image depicting the first real-world object in the target pose and an artificial background with visual attributes defined by the cue. The first personalized generative model 510 calculates the deviation between the estimated artificial image and the individual training image and updates one or more parameters of the first personalized generative model 510 based on this deviation. The first personalized generative model 510 then repeats these operations for additional training images from the first set of training images. The first personalized generative model 510 can complete training when the stopping criteria are met or when all training images have been processed.
[0113] In some examples, a first set of training images is generated by capturing multiple images of a first real-world object. Then, multiple regions depicting the first real-world object are extracted from the multiple images. A set of cues defining different visual properties of the background is obtained. The diffusion model can process the multiple regions and this set of cues to generate the first set of training images. In some cases, the first real-world object depicted in a portion of the multiple images is magnified, and after magnifying the first real-world object, the regions depicted in the magnified portion are extracted.
[0114] In some examples, the second personalized generative model 520 may receive input images, such as from a user, person, message, communication, and / or from a storage device or online database. In some cases, the second personalized generative model 520 may activate a rear or front camera of the user system 102. The activated camera may capture images or videos depicting a second real-world object (e.g., a second person wearing real-world fashion items) within a second real-world context. The second personalized generative model 520 may be trained to generate artificial images depicting second real-world objects in different poses and backgrounds. The second personalized generative model 520 may be trained based on a second set of training operations as described below. The second personalized generative model 520 may perform operations similar to those discussed above relative to the first personalized generative model 510, but for second real-world objects. In this way, the second personalized generative model 520 is personalized for second real-world objects to generate artificial images specifically depicting second real-world objects, while the first personalized generative model 510 is personalized for first real-world objects to generate artificial images specifically depicting first real-world objects.
[0115] For example, the second personalized generative model 520 receives a cue and, in some cases, an image depicting a second real-world object (e.g., a second person). The second personalized generative model 520 processes the cue to generate a second artificially personalized image depicting the second person in a pose defined by the cue, wearing clothing defined by the cue, and an artificial background defined by the cue.
[0116] The second personalized generative model 520 can implement one or more machine learning models similar to the machine learning model implemented by the first personalized generative model 510, such as a diffusion model.
[0117] In some cases, the second personalized generative model 520 can use an input image depicting a second real-world object to adjust its cues to include a depiction of an artificial object corresponding to the second real-world object. That is, the second personalized generative model 520 can receive text cues describing fashion items and / or backgrounds, and can also receive an input image depicting a second real-world object (e.g., a person in the second real world). In response, the second personalized generative model 520 can generate an artificial image depicting an artificial object that matches the second real-world object depicted in the input image, and is wearing artificial fashion items matching the description in the cues, and optionally on an artificial background matching the description in the cues, and optionally in a pose defined by the cues. In this case, the second personalized generative model 520 can be trained based on training data including text samples depicting the same second real-world object and training images, as well as corresponding real images matching the descriptions of fashion items, backgrounds, and poses in the text. During the training process discussed above, the second personalized generative model 520 can be trained to generate artificial images to match second real-world objects depicted in the training images, while also being consistent with textual descriptions of fashion items and poses.
[0118] In some cases, the second personalized generative model 520 performs a second set of training operations to train itself to generate personalized artificial images for a second real-world object (a second person). The second set of training operations includes: accessing a second set of training images, each depicting a second real-world object with different backgrounds and poses; and receiving images including depictions of the second real-world objects. The second personalized generative model 520 receives a second cue that defines visual attributes of the background of the second individual depicted in the second volume training image within the second set of training images, and defines a second target pose depicted in the individual training image. The second personalized generative model 520 processes the images and cue to generate a second estimated artificial image that depicts the second real-world object in the second target pose and a second artificial background with visual attributes defined by the second cue. The second personalized generative model 520 calculates a second deviation between the second estimated artificial image and the second volume training image, and updates one or more parameters of the second personalized generative model 520 based on the second deviation. The second personalized generative model 520 then repeats these operations for additional training images from the second set of training images. The second personalized generative model 520 can complete training when the stopping criterion is met or when all training images have been processed.
[0119] In some examples, a second set of training images is generated by capturing multiple images of a second real-world object. Then, multiple regions depicting the second real-world object are extracted from the multiple images. A set of cues defining different visual properties of the background is obtained. The diffusion model can process the multiple regions and this set of cues to generate the second set of training images. In some cases, the second real-world object depicted in a portion of the multiple images is magnified, and after magnifying the second real-world object, the regions depicted in the magnified portion are extracted.
[0120] In some examples, a first artificially personalized image and a second artificially personalized image, generated by a first personalized generative model 510 and a second personalized generative model 520, are provided to an image combining component 540. In some cases, the image combining component 540 may include a background image component 530. The image combining component 540 can generate a new image that combines the foreground of the first artificially personalized image and the second artificially personalized image with a new background provided by the background image component 530. For example, the image combining component 540 can extract a portion of the first artificially personalized image depicting a first real-world object, and can extract a portion of the second artificially personalized image depicting a second real-world object. Specifically, the image combining component 540 obtains a foreground mask and a background mask for each of the first artificially personalized image and the second artificially personalized image. The image combining component 540 uses the foreground mask to blend the depiction of the foreground from the first artificially personalized image and the second artificially personalized image, and uses the background mask to patch the background of the blended depiction of the foreground from the first artificially personalized image and the second artificially personalized image. In some cases, the image combining component 540 applies one or more machine learning models to patch the background of a blended depiction of the foreground from the first and second artificially personalized images using a background mask.
[0121] In some examples, the image combining component 540 may place the extracted portion onto an empty image, a portion of a first artificially personalized image, and / or a portion of a second artificially personalized image to generate a foreground image. In some examples, the image combining component 540 generates a first segmentation of a first real-world object depicted in the first artificially personalized image. The image combining component 540 may apply one or more machine learning models to the first artificially personalized image to generate the first segmentation and extract a first region of the first artificially personalized image based on the first segmentation. The first region may include pixels falling within the first segmentation. The image combining component 540 may extract a second region of a second artificially personalized image based on the first segmentation. The second region of the second artificially personalized image may exclude pixels falling within the first segmentation. That is, the image combining component 540 may extract a portion of the second artificially personalized image that includes all pixels except for regions that match or include pixels falling within the first segmentation. This effectively creates an image that includes a depiction of a second real-world object and some background depicted in the second artificially personalized image, and has holes or missing portions whose size and shape match or correspond to the first real-world object or the first segmentation of the first real-world object.
[0122] Image combining component 540 generates a foreground image by combining a first region and a second region. Image combining component 540 can generate a second segmentation of a second real-world object depicted in a second artificially personalized image, for example, by using a machine learning model trained to generate object segments corresponding to the type of real-world object sought to be extracted. Image combining component 540 then combines the first and second segments to generate a combined segmentation. Image combining component 540 obtains a background from background image component 530. Background image component 530 can implement a diffusion model that generates a background corresponding to the description provided in the prompt. Image combining component 540 performs patching to combine the background provided by background image component 530 with the foreground image. This results in the creation or generation of a new artificial image by the multi-person image generation system 500.
[0123] like Figure 6 As shown, the first personalized generative model 510 can generate a first artificial personalized image 610 (as described above). The first artificial personalized image 610 includes a depiction of a first artificial background 614 that matches the textual description of the background in the prompt. The first artificial personalized image 610 also includes an artificially generated depiction of an artificial object 612 that matches or is similar in visual quality to a first real-world object (corresponding to a first person).
[0124] As described above, the second personalized generative model 520 can generate a second artificially personalized image 620. The second artificially personalized image 620 includes a depiction of a second artificial background 624 that matches the textual description of the background in the prompt. The second artificially personalized image 620 also includes an artificially generated depiction of an artificial object 622 that matches or is similar in visual quality to a second real-world object (corresponding to the first person).
[0125] The first artificial personalized image 610 and the second artificial personalized image 620 can be provided to a multi-user image generation system 630, which implements some or all of the multi-user image generation system 500. The multi-user image generation system 630 also accesses or receives background cues. The multi-user image generation system 630 then generates a new artificial image 640 that depicts artificial objects 612 and 622 on a newly generated background 644. In some cases, the new background 644 may be the same as or similar to the first artificial background 614 and / or the second artificial background 624, or some combination thereof.
[0126] The new artificial image 640 can be sent to one or more users. In some cases, the new artificial image 640 can be used to generate advertisements that promote a specific product or target individual users (e.g., the person depicted in the new artificial image 640). In some cases, the image can be used to provide a virtual or XR try-on experience to the user depicted in the new artificial image 640 on user system 102.
[0127] In some examples, the above process can be applied continuously to a video. That is, the above process can be applied to each frame of a video to continuously adjust artificially generated images depicting multiple people.
[0128] Figure 7 This is a flowchart of a process or method 700 performed by a multi-person image generation system 500, based on some examples. Although a flowchart can describe operations as a sequential process, many operations can be performed in parallel or concurrently. Furthermore, the order of operations can be rearranged. A process terminates when its operations are completed. A process can correspond to a method, program, etc. The steps of a method can be performed in whole or in part, can be combined with some or all of the steps in other methods, and can be performed by any number of different systems or any part thereof (e.g., a processor included in any system).
[0129] At operation 701, the multi-person image generation system 500 (e.g., user system 102 or server) accesses a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model, the first generative machine learning model being trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model being trained to generate the second artificially personalized image including a depiction of a second person (as described above).
[0130] At operation 702, the multi-person image generation system 500 generates a foreground image that combines the depiction of the first person in the first image with the depiction of the second person in the second image (as described above).
[0131] At operation 703, the multi-person image generation system 500 accesses background information (as described above).
[0132] At operation 704, the multi-person image generation system 500 generates a new artificial image, which includes a foreground image on a background having visual attributes corresponding to the background information (as described above).
[0133] Example
[0134] Example 1. A method comprising: accessing a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model, the first generative machine learning model being trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model being trained to generate the second artificially personalized image including a depiction of a second person; generating a foreground image combining the depiction of the first person in the first image with the depiction of the second person in the second image; accessing background information; and generating a new artificial image including the foreground image on a background having visual attributes corresponding to the background information.
[0135] Example 2. The method according to Example 1 further includes: accessing a cue including the background information; and processing the foreground image and the cue by a third generative machine learning model to generate the new artificial image.
[0136] Example 3. According to the method of Example 2, wherein the prompt includes an image depicting the background information.
[0137] Example 4. The method according to any one of Examples 2-3, wherein the prompt includes a textual description of the background information.
[0138] Example 5. The method according to any one of Examples 2-4, wherein the first generative machine learning model, the second generative machine learning model and the third generative machine learning model each include a corresponding diffusion machine learning model.
[0139] Example 6. The method according to any one of Examples 1-5, wherein the first generative machine learning model and the second generative machine learning model each include a corresponding diffusion machine learning model.
[0140] Example 7. The method according to any one of Examples 1-6 further includes: obtaining a foreground mask and a background mask for each of the first artificially personalized image and the second artificially personalized image; using the foreground mask to blend the depiction of the foreground from the first artificially personalized image and the second artificially personalized image; and using the background mask to patch the background of the blended depiction of the foreground from the first artificially personalized image and the second artificially personalized image.
[0141] Example 8. The method according to any one of Examples 1-7 further includes: generating a first segment of the first person depicted in the first artificially personalized image.
[0142] Example 9. The method according to Example 8 further includes: extracting a first region of the first artificially personalized image based on the first segmentation, the first region including pixels falling within the first segmentation; extracting a second region of the second artificially personalized image based on the first segmentation, the second region of the second artificially personalized image excluding pixels falling within the first segmentation; and generating the foreground image by combining the first region and the second region.
[0143] Example 10. The method according to Example 9 further includes: generating a second segment of the second person depicted in the second artificially personalized image; and combining the first segment and the second segment to generate a combined segment.
[0144] Example 11. The method according to Example 10 further includes: performing background patching on the foreground image based on the combined segmentation to generate the new artificial image.
[0145] Example 12. The method described in Example 11, wherein the background is generated based on prompts.
[0146] Example 13. The method according to any one of Examples 1-12 further includes: receiving a first pose image, the first pose image comprising a depiction of an object in an individual pose; receiving a first cue defining a first set of visual attributes; and processing the pose image and the cue by the first generative machine learning model to generate the first artificially personalized image.
[0147] Example 14. The method according to Example 13 further includes: receiving a second pose image, the second pose image including a depiction of another object in another pose; receiving a second cue defining a second set of visual attributes; and processing the second pose image and the second cue by the second generative machine learning model to generate the second artificially personalized image.
[0148] Example 15. The method according to any one of Examples 1-14 further includes: training the first generative machine learning model by performing a first set of training operations, the first set of training operations including: accessing a first set of training images, each training image depicting a first person with a different background and in a different pose; receiving images including depictions of the first person; receiving a cue defining visual attributes of an individual background depicted in an individual training image within the first set of training images and defining a target pose depicted in the individual training image; processing the images and the cue by the first generative machine learning model to generate an estimated artificial image depicting the first person in the target pose and an artificial background having visual attributes defined by the cue; calculating a deviation between the estimated artificial image and the individual training image; and updating one or more parameters of the first generative machine learning model based on the deviation.
[0149] Example 16. The method according to Example 15 further includes: capturing multiple images of the first person; extracting multiple regions of the multiple images that depict the first person; obtaining a set of cues defining different visual attributes of the background; and processing the multiple regions and the set of cues by a diffusion model to generate the first set of training images.
[0150] Example 17. The method according to Example 16 further includes: zooming in on the first person in the plurality of images to extract the plurality of regions.
[0151] Example 18. The method according to any one of Examples 15-17 further includes: training the second generative machine learning model by performing a second set of training operations, the second set of training operations including: accessing a second set of training images, each training image depicting a second person with a different background and in a different pose; receiving a second image including a depiction of the second person; receiving a second cue, the second cue defining visual attributes of a second individual background depicted in a second individual training image in the second set of training images and defining a second target pose depicted in the second individual training image; processing the second image and the second cue by the second generative machine learning model to generate a second estimated artificial image, the second estimated artificial image depicting the second person in the second target pose and a second artificial background having the visual attributes defined by the second cue; calculating a second deviation between the second estimated artificial image and the second individual training image; and updating one or more parameters of the second generative machine learning model based on the second deviation.
[0152] Example 19. A system comprising: at least one processor; and at least one memory component storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations including: accessing a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model, the first generative machine learning model being trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model being trained to generate the second artificially personalized image including a depiction of a second person; generating a foreground image combining the depiction of the first person in the first image and the depiction of the second person in the second image; accessing background information; and generating a new artificial image including the foreground image on a background having visual attributes corresponding to the background information.
[0153] Example 20. A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform operations, the operations including: accessing a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model, the first generative machine learning model being trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model being trained to generate the second artificially personalized image including a depiction of a second person; generating a foreground image combining the depiction of the first person in the first image and the depiction of the second person in the second image; accessing background information; and generating a new artificial image on the foreground image on a background having visual attributes corresponding to the background information.
[0154] Machine architecture
[0155] Figure 8This is a schematic representation of machine 800, within which instructions 802 (e.g., software, programs, applications, applets, or other executable code) can be executed to cause machine 800 to perform any or more of the methods discussed herein. For example, instructions 802 can cause machine 800 to perform any or more of the methods described herein. Instructions 802 transform a general, unprogrammed machine 800 into a specific machine 800 programmed to perform the described and illustrated functions in the described manner. Machine 800 can operate as a standalone device or can be coupled (e.g., networked) to other machines. In a networked deployment, machine 800 can operate as a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. Machine 800 may include, but is not limited to, server computers, client computers, personal computers (PCs), tablets, laptops, netbooks, set-top boxes (STBs), personal digital assistants (PDAs), entertainment media systems, cellular phones, smartphones, mobile devices, wearable devices (e.g., smartwatches), smart home devices (e.g., smart appliances), other smart devices, web appliances, network routers, network switches, bridges, or any machine capable of sequentially or otherwise executing instructions 802 that specify actions to be taken by machine 800. Furthermore, while a single machine 800 is shown, the term "machine" should also be considered as a collection of machines that individually or jointly execute instructions 802 to perform any one or more of the methods discussed herein. For example, machine 800 may include user system 102 or any of a plurality of server devices forming part of interactive server system 110. In some examples, machine 800 may also include both client and server systems, wherein certain operations of a particular method or algorithm are performed on the server side and certain operations of said particular method or algorithm are performed on the client side.
[0156] Machine 800 may include a processor 804, a memory 806, and an input / output (I / O) unit 808 that can be configured to communicate with each other via a bus 810. In the example, processor 804 (e.g., a central processing unit (CPU), a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphics processing unit (GPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a radio frequency integrated circuit (RFIC), another processor, or any suitable combination thereof) may include, for example, processors 812 and 814 that execute instruction 802. The term "processor" is intended to include multi-core processors, which may include two or more independent processors (sometimes referred to as "cores") capable of executing instructions simultaneously. Although Figure 8Multiple processors 804 are shown, but machine 800 may include a single processor with a single core, a single processor with multiple cores (e.g., a multi-core processor), multiple processors with a single core, multiple processors with multiple cores, or any combination thereof.
[0157] Memory 806 includes main memory 816, static memory 818, and memory cells 820, all of which are accessible by processor 804 via bus 810. Main memory 806, static memory 818, and memory cells 820 store instructions 802 that embody any one or more of the methods or functions described herein. Instructions 802 may also reside wholly or partially within main memory 816, static memory 818, machine-readable medium 822 within memory cell 820, at least one processor of processor 804 (e.g., the processor's cache memory), or any suitable combination thereof, during execution by machine 800.
[0158] I / O component 808 may include a wide variety of components for receiving input, providing output, generating output, transmitting information, exchanging information, capturing measurement results, etc. The specific I / O component 808 included in a particular machine will depend on the type of machine. For example, portable machines such as mobile phones may include touch input devices or other such input mechanisms, while headless server machines may not include such touch input devices. It will be understood that I / O component 808 may include... Figure 8 Many other components are not shown. In various examples, I / O component 808 may include user output component 824 and user input component 826. User output component 824 may include visual components (e.g., displays such as plasma display panels (PDPs), light-emitting diode (LED) displays, liquid crystal displays (LCDs), projectors, or cathode ray tube (CRTs), acoustic components (e.g., speakers), haptic components (e.g., vibrating motors, resistive mechanisms), other signal generators, etc. User input component 826 may include alphanumeric input components (e.g., keyboards, touchscreens configured to receive alphanumeric input, photoelectric keyboards, or other alphanumeric input components), point-based input components (e.g., mice, touchpads, trackballs, joysticks, motion sensors, or other pointing instruments), haptic input components (e.g., physical buttons, touchscreens or other haptic input components that provide touch gestures or the location and force of touch), audio input components (e.g., microphones), etc. Any biometric features collected by biometric components are captured and stored with user approval and deleted upon user request.
[0159] Furthermore, such biometric data can be used for very limited purposes, such as identity verification. To ensure the restricted and authorized use of biometric information and other personally identifiable information (PII), access to this data is limited to authorized personnel, if permitted. Any use of biometric data can be strictly limited to identity verification purposes, and the data may not be shared or sold to any third party without the user's explicit consent. In addition, appropriate technical and organizational measures have been implemented to ensure the security and confidentiality of this sensitive information.
[0160] In another example, I / O component 808 may include biometric component 828, motion component 830, environmental component 832, or position component 834, as well as various other components. For example, biometric component 828 includes components for detecting expressions (e.g., hand gestures, facial expressions, vocal expressions, body posture, or eye tracking), measuring biosignals (e.g., blood pressure, heart rate, body temperature, sweating, or brainwaves), and recognizing a person (e.g., speech recognition, retinal recognition, facial recognition, fingerprint recognition, or EEG-based recognition). Biometric component may include a brain-computer interface (BMI) system that allows communication between the brain and external devices or machines. This can be achieved by recording brain activity data, converting that data into a format that can be understood by a computer, and then using the resulting signals to control the device or machine.
[0161] Examples of BMI technology types include:
[0162] • BMI based on electroencephalography (EEG) uses electrodes placed on the scalp to record electrical activity in the brain.
[0163] • Invasive BMI, which uses electrodes surgically implanted into the brain.
[0164] • Optogenetics BMI uses light to control the activity of specific nerve cells in the brain.
[0165] The moving part 830 includes an acceleration sensor part (e.g., an accelerometer), a gravity sensor part, and a rotation sensor part (e.g., a gyroscope).
[0166] Environmental component 832 includes, for example, one or more camera devices (with still image / photograph and video capabilities), illuminance sensor components (e.g., photometers), temperature sensor components (e.g., one or more thermometers for detecting ambient temperature), humidity sensor components, pressure sensor components (e.g., barometers), acoustic sensor components (e.g., one or more microphones for detecting background noise), proximity sensor components (e.g., infrared sensors for detecting nearby objects), gas sensors (e.g., gas detection sensors for detecting hazardous gas concentrations or measuring pollutants in the atmosphere for safety purposes), or other components that can provide indications, measurements, or signals corresponding to the surrounding physical environment.
[0167] Regarding the camera device, the user system 102 may have a camera device system, which includes, for example, a front-facing camera on the front surface of the user system 102 and a rear-facing camera on the rear surface of the user system 102. The front-facing camera may be used, for example, to capture still images and videos (e.g., “selfies”) of the user of the user system 102, and then enhance these still images and videos with enhancement data (e.g., filters) as described above. The rear-facing camera may be used, for example, to capture still images and videos in a more conventional camera mode, which are similarly enhanced with enhancement data. In addition to the front-facing and rear-facing cameras, the user system 102 may also include a 360° camera for capturing 360° photos and videos.
[0168] Furthermore, the camera system of user system 102 may include dual rear cameras (e.g., a main camera and a depth sensing camera) located on the front and rear sides of user system 102, or even triple, quadruple, or quintuple rear camera configurations. These multi-camera systems may include, for example, wide-angle cameras, ultra-wide-angle cameras, telephoto cameras, macro cameras, and depth sensors.
[0169] The position component 834 includes a position sensor component (e.g., a GPS receiver component), an altitude sensor component (e.g., an altimeter or barometer that detects air pressure from which altitude can be obtained), a direction sensor component (e.g., a magnetometer), etc.
[0170] Communication can be implemented using a wide variety of technologies. I / O component 808 also includes a communication component 836 operable to couple machine 800 to network 838 or device 840 via a suitable coupling or connection. For example, communication component 836 may include a network interface component or another suitable device that interfaces with network 838. In further examples, communication component 836 may include wired communication components, wireless communication components, cellular communication components, near field communication (NFC) components, Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi® components, and other communication components that provide communication via other modes. Device 840 may be another machine or any peripheral device from a variety of peripheral devices (e.g., a peripheral device coupled via Universal Serial Bus (USB)).
[0171] Furthermore, the communication component 836 may detect identifiers or include components operable to detect identifiers. For example, the communication component 836 may include a radio frequency identification (RFID) tag reader component, an NFC smart tag detection component, an optical reader component (e.g., an optical sensor for detecting one-dimensional barcodes (e.g., Universal Product Code (UPC) barcodes), multi-dimensional barcodes (e.g., Quick Response (QR) codes, Aztec codes, Data Matrix, Dataglyph™, MaxiCode, PDF417, Ultra Code, UCC RSS-2D barcodes, and other optical codes)), or an acoustic detection component (e.g., a microphone for identifying audio signals from tags). Additionally, various information can be derived via the communication component 836, such as location via Internet Protocol (IP) geolocation, location via Wi-Fi® signal triangulation, location by detecting NFC beacon signals that can indicate a specific location, etc.
[0172] Various memories (e.g., main memory 816, static memory 818, and memory of processor 804) and storage units 820 may store one or more sets of instructions and data structures (e.g., software) used or embodied by any one or more of the methods or functions described herein. These instructions (e.g., instruction 802) cause various operations to implement the disclosed examples when executed by processor 804.
[0173] Instruction 802 can be sent or received via network 838, using a transmission medium, via a network interface device (e.g., a network interface component included in communication component 836), and using any of several known transmission protocols (e.g., Hypertext Transfer Protocol (HTTP)). Similarly, instruction 802 can be sent or received using a transmission medium via coupling to device 840 (e.g., peer-to-peer coupling).
[0174] Software Architecture
[0175] Figure 9 This is a block diagram 900 illustrating a software architecture 902 that can be installed on any one or more of the devices described herein. The software architecture 902 is supported by hardware such as a machine 904 including a processor 906, memory 908, and I / O components 910. In this example, the software architecture 902 can be conceptualized as a stack of layers, each providing specific functionality. The software architecture 902 includes layers such as an operating system 912, libraries 914, frameworks 916, and applications 918. Operationally, application 918 invokes API calls 920 through the software stack and receives messages 922 in response to API calls 920.
[0176] Operating system 912 manages hardware resources and provides public services. Operating system 912 includes, for example, a kernel 924, services 926, and drivers 928. Kernel 924 serves as an abstraction layer between hardware and other software layers. For example, kernel 924 provides functions such as memory management, processor management (e.g., scheduling), component management, networking, and security settings. Services 926 can provide other public services to other software layers. Drivers 928 are responsible for controlling or interfacing with the underlying hardware. For example, drivers 928 may include display drivers, camera drivers, Bluetooth® or Bluetooth® Low Energy drivers, flash memory drivers, serial communication drivers (e.g., USB drivers), Wi-Fi® drivers, audio drivers, power management drivers, etc.
[0177] Library 914 provides common low-level infrastructure used by application 918. Library 914 may include system libraries 930 (e.g., the C standard library) that provide functions such as memory allocation, string manipulation, and mathematical functions. Furthermore, library 914 may include API libraries 932, such as media libraries (e.g., libraries for supporting the rendering and manipulation of various media formats, such as Moving Picture Experts Group-4 (MPEG4), Advanced Video Coding (H.264 or AVC), Moving Picture Experts Group Layer-3 (MP3), Advanced Audio Coding (AAC), Adaptive Multi-Rate (AMR) audio codecs, Joint Picture Experts Group (JPEG or JPG), or Portable Web Graphics (PNG)), graphics libraries (e.g., the OpenGL framework for rendering graphic content on a display in 2D and 3D), database libraries (e.g., SQLite for providing various relational database functions), web libraries (e.g., WebKit for providing web browsing functionality to applications), etc. Library 914 may also include a wide variety of other libraries 934 to provide many other APIs to application 918.
[0178] Framework 916 provides common high-level infrastructure for use by Application 918. For example, Framework 916 provides various GUI functionalities, advanced resource management, and advanced location services. Framework 916 can provide a wide range of other APIs that can be used by Application 918, some of which may be specific to a particular operating system or platform.
[0179] In the example, application 918 may include home application 936, contact application 938, browser application 940, book reader application 942, location application 944, media application 946, messaging application 948, game application 950, and various other categories of applications, such as third-party application 952. Application 918 is a program that performs the functions defined in the program. One or more applications 918 can be created using various programming languages, such as object-oriented programming languages (e.g., Objective-C, Java, or C++) or procedural programming languages (e.g., C or assembly language). In a particular example, third-party application 952 (e.g., an application developed by an entity other than a platform vendor using the Android™ or iOS™ SDK) may be mobile software running on a mobile operating system (e.g., iOS™, Android™, Windows® Phone, or another mobile operating system). In this example, third-party application 952 may invoke API calls 920 provided by operating system 912 to assist in the functions described herein.
[0180] Systems with head-mounted devices
[0181] Figure 10 A system 1000 including a head-mounted device 116 with a selector input device is shown according to some examples. Figure 10 This is a high-level functional block diagram of an example head-mounted device 116 that is communicatively coupled to mobile devices 114 and various server systems 1004 (e.g., interactive server system 110) via various networks 1016.
[0182] The head-mounted device 116 includes one or more camera devices, each of which may be, for example, a visible light camera 1006, an infrared emitter 1008, and an infrared camera 1010.
[0183] Mobile device 114 is connected to head-mounted device 116 using both low-power wireless connection 1012 and high-speed wireless connection 1014. Mobile device 114 is also connected to server system 1004 and network 1016.
[0184] The head-mounted device 116 also includes two image displays 1018 of the optical components. These two image displays 1018 include an image display associated with the left lateral side of the head-mounted device 116 and an image display associated with the right lateral side of the head-mounted device 116. The head-mounted device 116 also includes an image display driver 1020, an image processor 1022, a low-power circuitry system 1024, and a high-speed circuitry system 1026. The image displays 1018 of the optical components are used to present images and videos, including images that may include a GUI, to a user of the head-mounted device 116.
[0185] The image display driver 1020 commands and controls the image display 1018 of the optical components. The image display driver 1020 can directly transmit image data to the image display 1018 of the optical components for display, or it can convert the image data into a signal or data format suitable for transmission to the image display device. For example, the image data can be video data formatted according to a compression format (e.g., H.264 (MPEG-4 Part 10), HEVC, Theora, Dirac, RealVideo RV40, VP8, VP9, etc.), and still image data can be formatted according to a compression format (e.g., PNG, JPEG, Tagged Image File Format (TIFF), or Interchangeable Image File Format, etc.).
[0186] The head-mounted device 116 includes a frame and stems (or temples) extending laterally from the frame. The head-mounted device 116 also includes a user input device 1028 (e.g., a touch sensor or button), which includes an input surface on the head-mounted device 116. The user input device 1028 (e.g., a touch sensor or a pressed button) receives input selections from the user to manipulate a GUI of the presented image.
[0187] Figure 10 The components shown for the head-mounted device 116 are located on one or more circuit boards (e.g., PCBs or flexible PCBs) on the edge or temples. Alternatively or additionally, the depicted components may be located in chunks, frames, hinges, or nose bridges of the head-mounted device 116. The left and right visible light imaging devices 1006 may include digital imaging device elements, such as complementary metal-oxide-semiconductor (CMOS) image sensors, charge-coupled devices, camera lenses, or any other corresponding visible or light-capturing elements that can be used to capture data, including images of scenes with unknown objects.
[0188] The head-mounted device 116 includes a memory 1002 that stores instructions for performing a subset or all of the functions described herein. The memory 1002 may also include a storage device.
[0189] like Figure 10 As shown, the high-speed circuit system 1026 includes a high-speed processor 1030, a memory 1002, and a high-speed wireless circuit system 1032. In some examples, an image display driver 1020 is coupled to the high-speed circuit system 1026 and operated by the high-speed processor 1030 to drive the left and right image displays in the image display 1018 of the optical components. The high-speed processor 1030 can be any processor capable of managing the operation of any general-purpose computing system and high-speed communication required by the head-mounted device 116. The high-speed processor 1030 includes the processing resources required for managing high-speed data transmission over a high-speed wireless connection 1014 to a wireless local area network (WLAN) using the high-speed wireless circuit system 1032. In some examples, the high-speed processor 1030 executes the operating system of the head-mounted device 116, such as the LINUX operating system or another such operating system, and this operating system is stored in the memory 1002 for execution. Among other duties, the high-speed processor 1030, which executes the software architecture for the head-mounted device 116, manages data transmission with the high-speed wireless circuit system 1032. In some examples, the high-speed wireless circuit system 1032 is configured to implement the Institute of Electrical and Electronics Engineers (IEEE) 802.11 communication standard, also referred to herein as WiFi. In some examples, the high-speed wireless circuit system 1032 may implement other high-speed communication standards.
[0190] The low-power wireless circuitry system 1034 and high-speed wireless circuitry system 1032 of the head-mounted device 136 may include a short-range transceiver (Bluetooth™) and a wireless wide-area network transceiver, a local area network transceiver, or a wide-area network transceiver (e.g., cellular or WiFi). The mobile device 114, including transceivers communicating via low-power wireless connection 1012 and high-speed wireless connection 1014, can be implemented using the architectural details of the head-mounted device 116, as can other components of the network 1016.
[0191] Memory 1002 includes any storage device capable of storing various data and applications, including camera data generated by the left and right visible light imaging devices 1006, the infrared imaging device 1010, and the image processor 1022, as well as images generated for display on an image display 1018 of the optical components by the image display driver 1020, and so on. While memory 1002 is shown as integrated with the high-speed circuitry 1026, in some examples, memory 1002 may be a separate, independent component of the head-mounted device 116. In some such examples, electrical wiring may provide a connection from the image processor 1022 or the low-power processor 1036 to memory 1002 via a chip including a high-speed processor 1030. In some examples, the high-speed processor 1030 may manage addressing of memory 1002, such that the low-power processor 1036 will activate the high-speed processor 1030 whenever a read or write operation involving memory 1002 is required.
[0192] like Figure 10 As shown, the low-power processor 1036 or high-speed processor 1030 of the head-mounted device 136 may be coupled to a camera device (visible light camera 1006, infrared emitter 1008 or infrared camera 1010), an image display driver 1020, a user input device 1028 (e.g., a touch sensor or button) and a memory 1002.
[0193] The head-mounted device 116 is connected to a host computer. For example, the head-mounted device 116 is paired with the mobile device 114 via a high-speed wireless connection 1014, or connected to a server system 1004 via a network 1016. The server system 1004 may be one or more computing devices as part of a service or network computing system, which includes, for example, a processor, memory, and a network communication interface to communicate with the mobile device 114 and the head-mounted device 116 via the network 1016.
[0194] Mobile device 114 includes a processor and a network communication interface coupled to the processor. The network communication interface allows communication via network 1016, low-power wireless connection 1012, or high-speed wireless connection 1014. Mobile device 114 may also store in its memory at least a portion of instructions for generating binaural audio content to implement the functions described herein.
[0195] The output components of the head-mounted device 116 include visual components, such as displays like LCDs, PDPs, LED displays, projectors, or waveguides. The image display of the optical components is driven by an image display driver 1020. The output components of the head-mounted device 116 also include acoustic components (e.g., speakers), haptic components (e.g., vibrating motors), other signal generators, etc. The input components (e.g., user input devices 1028) of the head-mounted device 116, mobile device 114, and server system 1004 may include alphanumeric input components (e.g., keyboards, touchscreens configured to receive alphanumeric input, photoelectric keyboards, or other alphanumeric input components), point-based input components (e.g., mice, touchpads, trackballs, joysticks, motion sensors, or other pointing instruments), haptic input components (e.g., physical buttons, touchscreens that provide position and force for touch or touch gestures, or other haptic input components), audio input components (e.g., microphones), etc.
[0196] The head-mounted device 116 may also include additional peripheral device elements. Such peripheral device elements may include display elements, additional sensors, or biometric sensors integrated with the head-mounted device 116. For example, peripheral device elements may include any I / O components, including output components, motion components, position components, or any other such components described herein.
[0197] For example, biometric components include those for detecting expressions (e.g., hand gestures, facial expressions, vocal expressions, body posture, or eye tracking), measuring biosignals (e.g., blood pressure, heart rate, body temperature, sweating, or brain waves), and identifying people (e.g., voice recognition, retinal recognition, facial recognition, fingerprint recognition, or EEG-based recognition). Biometric components may include a BMI system that allows communication between the brain and external devices or machines. This can be achieved by recording brain activity data, converting that data into a format that can be understood by a computer, and then using the resulting signals to control the device or machine.
[0198] Motion components include accelerometer components (e.g., accelerometers), gravity sensor components, rotation sensor components (e.g., gyroscopes), etc. Position components include position sensor components (e.g., GPS receiver components) for generating position coordinates, Wi-Fi or Bluetooth™ transceivers for generating positioning system coordinates, altitude sensor components (e.g., altimeters or barometers that detect air pressure from which altitude can be obtained), orientation sensor components (e.g., magnetometers), etc. Such positioning system coordinates can also be received from mobile device 114 via low-power wireless circuit system 1034 or high-speed wireless circuit system 1032 through low-power wireless connection 1012 and high-speed wireless connection 1014.
[0199] Glossary
[0200] "Carrier signal" refers to any intangible medium, such as a medium capable of storing, encoding, or carrying machine-executable instructions and including digital or analog communication signals, or other intangible medium facilitating the transmission of such instructions. Instructions can be sent or received over a network using a transmission medium via a network interface device.
[0201] "Client device" means any machine that interfaces with a communication network to obtain resources from one or more server systems or other client devices. Client devices can be, but are not limited to, mobile phones, desktop computers, laptop computers, portable digital assistants (PDAs), smartphones, tablet computers, ultrabooks, netbooks, laptops, multiprocessor systems, microprocessor-based or programmable consumer electronics, game consoles, STBs, or any other communication device that a user can use to access the network.
[0202] "Communications network" means, for example, one or more parts of a network, which can be an ad hoc network, intranet, extranet, virtual private network (VPN), local area network (LAN), WLAN, wide area network (WAN), wireless WAN (WWAN), metropolitan area network (MAN), the Internet, a part of the Internet, a part of the Public Switched Telephone Network (PSTN), a Common Old-Style Telephone Service (POTS) network, a cellular telephone network, a wireless network, a Wi-Fi® network, other types of networks, or a combination of two or more such networks. For example, a network or part of a network may include a wireless network or a cellular network, and the coupling may be a Code Division Multiple Access (CDMA) connection, a Global System for Mobile Communications (GSM) connection, or other types of cellular or wireless coupling. In this example, coupling can enable any data transmission technology of various types, such as single-carrier radio transmission technology (1xRTT), evolved data optimization (EVDO) technology, general packet radio service (GPRS) technology, enhanced data rate GSM evolution (EDGE) technology, the 3rd Generation Partnership Project (3GPP) including 3G, fourth-generation wireless (4G) networks, Universal Mobile Telecommunications System (UMTS), High-Speed Packet Access (HSPA), Global Microwave Access Interoperability (WiMAX), Long Term Evolution (LTE) standards, other data transmission technologies defined by various standards setting organizations, other long-distance protocols, or other data transmission technologies.
[0203] A "component" refers to a logical or physical entity or device, such as having boundaries, branch points, APIs, or other technical definitions that provide partitioning or modularity for a particular processing or control function, such as those defined by functions or subroutines. A component can be combined with other components via its interface to perform machine processing. A component can be a packaged functional hardware unit designed to be used with other components, or part of a program that typically performs a related function. A component can constitute a software component (e.g., code embodied on a machine-readable medium) or a hardware component.
[0204] A “hardware component” is a tangible unit capable of performing certain operations and can be configured or arranged in a physical manner. In various examples, one or more computer systems (e.g., standalone computer systems, client computer systems, or server computer systems) or one or more hardware components (e.g., processors or processor groups) of a computer system can be configured by software (e.g., applications or application portions) to operate to perform certain operations described herein.
[0205] A "hardware component" can be implemented mechanically, electronically, or in any suitable combination thereof. For example, a hardware component may include a dedicated circuit system or logic permanently configured to perform certain operations. A hardware component may be a dedicated processor, such as a field-programmable gate array (FPGA) or an ASIC. A hardware component may also include programmable logic or a circuit system temporarily configured by software to perform certain operations. For example, a hardware component may include software executed by a general-purpose processor or other programmable processor. Once configured by such software, the hardware component becomes a specific machine (or a specific part of a machine) uniquely tailored to perform the configured function, and is no longer a general-purpose processor. It will be understood that the decision to implement a hardware component mechanically in a temporarily configured (e.g., software-configured) circuit system or in a dedicated and permanently configured circuit system may be driven by cost and time considerations. Therefore, the phrase "hardware component" (or "hardware-implemented component") should be understood to encompass tangible entities that are physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain way or perform certain operations described herein.
[0206] Consider an example where hardware components are temporarily configured (e.g., programmed), without requiring each of the hardware components to be configured or instantiated at any given time. For example, in cases where the hardware components include a general-purpose processor that becomes a dedicated processor through software configuration, this general-purpose processor can be configured at different times as (e.g., including different hardware components) different dedicated processors. The software accordingly configures one or more specific processors to constitute a particular hardware component at one time and different hardware components at different times. Hardware components can provide information to and receive information from other hardware components. Therefore, the described hardware components can be considered communicatively coupled. In cases where multiple hardware components exist simultaneously, communication can be achieved through signal transmission between or among two or more hardware components (e.g., via appropriate circuitry and buses). In examples where multiple hardware components are configured or instantiated at different times, such communication between hardware components can be achieved, for example, through the storage and retrieval of information in a storage structure accessible to the multiple hardware components. For example, a hardware component can perform an operation and store the output of that operation in a storage device communicatively coupled to it. Then, additional hardware components can access the storage device at a later time to obtain and process the stored output. The hardware components can also initiate communication with input or output devices and are capable of manipulating resources (e.g., collections of information). The various operations of the example methods described herein can be performed, at least in part, by (e.g., via software) one or more processors temporarily or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors can constitute components of a processor implementation that operates to perform one or more operations or functions described herein.
[0207] As used herein, "processor-implemented component" refers to a hardware component implemented using one or more processors. Similarly, the methods described herein can be implemented at least partially by processors, where one or more specific processors are examples of hardware. For example, at least some of the operations of the methods can be performed by one or more processors or processor-implemented components. Furthermore, one or more processors can also operate to support the execution of related operations in a "cloud computing" environment or as a "Software as a Service" (SaaS) operation. For example, at least some of the operations can be performed by a group of computers (as an example of machines including processors), where these operations are accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., APIs). The execution of some operations can be distributed across processors, not residing within a single machine, but deployed across multiple machines. In some examples, the processor or processor-implemented component may reside in a single geographic location (e.g., in a home environment, office environment, or server cluster). In other examples, the processor or processor-implemented component may be distributed across multiple geographic locations.
[0208] "Computer-readable storage medium" refers to both, for example, machine storage media and transmission media. Therefore, these terms include both storage devices / media and carrier / modulated data signals. The terms "machine-readable medium," "computer-readable medium," and "device-readable medium" refer to the same thing and can be used interchangeably in this disclosure. "Temporary message" refers to a message that is accessible for a limited time period, for example. A temporary message can be text, an image, video, etc. The access time of a temporary message can be set by the message sender. Alternatively, the access time can be a default setting or a setting specified by the recipient. Regardless of the setting technique, the message is temporary.
[0209] "Machine storage medium" refers to one or more storage devices and media (e.g., centralized or distributed databases, and associated caches and servers) that store executable instructions, routines, and data. This term should be accordingly considered to include, but is not limited to, solid-state memory, as well as optical and magnetic media, including memory internal or external to the processor. Specific examples of machine storage media, computer storage media, and device storage media include: non-volatile memory, including, by way of example, semiconductor storage devices such as erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), FPGAs, and flash memory devices; disks, such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms "machine storage medium," "device storage medium," and "computer storage medium" refer to the same thing and may be used interchangeably in this disclosure.
[0210] The terms “machine storage medium,” “computer storage medium,” and “device storage medium” explicitly exclude carrier waves, modulated data signals, and other such media, at least some of which are encompassed within the term “signal medium.” “Non-transitory computer-readable storage medium” refers to, for example, a tangible medium capable of storing, encoding, or carrying instructions executable by a machine. “Signal medium” refers to, for example, any intangible medium capable of storing, encoding, or carrying instructions executable by a machine, and includes other intangible media such as digital or analog communication signals or auxiliary software or data communication. The term “signal medium” should be considered to include any form of modulated data signal, carrier wave, etc. The term “modulated data signal” means a signal whose characteristics are set or altered in a manner that encodes information in the signal. The terms “transmission medium” and “signal medium” refer to the same thing and may be used interchangeably in this disclosure.
[0211] "User equipment" means, for example, a device accessed, controlled, or owned by a user and on which the user interacts to perform interactions or actions, including interactions with other users or computer systems. "Carrier signal" means any intangible medium or other intangible medium capable of storing, encoding, or carrying machine-executable instructions and including digital or analog communication signals. Instructions can be sent or received over a network using a transmission medium via a network interface device. "Client device" means any machine that interfaces with a communication network to obtain resources from one or more server systems or other client devices. Client devices can be, but are not limited to, mobile phones, desktop computers, laptop computers, PDAs, smartphones, tablet computers, ultrabooks, netbooks, laptops, multiprocessor systems, microprocessor-based or programmable consumer electronics, game consoles, STBs, or any other communication device that a user can use to access the network.
[0212] "Communication network" refers to one or more parts of a network, which can be an ad hoc network, intranet, extranet, VPN, LAN, WLAN, WAN, WWAN, MAN, the Internet, a part of the Internet, a part of the PSTN, POTS network, cellular telephone network, wireless network, Wi-Fi® network, other types of network, or a combination of two or more such networks. For example, a network or part of a network may include a wireless network or a cellular network, and coupling may be a CDMA connection, a GSM connection, or other types of cellular or wireless coupling. In this example, coupling can implement any data transmission technology of various types, such as 1xRTT, EVDO technology, GPRS technology, EDGE technology, 3GPP, UMTS, HSPA, WiMAX, LTE standards including 3G and 4G networks, other data transmission technologies defined by various standards setting organizations, other long-distance protocols, or other data transmission technologies.
[0213] Components can constitute software components (e.g., code embodied on a machine-readable medium) or hardware components. A “hardware component” is a tangible unit capable of performing certain operations and can be configured or arranged in some physical manner. In various examples, one or more computer systems (e.g., standalone computer systems, client computer systems, or server computer systems) or one or more hardware components (e.g., processors or processor groups) of a computer system can be configured by software (e.g., an application or application portion) to operate to perform certain operations described herein.
[0214] Hardware components can also be implemented mechanically, electronically, or in any suitable combination thereof. For example, a hardware component may include a dedicated circuit system or logic permanently configured to perform certain operations. A hardware component may be a dedicated processor, such as an FPGA or ASIC. A hardware component may also include programmable logic or a circuit system temporarily configured by software to perform certain operations. For example, a hardware component may include software executed by a general-purpose processor or other programmable processor. Once configured by such software, the hardware component becomes a specific machine (or a specific part of a machine) uniquely tailored to perform the configured function, and is no longer a general-purpose processor. It will be understood that the decision to implement a hardware component mechanically in a temporarily configured (e.g., software-configured) circuit system or in a dedicated and permanently configured circuit system may be driven by cost and time considerations. Therefore, the phrase “hardware component” (or “hardware-implemented component”) should be understood to encompass tangible entities that are physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate or perform certain operations described herein.
[0215] The various operations of the example methods described herein can be performed at least in part by one or more processors, which are temporarily (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors can constitute components of a processor implementation that operates to perform one or more of the operations or functions described herein.
[0216] Changes and modifications may be made to the disclosed examples without departing from the scope of this disclosure. Such and other changes or modifications are intended to be included within the scope of this disclosure as set forth in the appended claims.
Claims
1. A method comprising: Access a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model, wherein the first generative machine learning model is trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model is trained to generate the second artificially personalized image including a depiction of a second person. Generate a foreground image, which combines the depiction of the first person in the first artificially personalized image with the depiction of the second person in the second artificially personalized image; Access background information; as well as Generate a new artificial image, the new artificial image comprising the foreground image on a background having visual attributes corresponding to the background information.
2. The method according to claim 1, further comprising: Access prompts that include the aforementioned background information; as well as The foreground image and the cue are processed by a third generative machine learning model to generate the new artificial image.
3. The method according to any one of claims 1 to 2, wherein, The prompt includes an image depicting the background information.
4. The method according to any one of claims 1 to 3, wherein, The prompt includes a textual description of the background information.
5. The method according to any one of claims 1 to 4, wherein, The first generative machine learning model, the second generative machine learning model, and the third generative machine learning model each include a corresponding diffusion machine learning model.
6. The method according to any one of claims 1 to 5, wherein, The first generative machine learning model and the second generative machine learning model each include a corresponding diffusion machine learning model.
7. The method according to any one of claims 1 to 6, further comprising: Obtain the foreground mask and background mask for each of the first and second artificially personalized images; The foreground depiction from the first and second artificially personalized images is blended using the foreground mask; as well as The background mask is used to patch the background of the mixed depiction of the foreground from the first and second artificially personalized images.
8. The method according to any one of claims 1 to 7, further comprising: Generate a first segment of the first person depicted in the first artificially personalized image.
9. The method according to claim 8, further comprising: Based on the first segmentation, a first region of the first artificially personalized image is extracted, and the first region includes pixels that fall within the first segmentation. The second region of the second artificially personalized image is extracted based on the first segmentation, and the second region of the second artificially personalized image excludes pixels that fall within the first segmentation; as well as The foreground image is generated by combining the first region and the second region.
10. The method according to any one of claims 1 to 9, further comprising: Generate a second segmentation of the second person depicted in the second artificially personalized image; as well as The first segment and the second segment are combined to generate a combined segment.
11. The method according to any one of claims 1 to 10, further comprising: The background is repaired based on the combined segmentation of the foreground image to generate the new artificial image.
12. The method according to any one of claims 1 to 11, wherein, The background was generated based on the prompts.
13. The method according to any one of claims 1 to 12, further comprising: Receive a first pose image, the first pose image including a depiction of an object in an individual pose; Receive the first prompt defining the first set of visual attributes; as well as The first pose image and the first cue are processed by the first generative machine learning model to generate the first artificially personalized image.
14. The method of claim 13, further comprising: Receive a second pose image, the second pose image including a depiction of another object in another pose; Receive the second cue that defines the second set of visual attributes; as well as The second pose image and the second cue are processed by the second generative machine learning model to generate the second artificially personalized image.
15. The method according to any one of claims 1 to 14, further comprising: The first generative machine learning model is trained by performing a first set of training operations, the first set of training operations including: Access the first set of training images, each of which depicts a first person with a different background and in a different pose; Receive an image including a depiction by the first person; Receive the following prompt: This prompt defines the visual attributes of the individual background depicted in the individual training image in the first set of training images and defines the target pose depicted in the individual training image; The first generative machine learning model processes the image and the cue to generate an estimated artificial image, the estimated artificial image depicting the first person in the target pose and an artificial background having visual attributes defined by the cue; Calculate the deviation between the estimated artificial image and the individual training image; and Update one or more parameters of the first generative machine learning model based on the bias.
16. The method of claim 15, further comprising: Capture multiple images of the first person; Extract multiple regions from the multiple images that depict the first person; Obtain a set of cues for defining different visual attributes of the background; as well as The multiple regions and the set of cues are processed by a diffusion model to generate the first set of training images.
17. The method according to any one of claims 1 to 16, further comprising: The first person in the plurality of images is magnified to extract the plurality of regions.
18. The method according to any one of claims 1 to 17, further comprising: The second generative machine learning model is trained by performing a second set of training operations, which include: Access the second set of training images, each of which depicts a second person with a different background and in a different pose; Receive a second image including a depiction of the second person; Receive a second prompt, the second prompt defining the visual attributes of the background of the second individual depicted in the second volume training image in the second set of training images and defining the second target pose depicted in the second individual training image; The second image and the second cue are processed by the second generative machine learning model to generate a second estimated artificial image, the second estimated artificial image depicting the second person in the second target pose and a second artificial background having visual attributes defined by the second cue; Calculate the second deviation between the second estimated artificial image and the second individual training image; and The second bias is used to update one or more parameters of the second generative machine learning model.
19. A system comprising: At least one processor; as well as At least one memory component thereon stores instructions that, when executed by the at least one processor, cause the at least one processor to perform operations, the operations including: Access a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model, wherein the first generative machine learning model is trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model is trained to generate the second artificially personalized image including a depiction of a second person. Generate a foreground image that combines the depiction of the first person in the first artificially personalized image with the depiction of the second person in the second artificially personalized image; Access background information; and Generate a new artificial image, the new artificial image comprising the foreground image on a background having visual attributes corresponding to the background information.
20. A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform an operation, the operation comprising: Access a first artificially personalized image and a second artificially personalized image generated by a first generative machine learning model and a second generative machine learning model, wherein the first generative machine learning model is trained to generate the first artificially personalized image including a depiction of a first person, and the second generative machine learning model is trained to generate the second artificially personalized image including a depiction of a second person. Generate a foreground image that combines the depiction of the first person in the first artificially personalized image with the depiction of the second person in the second artificially personalized image; Access background information; as well as Generate a new artificial image, wherein the new artificial image includes the foreground image on a background having visual attributes corresponding to the background information.