Clothing segmentation
By segmenting and smoothing user clothing using machine learning techniques, the cost and complexity issues of depth sensors in existing augmented reality systems are solved, enabling efficient full-body background recognition and visual effects applications, thus improving the user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SNAP INC
- Filing Date
- 2022-04-12
- Publication Date
- 2026-06-23
Smart Images

Figure CN117157667B_ABST
Abstract
Description
[0001] Priority Declaration
[0002] This application claims the benefit of priority to U.S. Patent Application Serial No. 17 / 301,690, filed April 12, 2021, which is incorporated herein by reference in its entirety. Technical Field
[0003] This disclosure generally relates to the use of messaging applications to provide augmented reality experiences. Background Technology
[0004] Augmented reality (AR) is a modification of a virtual environment. For example, in virtual reality (VR), the user is fully immersed in a virtual world, while in AR, the user is immersed in a world where virtual objects are combined with or overlaid on the real world. AR systems are designed to generate and present virtual objects that realistically interact with and interact with each other in the real world environment. Examples of AR applications can include single-player or multiplayer video games, instant messaging systems, and more. Attached Figure Description
[0005] In the accompanying drawings (which are not necessarily drawn to scale), similar numbers may describe similar parts in different views. To facilitate identification of any particular element or action being discussed, one or more of the most significant digits in the reference numerals indicate the drawing number in which the element was first introduced. Some non-limiting examples are shown in the accompanying drawings:
[0006] Figure 1 It is a graphical representation of a networked environment in which the content of this disclosure can be deployed, based on some examples.
[0007] Figure 2 It is a graphical representation of a message sending and receiving client application based on some examples.
[0008] Figure 3 It is a graphical representation of the data structures maintained in the database based on some examples.
[0009] Figure 4 It is a graphical representation based on some example messages.
[0010] Figure 5 This is a block diagram illustrating an example segmentation estimation system based on the example.
[0011] Figure 6 , Figure 7A , Figure 7B , Figure 8 and Figure 9 This is a graphical representation of the output of a segmentation estimation system based on some examples.
[0012] Figure 10Aand Figure 10B This is a flowchart illustrating example operations of an application server for sending and receiving messages, based on an example.
[0013] Figure 11 It is a graphical representation of a machine in the form of a computer system, based on some examples, within which a set of instructions can be executed to cause the machine to perform any or more of the methods discussed herein.
[0014] Figure 12 It is a block diagram showing a software architecture in which examples can be implemented. Detailed Implementation
[0015] The following description includes illustrative examples of systems, methods, techniques, instruction sequences, and computer program products that implement the contents of this disclosure. In the following description, numerous specific details are set forth for illustrative purposes to provide an understanding of the various examples. However, it will be apparent to those skilled in the art that the examples can be practiced without these specific details. Generally, it is not necessary to show in detail well-known examples of instructions, protocols, structures, and techniques.
[0016] Typically, virtual reality (VR) and augmented reality (AR) systems display an image representing a given user by capturing an image of the user and additionally using a depth sensor to obtain a depth map of the real-world body depicted in the image. By processing the depth map and the image together, VR and AR systems can detect the user's location within the image and can appropriately modify the user or background within the image. While such systems can work well, the requirement for a depth sensor limits their applications. This is because adding a depth sensor to the user device for image modification purposes increases the overall cost and complexity of the device, thus reducing their appeal.
[0017] Some systems do not require depth sensors to modify images. For example, some systems allow users to replace the background in a video conference where a user's face is detected. Specifically, such systems can use specialized techniques optimized for recognizing a user's face to identify the background in an image depicting the user's face. These systems can then replace only the pixels depicting the background, so that the real-world background is replaced by an alternative background in the image. However, such systems typically cannot recognize the user's entire body. Therefore, if the user is more than a threshold distance from the camera, so that not only the user's face is captured by the camera, background replacement with an alternative background begins to fail. In this case, image quality is severely affected, and multiple parts of the user's face and body may be unintentionally removed by the system because the system incorrectly identifies such parts as background rather than foreground. Furthermore, when more than one user is depicted in an image or video feed, such systems cannot properly replace the background. Because such systems typically cannot distinguish the entire body of a user in an image from the background, they also cannot apply visual effects to certain parts of the user's body, such as clothing.
[0018] The disclosed technique improves the efficiency of using electronic devices by segmenting clothing or garments depicted in an image. By segmenting the clothing or garments worn by a user or different corresponding users depicted in an image, the disclosed technique can apply one or more visual effects to the image, and particularly to the clothing depicted in the image. Specifically, the disclosed technique applies a first machine learning technique to generate a segmentation of the clothing worn by a user depicted in the image (e.g., to distinguish pixels corresponding to one or more garments worn by the user from pixels corresponding to the background of the image or body parts of the user). The disclosed technique then smooths, filters, or improves the generated segmentation based on an estimated segmentation of the clothing worn by the user depicted in a single image (e.g., the current frame of a video) generated by applying a second machine learning technique to previously received video frames of the user.
[0019] Specifically, a video depicting a user's body is received. The current frame of the video is processed using a first machine learning technique to segment the clothing worn by the user in the current frame. A set of previous frames (e.g., 1 to 2 seconds of video before the current frame) is processed using a second machine learning technique to estimate the clothing segmentation for subsequent frames (e.g., the current frame). The segmentation for the current frame generated by the first machine learning technique is compared with the estimated segmentation predicted based on previous frames using the second machine learning technique. Then, any deviations or differences between the two segments are corrected using the second machine learning technique to smooth, refine, and filter the segmentation for the current frame generated by the first machine learning technique.
[0020] In some cases, the disclosed techniques adjust the loss function used to train a first machine learning technique based on the presence of detected distinguishable attributes of clothing items (e.g., shoulder straps, collars, or sleeves). For example, if shirt sleeves are detected in the segmentation of an image and garment (e.g., a shirt), the loss function is adjusted during training to increase the weight of parameter adjustments made for the first machine learning technique. That is, the way different training images influence or act on parameter adjustments during the training of the first machine learning technique can be based on or depend on the presence of detected or identified distinguishable attributes of clothing items or garments depicted in the training images.
[0021] In this way, the disclosed techniques can apply one or more visual effects to clothing segmented in the current image and worn by a user. For example, the color or texture of a shirt depicted as being worn by the user in the image can be replaced with a different color, texture, or animation to provide the illusion that the user is wearing a different shirt than the one the user is actually wearing in the image. As another example, visual elements can be used to enhance or replace the boundaries of the clothing itself (e.g., adding glowing or shimmering effects only to a portion of the clothing or its boundaries). As yet another example, the segmentation of clothing worn by the user can be used to adjust the display of another virtual garment displayed in the image depicting the user. Specifically, the segmentation of clothing can be used to control occlusion or anti-occlusion mode effects applied to virtual clothing (e.g., trousers) or real-world clothing displayed in the user's image. In this way, real-world clothing (e.g., the pixel color of certain parts of the real-world clothing worn by the user) can cover specific parts of the virtual clothing (e.g., a portion of the pixels of the virtual clothing) rather than covering the entire virtual clothing, or vice versa. Therefore, a realistic display is provided, showing that the user is wearing a real-world garment (e.g., a shirt) while also wearing a virtual garment (e.g., trousers). As used herein, garments and clothing are used interchangeably and should be understood to have the same meaning. This improves the overall user experience when using electronic devices. Furthermore, by performing such segmentation without using a depth sensor, the total amount of system resources required to complete the task is reduced.
[0022] Networked computing environment
[0023] Figure 1This is a block diagram illustrating an example messaging system 100 for exchanging data (e.g., messages and associated content) over a network. The messaging system 100 includes multiple instances of client devices 102, each instance hosting multiple applications including a messaging client 104 and other external applications 109 (e.g., third-party applications). Each messaging client 104 is communicatively coupled via a network 112 (e.g., the Internet) to (e.g., hosted on corresponding other client devices 102) other instances of the messaging client 104, a messaging server system 108, and an external application server 110. The messaging client 104 can also communicate with the locally hosted third-party applications 109 using an application programming interface (API).
[0024] Message transceiver client 104 can communicate and exchange data with other message transceiver clients 104 and message transceiver server system 108 via network 112. The data exchanged between message transceiver clients 104 and between message transceiver client 104 and message transceiver server system 108 includes functions (e.g., commands for activating functions) and payload data (e.g., text, audio, video, or other multimedia data).
[0025] Message transceiver server system 108 provides server-side functionality to specific message transceiver clients 104 via network 112. While some functions of message transceiver system 100 are described herein as being performed by message transceiver client 104 or message transceiver server system 108, the location of certain functions—whether within message transceiver client 104 or message transceiver server system 108—may be a design choice. For example, it might be technically preferred that certain technologies and functions are initially deployed within message transceiver server system 108, but later migrated to message transceiver client 104 with sufficient processing power on client device 102.
[0026] The messaging server system 108 supports various services and operations provided to the messaging client 104. Such operations include sending data to and receiving data from the messaging client 104, and processing data generated by the messaging client 104. As an example, this data may include message content, client device information, geolocation information, media enhancements and coverage, message content persistence conditions, social network information, and live event information. Data exchange within the messaging system 100 is activated and controlled through functions available via the user interface (UI) of the messaging client 104.
[0027] Specifically, turning to message transceiver server system 108, application programming interface (API) server 116 is coupled to application server 114 and provides a programming interface to application server 114. Application server 114 is communicatively coupled to database server 120, which facilitates access to database 126, which stores data associated with messages processed by application server 114. Similarly, web server 128 is coupled to application server 114 and provides a web-based interface to application server 114. To this end, web server 128 processes incoming network requests via Hypertext Transfer Protocol (HTTP) and several other related protocols.
[0028] Application Programming Interface (API) server 116 receives and sends message data (e.g., commands and message payloads) between client device 102 and application server 114. Specifically, API server 116 provides a set of interfaces (e.g., routines and protocols) that can be invoked or queried by message sending and receiving client 104 to activate the functionality of application server 114. Application Programming Interface (API) server 116 exposes various functions supported by application server 114, including: account registration; login functionality; sending messages from one messaging client 104 to another messaging client 104 via application server 114; sending media files (e.g., images or videos) from messaging client 104 to messaging server 118 and for possible access by another messaging client 104; setting up media data sets (e.g., stories); retrieving the friend list of the user of client device 102; retrieving such a set; retrieving messages and content; adding and deleting entities (e.g., friends) in an entity graph (e.g., a social graph); locating friends in the social graph; and opening application events (e.g., associated with messaging client 104).
[0029] Application server 114 hosts multiple server applications and subsystems, including, for example, messaging server 118, image processing server 122, and social networking server 124. Messaging server 118 implements multiple messaging technologies and functions, particularly those related to the aggregation and other processing of content (e.g., text and multimedia content) included in messages received from multiple instances of messaging client 104. As will be described in further detail, text and media content from multiple sources can be aggregated into collections of content (e.g., referred to as stories or galleries). These collections are then made available to messaging client 104. Given the hardware requirements for other processor- and memory-intensive processing of data, such processing can also be performed on the server side by messaging server 118.
[0030] Application server 114 also includes image processing server 122, which is dedicated to performing various image processing operations, typically on images or videos in the payload of messages sent from or received at message transceiver server 118.
[0031] Image processing server 122 is used to implement the scanning function of augmented reality system 208. The scanning function includes activating and providing one or more augmented reality experiences on client device 102 when an image is captured by client device 102. Specifically, messaging application 104 on client device 102 can be used to activate a camera device. The camera device displays one or more live images or videos and one or more icons or identifiers for one or more augmented reality experiences to the user. The user can select a given identifier from the identifiers to initiate the corresponding augmented reality experience or perform desired image modifications (e.g., replacing the clothing worn by the user in the video or recoloring the clothing worn by the user in the video).
[0032] Social networking server 124 supports various social networking functions and services and makes these functions and services available to messaging server 118. To this end, social networking server 124 maintains and accesses entity graph 308 (such as...) within database 126. Figure 3 (As shown). Examples of functions and services supported by the social network server 124 include identifying other users that a specific user of the messaging system 100 has a relationship with or that the specific user is “following”, and also identifying the interests of a specific user and other entities.
[0033] Returning to messaging client 104, the features and functionalities of external resources (e.g., third-party application 109 or applet) are made available to the user via the interface of messaging client 104. Messaging client 104 receives user selections regarding options for launching or accessing features of external resources (e.g., third-party resources) such as external application 109. External resources may be third-party applications (external application 109) installed on client device 102 (e.g., “native application”), or smaller versions of third-party applications (e.g., “applets”) hosted on client device 102 or located remotely on client device 102 (e.g., on third-party server 110). The smaller version of a third-party application comprises a subset of the features and functionalities of the third-party application (e.g., a full-scale, native version of a third-party standalone application) and is implemented using markup language documentation. In one example, the smaller version of a third-party application (e.g., “applet”) is a web-based markup language version of the third-party application and is embedded in messaging client 104. In addition to using markup language documents (e.g., *ml files), mini-programs can include scripting languages (e.g., .*js files or .json files) and style sheets (e.g., *ss files).
[0034] In response to receiving a user selection of options for launching or accessing an external resource (external application 109), the messaging client 104 determines whether the selected external resource is a web-based external resource or a locally installed external application. In some cases, the external application 109, locally installed on the client device 102, can be launched independently of and separately from the messaging client 104, for example, by selecting an icon corresponding to the external application 109 on the home screen of the client device 102. A smaller version of such an external application can be launched or accessed via the messaging client 104, and in some examples, no part of the smaller external application can be accessed (or only a limited part can be accessed) outside of the messaging client 104. The smaller external application can be launched by receiving and processing a markup language document associated with the smaller external application from the external application server 110 via the messaging client 104.
[0035] In response to determining that the external resource is a locally installed external application 109, the messaging client 104 instructs the client device 102 to launch the external application 109 by executing locally stored code corresponding to the external application 109. In response to determining that the external resource is a web-based resource, the messaging client 104 communicates with the external application server 110 to obtain a markup language document corresponding to the selected resource. The messaging client 104 then processes the obtained markup language document to render the web-based external resource within the user interface of the messaging client 104.
[0036] The messaging client 104 can notify users of client device 102 or other users (e.g., "friends") associated with such users of one or more external resources. For example, the messaging client 104 can provide participants in a conversation (e.g., a chat session) within the messaging client 104 with notifications related to the current or recent use of external resources by one or more members of a user group. One or more users can be invited to join an active external resource or to activate a recently used but currently inactive external resource (within the friend group). External resources can provide participants in the conversation, each using the corresponding messaging client 104, with the ability to share items, conditions, states, or locations within the external resource with one or more members of the user group who have entered the chat session. Shared items can be interactive chat cards that chat members can use to interact with, for example, activate the corresponding external resource, view specific information within the external resource, or take chat members to a specific location or state within the external resource. Within a given external resource, response messages can be sent to users on the messaging client 104. External resources can selectively include different media items in the response based on the current context of the external resource.
[0037] The messaging client 104 can present a list of available external resources (e.g., third-party or external applications 109 or mini-programs) to the user to launch or access a given external resource. This list can be presented in a context-sensitive menu. For example, the icons representing different external applications in external application 109 (or mini-program) can vary based on how the user launches the menu (e.g., from a conversational interface or from a non-conversational interface).
[0038] System Architecture
[0039] Figure 2This is a block diagram illustrating further details of a messaging system 100 according to some examples. Specifically, the messaging system 100 is shown as including a messaging client 104 and an application server 114. The messaging system 100 includes multiple subsystems supported on the client side by the messaging client 104 and on the server side by the application server 114. These subsystems include, for example, a short-lived timer system 202, a collection management system 204, an enhancement system 208, a map system 210, a game system 212, and an external resource system 220.
[0040] The short-lived timer system 202 is responsible for enabling temporary or time-limited access to content by the message sending client 104 and the message sending server 118. The short-lived timer system 202 includes multiple timers that selectively enable access to messages and associated content via the message sending client 104 (e.g., for rendering and display) based on duration and display parameters associated with a message or set of messages (e.g., a story). Further details regarding the operation of the short-lived timer system 202 are provided below.
[0041] The collection management system 204 is responsible for managing collections or sets of media (e.g., collections of text, image, video, and audio data). Collections of content (e.g., messages, including images, videos, text, and audio) can be organized into "event galleries" or "event stories." Such collections can be made available for a specified time period (e.g., the duration of an event related to the content). For example, content related to a concert can be made available as a "story" for the duration of the concert. The collection management system 204 can also be responsible for publishing icons that notify the user interface of the messaging client 104 of the existence of a specific collection.
[0042] Furthermore, the collection management system 204 includes a curation interface 206, which allows collection managers to manage and curate specific content collections. For example, the curation interface 206 enables an event organizer to curate collections of content related to a specific event (e.g., removing inappropriate content or redundant messages). Additionally, the collection management system 204 employs machine vision (or image recognition technology) and content rules to automatically curate content collections. In some examples, users may be compensated for including user-generated content in the collection. In such cases, the collection management system 204 operates to automatically pay such users for using their content.
[0043] Enhancement system 208 provides various functionalities that enable users to enhance (e.g., annotate or otherwise modify or edit) media content associated with messages. For example, enhancement system 208 provides functionality related to generating and publishing media overlays for messages processed by messaging system 100. Enhancement system 208 can operable to provide media overlays or enhancements (e.g., image filters) to messaging client 104 based on the geographic location of client device 102. In another example, enhancement system 208 can operable to provide media overlays to messaging client 104 based on other information such as the social network information of the user of client device 102. Media overlays can include audio and visual content as well as visual effects. Examples of audio and visual content include images, text, logos, animations, and sound effects. Examples of visual effects include color overlays. Audio and visual content or visual effects can be applied to media content items (e.g., photographs) at client device 102. For example, media overlays can include text, graphic elements, or images that can be overlaid on photographs taken by client device 102. In another example, media overlays include location identifier overlays (e.g., Venice Beach), names of live events, or business names (e.g., Beach Cafe). In yet another example, enhancement system 208 uses the geolocation of client device 102 to identify media overlays that include the business name at the location of client device 102. Media overlays may include other tags associated with the business. Media overlays may be stored in database 126 and accessed through database server 120.
[0044] In some examples, the enhancement system 208 provides a user-based publishing platform that allows users to select geographic locations on a map and upload content associated with those locations. Users can also specify environments in which particular media overlays should be provided to other users. The enhancement system 208 generates media overlays that include the uploaded content and associate it with the selected geographic locations.
[0045] In other examples, augmentation system 208 provides a merchant-based publishing platform that enables merchants to select specific media coverage associated with a geographic location via a bidding process. For example, augmentation system 208 associates the media coverage of the highest bidder with a corresponding geographic location for a predefined amount of time. Augmentation system 208 communicates with image processing server 122 to obtain augmented reality experiences and presents identifiers of such experiences in one or more user interfaces (e.g., as icons on live images or videos, or as thumbnails or icons in interfaces dedicated to the presented augmented reality experience). Once an augmented reality experience is selected, one or more images, videos, or augmented reality graphic elements are retrieved and presented as overlays on the images or videos captured by client device 102. In some cases, the camera is switched to a frontal view (e.g., the front camera of client device 102 is activated in response to the activation of a specific augmented reality experience), and images from the front camera of client device 102, rather than the rear camera of client device 102, begin to appear on client device 102. One or more images, videos, or augmented reality graphics elements are retrieved and presented as overlays on images captured and displayed by the front-facing camera of the client device 102.
[0046] In other examples, augmentation system 208 can communicate and exchange data with another augmentation system 208 on another client device 102 and a server via network 106. The exchanged data may include: a session identifier identifying the shared AR session; a transformation between the first client device 102 and the second client device 102 (e.g., multiple client devices 102, including the first device and the second device) to align the shared AR session to a common origin; a common coordinate system; functions (e.g., commands for activating functions); and other payload data (e.g., text, audio, video, or other multimedia data).
[0047] The enhancement system 208 sends a transformation to the second client device 102, allowing the second client device 102 to adjust its AR coordinate system based on the transformation. In this way, the first and second client devices 102 synchronize their coordinate systems and frames to display content in the AR session. Specifically, the enhancement system 208 calculates the origin of the second client device 102 in the coordinate system of the first client device 102. The enhancement system 208 then determines an offset in the coordinate system of the second client device 102 based on the position of this origin in the coordinate system of the second client device 102 as viewed from its own perspective. This offset is used to generate a transformation that allows the second client device 102 to generate AR content based on a shared coordinate system with the first client device 102.
[0048] Enhancement system 208 can communicate with client device 102 to establish individual or shared AR sessions. Enhancement system 208 can also be coupled to messaging server 118 to establish electronic group communication sessions (e.g., group chat, instant messaging) for client device 102 within a shared AR session. The electronic group communication session can be associated with a session identifier provided by client device 102 to gain access to both the electronic group communication session and the shared AR session. In one example, client device 102 first gains access to the electronic group communication session and then obtains a session identifier within the electronic group communication session that allows client device 102 to access the shared AR session. In some examples, client device 102 can access the shared AR session without the assistance of enhancement system 208 in application server 114 or without communicating with enhancement system 208 in application server 114.
[0049] Map system 210 provides various geolocation functions and supports the presentation of map-based media content and messages by messaging client 104. For example, map system 210 enables the display (e.g., stored in profile data 316) of user icons or avatars on a map to indicate the current or past locations of a user's "friends," as well as media content (e.g., a collection of messages including photos and videos) generated by such friends within the context of the map. For example, a message posted by a user from a specific geolocation to messaging system 100 can be displayed to a specific user's "friends" on the map interface of messaging client 104 within the context of that specific location on the map. A user can also share his or her location and status information with other users of messaging system 100 (e.g., using appropriate status avatars) via messaging client 104, where the location and status information is similarly displayed to the selected user within the context of the map interface of messaging client 104.
[0050] Game system 212 provides various game functions within the context of messaging client 104. Messaging client 104 provides a game interface that offers a list of available games (e.g., web-based games or web-based applications) that can be started by a user within the context of messaging client 104 and played with other users of messaging system 100. Messaging system 100 also enables specific users to invite other users to play a specific game by sending invitations from messaging client 104. Messaging client 104 also supports both voice and text messaging (e.g., chat) within the game context, provides leaderboards for the game, and supports in-game rewards (e.g., game currency and items).
[0051] External resource system 220 provides messaging client 104 with an interface for communicating with external application server 110 to launch or access external resources. Each external resource (application) server 110 hosts applications based on markup languages (e.g., HTML5) or smaller versions of external applications (e.g., games, utilities, payment, or ride-sharing applications outside of messaging client 104). Messaging client 104 can launch a web-based resource (e.g., an application) by accessing an HTML5 file from the external resource (application) server 110 associated with the web-based resource. In some examples, the application hosted by external resource server 110 is programmed in JavaScript using a software development kit (SDK) provided by messaging server 118. The SDK includes application programming interfaces (APIs) with functionality that can be called or activated by the web-based application. In some examples, messaging server 118 includes a JavaScript library that provides access to certain user data of messaging client 104 to a given third-party resource. HTML5 is used as an example technology for programming games, but applications and resources programmed based on other technologies can be used.
[0052] To integrate the SDK's functionality into a web-based resource, the external resource (application) server 110 downloads the SDK from the messaging server 118 or receives the SDK from the external resource (application) server 110 in another manner. Once downloaded or received, the SDK is included as part of the application code of the web-based external resource. The code of the web-based resource can then call or activate certain functions of the SDK to integrate the features of the messaging client 104 into the web-based resource.
[0053] The SDK stored on the messaging server 118 effectively bridges external resources (e.g., third-party or external applications 109 or applets) with the messaging client 104. This provides users with a seamless experience communicating with other users on the messaging client 104 while preserving the appearance of the messaging client 104. To bridge communication between external resources and the messaging client 104, in some examples, the SDK facilitates communication between the external resource server 110 and the messaging client 104. In some examples, a WebViewJavaScriptBridge running on the client device 102 establishes two unidirectional communication channels between the external resources and the messaging client 104. Messages are sent asynchronously between the external resources and the messaging client 104 via these communication channels. Each SDK function activation is sent as a message and a callback. Each SDK function is implemented by constructing a unique callback identifier and sending a message with that callback identifier.
[0054] By using the SDK, not all information from message client 104 is shared with external resource server 110. The SDK limits which information is shared based on the needs of the external resource. In some examples, each external resource server 110 provides the message client 118 with an HTML5 file corresponding to the web-based external resource. Message client 118 can add a visual representation (e.g., box art or other graphics) of the web-based external resource in message client 104. Once the user selects a visual representation or instructs message client 104 to access features of the web-based external resource through the message client 104's GUI, message client 104 obtains the HTML5 file and instantiates the resources required to access the features of the web-based external resource.
[0055] The messaging client 104 presents a graphical user interface (GUI) for an external resource (e.g., a login page or title screen). During, before, or after presenting the login page or title screen, the messaging client 104 determines whether the initiated external resource has previously been authorized to access the messaging client 104's user data. In response to determining that the initiated external resource has previously been authorized to access the messaging client 104's user data, the messaging client 104 presents another GUI for the external resource, which includes the external resource's functionality and characteristics. In response to determining that the initiated external resource has not previously been authorized to access the messaging client 104's user data, after a threshold time period (e.g., 3 seconds) of displaying the external resource's login page or title screen, the messaging client 104 slides a menu for authorizing the external resource to access user data (e.g., animates the menu to appear from the bottom of the screen to the middle of the screen or another part of the screen). This menu identifies the type of user data that the external resource will be authorized to use. In response to receiving a user's selection of the accept option, messaging client 104 adds the external resource to the list of authorized external resources, enabling the external resource to access user data from messaging client 104. In some examples, messaging client 104 authorizes external resources to access user data according to the OAuth 2 framework.
[0056] The messaging client 104 controls the type of user data shared with external resources based on the type of authorized external resource. For example, it provides access to a first type of user data (e.g., a two-dimensional avatar of a user, with or without different avatar characteristics) to external resources including full-scale external applications (e.g., third-party or external application 109). As another example, it provides access to a second type of user data (e.g., payment information, a user's two-dimensional avatar, a user's three-dimensional avatar, and avatars with various avatar characteristics) to external resources including smaller versions of external applications (e.g., a web-based version of a third-party application). Avatar characteristics include different ways of customizing the avatar's appearance (e.g., different poses, facial features, clothing, etc.).
[0057] The segmentation estimation system 224 segments the clothing worn by a user depicted in an image (or video), or the clothing worn by multiple users respectively depicted in an image (or video). The following section combines... Figure 5 A schematic implementation of the segmentation estimation system 224 is shown and described.
[0058] Specifically, the segmentation estimation system 224 is a component accessible to an AR / VR application implemented on client device 102. The AR / VR application uses an RGB camera to capture monocular images of the user and one or more garments worn by the user. The AR / VR application applies various trained machine learning techniques to the captured images of the user wearing the garments, along with one or more previous frames depicting the user wearing the same garments, to segment the garments worn by the user in the images (e.g., shirts, jackets, trousers, dresses, etc.) and apply one or more visual effects to the captured images. Garment segmentation yields the outlines of the boundaries of the garments appearing in the image or video. Pixels within the boundaries of the segmented garments correspond to the garments or clothing worn by the user. The segmented garments are used to distinguish the clothes or clothing worn by the user from other objects or elements depicted in the image, such as the user's body parts (e.g., arms, head, legs, etc.) and the background of the image. In some implementations, the AR / VR application continuously captures images of the user wearing the garments in real-time or periodically to continuously or periodically update one or more applied visual effects. This allows users to move around in the real world and view one or more visual updates in real time.
[0059] To enable AR / VR applications to directly apply one or more visual effects based on captured RGB images, the AR / VR application obtains a first trained machine learning technique from the segmentation estimation system 224. The first trained machine learning technique processes the captured RGB image to generate a segmentation corresponding to the clothing worn by the user depicted in the captured RGB image. Although this disclosure discusses the application of the segmentation estimation system 224 in segmenting clothing worn by a single user, this disclosure is similarly applicable to detecting and segmenting multiple garments worn by corresponding multiple users simultaneously depicted in the same image, in order to apply corresponding visual effects to their clothing. The AR / VR application also obtains a second trained machine learning technique from the segmentation estimation system 224. The second trained machine learning technique processes one or more previously captured frames (e.g., video frames immediately preceding the RGB image from 1 to 2 seconds) to estimate or predict the segmentation of the clothing worn by the user for subsequent frames. Video frames at a threshold number of seconds (which can be user-defined, previously specified, and / or dynamically determined) can be stored continuously or periodically in a buffer, allowing access to video frames at the threshold number of seconds preceding the current RGB image via a second-trained machine learning technique. The output or prediction of the segmentation by the second-trained machine learning technique is used to smooth, filter, or improve the segmentation of the garment generated by the first-trained machine learning technique.
[0060] During training, the segmentation estimation system 224 obtains a first plurality of input training images comprising depictions of one or more users wearing different clothing. These training images also provide ground truth information related to the segmentation of the clothing worn by the users depicted in each image. A first machine learning technique (e.g., a deep neural network) is trained based on features from the plurality of training images. Specifically, the first machine learning technique extracts one or more features from a given training image and estimates the segmentation of the clothing worn by the users depicted in the given training image. The first machine learning technique obtains ground truth information corresponding to the training images and adjusts or updates one or more coefficients to improve subsequent estimations of clothing segmentation.
[0061] During training, the segmentation estimation system 224 obtains a first plurality of input training videos (each having a certain number of frames corresponding to a threshold video duration, such as 1 to 2 seconds) depicting one or more users wearing different clothing. These training videos also provide ground truth information on the segmentation of clothing relative to subsequent frames of each video. That is, the first training video can be associated with ground truth information identifying the segmentation of clothing worn by users depicted in frames immediately following a given frame in the first training video. A second machine learning technique (e.g., a neural network) is trained based on features from the plurality of training videos. Specifically, the second machine learning technique extracts one or more features from a given training video and estimates or predicts the segmentation of clothing worn by users in subsequent frames relative to previous frames of the training video. The second machine learning technique obtains ground truth information corresponding to the training videos and adjusts or updates one or more coefficients to improve subsequent estimates of the segmentation of clothing worn by users depicted in subsequent videos.
[0062] Data Architecture
[0063] Figure 3 This is a schematic diagram illustrating a data structure 300 that can be stored in a database 126 of a message transceiver server system 108, according to certain examples. Although the contents of database 126 are shown to include multiple tables, it should be understood that the data can be stored in other types of data structures (e.g., as an object-oriented database).
[0064] Database 126 includes message data stored in message table 302. For any given message, this message data includes at least message sender data, message receiver (or recipient) data, and payload. See below for reference. Figure 4 Further details are provided regarding information that can be included in the message and is contained within the message data stored in message table 302.
[0065] Entity table 306 stores entity data and (e.g., by reference) links to entity diagram 308 and profile data 316. Entities whose records are maintained in entity table 302 may include individuals, company entities, organizations, objects, locations, events, etc. Regardless of entity type, any entity whose data is stored in message transceiver server system 108 can be an identifiable entity. Each entity is provided with a unique identifier and an entity type identifier (not shown).
[0066] Entity graph 308 stores information about the relationships and associations between entities. As an example only, such relationships can be based on social, professional (e.g., working in the same company or organization), interest, or activity.
[0067] Profile data 316 stores various types of profile data about a specific entity. Based on privacy settings specified by the specific entity, profile data 316 can be selectively used and presented to other users of messaging system 100. In the case of an individual, profile data 316 includes, for example, a username, phone number, address, settings (e.g., notification and privacy settings), and an avatar representation (or a set of such avatar representations) selected by the user. A specific user can then selectively include one or more of these avatar representations within the content of messages transmitted via messaging system 100 and on map interfaces displayed to other users by messaging client 104. The set of avatar representations may include “status avatars,” which present a graphical representation of a status or activity that the user can choose to communicate at a specific time.
[0068] In the case that the entity is a group, in addition to the group name, members and various settings for the relevant group (e.g., notifications), the profile data 316 for the group may similarly include one or more avatars associated with the group.
[0069] Database 126 also stores enhancement data, such as overlays or filters, in enhancement table 310. The enhancement data is associated with and applied to videos (video data is stored in video table 304) and images (image data is stored in image table 312).
[0070] Database 126 can also store data related to individual and shared AR sessions. This data may include data transmitted between the AR session client controller of the first client device 102 and another AR session client controller of the second client device 102, as well as data transmitted between the AR session client controller and the augmentation system 208. The data may include a common coordinate system for establishing a shared AR scene, transformations between devices, session identifiers, images depicting the body, skeletal joint positions, wrist joint positions, feet, etc.
[0071] In one example, a filter is an overlay displayed on an image or video during presentation to the receiving user. Filters can be of various types, including user-selected filters from a set of filters presented to the sending user by the messaging client 104 as the sending user is composing a message. Other types of filters include geolocation filters (also known as geographic filters), which can be presented to the sending user based on geolocation. For example, a geolocation filter specific to a nearby or specific location can be presented by the messaging client 104 within the user interface based on geolocation information determined by the Global Positioning System (GPS) unit of the client device 102.
[0072] Another type of filter is a data filter, which can be selectively presented to the sending user by the messaging client 104 based on other inputs or information collected by the client device 102 during the message creation process. Examples of data filters include the current temperature at a specific location, the current speed of the sending user, the battery life of the client device 102, or the current time.
[0073] Other augmented data that can be stored in image table 312 includes augmented reality content items (e.g., items corresponding to an applied augmented reality experience). Augmented reality content items or augmented reality items can be real-time special effects and sounds that can be added to images or videos.
[0074] As described above, augmented data includes augmented reality content items, overlays, image transformations, AR images, and similar items involving modifications that can be applied to image data (e.g., video or images). This includes real-time modifications that modify images as they are captured by the device sensors (e.g., one or more cameras) of client device 102 and then display the modified image on the screen of client device 102. This also includes modifications to stored content (e.g., video clips in a gallery that can be modified). For example, in client device 102, which has access to multiple augmented reality content items, a user can use a single video clip with multiple augmented reality content items to see how different augmented reality content items will modify the stored clip. For example, by selecting different augmented reality content items for the content, multiple augmented reality content items with different pseudo-random motion models can be applied to the same content. Similarly, real-time video capture can be used with the illustrated modifications to show how the video image currently captured by the sensors of client device 102 will modify the captured data. Such data can be displayed on the screen without being stored in memory, or the content captured by the device's sensors can be recorded and stored in memory with or without modification (or both). In some systems, preview features can simultaneously show how different augmented reality content items look in different windows on the display. This can, for example, make it possible to view multiple windows with different pseudo-random animations on the display at the same time.
[0075] Therefore, data, and various systems that use augmented reality content items, or other such transformation systems that use that data to modify content, can involve: the detection of objects (e.g., faces, hands, bodies, cats, dogs, surfaces, objects, etc.) in video frames; tracking such objects as they leave, enter, and move around within the field of view; and modifying or transforming such objects while they are being tracked. In various examples, different methods can be used to implement such transformations. Some examples may involve: generating a 3D mesh model of one or more objects, and using transformations and animated textures of the model within the video to implement the transformation. In other examples, tracking points on the object can be used to place an image or texture (which can be two-dimensional or three-dimensional) at the tracked location. In yet another example, neural network analysis of video frames can be used to place images, models, or textures within content (e.g., frames of images or videos). Therefore, augmented reality content items involve both: images, models, and textures used to create transformations within the content, and the additional modeling and analysis information required to implement such transformations using object detection, tracking, and placement.
[0076] Real-time video processing can be performed using any type of video data (e.g., video streams, video files, etc.) stored in the memory of any type of computerized system. For example, a user can load a video file and store it in the device's memory, or the device's sensors can be used to generate a video stream. Furthermore, computer-animated models can be used to process any object, such as a human face and body parts, animals, or inanimate objects (e.g., chairs, cars, or other objects).
[0077] In some examples, when a specific modification is selected along with the content to be transformed, the computing device identifies the elements to be transformed and then detects and tracks them if they exist in the frames of the video. The elements of the object are modified according to the modification request, thereby transforming the frames of the video stream. Different methods can be used to perform the transformation of the video stream frames for different types of transformations. For example, for frame transformations that primarily involve changing the form of elements of an object, feature points of each element of the object are calculated (e.g., using an Active Shape Model (ASM) or other known methods). A feature point-based mesh is then generated for each element in at least one of the object's elements. This mesh is used in subsequent stages of tracking the elements of the object in the video stream. During tracking, the mesh mentioned for each element is aligned with the position of each element. Additional points are then generated on the mesh. A first set of first points is generated for each element based on the modification request, and a second set of points is generated for each element based on this first set of first points and the modification request. The frames of the video stream can then be transformed by modifying the elements of the object based on this first set of first points, this second set of second points, and the mesh. In this method, the background of the modified object can also be changed or distorted by tracking and modifying the background.
[0078] In some examples, transformations that alter some regions of an object using its elements can be performed by calculating characteristic points for each element of the object and generating a mesh based on those calculated characteristic points. Points are generated on the mesh, and then various regions are generated based on those points. The elements of the object are then tracked by aligning the regions for each element with the positions for each of at least one element, and the properties of the regions can be modified based on modification requests, thereby transforming frames of the video stream. Depending on the specific modification request, the properties of the mentioned regions can be transformed in different ways. Such modifications may involve changing the color of the region; removing at least some portions of the region from the frames of the video stream; including one or more new objects into the region based on the modification request; and modifying or distorting the elements of the region or object. In various examples, any combination of such modifications or other similar modifications may be used. For certain models to be animated, some feature points can be selected as control points in the entire state space to determine options for animateting the model.
[0079] In some examples of computer animation models that use face detection to transform image data, a specific face detection algorithm (e.g., Viola-Jones) is used to detect faces in the image. The Active Shape Model (ASM) algorithm is then applied to the facial regions of the image to detect facial feature reference points.
[0080] Other methods and algorithms suitable for face detection can be used. For example, in some examples, landmarks are used to locate features; landmarks represent distinguishable points present in most of the images considered. For example, for facial landmarks, the location of the left pupil could be used. If the initial landmarks are not identifiable (e.g., in the case of a person wearing an eye patch), secondary landmarks can be used. Such a landmark identification process can be used for any such object. In some examples, a set of landmarks forms a shape. The shape can be represented as a vector using the coordinates of the points in the shape. One shape is aligned with another shape using a similarity transformation that minimizes the average Euclidean distance between the points of the shape (which allows for translation, scaling, and rotation). The mean shape is the average of the aligned training shapes.
[0081] In some examples, a landmark search begins with an average shape aligned with the position and size of a face determined by a global face detector. This search then repeats the following steps: proposing a provisional shape by template matching the image texture around each point to adjust the position of the shape points, and then conforming the provisional shape to a global shape model until convergence occurs. In some systems, individual template matching is unreliable, and the shape model pools the results of weak template matching to form a stronger overall classifier. The entire search is repeated at each level of the image pyramid from coarse to fine resolution.
[0082] The transformation system can capture image or video streams on a client device (e.g., client device 102) and perform complex image manipulations locally on client device 102 while maintaining an appropriate user experience, computation time, and power consumption. Complex image manipulations can include size and shape changes, mood shifts (e.g., changing a face from frowning to smiling), state shifts (e.g., aging a subject, reducing apparent age, changing gender), style shifts, application of graphic elements, and any other suitable image or video manipulations implemented by a convolutional neural network that has been configured to perform efficiently on client device 102.
[0083] In some examples, a computer animation model for transforming image data can be used by a system in which a user can capture an image or video stream (e.g., a selfie) using a client device 102 with a neural network, wherein the neural network operates as part of a messaging client 104 operating on the client device 102. A transformation system operating within the messaging client 104 determines the presence of a face within the image or video stream and provides a modification icon associated with the computer animation model used to transform the image data, or the computer animation model may exist in association with the interface described herein. This modification icon includes changes that may be the basis for modifying the user's face within the image or video stream as part of a modification operation. Once a modification icon is selected, the transformation system initiates processing to transform the user's image to reflect the selected modification icon (e.g., generating a smiling face for the user). Once the image or video stream is captured and the specified modification is selected, the modified image or video stream can be presented in a graphical user interface displayed on the client device 102. The transformation system may implement a complex convolutional neural network for a portion of the image or video stream to generate and apply the selected modification. In other words, users can capture image or video streams, and once an edit icon is selected, the modified result can be presented to the user in real-time or near real-time. Furthermore, the changes can be persistent while the video stream is being captured and the selected edit icon continues to toggle. Machine learning neural networks can be used to achieve such modifications.
[0084] The graphical user interface (GUI) presenting the modifications performed by the transformation system can provide users with additional interactive options. Such options can be based on the interface used to initiate the selection and content capture of a specific computer-animated model (e.g., from a content creator user interface). In various examples, the modifications can be persistent after the modification icon is initially selected. Users can turn the modification on or off by tapping or otherwise selecting the face being modified by the transformation system and save it for later viewing or browsing to other areas of the imaging application. In cases where multiple faces have been modified by the transformation system, users can globally toggle the modifications on or off by tapping or selecting a single face modified and displayed within the GUI. In some examples, individual faces within a group of multiple faces can be modified separately, or such modifications can be toggled individually by tapping or selecting a single face or a series of faces displayed within the GUI.
[0085] Story table 314 stores data relating to collections of messages and associated image, video, or audio data, compiled into collections (e.g., stories or galleries). The creation of a specific collection can be initiated by a specific user (e.g., each user whose records are maintained in entity table 306). A user can create a “personal story” in the form of a collection of content that has already been created and sent / broadcast by that user. For this purpose, the user interface of messaging client 104 may include user-selectable icons that allow the sending user to add specific content to his or her personal story.
[0086] The collection can constitute a "live story," which is a collection of content from multiple users created manually, automatically, or using a combination of manual and automatic technologies. For example, a "live story" can constitute a curated stream of user-submitted content from various locations and events. Users whose client devices have location services enabled and are at a common location event at a specific time can be presented with the option to contribute content to a specific live story, for example, via the user interface of messaging client 104. The messaging client 104 can identify a live story to a user based on their location. The end result is a "live story" told from a community perspective.
[0087] Another type of content collection is called a "location story," which allows users whose client devices 102 are located in a specific geographic location (e.g., on a college or university campus) to contribute to a specific collection. In some implementations, contributing to a location story may require secondary authentication to verify that the end user belongs to a specific organization or other entity (e.g., a student on a university campus).
[0088] As mentioned above, video table 304 stores video data, which, in one example, is associated with messages whose records are maintained within message table 302. Similarly, image table 312 stores image data associated with messages whose message data is stored in entity table 306. Entity table 306 can associate various enhancements from enhancement table 310 with various images and videos stored in image table 312 and video table 304.
[0089] The trained machine learning technique 307 stores parameters that have been trained during the training of the segmentation estimation system 224. For example, the trained machine learning technique 207 stores trained parameters of one or more neural network machine learning techniques.
[0090] The segmentation training images 309 store multiple images, each depicting one or more users wearing different clothing. The multiple images stored in the segmentation training images 309 include various depictions of one or more users wearing different clothing, as well as segmentations of the clothing indicating which pixels in the image correspond to clothing and which pixels correspond to the background or body parts of the user. That is, the segmentation provides the boundaries of the clothing depicted in the image. These segmentation training images 309 are used by the segmentation estimation system 224 to train a first machine learning technique for generating segments of one or more garments depicted in received RGB monocular images. In some cases, the segmentation training images 309 include ground truth skeletal keypoints of one or more bodies depicted in the corresponding training monocular images to enhance segmentation performance for various discriminative attributes of the clothing (e.g., shoulder straps, collars, or sleeves). In some cases, the segmentation training images 309 include multiple image resolutions of the bodies depicted in the images. The segmentation training images 309 may include labeled and unlabeled image and video data. The segmentation training image 309 may include a full-body depiction of a specific user, an image lacking a depiction of any user (e.g., a negative image), depictions of multiple users wearing different clothing, and depictions of users wearing clothing at different distances from the image capture device.
[0091] The segmentation training images 309 store multiple videos (video clips of 1 to 2 seconds) depicting one or more users wearing different clothing. The multiple videos also include ground truth information identifying the segmentation of clothing depicted in the current frame or subsequent frames relative to previous frames in each of the multiple videos. These segmentation training images 309 are used by the segmentation estimation system 224 to train a second machine learning technique to predict clothing segmentation for subsequent frames based on received RGB monocular videos of users wearing clothing.
[0092] Data communication architecture
[0093] Figure 4 This is a schematic diagram illustrating the structure of message 400 according to some examples, generated by message transceiver client 104 for transmission to another message transceiver client 104 or message transceiver server 118. The content of a particular message 400 is used to populate message table 302 within database 126 accessible by message transceiver server 118. Similarly, the content of message 400 is stored in memory as an "in-transit" or "in-flight" data repository for client device 102 or application server 114. Message 400 is shown to include the following example components.
[0094] • Message Identifier 402: A unique identifier that identifies message 400.
[0095] • Message text payload 404: The text to be generated by the user via the user interface of the client device 102 and included in message 400.
[0096] • Message image payload 406: Image data captured by the camera component of the client device 102 or retrieved from the memory component of the client device 102 and included in the message 400. The image data for the sent or received message 400 can be stored in the image table 312.
[0097] • Message video payload 408: Video data captured by the camera device component or retrieved from the memory component of the client device 102 and included in message 400. The video data for the sent or received message 400 can be stored in video table 304.
[0098] • Message audio payload 410: Audio data captured by the microphone or retrieved from the memory component of the client device 102 and included in message 400.
[0099] • Message enhancement data 412: Enhancement data (e.g., filters, labels, or other annotations or enhancements) representing enhancements to be applied to the message image payload 406, message video payload 408, or message audio payload 410 of message 400. Enhancement data for sent or received message 400 can be stored in enhancement table 310.
[0100] • Message duration parameter 414: A parameter value, in seconds, indicating the amount of time that the content of the message (e.g., message image payload 406, message video payload 408, message audio payload 410) will be presented to the user or made accessible to the user via the message sending and receiving client 104.
[0101] • Message geolocation parameter 416: Geographic location data (e.g., latitude and longitude coordinates) associated with the message's content payload. The payload may include multiple message geolocation parameter 416 values, each of which is associated with a content item included in the content (e.g., a specific image within the message image payload 406, or a specific video within the message video payload 408).
[0102] • Message Story Identifier 418: An identifier value that identifies one or more sets of content (e.g., “Stories” identified in Story Table 314) associated with a specific content item in the message image payload 406 of message 400. For example, the identifier value can be used to associate multiple images within the message image payload 406 with multiple sets of content, respectively.
[0103] • Message Tag 420: Each message 400 can be labeled with multiple tags, each of which indicates the subject of the content included in the message payload. For example, in the case where a specific image in the message image payload 406 depicts an animal (e.g., a lion), a tag value indicating the relevant animal can be included in the message tag 420. Tag values can be manually generated based on user input, or can be automatically generated using, for example, image recognition.
[0104] • Message sender identifier 422: An identifier (e.g., a message sending system identifier, email address, or device identifier) that indicates the user of the client device 102 on which message 400 is generated and from which message 400 is sent.
[0105] • Message receiver identifier 424: An identifier (e.g., message sending and receiving system identifier, email address, or device identifier) indicating the user of the client device 102 to which message 400 is addressed.
[0106] The content (e.g., values) of each component of message 400 can be pointers to locations in tables where content data values are stored. For example, an image value in message image payload 406 can be a pointer to a location (or an address of a location) within image table 312. Similarly, a value in message video payload 408 can point to data stored in video table 304, a value stored in message enhancement data 412 can point to data stored in enhancement table 310, a value stored in message story identifier 418 can point to data stored in story table 314, and values stored in message sender identifier 422 and message receiver identifier 424 can point to user records stored in entity table 306.
[0107] Segmentation estimation system
[0108] Figure 5 This is a block diagram illustrating an example segmentation estimation system 224 according to an example. The segmentation estimation system 224 includes a set of components 510 that operate on a set of input data (e.g., monocular images 501 depicting a real-world user's body in clothing, segmentation training image data 502, monocular video 503 depicting a user in clothing, and clothing segmentation training video data 504). This set of input data is obtained from a database stored during the training phase. Figure 3The segmentation training images 309 are obtained from the segmentation data, and are acquired, for example, from the RGB camera device of the client device 102 when the messaging client 104 uses an AR / VR application. The segmentation estimation system 224 includes a first machine learning technology module 512, a skeletal keypoint module 511, a segmentation module 514, a second machine learning technology module 517, a smoothing segmentation module 516, an image modification module 518, a visual effects selection module 519, a 3D body tracking module 513, a full-body segmentation module 515, and an image display module 520.
[0109] During training, segmentation estimation system 224 receives from segmentation training image data 502 a given training image (e.g., a monocular image 501 depicting the real body of a user wearing clothing, such as an image of a user wearing upper garment (e.g., a shirt (short-sleeved, T-shirt, or long-sleeved), jacket, vest, sweater, etc.), lower garment (e.g., trousers or skirt), full-body clothing (e.g., a dress or overcoat), or any suitable combination thereof, or depicting multiple users simultaneously wearing corresponding combinations of upper garment, lower garment, or full-body clothing). Segmentation estimation system 224 applies one or more machine learning techniques to the given training image using a first machine learning technique module 512. The first machine learning technique module 512 extracts one or more features from the given training image to estimate the segmentation of the clothing worn by the user depicted in the image. For example, the first machine learning technique module 512 obtains a given training image depicting a user wearing a shirt. The first machine learning technique module 512 extracts features from the image and segments or specifies which pixels in the image correspond to the shirt worn by the user, and which pixels correspond to the background or to body parts of the user. That is, the segmentation identifier output by the first machine learning technology module 512 defines the boundaries of clothing (e.g., shirts) worn by the user in a given training image.
[0110] The first machine learning module 512 retrieves clothing segmentation information associated with a given training image. The first machine learning module 512 compares the estimated segmentation (which, in the case of multiple users in the image, may include identifiers of multiple garments worn by the corresponding users in the image) with the ground truth clothing segmentation provided as part of the segmentation training image data 502. Based on a difference threshold or bias of the comparison, the first machine learning module 512 updates one or more coefficients and obtains one or more additional segmentation training images. After processing a specified number of epochs or batches of training images and / or when the difference threshold or bias reaches a specified value, the first machine learning module 512 completes training, and the parameters and coefficients of the first machine learning module 512 are stored in the trained machine learning module 307.
[0111] In some examples, the first machine learning technique module 512 implements multiple segmentation models of the first machine learning technique. Each segmentation model of the first machine learning technique can be trained on different sets of training images associated with a specific resolution. That is, one segmentation model in the segmentation model can be trained to estimate clothing segmentation for an image with a first resolution (or a first resolution range). A second segmentation model in the segmentation model can be trained to estimate clothing segmentation for an image with a second resolution (or a second resolution range different from the first resolution range). In this way, different complexities of the first machine learning technique can be trained and stored. When a given device with certain capabilities uses an AR / VR application, a corresponding clothing segmentation model from the various clothing segmentation models can be provided to perform clothing segmentation matching the capabilities of the given device. In some cases, multiple clothing segmentation models of each machine learning technique implemented by the segmentation estimation system 224 can be provided, each clothing segmentation model being configured to operate at a different level of complexity. The appropriate segmentation model with the appropriate level of complexity can then be provided to the client device 102 for segmenting clothing depicted in one or more images.
[0112] In some examples, during training, the first machine learning technology module 512 receives 2D skeletal joint information from the skeletal keypoint module 511. The skeletal keypoint module tracks the skeletal keypoints (e.g., head joints, shoulder joints, hip joints, leg joints, etc.) of a user depicted in a given training image and provides 2D or 3D coordinates of the skeletal keypoints. The first machine learning technology module 512 uses this information to identify distinctive attributes of clothing depicted in the training images. A process for detecting and tracking skeletal keypoints is discussed in U.S. Patent Application No. 16 / 949,607, co-owned by Assouline et al., filed November 6, 2020, which is incorporated herein by reference in its entirety.
[0113] The clothing segment generated by the first machine learning technology module 512 is provided to the segmentation module 514. The segmentation module 514 can determine the position of the elbow joint, output by the skeletal keypoint module 511, within a given edge threshold distance from the boundary of the clothing segment. In response, the segmentation module 514 can determine that the clothing corresponds to a T-shirt or short-sleeved shirt, and that the given edge corresponds to the sleeve of the shirt. In this case, the segmentation module 514 can adjust the loss function or the weights of the parameters used to update the parameters of the first machine learning technology module 512 to improve the segmentation of the upper-body clothing (e.g., a shirt). More specifically, the segmentation module 514 can determine that a given distinguishing attribute exists in the clothing segment generated based on a comparison of skeletal joint positions with the boundary of the clothing segment. In this case, the segmentation module 514 adjusts the loss function or weights of the parameters used to update the parameters of the first machine learning technology module 512 for training images depicting clothing with distinguishing attributes. Similarly, the segmentation module 514 can adjust the loss or parameter weights based on the difference between the clothing segment and the pixels corresponding to the background of the image.
[0114] During training, segmentation estimation system 224 receives a given training video (e.g., a monocular video 503 depicting a user wearing clothing (or a combination of clothing) or multiple users simultaneously wearing the same clothing (or a combination of clothing) in a video) from segmentation training image data 502. Segmentation estimation system 224 applies one or more machine learning techniques to the given training video using a second machine learning technique module 517. The second machine learning technique module 517 extracts one or more features from the given training video to predict clothing segmentation for frames following the current or previous frame of the video. For example, the second machine learning technique module 517 obtains a given training video depicting the movement of a user wearing clothing (or a combination of clothing) across a set of frames in a 1- to 2-second video. The second machine learning technique module 517 extracts features from the video to predict clothing segmentation for clothing (or combinations of clothing) worn by one or more users in frames following the current or previous frame of the video.
[0115] The second machine learning module 517 predicts clothing segmentation for one or more subsequent frames following a given training video frame. For example, the second machine learning module 517 may process frames 2 through 25 of a given video to predict clothing segmentation for clothing worn by a user depicted in frame 26 of the same video. The second machine learning module 517 compares the determined / predicted clothing segmentation with a ground truth clothing segmentation provided as part of the segmentation training image data 502. The ground truth clothing segmentation may be provided for frame 26 with respect to the movement of the clothing worn by the user depicted in frames 2 through 25. Based on a difference threshold or bias of the comparison, the second machine learning module 517 updates one or more coefficients and obtains one or more additional segmentation training videos. After processing a specified number of periods or batches of training videos and / or when the difference threshold or bias reaches a specified value, the second machine learning module 517 completes training, and the parameters and coefficients of the second machine learning module 517 are stored in the trained machine learning module 307.
[0116] Specifically, the second machine learning module 517 processes a sequence of video frames that are 1 to 2 seconds immediately preceding the current frame. The second machine learning module 517 analyzes the movement of clothing segments within this video frame sequence to predict estimated clothing segments in the current frame or frames following the current frame.
[0117] After training, the segmentation estimation system 224 receives an input image 501 as a single RGB image (e.g., a monocular image depicting a user wearing clothing or multiple users wearing corresponding clothing) from the client device 102. The segmentation estimation system 224 applies a first trained machine learning technique module 512 to the received input image 501 to extract one or more features representing the segmentation of one or more garments depicted in image 501. The segmentation estimation system 224 also receives a set of frames depicting the same user or multiple users wearing corresponding clothing in monocular video 503 captured prior to image 501. The segmentation estimation system 224 applies a second trained machine learning technique module 517 to the received monocular video 503 to generate a prediction or estimate of clothing segmentation in subsequent frames (e.g., generating a prediction of clothing segmentation in the current frame (e.g., in image 501)). That is, the segmentation estimation system 224 generates clothing segmentation based on the current input image and generates predictions of segmentation occurring in the current input image based on frames of one or more previously received images or videos.
[0118] The smoothing segmentation module 516 compares the estimated clothing segmentation for the current frame generated by the first machine learning technology module 512 with the predicted clothing segmentation generated by the second machine learning technology module 517. The smoothing segmentation module 516 adjusts, smooths, or filters any differences in the clothing segmentation to generate a smoothed clothing segmentation for the current frame. In some cases, the smoothing segmentation module 516 generates clothing segmentation boundaries based on the smoothed segmentation. Clothing segmentation boundaries indicate a set of pixels located at the boundaries or edges of the clothing segmentation. That is, clothing segmentation boundary pixels correspond to pixels between the background and the clothing, and pixels between the clothing and the body parts of the user depicted in the image. In some cases, the clothing segmentation boundary pixels have a specified width (e.g., 3 to 4 pixels), in which case the clothing segmentation boundary represents the edge of a clothing segmentation with a specified width. In some cases, the smoothing segmentation module 516 applies a guiding filter to the smoothed segmentation to improve the clothing segmentation quality of the portion of the smoothed clothing segmentation within a specified number of pixels of the smoothed segmentation edges.
[0119] The number of previous video frames analyzed by the second machine learning module 517, or the duration of previously received video segments, can be set by the user, can be predetermined, or can be dynamically adjusted. In some cases, if the segmentation filtering or correction amount determined by the first machine learning module 512 exceeds a specified threshold, the number of previous video frames analyzed can be increased (e.g., from 1 second video to 2 seconds video). In some cases, the number of previous video frames analyzed can be increased (e.g., from 1 second video to 2 seconds video) based on the distance between the user and the camera device exceeding a specified threshold.
[0120] In some cases, the first machine learning module 512 is trained to generate segmentation of upper body clothing. In this case, when only an image of the user's legs is presented, the first machine learning module 512 does not generate clothing segmentation. In such examples, the first machine learning module 512 is trained using training images depicting a user wearing upper body clothing. Similarly, the second machine learning module 517 is trained to generate predictions for upper body clothing segmentation. In such examples, the second machine learning module 517 is trained using training videos depicting a user wearing upper body clothing. In these examples, when an image depicting the user's full body is received, the segmentation estimation system 224 only provides segmentation of the upper body clothing (e.g., a shirt or jacket) worn by the user and ignores or does not provide segmentation of any other clothing worn by the user (e.g., pants).
[0121] Figure 6 This is a graphical representation of the output of segmentation estimation system 224 based on some examples. Specifically, Figure 6Multiple garment segments 600 generated by the smoothing segmentation module 516 are shown. In one example, the smoothing segmentation module 516 generates a first garment segment 610, which represents the pixel position of a dress worn by the user. In another example, the smoothing segmentation module 516 generates a second garment segment 612, which represents the pixel position of a short-sleeved shirt worn by the user. In yet another example, the smoothing segmentation module 516 generates a third garment segment 614, which represents the pixel position of a jacket worn by the user.
[0122] Return to reference Figure 5 The visual effects selection module 519 receives a selection of a virtualization mode from the client device 102. For example, a list of mode options may be presented to a user of an AR / VR application. In response to receiving a user's selection of a given mode option from the list, the given mode is provided to the visual effects selection module 519 as a selection of a virtualization mode. Mode options may include: replacing clothing with another clothing option; recoloring pixels of a clothing option (e.g., changing the color of each pixel in an image that falls within a clothing segment with a target pixel color, texture, or animation); applying animation or video to a region within a clothing segment (e.g., replacing each pixel in an image that falls within a clothing segment with a target animation or video); rendering ripples, flashes, or grains to the boundaries or portions of the boundaries of a clothing segment (e.g., applying one or more augmented reality elements to the region of an image corresponding to the boundary of a clothing segment); removing clothing (e.g., setting all pixel values within a clothing segment to a specified value, such as black or white); applying contour effects to a clothing segment; and adjusting the display position and occlusion mode of virtual clothing (e.g., trousers) displayed adjacent to or next to the clothing (e.g., a shirt) corresponding to the clothing segment. The virtualization mode selection controls how the segmentation of the clothing worn by the user affects the way visual elements are displayed in the image relative to the user. Figure 7A , Figure 7B , Figure 8 and Figure 9 The schematic output shows one or more options that can be selected by the visual effects selection module 519.
[0123] The image modification module 518 can adjust the image captured by the camera device based on the mode selected by the visual effects selection module 519 and the smoothed clothing segmentation received from the smoothing segmentation module 516. For example, the image modification module 518 adjusts how the clothing worn by the user is presented in the image by changing the color or occlusion mode of the clothing based on the clothing segmentation. The image display module 520 combines the adjustments made by the image modification module 518 into the received monocular image depicting the user's body. This image is provided by the image display module 520 to the client device 102 and can then be sent to another user or stored for later access and display.
[0124] In some examples, the image modification module 518 receives 3D body tracking information representing the 3D position of the user depicted in the image from the 3D body tracking module 513. The 3D body tracking module 513 generates the 3D body tracking information by processing the image 501 and the monocular video 503 using a fifth machine learning technique and a sixth machine learning technique. As an example, the fifth trained machine learning technique processes the captured RGB image to generate skeletal joint positions corresponding to the body depicted in the captured RGB image. The sixth trained machine learning technique processes one or more previously captured frames (e.g., video frames immediately preceding the RGB image by 1 to 2 seconds) to estimate or predict skeletal joint positions for subsequent frames. Video frames at a threshold number of seconds (which may be user-defined, previously specified, and / or dynamically determined) can be stored continuously or periodically in a buffer, allowing access to video frames preceding the current RGB image via the sixth trained machine learning technique. The output or prediction of skeletal joint positions from the sixth trained machine learning technique is used to smooth, filter, or improve the skeletal joint positions generated by the fifth trained machine learning technique. This generates the 3D skeletal joint positions of the user depicted in the image.
[0125] Image modification module 518 can also receive full-body segmentation, which indicates which pixels in the image correspond to the user's entire body. Full-body segmentation can be received from full-body segmentation module 515. Full-body segmentation module 515 generates full-body segmentation by processing image 501 and monocular video 503 using a third machine learning technique and a fourth machine learning technique. As an example, the third trained machine learning technique processes the captured RGB image to generate a segmentation corresponding to the body depicted in the captured RGB image. The fourth trained machine learning technique processes one or more previously captured frames (e.g., video frames immediately preceding the RGB image 1 to 2 seconds) to estimate or predict segmentation for subsequent frames. Video frames at a threshold number of seconds (which may be user-defined, previously specified, and / or dynamically determined) can be stored continuously or periodically in a buffer, allowing access to video frames preceding the current RGB image via the fourth trained machine learning technique. The output or prediction of the segmentation by the fourth trained machine learning technique is used to smooth, filter, or improve the segmentation generated by the third trained machine learning technique. This produces a full-body segmentation for the user.
[0126] As an example, the image modification module 518 can control the display of another virtual or augmented reality element based on clothing segmentation provided by the smooth segmentation module 516 and based on the user's 3D body tracking position and the user's full-body segmentation. Specifically, the image modification module 518 can control the occlusion mode of the virtual clothing relative to the real-world clothing corresponding to the clothing segmentation. For example, the image modification module 518 can adjust or modify the image captured by the client device 102 in real time using a layered multi-parts approach. In the first layer or first part, the image modification module 518 can use the user's 3D body tracking position to adjust the pose of the virtual (or augmented reality) clothing to be displayed above the user depicted in the received image. That is, if the user's left leg is angled relative to the user's torso in a given direction, the virtual clothing is similarly angled to cover the user's leg. In the second layer or second part, the user's full-body segmentation is used to place the pixels of the virtual clothing on the corresponding parts of the user's body, for example, when the virtual clothing corresponds to pants, it is used to place the pixels of the virtual clothing on the user's legs. Full-body segmentation is also used to blend the pixels of the virtual clothing with the background portion adjacent to the user's body part (e.g., legs) corresponding to the virtual clothing (e.g., pants). In the third layer or third part, the image modification module 518 can use clothing segmentation corresponding to the real-world clothing (e.g., shirt) depicted in the received image to occlude the virtual clothing according to an occlusion mode (e.g., real-world clothing occludes virtual clothing or virtual clothing occludes real-world clothing). That is, the image modification module 518 determines which part of the virtual clothing is occluded by the pixels of the real-world clothing, and / or also determines which part of the real-world clothing pixels is occluded by the pixels of the virtual clothing.
[0127] For example, the visual effects selection module 519 can receive input selecting a virtual trouser outfit to be added to a captured image or video. In response, the image modification module 518 can access 3D body tracking information about the user depicted in the image from the 3D body tracking module 513 to position the virtual trouser outfit in a manner similar to the currently depicted pose of the user's legs corresponding to the virtual trouser outfit. Based on the full-body segmentation received from the full-body segmentation module 515, the image modification module 518 positions the virtual trouser outfit on top of the user's trousers in a manner that blends with the background of the image.
[0128] In one example, the image modification module 518 can select an occlusion mode for virtual pants clothing based on the current occlusion mode of the real-world clothing worn by the user. That is, the image modification module 518 can determine that the real-world pants worn by the user cover a real-world shirt in the image. In this case, the image modification module 518 sets the occlusion mode such that the virtual pants clothing covers the portion of the real-world shirt clothing corresponding to the clothing segment received from the smoothing segmentation module 516. As another example, the image modification module 518 can select an occlusion mode for virtual pants clothing based on the current occlusion mode associated with the selected virtual pants clothing. That is, the image modification module 518 can determine that the virtual pants clothing is associated with an occlusion mode in which the virtual pants cover a real-world shirt in the image. In this case, the image modification module 518 sets the occlusion mode such that the virtual pants clothing covers the portion of the real-world shirt clothing corresponding to the clothing segment received from the smoothing segmentation module 516. As another example, the image modification module 518 can determine that the virtual pants clothing is associated with an occlusion mode in which the virtual pants are occluded by a real-world shirt in the image. In this case, the image modification module 518 sets the occlusion mode so that the virtual trousers clothing is covered by the portion of the real-world shirt clothing corresponding to the clothing segmentation received from the smooth segmentation module 516.
[0129] Image modification module 518 determines which subset of pixels of the real-world shirt covers a subset of pixels of the virtual clothing. If the occlusion mode indicates that the virtual trousers clothing occludes the real-world shirt clothing, then image modification module 518 replaces the subset of pixels of the real-world shirt clothing with the subset of pixels of the virtual clothing. If the occlusion mode indicates that the real-world shirt clothing occludes the virtual trousers clothing, then image modification module 518 replaces the subset of pixels of the virtual trousers clothing with the subset of pixels of the real-world shirt clothing.
[0130] Figure 7A , Figure 7B , Figure 8 and Figure 9 This is a graphical representation of the output of a segmentation estimation system based on some examples. In one example, such as Figure 7AAs shown, segmentation estimation system 224 can modify the pixel color of pixels in a monocular image or video that correspond to real-world clothing worn by a user. Specifically, segmentation estimation system 224 can receive monocular image 700. As discussed above, segmentation estimation system 224 can generate smoothed segments (e.g., smoothed segmentation 610). Segmentation estimation system 224 can determine that the clothing segmentation of the user's clothing corresponds to pixel set 710. Segmentation estimation system 224 can receive user input navigating user interface elements to modify the color of clothing pixels. In response, segmentation estimation system 224 generates an image in which the pixel set 710 corresponding to the clothing (e.g., pixels within the clothing segment) has been replaced with another color 712.
[0131] Specifically, as shown in Figure 7, the visualization mode selected by the user may include a selection of a recoloring option. In response to receiving a selection of this option, the segmentation estimation system 224 identifies pixels (e.g., pixel set 710) in the monocular image that correspond to the clothing in the monocular image based on clothing segmentation (e.g., segmentation 610). The segmentation estimation system 224 replaces the pixel set 710 of the monocular image with a different color 712, image, or video.
[0132] In some cases, the visualization mode selected by the user may also include texture modification options. In this case, in addition to replacing the pixel color of pixels within the clothing segmentation, or as an alternative to replacing the pixel color of pixels within the clothing segmentation, the segmentation estimation system 224 also replaces the texture of the pixels depicting the user's clothing in the image. Specifically, as Figure 7B As shown, the segmentation estimation system 224 identifies a set of pixels 720 in the monocular image corresponding to the clothing worn by the user based on a segmentation (e.g., segmentation 610). The segmentation estimation system 224 modifies the texture of one or more portions of the pixel set 720 to another texture 722. Specifically, the segmentation estimation system 224 replaces the pixel set 720 in the monocular image corresponding to the clothing worn by the user with a target pixel value corresponding to a different image, video, or texture 722. In some cases, the target pixel value to which the portion of the pixel set is modified is selected by the user, for example, through navigating user interface elements.
[0133] In some cases, the user-selected visualization mode may also include occlusion options. In this case, in addition to replacing or as an alternative to the pixels of the clothing, the segmentation estimation system 224 adds one or more graphic elements to the pixels of the clothing depicted in the image. Specifically, the segmentation estimation system 224 identifies a set 710 of pixels in the monocular image corresponding to the clothing worn by the user based on a segmentation (e.g., segmentation 610). The segmentation estimation system 224 retrieves graphic elements (e.g., augmented reality wings, horns, or equipment, or other user-selected graphic elements). The segmentation estimation system 224 selects a display position for the graphic elements based on user input or based on the type of the retrieved graphic elements. The segmentation estimation system 224 identifies a set of pixels in the set of pixels corresponding to the clothing worn by the user. The segmentation estimation system 224 adds the retrieved graphic elements near or in place of the identified set of pixels. For example, the segmentation estimation system 224 adds wings to a set of pixels associated with the sleeves of the clothing segmentation.
[0134] In some cases, in response to receiving a contour option, the segmentation estimation system 224 accesses the segmentation boundaries of the clothing worn by the user depicted in the image. In this case, the segmentation estimation system 224 may add graphical elements extending from the segmentation boundaries. For example, the segmentation estimation system 224 adds shadows or glows around the segmentation boundaries of the clothing worn by the user depicted in the image. This results in glows, flashes, ripples, grains, or shadows appearing around or behind the edges of the clothing worn by the user depicted in the image. Specifically, the segmentation estimation system 224 identifies the boundaries of the clothing worn by the user based on the edges of the pixel set segmented by the clothing. The segmentation estimation system 224 adds glow, flashes, ripples, grains, or shadow graphical elements along the boundaries of the clothing worn by the user.
[0135] Figure 8 An example is shown where virtual clothing is displayed alongside real-world clothing according to a first occlusion pattern. For example, a first image 800 depicts a user in a first pose. The user is wearing real-world clothing 810, and a segmentation estimation system 224 generates clothing segments for real-world clothing 810. A visual effects selection module 519 can receive input selecting virtual trousers clothing 820 to be added to image 800. In response, an image modification module 518 can access 3D body tracking information about the user depicted in image 800 from a 3D body tracking module 513 to position the virtual trousers clothing 820 in a manner similar to the currently depicted pose of the user's legs corresponding to the virtual trousers clothing. Based on the full-body segmentation received from a full-body segmentation module 515, the image modification module 518 positions the virtual trousers clothing above the user's legs in a manner that blends with the background of image 800.
[0136] Image modification module 518 can select an occlusion mode for the virtual trousers clothing, where the virtual trousers clothing covers the real-world shirt in the image. Specifically, image modification module 518 sets the occlusion mode such that the virtual trousers clothing 820 covers the portion of the real-world clothing 810 (e.g., a short-sleeved shirt) corresponding to the clothing segmentation received from smoothing segmentation module 516. Image modification module 518 determines which subset 812 of pixels in the real-world clothing 810 covers a subset of pixels in the virtual clothing 820. Image modification module 518 replaces the subset 812 of pixels in the coverage area of the real-world clothing 810 within the clothing segmentation of the real-world clothing 810 with the subset of pixels in the virtual clothing 820. This gives the illusion that the real-world clothing 810 is crammed inside the virtual clothing 820.
[0137] As the user moves around in the video, the positioning and occlusion modes change in real time. For example, when a subsequent frame of the video is received as the second image 801, the position of the real-world clothing 810 in the second image 801 changes relative to the first image 800. As a result, an updated clothing segmentation is generated and used to control the occlusion mode of the real-world clothing 810 relative to the virtual clothing 820. Specifically, the image modification module 518 can access 3D body tracking information about the user depicted in image 801 from the 3D body tracking module 513 to update the pose of the virtual trousers clothing 820 in a manner similar to the currently depicted pose of the user's legs corresponding to the virtual trousers clothing. Based on the full-body segmentation received from the full-body segmentation module 515, the image modification module 518 positions the virtual trousers clothing 820 above the user's legs in a manner that blends with the background of image 801.
[0138] Image modification module 518 updates the occlusion mode so that from the first image 800 to the second image 801, the virtual trousers clothing 820 still appears to cover the real-world clothing 810. That is, in the second image 801, the virtual trousers clothing 820 covers the portion of the real-world clothing 810 (e.g., a short-sleeved shirt) corresponding to the clothing segmentation received from the smoothing segmentation module 516. In this case, image modification module 518 determines which subset 814 of pixels of the real-world clothing 810 covers the subset of pixels of the virtual clothing 820 in the second image 801 relative to the first image 800. Image modification module 518 replaces the subset of pixels 814 of the real-world clothing 810 within the covered area of the clothing segmentation of the real-world clothing 810 with the subset of pixels of the virtual clothing 820.
[0139] Figure 9An example is shown of displaying virtual clothing alongside real-world clothing according to a second occlusion mode. For example, a first image 900 depicts a user in a first pose. The user is wearing real-world clothing 910, and a segmentation estimation system 224 generates clothing segments for real-world clothing 910. A visual effects selection module 519 can receive input selecting virtual trousers clothing 920 to be added to image 900. In response, an image modification module 518 can access 3D body tracking information about the user depicted in image 900 from a 3D body tracking module 513 to position the virtual trousers clothing 920 in a manner similar to the currently depicted pose of the user's legs corresponding to the virtual trousers clothing. Based on the full-body segmentation received from a full-body segmentation module 515, the image modification module 518 positions the virtual trousers clothing 920 above the user's legs in a manner that blends with the background of image 900.
[0140] Image modification module 518 can select an occlusion mode for virtual trousers clothing 920 in which real-world clothing 910 covers virtual trousers clothing 920. Specifically, image modification module 518 sets the occlusion mode such that virtual trousers clothing 920 is covered by a portion of real-world clothing 910 (e.g., a short-sleeved shirt) corresponding to the clothing segmentation received from smoothing segmentation module 516. Image modification module 518 determines which subset of pixels of real-world clothing 910 covers a subset of pixels 922 of virtual clothing 920. Image modification module 518 replaces the subset of pixels 922 of virtual trousers clothing 920 with the subset of pixels of real-world clothing 910 within the covered area of clothing segmentation of real-world clothing 910. This gives the illusion that real-world clothing 810 is draped over virtual clothing 920.
[0141] As the user moves around in the video, the positioning and occlusion patterns change in real time. For example, when a subsequent frame of the video is received as the second image 901, the position of the real-world clothing 910 in the second image 901 changes relative to the first image 900. Therefore, an updated clothing segmentation is generated and used to control the occlusion pattern of the real-world clothing 910 relative to the virtual clothing 920. Specifically, the image modification module 518 can access 3D body tracking information about the user depicted in image 901 from the 3D body tracking module 513 to update the pose of the virtual trousers clothing 920 in a manner similar to the currently depicted pose of the user's legs corresponding to the virtual trousers clothing. Based on the full-body segmentation received from the full-body segmentation module 515, the image modification module 518 positions the virtual trousers clothing 920 above the user's legs in a manner that blends with the background of image 901.
[0142] Image modification module 518 updates the occlusion mode so that from the first image 900 to the second image 901, the virtual trousers clothing 920 still appears to be covered by the real-world clothing 910. That is, in the second image 901, the virtual trousers clothing 920 is covered by the portion of the real-world clothing 910 (e.g., a short-sleeved shirt) corresponding to the clothing segmentation received from the smoothing segmentation module 516. In this case, image modification module 518 determines which subset of pixels 924 of the virtual trousers clothing 920 is covered by a subset of pixels of the real-world clothing 910 in the second image 901 relative to the first image 900. Image modification module 518 replaces the subset of pixels 924 of the virtual trousers clothing 920 with the subset of pixels of the real-world clothing 910 within the covered area of the clothing segmentation of the real-world clothing 910.
[0143] Figure 10A This is a flowchart of a process 1000 for generating clothing segmentation depicting a user in an image, based on some examples. Although a flowchart can describe operations as a sequential process, many of these operations can be performed in parallel or simultaneously. Furthermore, the order of operations can be rearranged. The process terminates when its operations are completed. A process can correspond to a method, a program, etc. The steps of a method can be performed in whole or in part, can be combined with some or all of the steps from other methods, and can be performed by any number of different systems or any part thereof (e.g., a processor included in any system).
[0144] At operation 1001, client device 102 receives a monocular image that includes a depiction of a user wearing clothing. For example, segmentation estimation system 224 may capture images depicting one or more users (e.g., multiple users) wearing corresponding clothing.
[0145] At operation 1002, client device 102 generates a segmentation of the clothing worn by the user in the monocular image. As an example, segmentation estimation system 224 can generate multiple segments of multiple garments worn by the user in the image by applying the first machine learning technique module 512 to the image.
[0146] At operation 1003, as discussed above, client device 102 accesses a video feed comprising multiple monocular images received prior to the monocular image. For example, segmentation estimation system 224 may apply second machine learning technology module 517 to multiple previously received images, each depicting clothing worn by the user, to predict clothing segmentation of the user's clothing for the currently received image.
[0147] At operation 1004, as discussed above, client device 102 uses a video feed to smooth the segmentation of the user's clothing generated from the monocular image to provide a smoothed clothing segmentation. For example, segmentation estimation system 224 can calculate the deviation between one or more segments of one or more garments generated by the first machine learning technology module 512 and the predicted segmentation generated by the second machine learning technology module 517. Segmentation estimation system 224 can modify the segmentation generated by the first machine learning technology module 512 based on the deviation.
[0148] At operation 1005, as discussed above, client device 102 applies one or more visual effects to a monocular image based on the smoothed clothing segmentation. For example, segmentation estimation system 224 may replace the clothing with another garment, recolor the pixels of the clothing, apply animation or video to the region within the clothing segmentation, render ripples, flashes, or grains to the boundaries or portions of the clothing segmentation, remove the clothing, apply contour effects to the clothing segmentation, and adjust the display position and occlusion mode of virtual clothing (e.g., trousers) displayed adjacent to or next to the clothing (e.g., shirt) corresponding to the clothing segmentation.
[0149] Figure 10B This is a flowchart of a process 1010 for generating clothing segmentation of clothing depicted in an image, based on some examples. Although a flowchart can describe operations as a sequential process, many of these operations can be performed in parallel or simultaneously. Furthermore, the order of operations can be rearranged. The process terminates when its operations are completed. A process can correspond to a method, a program, etc. The steps of a method can be performed in whole or in part, can be combined with some or all of the steps from other methods, and can be performed by any number of different systems or any part thereof (e.g., a processor included in any system).
[0150] At operation 1011, as discussed above, client device 102 receives a monocular image that includes a depiction of a user wearing clothing.
[0151] At operation 1012, as discussed above, client device 102 generates a segmentation of the clothing worn by the user based on the monocular image.
[0152] At operation 1013, as discussed above, client device 102 receives input to select a visualization mode.
[0153] At operation 1014, as discussed above, client device 102 applies one or more visual effects corresponding to the visualization mode to the monocular image based on clothing segmentation.
[0154] Machine structure
[0155] Figure 11 This is a schematic representation of machine 1100, within which instructions 1108 (e.g., software, program, application, app, or other executable code) can be executed to cause machine 1100 to perform any or more of the methods discussed herein. For example, instructions 1108 can cause machine 1100 to perform any or more of the methods described herein. Instructions 1108 transform the general, unprogrammed machine 1100 into a specific machine 1100 programmed to perform the described and illustrated functions in the described manner. Machine 1100 can operate as a standalone device or can be coupled (e.g., networked) to other machines. In a networked deployment, machine 1100 can operate as a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. Machine 1100 may include, but is not limited to: server computers, client computers, personal computers (PCs), tablet computers, laptop computers, netbooks, set-top boxes (STBs), personal digital assistants (PDAs), entertainment media systems, cellular phones, smartphones, mobile devices, wearable devices (e.g., smartwatches), smart home devices (e.g., smart appliances), other smart devices, web devices, network routers, network switches, network bridges, or any machine capable of sequentially or otherwise executing instructions 1108 specifying actions to be taken by machine 1100. Furthermore, although only a single machine 1100 is shown, the term "machine" should also be considered to include a collection of machines that individually or jointly execute instructions 1108 to perform any one or more of the methods discussed herein. For example, machine 1100 may include client device 102 or any of several server devices forming part of message transceiver server system 108. In some examples, machine 1100 may also include both client and server systems, wherein certain operations of a particular method or algorithm are performed on the server side and certain operations of said particular method or algorithm are performed on the client side.
[0156] Machine 1100 may include a processor 1102, a memory 1104, and an input / output (I / O) unit 1138 that can be configured to communicate with each other via a bus 1140. In the example, processor 1102 (e.g., a central processing unit (CPU), a reduced instruction set computing (RISC) processor, a complex instruction set computing (CISC) processor, a graphics processing unit (GPU), a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a radio frequency integrated circuit (RFIC), another processor, or any suitable combination thereof) may include, for example, processor 1106 and processor 1110 that execute instruction 1108. The term "processor" is intended to include multi-core processors, which may include two or more independent processors (sometimes referred to as "cores") capable of executing instructions simultaneously. Although Figure 11 Multiple processors 1102 are shown, but machine 1100 may include a single processor with a single core, a single processor with multiple cores (e.g., a multi-core processor), multiple processors with a single core, multiple processors with multiple cores, or any combination thereof.
[0157] Memory 1104 includes main memory 1112, static memory 1114, and memory cells 1116, all of which are accessible by processor 1102 via bus 1140. Main memory 1104, static memory 1114, and memory cells 1116 store instructions 1108 embodying any one or more of the methods or functions described herein. Instructions 1108 may also reside wholly or partially in main memory 1112, in static memory 1114, in machine-readable medium 1118 within memory cell 1116, in at least one processor of processor 1102 (e.g., in the processor's cache memory), or in any suitable combination thereof during execution by machine 1100.
[0158] I / O component 1138 may include various components for receiving input, providing output, generating output, sending information, exchanging information, capturing measurement results, etc. The specific I / O component 1138 included in a particular machine will depend on the type of machine. For example, portable machines such as mobile phones may include touch input devices or other such input mechanisms, while headless server machines will likely not include such touch input devices. It should be recognized that I / O component 1138 may include... Figure 11Many other components are not shown. In various examples, I / O component 1138 may include user output component 1124 and user input component 1126. User output component 1124 may include visual components (e.g., displays such as plasma display panels (PDP), light-emitting diode (LED) displays, liquid crystal displays (LCDs), projectors, or cathode ray tubes (CRTs)), acoustic components (e.g., speakers), haptic components (e.g., vibration motors, resistance mechanisms), other signal generators, etc. User input component 1126 may include alphanumeric input components (e.g., keyboards, touchscreens configured to receive alphanumeric input, photoelectric keyboards, or other alphanumeric input components), point-based input components (e.g., mice, touchpads, trackballs, joysticks, motion sensors, or other pointing instruments), haptic input components (e.g., physical buttons, touchscreens that provide the position and force of a touch or touch gesture, or other haptic input components), audio input components (e.g., microphones), etc.
[0159] In other examples, I / O component 1138 may include biometric component 1128, motion component 1130, environmental component 1132, or positioning component 1134, as well as various other components. For example, biometric component 1128 includes components for detecting expressions (e.g., hand expressions, facial expressions, voice expressions, body posture, or eye tracking), measuring biosignals (e.g., blood pressure, heart rate, body temperature, sweating, or brain waves), and identifying people (e.g., voice recognition, retinal recognition, facial recognition, fingerprint recognition, or EEG-based recognition). Motion component 1130 includes accelerometer components (e.g., accelerometers), gravity sensor components, and rotation sensor components (e.g., gyroscopes).
[0160] The environmental component 1132 includes, for example, one or more camera devices (with still image / photograph and video capabilities), lighting sensor components (e.g., photometers), temperature sensor components (e.g., one or more thermometers for detecting ambient temperature), humidity sensor components, pressure sensor components (e.g., barometers), acoustic sensor components (e.g., one or more microphones for detecting background noise), proximity sensor components (e.g., infrared sensors for detecting nearby objects), gas sensors (e.g., gas detection sensors for detecting the concentration of hazardous gases or measuring pollutants in the atmosphere for safety purposes), or other components that can provide indications, measurements, or signals corresponding to the surrounding physical environment.
[0161] Regarding the camera device, client device 102 may have a camera device system including, for example, a front-facing camera on the front surface of client device 102 and a rear-facing camera on the rear surface of client device 102. The front-facing camera may be used, for example, to capture still images and videos (e.g., "selfies") of the user of client device 102, which can then be enhanced with the aforementioned enhancement data (e.g., filters). For example, the rear-facing camera may be used to capture still images and videos in a more conventional camera device mode, wherein these images are similarly enhanced with enhancement data. In addition to the front-facing and rear-facing cameras, client device 102 may also include a 360° camera device for capturing 360° photos and videos.
[0162] Furthermore, the camera system of the client device 102 may include dual rear cameras (e.g., a main camera and a depth-sensing camera), or even include triple, quadruple, or quintuple rear camera configurations on the front and rear sides of the client device 102. For example, these multiple camera systems may include wide-angle cameras, ultra-wide-angle cameras, telephoto cameras, macro cameras, and depth sensors.
[0163] The positioning component 1134 includes a position sensor component (e.g., a GPS receiver component), an altitude sensor component (e.g., an altimeter or barometer that detects air pressure and can determine altitude based on air pressure), an orientation sensor component (e.g., a magnetometer), etc.
[0164] A wide variety of technologies can be used to implement communication. I / O component 1138 also includes communication component 1136, which is operable to couple machine 1100 to network 1120 or device 1122 via a corresponding coupling or connection. For example, communication component 1136 may include a network interface component or other suitable device that interfaces with network 1120. In further examples, communication component 1136 may include wired communication components, wireless communication components, cellular communication components, near field communication (NFC) components, etc. Components (e.g.) (low power consumption) Components and other communication components that provide communication via other modes. Device 1122 can be any peripheral device from another machine or various peripheral devices (e.g., a peripheral device coupled via USB).
[0165] Furthermore, the communication component 1136 may detect identifiers or include components capable of operating to detect identifiers. For example, the communication component 1136 may include a radio frequency identification (RFID) tag reader component, an NFC smart tag detection component, an optical reader component (e.g., an optical sensor for detecting one-dimensional barcodes such as Universal Product Code (UPC) barcodes, multi-dimensional barcodes such as Quick Response (QR) codes, Aztec codes, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2D barcodes, and other optical codes), or an acoustic detection component (e.g., a microphone for identifying the tagged audio signal). Additionally, various information can be derived via the communication component 1136, such as location via Internet Protocol (IP) geolocation, etc. Location can be determined by signal triangulation or by detecting NFC beacon signals that indicate a specific location.
[0166] Various memories (e.g., main memory 1112, static memory 1114, and the memory of processor 1102) and storage units 1116 may store one or more sets of instructions and data structures (e.g., software) implemented or used by any or more of the methods or functions described herein. When executed by processor 1102, these instructions (e.g., instruction 1108) enable various operations to implement the disclosed examples.
[0167] Instructions 1108 can be sent or received over network 1120 using a transmission medium via a network interface device (e.g., the network interface component included in communication component 1136) and using any of several known transmission protocols (e.g., Hypertext Transfer Protocol (HTTP)). Similarly, instructions 1108 can be sent or received via a transmission medium through coupling with device 1122 (e.g., peer-to-peer coupling).
[0168] Software Architecture
[0169] Figure 12This is a block diagram 1200 illustrating a software architecture 1204 that can be installed on any one or more of the devices described herein. The software architecture 1204 is supported by hardware, such as a machine 1202 including a processor 1220, memory 1226, and I / O components 1238. In this example, the software architecture 1204 can be conceptualized as a stack of layers, where each layer provides a specific function. The software architecture 1204 includes layers such as an operating system 1212, libraries 1210, frameworks 1208, and applications 1206. Operationally, application 1206 activates API calls 1250 via the software stack and receives messages 1252 in response to API calls 1250.
[0170] Operating system 1212 manages hardware resources and provides public services. Operating system 1212 includes, for example, a kernel 1214, services 1216, and drivers 1222. Kernel 1214 serves as an abstraction layer between hardware and other software layers. For example, kernel 1214 provides functions such as memory management, processor management (e.g., scheduling), component management, networking, and security settings. Services 1216 can provide other public services to other software layers. Driver 1222 is responsible for controlling the underlying hardware or interfacing with the underlying hardware. For example, driver 1222 may include a display driver, a camera driver, etc. or Low-power drives, flash drives, serial communication drives (e.g., USB drives), Drivers, audio drivers, power management drivers, etc.
[0171] Library 1210 provides a shared low-level infrastructure used by application 1206. Library 1210 may include system library 1218 (e.g., the C standard library) that provides functions such as memory allocation, string manipulation, and mathematical functions. Furthermore, library 1210 may include API library 1224, such as media libraries (e.g., libraries for supporting the rendering and manipulation of various media formats, such as Moving Picture Experts Group-4 (MPEG4), Advanced Video Coding (H.264 or AVC), Moving Picture Experts Group Layer-3 (MP3), Advanced Audio Coding (AAC), Adaptive Multi-Rate (AMR) audio codecs, Joint Picture Experts Group (JPEG or JPG) or Portable Web Graphics (PNG)), graphics libraries (e.g., OpenGL frameworks for rendering graphic content on a display in two-dimensional (2D) and three-dimensional (3D) formats), database libraries (e.g., SQLite providing various relational database functions), web libraries (e.g., WebKit providing web browsing capabilities), and so on. Library 1210 may also include various other libraries 1228 to provide many other APIs to application 1206.
[0172] Framework 1208 provides a common high-level infrastructure for use by application 1206. For example, framework 1208 provides various graphical user interface (GUI) functions, advanced resource management, and advanced location services. Framework 1208 can provide a wide range of other APIs that can be used by application 1206, some of which may be specific to a particular operating system or platform.
[0173] In the example, application 1206 may include home application 1236, contact application 1230, browser application 1232, book reader application 1234, location application 1242, media application 1244, messaging application 1246, game application 1248, and a wide variety of other applications such as external application 1240. Application 1206 is a program that performs the functions defined in the program. One or more applications 1206 can be created using various programming languages, such as object-oriented programming languages (e.g., Objective-C, Java, or C++) or procedural programming languages (e.g., C or assembly language). In a particular example, external application 1240 (e.g., used by an entity other than a vendor of a particular platform using Android) TM or iOS TM Applications developed using a Software Development Kit (SDK) can be used on platforms such as iOS. TM ANDROID TM , Mobile software running on a phone's mobile operating system or another mobile operating system. In this example, external application 1240 can activate API calls 1250 provided by operating system 1212 to facilitate the functions described herein.
[0174] Glossary
[0175] "Carrier signal" refers to any intangible medium capable of storing, encoding, or carrying instructions for machine execution, and includes digital or analog communication signals or other intangible media to facilitate the transmission of such instructions. Instructions can be sent or received over a network via a transmission medium using network interface devices.
[0176] "Client device" refers to any machine that interfaces with a communication network to obtain resources from one or more server systems or other client devices. Client devices can be, but are not limited to, mobile phones, desktop computers, laptop computers, portable digital assistants (PDAs), smartphones, tablet computers, ultrabooks, netbooks, laptops, multiprocessor systems, microprocessor-based or programmable consumer electronics, game consoles, set-top boxes, or any other communication device that a user can use to access the network.
[0177] "Communication network" refers to one or more parts of a network. A network can be an ad hoc network, intranet, extranet, virtual private network (VPN), local area network (LAN), wireless LAN (WLAN), wide area network (WAN), wireless WAN (WWAN), metropolitan area network (MAN), the Internet, a part of the Internet, a part of the Public Switched Telephone Network (PSTN), a Point of Sale (POTS) telephone network, a cellular telephone network, a wireless network, etc. A network, other types of networks, or a combination of two or more such networks. For example, a network or part of a network may include a wireless network or a cellular network, and the coupling may be a Code Division Multiple Access (CDMA) connection, a Global System for Mobile Communications (GSM) connection, or other types of cellular or wireless coupling. In this example, the coupling can implement any data transmission technology of various types, such as Single Carrier Radio Transmission (1xRTT), Evolved Data Optimization (EVDO), General Packet Radio Service (GPRS), Enhanced Data Rate Evolution (EDGE) technology, the 3rd Generation Partnership Project (3GPP) including 3G, fourth-generation wireless (4G) networks, Universal Mobile Telecommunications System (UMTS), High-Speed Packet Access (HSPA), Global Microwave Access Interoperability (WiMAX), Long Term Evolution (LTE) standards, other data transmission technologies defined by various standards setting organizations, other long-distance protocols, or other data transmission technologies.
[0178] A "component" refers to a device, physical entity, or logic with boundaries defined by functional or subroutine calls, branch points, APIs, or other technologies that provide partitioning or modularity for specific processing or control functions. Components can interface with other components via their interfaces to perform machine processes. A component can be an encapsulated functional hardware unit designed for use with other components and is part of a program that typically performs a specific function within a related function.
[0179] A component can constitute a software component (e.g., code implemented on a machine-readable medium) or a hardware component. A “hardware component” is a tangible unit capable of performing certain operations and which can be configured or arranged in a physical manner. In various exemplary examples, one or more computer systems (e.g., standalone computer systems, client computer systems, or server computer systems) or one or more hardware components (e.g., processors or processor groups) of a computer system can be configured by software (e.g., an application or application portion) to operate to perform certain operations described herein.
[0180] Hardware components can also be implemented mechanically, electronically, or in any suitable combination thereof. For example, a hardware component may include dedicated circuitry or logic permanently configured to perform certain operations. A hardware component may be a dedicated processor, such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). A hardware component may also include programmable logic or circuitry temporarily configured by software to perform certain operations. For example, a hardware component may include software executed by a general-purpose processor or other programmable processor. Once configured by such software, the hardware component becomes a specific machine (or a specific part of a machine) uniquely tailored to perform the configured function, and is no longer a general-purpose processor. It will be appreciated that the decision to implement a hardware component mechanically in a dedicated and permanently configured circuit or in a temporarily configured (e.g., software-configured) circuit may be made for cost and time considerations. Therefore, the phrase “hardware component” (or “hardware-implemented component”) should be understood to include tangible entities, i.e., entities physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain way or perform certain operations described herein.
[0181] Consider an example where hardware components are temporarily configured (e.g., programmed), without requiring each of the hardware components to be configured or instantiated at any given time. For instance, where the hardware components include a general-purpose processor that can be configured by software to become a dedicated processor, that general-purpose processor can be configured as different dedicated processors (e.g., including different hardware components) at different times. The software accordingly configures one or more specific processors to constitute a specific hardware component at one time and different hardware components at different times.
[0182] Hardware components can provide information to and receive information from other hardware components. Therefore, the described hardware components can be considered communicatively coupled. In the presence of multiple hardware components, communication can be achieved through signal transmission (e.g., via appropriate circuitry and buses) between or among two or more hardware components. In an example where multiple hardware components are configured or instantiated at different times, such communication between hardware components can be achieved, for example, by storing information in a memory structure accessible to the multiple hardware components and retrieving information from that memory structure. For example, a hardware component can perform an operation and store the output of that operation in a memory device communicatively coupled to it. Other hardware components can then access that memory device at a subsequent time to retrieve and process the stored output. Hardware components can also initiate communication with input or output devices and can operate on resources (e.g., collections of information).
[0183] Various operations of the example methods described herein can be performed, at least in part, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors can constitute components of a processor implementation that operate to perform one or more operations or functions described herein. As used herein, "processor-implemented component" refers to a hardware component implemented using one or more processors. Similarly, the methods described herein can be implemented, at least in part, by processors, where a particular processor or one or more processors are examples of hardware. For example, at least some operations of the methods can be performed by one or more processors 1102 or processor-implemented components. Furthermore, one or more processors can also operate to support the execution of relevant operations in a "cloud computing" environment or as a "Software as a Service" (SaaS) operation. For example, at least some of the operations can be performed by a group of computers (as an example of a machine including processors), where these operations can be accessed via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., APIs). The execution of some operations can be distributed among processors, not residing solely within a single machine, but deployed across multiple machines. In some examples, the processor or processor-implemented components may be located in a single geographic location (e.g., within a home environment, an office environment, or a server cluster). In other examples, the processor or processor-implemented components may be distributed across multiple geographic locations.
[0184] "Computer-readable storage medium" refers to both machine-readable storage media and transmission media. Therefore, the term includes both storage devices / media and carrier / modulated data signals. The terms "machine-readable medium," "computer-readable medium," and "device-readable medium" refer to the same thing and can be used interchangeably in this disclosure.
[0185] A "brief message" is a message that can be accessed for a limited time. Brief messages can be text, images, videos, etc. The access time for a brief message can be set by the message sender. Alternatively, the access time can be a default setting or a setting specified by the recipient. Regardless of the setting method, the message is transient.
[0186] "Machine storage medium" refers to one or more storage devices and media (e.g., centralized or distributed databases, and associated caches and servers) that store executable instructions, routines, and data. Therefore, this term should be considered to include, but is not limited to, solid-state memory and optical and magnetic media, including memory internal or external to the processor. Specific examples of machine storage media, computer storage media, and device storage media include: non-volatile memory, including, for example, semiconductor memory devices such as erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), FPGAs, and flash memory devices; disks, such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms "machine storage medium," "device storage medium," and "computer storage medium" mean the same thing and may be used interchangeably in this disclosure. The terms "machine storage medium," "computer storage medium," and "device storage medium" expressly exclude carrier waves, modulated data signals, and other such media, at least some of which are covered by the term "signal medium."
[0187] "Non-transitory computer-readable storage medium" refers to a tangible medium capable of storing, encoding, or carrying instructions that can be executed by a machine.
[0188] "Signal medium" means any intangible medium capable of storing, encoding, or carrying instructions executable by a machine, and includes digital or analog communication signals or other intangible media that facilitate the communication of software or data. The term "signal medium" should be considered to include any form of modulated data signal, carrier wave, etc. The term "modulated data signal" refers to a signal in which one or more of its characteristics are set or altered in a manner that encodes information in the signal. The terms "transmission medium" and "signal medium" refer to the same thing and may be used interchangeably in this disclosure.
[0189] Changes and modifications may be made to the disclosed examples without departing from the scope of this disclosure. Such and other changes or modifications are intended to be included within the scope of this disclosure as set forth in the appended claims.
Claims
1. A method comprising: Monocular images, including depictions of users wearing clothing, are received by one or more processors. The one or more processors generate a segmentation of the clothing worn by the user in the monocular image; Obtain the skeletal key points associated with the user; Determine the location of the joints represented by the skeletal keypoints within a given edge threshold distance from the boundary of the segmentation of the clothing worn by the user; In response to determining the position of the joint within a given edge threshold distance from the boundary of the segment of the clothing worn by the user, the type of clothing worn by the user is determined. Access a video feed, the video feed comprising a plurality of monocular images received prior to the monocular image; The video feed is used to smooth the segmentation of the clothing worn by the user to provide a smoothed segmentation of the clothing worn by the user. as well as Based on the smoothed segmentation of the clothing worn by the user and the determined type of clothing worn by the user, one or more visual effects are applied to the monocular image.
2. The method according to claim 1, wherein, The clothing worn by the user includes an upper garment, which includes clothing appearing in the upper region of the body, and wherein the monocular image includes a depiction of multiple users wearing corresponding clothing, the multiple users including the user, and the method further includes: generating multiple segments of the corresponding clothing of the multiple users.
3. The method according to claim 1 or 2, wherein, The monocular image is the first frame of the video, and the method further includes: The segmentation is generated using a first machine learning technique, wherein smoothing the segmentation of the clothing worn by the user includes comparing the generated segmentation with a previous segmentation generated based on a second frame of the video using a second machine learning technique.
4. The method according to claim 3, wherein, The first machine learning technique includes a first deep neural network.
5. The method according to claim 4, further comprising: The first deep neural network is trained by performing the following operations: Receive training data including multiple training monocular images and ground truth segmentation for each of the multiple training monocular images; The first deep neural network is applied to the first training monocular image among the plurality of training monocular images to estimate the segmentation of clothing worn by a given user depicted in the first training monocular image; Calculate the deviation between the estimated segmentation and the ground truth segmentation associated with the first training monocular image; The parameters of the first deep neural network are updated based on the calculated deviation; and The application steps, the calculation steps, and the update steps are repeated for each of the plurality of training monocular images.
6. The method according to claim 5, wherein, The type of clothing worn by the user includes a short-sleeved shirt, the joint includes an elbow joint, the given edge corresponds to the sleeve of the short-sleeved shirt, the training data includes a depiction of the upper garment worn by the training user, and wherein the plurality of training monocular images include ground truth skeletal keypoints of one or more training user bodies depicted in the respective training monocular images, the method further comprising: Based on the ground truth skeletal keypoints of a given user depicted in the first training monocular image, the sleeves of the clothing worn by the given user are identified; and In response to the identification of the sleeve, the weights associated with the parameters of the loss function used to update the parameters of the first deep neural network are adjusted.
7. The method according to claim 5 or 6, wherein, The plurality of training monocular images include a plurality of image resolutions, and the method further includes generating a plurality of segmentation models based on the first deep neural network, wherein a first segmentation model among the plurality of segmentation models is trained based on a training monocular image having a first image resolution among the plurality of image resolutions, and a second segmentation model among the plurality of segmentation models is trained based on a training monocular image having a second image resolution among the plurality of image resolutions.
8. The method according to claim 5 or 6, wherein, The multiple training monocular images include multiple labeled and unlabeled image and video data.
9. The method according to claim 5 or 6, wherein, The multiple training monocular images include full-body depictions of a specific user, images lacking depictions of any user, depictions of multiple users, and depictions of users at different distances from the image capture device.
10. The method according to claim 1, wherein, Smoothing the segmentation includes applying a second machine learning technique to the video feed to predict one or more segments of the clothing based on depictions of the clothing in the plurality of monocular images, respectively.
11. The method according to claim 10, wherein, The second machine learning technique includes a second deep neural network, and wherein smoothing the segmentation includes comparing one or more segments of the garment predicted by the second deep neural network with segments of the garment in a received monocular image generated by the first machine learning technique.
12. The method of claim 11, further comprising: The second deep neural network is trained by performing the following operations: Receive training data including multiple training videos and ground truth segmentation for clothing depicted in each of the multiple training videos; The second deep neural network is applied to the first training video among the plurality of training videos to predict the segmentation of clothing in frames following the first training video; Calculate the deviation between the predicted clothing segmentation in frames following the first training video and the ground truth segmentation of the clothing depicted in frames following the first training video. The parameters of the second deep neural network are updated based on the calculated deviation; and The application steps, the calculation steps, and the update steps are repeated for each of the plurality of training videos.
13. The method according to claim 1 or 2, wherein, The plurality of monocular images are received in seconds before the monocular images are received.
14. The method according to claim 1 or 2, further comprising: Determine the capabilities of one or more client devices used to capture the monocular images; as well as The segmentation model is selected based on the capabilities of one or more devices to generate the segmentation.
15. The method according to claim 1 or 2, wherein, Applying one or more of the aforementioned visual effects includes: Replace the user's clothing with virtual clothing; and The clothing is recolored by replacing the pixels of the user's clothing that fall within the smoothed segment with the target pixel value.
16. The method according to claim 15, wherein, The target pixel value corresponds to a pixel representing the texture or color of the animation or target image.
17. The method according to claim 1 or 2, wherein, Applying one or more of the aforementioned visual effects includes: The user's full-body posture is determined based on data representing the user's three-dimensional full-body position information depicted in the monocular image; The segmentation of the user's clothing corresponds to the upper body clothing; In response to determining that the segmentation of the user's clothing corresponds to the upper body clothing, the virtual lower body clothing is accessed; The posture of the virtual lower body clothing is adjusted based on the user's overall posture. Receive the full-body segmentation of the user depicted in the monocular image; Based on the full-body segmentation of the user depicted in the monocular image, the pixels of the virtual lower body clothing are blended with the pixels of the image's background; and Select the occlusion mode between the user's clothing and the virtual lower body clothing.
18. The method of claim 17, further comprising: Determine a portion of the segment of the user's clothing that covers the first portion of the pixels of the virtual lower body clothing; as well as In response to determining that a portion of the segment of the user's clothing covers the first portion of the pixels of the virtual lower body clothing, the first portion of the pixels of the virtual lower body clothing is replaced with a second portion of the pixels of the user's clothing that corresponds to the segment of the segment.
19. A system comprising: processor; as well as A memory component storing instructions that, when executed by the processor, cause the processor to perform operations, including: Receive a monocular image, the monocular image including a depiction of a user wearing clothing; Generate a segmentation of the clothing worn by the user in the monocular image; Obtain the skeletal key points associated with the user; Determine the location of the joints represented by the skeletal keypoints within a given edge threshold distance from the boundary of the segmentation of the clothing worn by the user; In response to determining the position of the joint within a given edge threshold distance from the boundary of the segment of the clothing worn by the user, the type of clothing worn by the user is determined. Access a video feed, the video feed comprising a plurality of monocular images received prior to the monocular image; The segmentation of the user's clothing is smoothed using the video feed to provide a smoothed segmentation of the user's clothing; and Based on the smoothed segmentation of the clothing worn by the user and the determined type of clothing worn by the user, one or more visual effects are applied to the monocular image.
20. A non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform an operation, the operation comprising: Receive a monocular image, the monocular image including a depiction of a user wearing clothing; Generate a segmentation of the clothing worn by the user in the monocular image; Obtain the skeletal key points associated with the user; Determine the location of the joints represented by the skeletal keypoints within a given edge threshold distance from the boundary of the segmentation of the clothing worn by the user; In response to determining the position of the joint within a given edge threshold distance from the boundary of the segment of the clothing worn by the user, the type of clothing worn by the user is determined. Access a video feed, the video feed comprising a plurality of monocular images received prior to the monocular image; The video feed is used to smooth the segmentation of the clothing worn by the user to provide a smoothed segmentation of the clothing worn by the user. as well as Based on the smoothed segmentation of the clothing worn by the user and the determined type of clothing worn by the user, one or more visual effects are applied to the monocular image.