Generating video from a static image using interpolation
By applying filters and image interpolators to static image pairs and using machine learning models to generate intermediate images, the problem of balancing quality and efficiency in generating realistic videos in existing technologies is solved, achieving high-precision and high-efficiency video generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GOOGLE LLC
- Filing Date
- 2021-12-30
- Publication Date
- 2026-06-16
Smart Images

Figure CN116324989B_ABST
Abstract
Description
[0001] Cross-references to related applications
[0002] This application claims priority to U.S. Provisional Patent Application No. 63 / 190,234, filed May 18, 2021, entitled “Using Interpolation to Generate a Video from Static Images,” the entire contents of which are incorporated herein by reference. Background Technology
[0003] Users of devices such as smartphones or other digital cameras capture and store large amounts of photos and videos in their image libraries. Users utilize such libraries to view their photos and videos to recall various events such as birthdays, weddings, vacations, trips, and more. Users can have large image libraries containing tens of thousands of images taken over long periods of time.
[0004] The background description provided herein is for the purpose of generally presenting the context of this disclosure. The work of the currently attributed inventors, to the extent described in this background section and in aspects that may not be considered prior art at the time of filing, is neither express nor implied as an admission of prior art to this disclosure. Summary of the Invention
[0005] A method includes selecting candidate image pairs from a set of images associated with a user account, wherein each pair includes a first still image and a second still image from the user account. The method further includes applying a filter to select a specific image pair from the candidate image pairs. The method further includes generating one or more intermediate images based on the specific image pairs using an image interpolator. The method further includes generating a video comprising three or more frames arranged in a sequence, wherein a first frame of the sequence is a first still image, a last frame of the sequence is a second still image, and each of the one or more intermediate images is a corresponding intermediate frame in the sequence between the first frame and the last frame.
[0006] In some embodiments, the filter includes a temporal filter that excludes a candidate image pair when the time difference between corresponding timestamps associated with a first still image and a second still image in one or more candidate image pairs is greater than a time threshold. In some embodiments, each of the one or more intermediate images is associated with a corresponding timestamp having a value between the timestamps of the first and second still images, and wherein the position of each intermediate image in the sequence is based on the corresponding timestamp. In some embodiments, the time threshold is 2 seconds. In some embodiments, the filter includes a motion filter that excludes one or more candidate image pairs by: estimating motion between the first and second still images; and determining that the motion between the first and second still images is less than a minimum motion threshold. In some embodiments, the filter further excludes one or more candidate image pairs by determining that the motion between the first and second still images exceeds a maximum motion threshold. In some embodiments, the filter includes a filter machine learning module that excludes one or more candidate image pairs from the candidate image pairs by: generating feature vectors representing a first static image and a second static image in each candidate pair; and excluding one or more candidate image pairs corresponding to corresponding feature vectors, wherein the distance between corresponding feature vectors is greater than a threshold vector distance, wherein the feature vectors are mathematical representations, and wherein the mathematical representations of similar images are closer in vector space than the mathematical representations of dissimilar images. In some embodiments, the feature vector is a first feature vector, and the filter machine learning module is further operable to: receive an intermediate image from one or more intermediate images as input; generate one or more second feature vectors corresponding to the intermediate images; and exclude one or more intermediate images corresponding to corresponding feature vectors, wherein the distance between the corresponding feature vector of the intermediate image and the corresponding feature vector of the corresponding candidate image pair is greater than a threshold vector distance. In some embodiments, the image interpolator includes an interpolation machine learning model that receives the first static image and the second static image as input and generates one or more intermediate images. In some embodiments, generating one or more intermediate images based on a specific image pair includes: generating a plurality of candidate intermediate images; and evaluating each candidate intermediate image by: generating a candidate video including a first still image as a first frame, a candidate intermediate image as a second frame, and a second still image as a third frame; and selecting a candidate intermediate image as one of the one or more intermediate images if the candidate video does not contain frame interpolation faults.In some embodiments, the method further includes determining an interpolation fault using a discriminator machine learning model trained to determine whether an input image is a generated image, wherein a candidate intermediate image is selected if the discriminator machine learning model determines that a candidate intermediate image is indistinguishable from the generated image. In some embodiments, the method further includes: determining that an interpolation fault has occurred if a filter excludes one or more intermediate images; and generating one or more additional intermediate images in response to the occurrence of an interpolation fault. In some embodiments, generating a video includes: generating a three-dimensional representation of a scene in a first static image using a deep machine learning model based on a prediction of the depth of the first static image, wherein the deep machine learning model is a classifier that receives the first static image as input, and wherein the video includes camera effects generated based on the three-dimensional representation of the scene. In some embodiments, the camera effects include at least one of the following: zoom, pan, or rotation.
[0007] This specification advantageously describes a method for moving between pairs of synthesized images and filling gaps with newly generated frames to create video from those pairs. This specification describes a media application that filters candidate image pairs, performs frame interpolation on specific image pairs to generate intermediate images, and generates video from the specific image pairs and the intermediate images. This specification advantageously describes a method for balancing the need for a series of filters with high recall and fast computation time with a series of filters with high accuracy and slower computation time. Attached Figure Description
[0008] Figure 1 This is a block diagram of an example network environment based on some embodiments described herein.
[0009] Figure 2 This is a block diagram of an example computing device according to some embodiments described herein.
[0010] Figure 3 This is a detailed block diagram illustrating a filtering module and an image interpolator according to some embodiments described herein.
[0011] Figure 4 The illustration shows examples of different filters to be applied to image pairs according to some embodiments described herein.
[0012] Figure 5 Examples of computer-generated interpolation, including those with two intermediate images between a first image and a second image, according to some embodiments described herein.
[0013] Figure 6 The illustration shows an example of a three-dimensional rotation of a static image according to some embodiments described herein.
[0014] Figures 7A to 7BThe diagram illustrates a flowchart for generating video from image pairs according to some embodiments described herein. Detailed Implementation
[0015] Example Environment 100
[0016] Figure 1 The illustration shows a block diagram of an example environment 100. In some embodiments, environment 100 includes a media server 101, user equipment 115a, and user equipment 115n, all coupled to network 105. Users 125a and 125n may be associated with corresponding user equipment 115a and 115n. In some embodiments, environment 100 may include... Figure 1 Other servers or devices not shown, or media server 101 may not be included. Figure 1 In the other accompanying drawings, letters following the reference numerals (e.g., "115a") indicate reference to an element having that particular reference numeral. Reference numerals without a following letter (e.g., "115") indicate general reference to an embodiment of the element having that reference numeral.
[0017] Media server 101 may include a processor, memory, and network communication hardware. In some embodiments, media server 101 is a hardware server. Media server 101 is communicatively coupled to network 105 via signal line 102. Signal line 102 may be a wired connection, such as Ethernet, coaxial cable, fiber optic cable, etc., or a wireless connection, for example... Or other wireless technologies. In some embodiments, media server 101 transmits data to one or more of user devices 115a, 115n and receives data from one or more of user devices 115a, 115n via network 105. Media server 101 may include media application 103a and database 199.
[0018] Media application 103a may include code and routines (including one or more trained machine learning models) that operate to enable a user interface to generate video with motion from at least two still images. In some embodiments, media application 103a may be implemented using hardware including a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a machine learning processor / coprocessor, any other type of processor, or a combination thereof. In some embodiments, media application 103a may be implemented using a combination of hardware and software.
[0019] Database 199 can store collections of media (e.g., still images; images with movement, such as animated GIFs or motion graphics; videos; etc.) associated with user accounts. Database 199 can store indexed media items associated with the identity of user 125 on user device 115. Database 199 can also store social network data associated with user 125, user preferences of user 125, etc.
[0020] User equipment 115 may be a computing device including memory and a hardware processor. For example, user equipment 115 may include a desktop computer, mobile device, tablet computer, mobile phone, wearable device, head-mounted display, mobile email device, portable game player, portable music player, reader device, or another electronic device capable of accessing network 105.
[0021] In the illustrated embodiment, user equipment 115a is coupled to network 105 via signal line 108, and user equipment 115n is coupled to network 105 via signal line 110. Media application 103 can be stored on user equipment 115a as media application 103b, or on user equipment 115n as media application 103c. Signal lines 108 and 110 can be wired connections, such as Ethernet, coaxial cable, fiber optic cable, etc., or wireless connections, such as… Or other wireless technologies. User equipment 115a and 115n are accessed by users 125a and 125n, respectively. Figure 1 User equipment 115a and 115n are used as examples. Although Figure 1 The illustration shows two user equipments 115a and 115n, but this disclosure applies to system architectures having one or more user equipments 115.
[0022] In some embodiments, a user account includes an image collection. For example, a user captures images and videos from their camera (e.g., a smartphone or other camera), uploads images from a digital SLR (DSLR) camera, adds images captured by another user shared with them to their image collection, etc.
[0023] Although the example below uses still image pairs, media application 103 can generate video using at least two still images. Media application 103 selects candidate image pairs from an image collection. Each image pair may include a first still image and a second still image from a library.
[0024] Media application 103 applies a filter to select specific image pairs from candidate image pairs. For example, the filter may exclude candidate pairs that are not close copies, candidate pairs without artifacts, candidate pairs of poor quality, a large number of candidate pairs, candidate pairs that are not close in temporal order, candidate pairs outside an acceptable time range (e.g., more than 0.1 seconds apart within two seconds of the captured images), candidate pairs that are dissimilar (having less than a threshold similarity), candidate pairs with too much motion (having motion above a threshold level), candidate pairs lacking bidirectional optical flow, and / or visually distinguishable candidate pairs (e.g., one image has visual enhancement or the filter applied to that image renders it differently from the other). For example, media application 103 excludes pairs where the motion estimate between the first and second still images exceeds a first motion threshold (e.g., there is too much motion between the two still images that needs to be depicted via interpolation) and / or the motion estimate between the first and second still images is less than a minimum motion threshold (e.g., there is too little motion between the two still images). In some embodiments, media application 103 selects specific image pairs, wherein the motion between the pairs is greater than a first threshold and less than a second threshold, wherein the images depict certain types of objects (e.g., faces, pets, humans, etc.), wherein the images meet quality thresholds (e.g., not blurry, well-lit, etc.), wherein the images depict emotions of interest (e.g., smiling facts), and / or wherein the images depict specific types of activities (e.g., sports, dancing, etc.).
[0025] Media application 103 can use an image interpolator to generate one or more intermediate images based on two still images. Each of the one or more intermediate images can be associated with a corresponding timestamp having a value between the timestamps of the first and second still images. The position of each image in the sequence can be based on the corresponding timestamp.
[0026] In some embodiments, generating one or more intermediate images based on two still images includes: generating a plurality of candidate intermediate images and evaluating each candidate intermediate image by: generating a candidate video including a first still image as a first frame, a candidate intermediate image as a second frame, and a second still image as a third frame; and selecting a candidate intermediate image as one of the one or more intermediate images if the candidate video does not contain frame interpolation faults.
[0027] In some embodiments, the image interpolator is an interpolation machine learning model (e.g., a generative model) that receives a first still image and a second still image as input and generates one or more intermediate images to simulate motion between the first and second images. The interpolation machine learning model can be trained on a training set of video that includes motion. For example, the interpolation machine learning model can receive a subset of frames from a video as input and generate one or more missing frames as output. A cost function based on the difference between the generated frames and their corresponding original frames (excluded from the subset) can be used to train the model. Examples of such loss functions include, but are not limited to, pixel-wise L2 or L1 loss between one or more generated frames and their corresponding original frames. Training can include applying optimization routines (such as stochastic gradient descent) to the loss function to determine updates to the parameters of the interpolation machine learning model. Optimization routines can be applied until a threshold condition is met. The threshold condition can include the number of training iterations reached on a test dataset and / or threshold performance. For example, a model can be considered sufficiently trained when the cost function is minimized, such as when the generated frames and their corresponding original frames are indistinguishable. A classifier can be trained to perform comparisons between the generated frames and their corresponding original frames. The training set of videos obtained with user permission can include various types of videos, such as videos depicting facial movements (such as smiling or opening / closing eyes); body movements (such as walking, dancing, jumping, etc.); pet movements; etc.
[0028] Media application 103 can use a deep machine learning model to generate a 3D representation of a scene in a first image based on a prediction of depth in the first image. For example, the deep machine learning model could be a classifier that takes the first image as input and outputs a prediction of depth in the first image. The depth prediction could include depth coordinates of objects / features in the image, such as the z-axis coordinates of objects in the image, where the image is in the xy plane. The video could include camera effects generated based on the 3D representation of the scene. Camera effects could include zoom and / or panning.
[0029] In some embodiments, the deep machine learning model includes a neural network. In some embodiments, the neural network includes a convolutional neural network. For example, a convolutional neural network can extract features from an input image and create a 3D image by providing a low-resolution version and iteratively improving the 3D image. In some embodiments, a deep machine learning (ML) model can be trained using a training set of images and their corresponding depth maps. For example, a deep ML model can be trained to predict the depth of images in the training set, and the prediction can be compared with the ground truth in the depth map, where the difference is used as feedback to update the parameters of the deep machine learning model during training. The comparison of the predicted depth and the ground truth depth map can be performed using a loss function, such as an L1 or L2 loss between the predicted depth and their corresponding ground truth depth. Optimization routines can be applied to the loss function to determine parameter updates.
[0030] Media application 103 can enable the display of a user interface that includes video. Media application 103 can also provide users with notifications that video is available. Media application 103 can generate video for any images not excluded by filters. Media application 103 can generate video periodically (e.g., monthly, weekly, daily, etc.).
[0031] Example 200 of computing devices
[0032] Figure 2 This is a block diagram of an example computing device 200 that can be used to implement one or more features described herein. The computing device 200 can be any suitable computer system, server, or other electronic or hardware device. In one example, the computing device 200 is a user device 115 for implementing media application 103. In another example, the computing device 200 is a media server 101. In yet another example, media application 103 is partly on user device 115 and partly on media server 101.
[0033] One or more methods described herein can be used as standalone programs (which can execute on any type of computing device), programs running on web browsers, or mobile applications (“apps”) running on mobile computing devices (e.g., cellular phones, smartphones, smart displays, tablets, wearable devices (wristwatches, armbands, jewelry, headbands, virtual reality goggles or glasses, augmented reality goggles or glasses, head-mounted displays, etc.), laptops, etc.). In the primary example, all computation is performed within the mobile application on the mobile computing device. However, a client / server architecture can also be used, whereby the mobile computing device sends user input data to a server device and receives final output data from the server for output (e.g., for display). In another example, computation can be split between the mobile computing device and one or more server devices.
[0034] In some embodiments, the computing device 200 includes a processor 235, a memory 237, an I / O interface 239, a display 241, a camera 243, and a storage device 245. The processor 235 can be coupled to the bus 218 via signal line 222, the memory 237 can be coupled to the bus 218 via signal line 224, the I / O interface 239 can be coupled to the bus 218 via signal line 226, the display 241 can be coupled to the bus 218 via signal line 228, the camera 243 can be coupled to the bus 218 via signal line 230, and the storage device 245 can be coupled to the bus 218 via signal line 232.
[0035] Processor 235 may be one or more processors and / or processing circuits that execute program code and control the basic operations of computing device 200. "Processor" includes any suitable hardware system, mechanism, or component that processes data, signals, or other information. A processor may include a system having a general-purpose central processing unit (CPU), multiple processing units (e.g., in a multiprocessor configuration), a graphics processing unit (GPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a complex programmable logic device (CPLD), dedicated circuitry for implementing functions, a dedicated processor for implementing neural network-based processing, neural circuits, a system with a processor optimized for matrix computations (e.g., matrix multiplication), or other systems where the CPU has one or more cores (e.g., in a single-core, dual-core, or multi-core configuration). In some embodiments, processor 235 may include one or more coprocessors that implement neural network processing. In some embodiments, processor 235 may be a processor that processes data to produce probabilistic outputs; for example, the output produced by processor 235 may be inaccurate or may be accurate within a range of expected outputs. Processing needs to be limited to a specific geographical location or have time constraints. For example, a processor may perform its functions in real-time, offline, in batch mode, etc. The processing can be performed at different times and in different places by different (or the same) processing systems. The computer can be any processor that communicates with memory.
[0036] Memory 237 is typically provided in computing device 200 for access by processor 235 and can be any suitable processor-readable storage medium, such as random access memory (RAM), read-only memory (ROM), electrically erasable read-only memory (EEPROM), flash memory, etc., adapted to store instructions executed by the processor or processor set, and placed separately from and / or integrated with the processor 235. Memory 237 may store software operated by processor 235 on computing device 200, including media application 103.
[0037] The memory 237 may include an operating system 262, other applications 264, and application data 266. Other applications 264 may include, for example, camera applications, image library applications, image management applications, image gallery applications, media display applications, communication applications, web hosting engines or applications, mapping applications, media sharing applications, etc. One or more methods disclosed herein can operate in various environments and platforms, for example, as a standalone computer program that can run on any type of computing device, as a web application with web pages, as a mobile application (“app”) running on a mobile computing device, etc.
[0038] Application data 266 may be data generated by other applications 264 or the hardware of computing device 200. For example, application data 266 may include images captured by camera 243, user actions identified by other applications 264 (e.g., social networking applications), etc.
[0039] I / O interface 239 provides functionality that enables computing device 200 to interface with other systems and devices. The interfaced devices may be included as part of computing device 200, or they may be separate and communicate with computing device 200. For example, network communication devices, storage devices (e.g., memory 237 and / or database 199), and input / output devices may communicate via I / O interface 239. In some embodiments, I / O interface 239 may connect to interface devices such as input devices (keyboard, pointing device, touchscreen, microphone, camera, scanner, sensor, etc.) and / or output devices (display device, speaker device, printer, monitor, etc.). For example, when a user provides touch input, I / O interface 239 transmits data to media application 103.
[0040] Some examples of devices that can be connected to the I / O interface 239 may include a display 241, which can be used to display content such as images, videos, and / or user interfaces for output applications as described herein, as well as to receive touch (or gesture) input from a user. For example, the display 241 may be used to display a user interface that includes a subset of candidate image pairs. The display 241 may include any suitable display device, such as a liquid crystal display (LCD), a light-emitting diode (LED) or plasma display, a cathode ray tube (CRT), a television, a monitor, a touch screen, a 3D display, or other visual display device. For example, the display 241 may be a flat panel display provided on a mobile device, multiple displays embedded in eyeglasses or headphone devices, or a monitor screen for a computer device.
[0041] Camera 243 can be any type of image capture device capable of capturing images and / or video. In some embodiments, camera 243 captures images or video transmitted to media application 103 via I / O interface 239.
[0042] Storage device 245 stores data related to media application 103. For example, storage device 245 may store image collections associated with user accounts, training sets for machine learning models, videos, etc. In embodiments where media application 103 is part of media server 101, storage device 245 and... Figure 1 The database 199 is the same.
[0043] Example Media Application 103
[0044] Figure 2 The illustrated example media application 103 includes a filtering module 202, an image interpolator 204, and a user interface module 206.
[0045] The filtering module 202 applies filters to select specific image pairs from candidate image pairs. In some embodiments, the filtering module 202 includes a set of instructions executable by the processor 235 to apply the filters. In some embodiments, the filtering module 202 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.
[0046] In some embodiments, the filtering module 202 includes a time filter that excludes candidate image pairs in which there are too many time-separated images, where time refers to the capture time of the images. For example, the time filter excludes a candidate image pair when the time difference between the corresponding timestamps associated with the first still image and the second still image is greater than a time threshold. The time threshold can be any time value, such as two seconds, one minute, one day, etc.
[0047] In some embodiments, the temporal filter excludes candidate image pairs that are not in chronological order. In some embodiments, the temporal filter changes the order of any candidate pairs that are not in chronological order and resubmits them to the filtering module 202 for analysis. For example, even if the candidate image pairs are in chronological order, the filtering module 202 may exclude candidate image pairs, such as if the time between the first still image and the second still image exceeds a time threshold.
[0048] In some embodiments, the filtering module 202 includes a motion filter that excludes candidate image pairs with too much or too little motion. For example, the motion filter may estimate the motion between a first still image and a second still image, and exclude the candidate image pair if the motion between the first still image and the second still image is less than a minimum motion threshold. In another example, the motion filter may also exclude the candidate image pair when the motion between the first still image and the second still image exceeds a maximum motion threshold.
[0049] In some embodiments, the filtering module 202 includes a quality filter that excludes candidate image pairs with a quality less than a quality threshold. For example, the quality filter excludes candidate image pairs where one or both of the candidate image pairs are blurry, noisy, violate the rule of thirds (where the image is divided into three parts and the subject should be in one-third of the image), etc.
[0050] In some embodiments, the filtering module 202 includes a semantic filter that excludes candidate image pairs that do not have a topic determined to be of interest to the user. For example, user interests may be determined based on: (a) expressive preferences provided by the user (e.g., a particular person or pet the user is interested in); and (b) user behavior (where permitted, determining the individual most frequently represented in images captured by the user, images frequently viewed by the user, images for which the user has provided approved instructions (e.g., like, thumbs up, +1, etc.), and wherein such determination includes a technical comparison of the current image with known user interest attributes (e.g., depicting individual A). The filtering module 202 may only subject this determination to the user's permission for the collection of user data.
[0051] The filtering module 202 may include a list of acceptable topics that are personalized or more generalized for the user. For example, the user can consistently indicate approval for certain topics, such as the user's daughter, dog, landscape, etc. In another example, the semantic module may automatically exclude images of topics such as receipts, screenshots of memos, etc.
[0052] In some embodiments, the filtering module 202 also excludes images with incompatible image sizes. For example, the filtering module 202 may exclude images that fail to meet a predetermined image size (e.g., smaller than a specific resolution, such as at least 400 pixels wide and at least 500 pixels high).
[0053] Go to Figure 3The illustration shows a detailed example 300 of the filtering module 202 and the image interpolator 204. In some embodiments, the filtering module 202 includes one or both of the filter 302 and the filter machine learning module 304. In some embodiments, the filter 302 includes one or more of the following: a temporal filter, a motion filter, a quality filter, a semantic filter, etc.
[0054] In some embodiments, the filter machine learning module 304 includes a machine learning model trained to generate feature vectors from candidate images and to filter candidate images based on these feature vectors. In some embodiments, the filter machine learning module 304 includes a set of instructions executable by the processor 235 to generate feature vectors. In some embodiments, the filter machine learning module 304 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.
[0055] In some embodiments, the filter machine learning module 304 can generate feature vectors representing the first and second static images in each candidate pair in a multidimensional feature space. Images with similar features can have similar feature vectors; for example, the feature vectors are mathematical representations, and the mathematical representations of similar images are closer in the vector space than the mathematical representations of dissimilar images. The vector space can be a function of various factors of the image, such as the described subject (objects detected in the image), the composition of the image, color information, image orientation, image metadata, specific objects identified in the image (e.g., known faces with user permission), etc.
[0056] In some embodiments, the filter machine learning module 304 may use training data (licensed for training purposes) to generate a trained model, specifically a filter machine learning model. For example, the training data may include ground-based real-world data in the form of image pairs associated with descriptions of visual similarity between image pairs. In some embodiments, the descriptions of visual similarity may include feedback from a user regarding whether the image pairs are related. In some embodiments, descriptions of visual similarity may be added automatically through image analysis. Training data may be obtained from any source, such as a data repository specifically tagged for training, data licensed to it for use as training data for machine learning, etc. In some embodiments, training may occur on a media server 101 that directly provides training data to user device 115, training may occur locally on user device 115, or a combination of both.
[0057] In some embodiments, training data may include synthetic data generated for training purposes, such as data not based on activities in the training context, for example, data generated from simulated or computer-generated images / videos. In some embodiments, the filter machine learning module 304 uses weights obtained from another application and not edited / transmitted. For example, in these embodiments, a trained model may be generated, for example, on different devices and provided as part of media application 103. In various embodiments, a trained model may be provided as a data file including a model structure or form (e.g., defining the number and type of neural network nodes, the connectivity between nodes, and organizing the nodes into multiple layers) and associated weights. The filter machine learning module 304 may read the data file for the trained model and implement a neural network with node connectivity, layers, and weights based on the model structure or form specified in the trained model.
[0058] The filter machine learning module 304 generates a trained model, referred to herein as the filter machine learning model. In some embodiments, the filter machine learning module 304 is configured to apply the filter machine learning model to data such as application data 266 (e.g., candidate image pairs) to identify one or more features in each candidate image and generate a corresponding feature vector (embedding) representing the candidate image pair. In some embodiments, the filter machine learning module 304 may include software code to be executed by the processor 235. In some embodiments, the filter machine learning module 304 may specify a circuit configuration (e.g., for a programmable processor, for a field-programmable gate array (FPGA), etc.) that enables the processor 235 to apply the filter machine learning model. In some embodiments, the filter machine learning module 304 may include software instructions, hardware instructions, or a combination thereof. In some embodiments, the filter machine learning module 304 may provide an application programming interface (API) that can be invoked by the operating system 262 and / or other applications 264, for example, to apply the filter machine learning model to application data 266 to output a corresponding feature vector for the candidate image pair. In some embodiments, the candidate image pairs stored by the filter machine learning module 304 are closer in vector space to the mathematical representation of the candidate image pairs excluded by the filter machine learning module 304.
[0059] In some embodiments, the filter machine learning model includes a classifier that takes candidate image pairs as input. Examples of classifiers include neural networks, support vector machines, k-nearest neighbors, logistic regression, Naive Bayes, decision trees, perceptrons, etc.
[0060] In some embodiments, the filter machine learning model may include one or more model forms or structures. For example, the model form or structure may include any type of neural network, such as a linear network, a deep neural network that implements multiple layers (e.g., "hidden layers" between the input layer and the output layer, where each layer is a linear network), a convolutional neural network (CNN) (e.g., a network that splits or divides input data into multiple parts or tiles, processes each tile individually using one or more neural network layers, and aggregates the results of the processing from each tile), a sequence-to-sequence neural network (e.g., a network that takes sequential data such as words in a sentence, frames in a video, etc., as input and produces a sequence of results as output), and so on.
[0061] The model form or structure can specify the connectivity between individual nodes and organize the nodes into layers. For example, nodes in the first layer (e.g., the input layer) can receive data as input data or application data. This data can include, for example, one or more pixels of each node, such as, when the filter machine learning model is used for analysis, input images, such as candidate image pairs. Subsequent intermediate layers can receive the outputs of nodes in the previous layer as input, depending on the connectivity specified in the model form or structure. These layers can also be referred to as hidden layers. The final layer (e.g., the output layer) produces the output of the filter machine learning model. For example, the output can be a feature vector of a candidate image pair. In some embodiments, the model form or structure also specifies the number and / or type of nodes in each layer.
[0062] Features output by the filter machine learning module 304 may include the subject (e.g., sunset and a specific person); colors present in the image (green mountains and blue lake); color balance; light source, angle, and intensity; the position of objects in the image (e.g., following the rule of thirds); the position of objects relative to each other (e.g., depth of field); the location where the image was taken; focus (foreground and background); or shadows. While the aforementioned features are human-understandable, it should be understood that the feature output may be an embedding or other mathematical value that represents the image and is not human-interpretable (e.g., no individual feature value corresponds to a specific feature, such as the presence of color, object position, etc.); however, the trained model is robust to images such that it outputs similar features for similar images and outputs corresponding dissimilar features for images with significant dissimilarity. Examples of such models include the encoder of an autoencoder model.
[0063] In some embodiments, the model form is a CNN with network layers, where each network layer extracts image features at a different level of abstraction. A CNN used to identify features in an image can be used for image classification. The model architecture can include combinations and orders of layers consisting of multidimensional convolutions, average pooling, max pooling, activation functions, normalization, regularization, and other layers and modules of deep neural networks used in practice for applications.
[0064] In various embodiments, the filter machine learning model may include one or more models. One or more of these models may include multiple nodes arranged in layers according to the model structure or form. In some embodiments, a node may be a computational node without memory, for example, configured to process an input unit to produce an output unit. The computation performed by the node may include, for example, multiplying each of the multiple node inputs by a weight to obtain a weighted sum, and adjusting the weighted sum using a bias or intercept value to produce the node output. For example, the filter machine learning module 304 may adjust the corresponding weights based on feedback in response to automatically updating one or more parameters of the filter machine learning model.
[0065] In some embodiments, the computation performed by the node may further include applying a step / activation function to an adjusted weighted sum. In some embodiments, the step / activation function may be a nonlinear function, such as a ReLU function, a sigmoid function, a tanh function, etc. In various embodiments, such computation may include operations such as matrix multiplication. In some embodiments, computations performed by multiple nodes may be performed in parallel, for example, using multiple processor cores of a multi-core processor, individual processing units using a graphics processing unit (GPU), or dedicated neural circuits. In some embodiments, the node may include memory, for example, being able to store and use one or more earlier inputs while processing subsequent inputs. For example, a node with memory may include a Long Short-Term Memory (LSTM) node. LSTM nodes can use memory to maintain states that allow the node to act like a finite state machine (FSM). Models with such nodes may be useful in processing sequential data, such as words in a sentence or paragraph, a series of images, frames in a video, speech or other audio, etc. For example, a heuristic-based model used in a gating model may store one or more previously generated features corresponding to previous images.
[0066] In some embodiments, a filter machine learning model may include embeddings or weights for individual nodes. For example, the filter machine learning model may be initialized as a layer of nodes organized as specified by the model form or structure. During initialization, appropriate weights may be applied to the connections between each pair of nodes connected by each model form (e.g., nodes in consecutive layers of a neural network). For example, the appropriate weights may be randomly assigned or initialized to default values. The filter machine learning model may then be trained, for example, using a training set of image pairs to produce results. In some embodiments, a subset of the overall architecture may be reused from other machine learning applications as a transfer learning method to leverage the pre-trained weights.
[0067] For example, training may include applying supervised learning techniques. In supervised learning, training data may include multiple inputs (e.g., image pairs from a collection of images associated with a user account) and a corresponding expected output for each image pair (e.g., image embeddings for the image pairs). The values of the weights are automatically adjusted based on a comparison between the output of the filter machine learning model and the expected output, for example, to increase the probability that the filter machine learning model will produce the expected output when given similar inputs. The comparison can be performed using a loss function, where the adjusted weight values are determined by applying an optimization routine to the loss function.
[0068] In some embodiments, training may include applying unsupervised learning techniques. In unsupervised learning, only input data (e.g., image pairs from a collection of images associated with a user account) may be provided, and a filter machine learning model may be trained to distinguish the data, for example, clustering image pairs into different groups.
[0069] In various embodiments, the trained model includes a set of weights corresponding to the model structure. In embodiments where the training set is omitted, the filter machine learning module 304 may generate a filter machine learning model based on (e.g., by the developer of the filter machine learning module 304, by a third party, etc.) a previously trained filter machine learning model. In some embodiments, the filter machine learning model may include a fixed set of weights (e.g., downloaded from a server that provides the weights).
[0070] In some embodiments, the filter machine learning module 304 may be implemented offline. Implementing the filter machine learning module 304 may include using a static training set that does not include updates when the data in the static training set changes. This advantageously results in increased processing efficiency performed by the computing device 200 and reduced power consumption of the computing device 200. In these embodiments, a filter machine learning model may be generated in a first stage and provided as part of the filter machine learning module 304. In some embodiments, small updates to the filter machine learning model may be implemented online, including updates to the training data as part of training the filter machine learning model. Small updates are updates smaller than a size threshold. The size of the update is related to the number of variables in the filter machine learning model affected by the update. In such embodiments, an application that invokes the filter machine learning module 304 (e.g., one or more of operating system 262, other applications 264, etc.) may utilize image embeddings of candidate image pairs to identify visually similar clusters. The filter machine learning module 304 may also periodically (e.g., hourly, monthly, quarterly, etc.) generate system logs, which may be used to update the filter machine learning model, for example, to update the embeddings of the filter machine learning model.
[0071] In some embodiments, the filter machine learning module 304 may be implemented in a manner that adapts to the specific configuration of the computing device 200 on which the filter machine learning module 304 is executed. For example, the filter machine learning module 304 may determine a computation graph utilizing available computing resources (e.g., processor 235). For example, if the filter machine learning module 304 is implemented as a distributed application across multiple devices, such as where media server 101 comprises multiple instances of media server 101, the filter machine learning module 304 may determine the computations to be performed on individual devices in a computationally optimized manner. In another example, the filter machine learning module 304 may determine that processor 235 includes a GPU with a specific number of GPU cores (e.g., 1000), and implement the filter machine learning module 304 accordingly (e.g., as 1000 individual processes or threads).
[0072] In some embodiments, the filter machine learning module 304 can integrate trained models. For example, the filter machine learning model may include multiple trained models, each applicable to the same input data. In these embodiments, the filter machine learning module 304 may select a specific trained model, for example, based on available computing resources, the success rate of previous inferences, etc.
[0073] In some embodiments, the filter machine learning module 304 can execute multiple trained models. In these embodiments, the filter machine learning module 205 can combine the outputs from the applied individual models, for example, using a voting technique that scores the individual outputs from each applied trained model, or by selecting one or more specific outputs. In some embodiments, such a selector is part of the model itself and serves as a connection layer between the trained models. Furthermore, in these embodiments, the filter machine learning module 304 can apply a time threshold (e.g., 0.5 ms) for applying the individual trained models and utilize only those individual outputs available within the time threshold. Outputs not received within the time threshold may not be utilized, for example, they may be excluded. This approach may be suitable, for example, when there is a time limit specified by, for example, the operating system 262 or one or more applications 264 when invoking the filter machine learning module 304. In this way, the maximum time spent by the filter machine learning module 304 performing a task (e.g., identifying one or more features in a candidate image pair and generating a corresponding feature vector (embedding) representing that candidate image pair) can be defined, which improves the responsiveness of the media application 103 and results in the filter machine learning module 304 providing a real-time guarantee for maximum effort classification.
[0074] In some embodiments, the operation of the filter machine learning module 304 causes one or more candidate pairs of corresponding feature vectors whose distance between corresponding feature vectors is greater than a threshold vector distance to be excluded.
[0075] In some embodiments, the filter machine learning module 304 receives feedback. For example, the filter machine learning module 304 may receive feedback from a user or a set of users via the user interface module 206. Feedback may include, for example, that candidate image pairs are too dissimilar to be used to generate a video. If a single user provides feedback, the filter machine learning module 304 provides feedback to a filter machine learning model, which uses the feedback to update the parameters of the filter machine learning model to modify the output image embedding of the cluster of candidate image pairs. In the case of feedback provided by a set of users, the filter machine learning module 304 provides aggregated feedback to a filter machine learning model, which uses the aggregated feedback to update the parameters of the filter machine learning model to modify the output image embedding of the cluster of candidate image pairs. For example, aggregated feedback may include a subset of videos and how the user reacts to the subset of videos by watching only one video and refusing to watch the rest, watching all videos in the subset, sharing the video, providing an indication of approval or disapproval of the video (e.g., thumbs up / thumbs down, like, +1, etc.). The filter machine learning module 304 may modify the cluster of candidate image pairs based on updating the parameters of the filter machine learning model.
[0076] In some embodiments, the filtering module 202 determines a subset of different filters to apply to candidate image pairs. Go to Figure 4 The diagram illustrates different options 400 of filters to be applied to the image pair. Filters can be categorized as early filters, mid-term filters, and late filters. Early filters may perform analysis of image metadata, such as temporal filters that exclude candidate image pairs that are too far apart in time. Mid-term filters may perform analysis of image data, such as motion filters that exclude candidate image pairs with too much motion. Late filters may include image data and test renders. A balance between different filters may be between filters that are implemented quickly but have high recall and filters that are slower but have high precision. In some embodiments, the filtering module 202 may determine a subset of different filters based on how quickly the video needs to be generated. For example, if the video is generated monthly, processing time may be less important. However, if the video is requested by a user, it may be necessary to minimize processing time to provide a response to the user within a short timeframe, making the application providing the video considered responsive.
[0077] In some embodiments, the filtering module 202 selects a specific image pair from the candidate image pairs. For example, the specific image pair is the image pair that was not excluded by the filtering module 202. In some embodiments, the specific image pair is selected based on the first image pair among the candidate image pairs that was not excluded by the filtering module 202.
[0078] In some embodiments, the filter machine learning module 304 receives an intermediate image from one or more intermediate images from the image interpolator 204 as input to the filter machine learning module, generates a feature vector corresponding to the intermediate image, and excludes the intermediate image if it is too dissimilar to the corresponding first static image or second static image. For example, the image interpolator 204 generates a feature vector from the intermediate image and compares the feature vector with the feature vector of the corresponding candidate image pair, and excludes the intermediate image if the distance between the feature vector of the intermediate image and any of the feature vectors of the candidate image pair is greater than a threshold vector distance.
[0079] Image interpolator 204 generates one or more intermediate images based on specific image pairs. In some embodiments, image interpolator 204 includes a set of instructions executable by processor 235 to generate one or more intermediate images. In some embodiments, image interpolator 204 is stored in memory 237 of computing device 200 and is accessible and executable by processor 235.
[0080] In some embodiments, the image interpolator 204 generates one or more intermediate images including intermediate steps, such that the insertion of one or more intermediate images provides smooth animation as frames between a first static frame and a second static frame when these frames are displayed sequentially as video. For example, Figure 5 An example of interpolation is included, comprising two intermediate frames 550a and 550b between the first frame 500 and the second frame 575. This example illustrates the first frame 500 with open eyes and a closed mouth, and the second frame 575 with partially closed eyes and a partially open mouth with a smiling expression. The image interpolator 204 generates the intermediate frames 550a and 550b to include the movement of closed eyes and a moving mouth. When displayed sequentially (in the order of 500, 550a, 550b, 575) as video, the intermediate frames 550a and 550b allow the observer to perceive smooth motion between the first still image 500 and the second still image 575.
[0081] Although Figure 5 The illustration shows two intermediate images, but in different implementations, the image interpolator 204 generates one, two, three, or more intermediate images. In some embodiments, the number of intermediate images can be a function of the total motion between frames 500 and 575, where more motion occurs, resulting in more intermediate images. In some embodiments, the image interpolator 204 generates multiple intermediate images based on available computing capacity, where more intermediate images are generated if high capacity exists, and fewer intermediate images are generated if low capacity exists. In some embodiments, the image interpolator 204 can generate multiple intermediate images based on image resolution, where the image interpolator 204 generates more intermediate images if the selected image pair has high resolution. In some embodiments, the image interpolator 204 generates more intermediate images based on the refresh rate of the user equipment 115. For example, a higher refresh rate (e.g., 120Hz) may require more intermediate images than a lower refresh rate (e.g., 50Hz).
[0082] In some embodiments, the image interpolator 204 receives multiple consecutive image pairs and generates multiple intermediate images. For example, the image interpolator 204 may receive specific pairs of images a, b, c, and d, such that a, b; b, c; and c, d constitute an image pair. The image interpolator 204 may generate one or more intermediate pairs for each specific image pair.
[0083] In some embodiments, each intermediate image is associated with a corresponding timestamp having a value between the timestamp of the first still image and the timestamp of the second still image. In some embodiments, the image interpolator 204 organizes the video based on the corresponding timestamps of the first still image, one or more intermediate images, and the second still image.
[0084] In some embodiments, the image interpolator 204 generates one or more intermediate images by: generating candidate intermediate images; evaluating each candidate intermediate image by: generating a candidate video comprising a first still image as a first frame, a candidate intermediate image as a second frame, and a second still image as a third frame; and selecting a candidate intermediate image as one of the one or more intermediate images if the candidate video does not contain an interpolation fault. An interpolation fault may occur in response to providing an intermediate image to the filtering module 202 or based on a failure detected by the image interpolator 204, as discussed in more detail below with reference to the discriminator machine learning module 308.
[0085] In some embodiments, the image interpolator 204 sends each intermediate image to the filtering module 202 (e.g., the filter machine learning module 304) to ensure that the intermediate image is sufficiently similar to a particular image pair. If the filtering module 202 does not exclude an intermediate image, the image interpolator 204 generates a video. If the filter machine learning module 304 excludes an intermediate image, the exclusion is considered a frame interpolation failure, and the image interpolator 204 generates one or more additional intermediate images.
[0086] Go to Figure 3 The illustration shows a detailed example 300 of the filtering module 202 and the image interpolator 204. In some embodiments, the image interpolator 204 includes an interpolation machine learning module 306, a discriminator machine learning module 308, a deep machine learning module 310, and a video generator 312.
[0087] In some embodiments, one or more of the interpolation machine learning module 306, the discriminator machine learning module 308, and the deep machine learning module 310 may be each layer / block in a neural network, or each of them may be a separate neural network. For example, the interpolation machine learning module 306 may receive a specific image pair as input and output one or more intermediate images to the discriminator machine learning module 308. If the one or more intermediate images do not contain frame interpolation faults, the discriminator machine learning module 308 may output one or more intermediate images. One or more intermediate images may then be input to the video generator 312. The deep machine learning module 310 may also receive a first static input as input and output a 3D representation of the scene to the video generator 312. Other embodiments are possible. For example, the interpolation machine learning module 306 and the discriminator machine learning module 308 may be layers in a neural network, or the interpolation machine learning module 306 and the deep machine learning module 310 may be layers in a neural network. In yet another example, the interpolation machine learning module 306 may operate independently and directly provide one or more intermediate images as output to the video generator 312.
[0088] In some embodiments, the interpolation machine learning module 306 includes an interpolation machine learning model trained to receive a first still image and a second still image as input and generate one or more intermediate images as output. The interpolation machine learning module 306 may include any type of generative machine learning model trained to generate images from input image pairs. In some embodiments, the interpolation machine learning module 306 includes a set of instructions executable by the processor 235 to generate one or more intermediate images. In some embodiments, the interpolation machine learning module 306 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.
[0089] In some embodiments, the interpolation machine learning module 306 may use training data (licensed for training purposes) to generate a trained model, specifically an interpolation machine learning model. For example, the training data may include ground truth data in the form of image pairs and intermediate images associated with descriptions of visual similarity between image pairs and intermediate images. In some embodiments, descriptions of visual similarity may be automatically added through image analysis. The training data may be obtained from any source, such as a data repository specifically tagged for training, data licensed to it for use as training data for machine learning, etc. In some embodiments, training may occur on a media server 101 that directly provides training data to user device 115, training may occur locally on user device 115, or a combination of both.
[0090] In some embodiments, training data may include synthetic data generated for training purposes, such as data not based on activities in the training context, for example, data generated from simulated or computer-generated images / videos. In some embodiments, the interpolation machine learning module 306 uses weights obtained from another application and not edited / transmitted. For example, in these embodiments, a trained model may be generated, for example, on different devices and provided as part of media application 103. In various embodiments, a trained model may be provided as a data file including a model structure or form (e.g., defining the number and type of neural network nodes, the connectivity between nodes, and the organization of nodes into multiple layers) and associated weights. The interpolation machine learning module 306 may read the data file for the trained model and implement a neural network with node connectivity, layers, and weights based on the model structure or form specified in the trained model.
[0091] In some embodiments, the interpolation machine learning module 306 is configured to apply an interpolation machine learning model to data, such as application data 266 (e.g., selected image pairs), and generate one or more intermediate images that can roughly estimate the positions of different objects between a first static image and a second static image. For example, in the presence of a first static image and a second static image, the interpolation machine learning module 306 outputs a first intermediate image and then outputs a series of intermediate images. In this example, the first and second static images depict a toddler and an infant. The heads of the two children are rotated differently between the first and second static images. In this example, the interpolation machine learning module 306 generates an intermediate image with the child's head positioned between the first and second static images. In some embodiments, the interpolation machine learning module 306 generates additional intermediate images with the child's head positioned between the first and second static images.
[0092] In some embodiments, the interpolation machine learning module 306 may include software code to be executed by the processor 235. In some embodiments, the interpolation machine learning module 306 may specify a circuit configuration (e.g., for a programmable processor, for a field-programmable gate array (FPGA), etc.) that enables the processor 235 to apply an interpolation machine learning model. In some embodiments, the interpolation machine learning module 306 may include software instructions, hardware instructions, or a combination thereof. In some embodiments, the interpolation machine learning module 306 may provide an application programming interface (API) that can be invoked by the operating system 262 and / or other applications 264, for example, to apply an interpolation machine learning model to application data 266 to output one or more intermediate images.
[0093] In some embodiments, the interpolation machine learning model includes a classifier that takes selected image pairs as input. Examples of classifiers include neural networks, support vector machines, k-nearest neighbors, logistic regression, Naive Bayes, decision trees, perceptrons, etc.
[0094] In some embodiments, the interpolation machine learning model may include one or more model forms or structures. For example, the model form or structure may include any type of neural network, such as a linear network, a deep neural network that implements multiple layers (e.g., "hidden layers" between the input layer and the output layer, where each layer is a linear network), a convolutional neural network (CNN) (e.g., a network that splits or divides input data into multiple parts or tiles, processes each tile individually using one or more neural network layers, and aggregates the results of the processing from each tile), a sequence-to-sequence neural network (e.g., a network that takes sequential data such as words in a sentence, frames in a video, etc., as input and produces a sequence of results as output), and so on.
[0095] The model form or structure can specify the connectivity between individual nodes and organize the nodes into layers. For example, nodes in a first layer (e.g., an input layer) can receive data as input data or application data. This data may include, for example, one or more pixels of each node, such as, when the interpolation machine learning model is used for analysis, input images, such as selected image pairs. Subsequent intermediate layers may receive the outputs of nodes in the previous layer as input, depending on the connectivity specified in the model form or structure. For example, a first intermediate image between a first still image and a second still image may be part of a first intermediate layer. These layers may also be referred to as hidden layers. The final layer (e.g., an output layer) produces the output of the interpolation machine learning model. For example, the output may be a series of intermediate images based on the first still image, the second still image, and the first intermediate image. In some embodiments, the model form or structure also specifies the number and / or type of nodes in each layer.
[0096] In some embodiments, the model form is a CNN with network layers, where each network layer extracts image features at a different level of abstraction. A CNN used to identify features in an image can be used for image classification. The model architecture can include combinations and orders of layers consisting of multidimensional convolutions, average pooling, max pooling, activation functions, normalization, regularization, and other layers and modules of deep neural networks used in practice for applications.
[0097] In various embodiments, the interpolation machine learning model may include one or more models. One or more of these models may include multiple nodes arranged in layers according to the model structure or form. In some embodiments, a node may be a computational node without memory, for example, configured to process an input unit to produce an output unit. Computations performed by the node may include, for example, multiplying each of the multiple node inputs by a weight to obtain a weighted sum, and adjusting the weighted sum using a bias or intercept value to produce a node output. For example, the interpolation machine learning module 306 may adjust the corresponding weights based on feedback in response to automatically updating one or more parameters of the interpolation machine learning model.
[0098] In some embodiments, the computation performed by the node may further include applying a step / activation function to an adjusted weighted sum. In some embodiments, the step / activation function may be a nonlinear function, such as a ReLU function, a sigmoid function, a tanh function, etc. In various embodiments, such computation may include operations such as matrix multiplication. In some embodiments, computations performed by multiple nodes may be performed in parallel, for example, using multiple processor cores of a multi-core processor, individual processing units using a graphics processing unit (GPU), or dedicated neural circuits. In some embodiments, the node may include memory, for example, being able to store and use one or more earlier inputs while processing subsequent inputs. For example, a node with memory may include a Long Short-Term Memory (LSTM) node. LSTM nodes can use memory to maintain states that allow the node to act like a finite state machine (FSM). Models with such nodes may be useful in processing sequential data, such as words in a sentence or paragraph, a series of images, frames in a video, speech or other audio, etc. For example, a heuristic-based model used in a gating model may store one or more previously generated features corresponding to previous images.
[0099] In some embodiments, the interpolation machine learning model may include embeddings or weights for individual nodes. For example, the interpolation machine learning model may be initialized as a plurality of nodes organized into layers as specified by the model form or structure. During initialization, corresponding weights may be applied to the connections between each pair of nodes connected by each model form (e.g., nodes in consecutive layers of a neural network). For example, the corresponding weights may be randomly assigned or initialized to default values. The interpolation machine learning model may then be trained, for example, using a training set of image pairs to produce results. In some embodiments, a subset of the overall architecture may be reused from other machine learning applications as a transfer learning method to leverage the pre-trained weights.
[0100] For example, training may include applying supervised learning techniques. In supervised learning, training data may include multiple inputs (e.g., image pairs from a collection of images associated with a user account) and a corresponding expected output for each image pair (e.g., one or more intermediate images). The weights are automatically adjusted based on a comparison between the output of the interpolated machine learning model and the expected output, for example, in a way that increases the probability that the interpolated machine learning model will produce the expected output when given similar inputs.
[0101] In some embodiments, training may include applying unsupervised learning techniques. In unsupervised learning, only input data (e.g., image pairs from a collection of images associated with a user account) may be provided, and an interpolation machine learning model may be trained to distinguish the data, for example, clustering image pairs into different groups.
[0102] In various embodiments, the trained model includes a set of weights corresponding to the model structure. In embodiments where the training set is omitted, the interpolation machine learning module 306 may generate an interpolation machine learning model based on (e.g., by the developer of the interpolation machine learning module 306, by a third party, etc.) a previously trained interpolation machine learning model. In some embodiments, the interpolation machine learning model may include a fixed set of weights (e.g., downloaded from a server that provides weights).
[0103] In some embodiments, the interpolation machine learning module 306 may be implemented offline. Implementing the interpolation machine learning module 306 may include using a static training set that does not include updates when the data in the static training set changes. This advantageously results in increased processing efficiency performed by the computing device 200 and reduced power consumption of the computing device 200. In these embodiments, an interpolation machine learning model may be generated in a first stage and provided as part of the interpolation machine learning module 306. In some embodiments, small updates to the interpolation machine learning model may be implemented online, wherein updates to the training data are included as part of training the interpolation machine learning model. A small update is an update with a size less than a size threshold. The size of the update is related to the number of variables in the interpolation machine learning model affected by the update. In such embodiments, an application that invokes the interpolation machine learning module 306 (e.g., one or more of operating system 262, other applications 264, etc.) may utilize image embeddings of clusters of candidate image pairs to identify visually similar clusters. The interpolation machine learning module 306 may also periodically (e.g., hourly, monthly, quarterly, etc.) generate system logs, which may be used to update the interpolation machine learning model, for example, to update the embeddings of the interpolation machine learning model.
[0104] In some embodiments, the interpolation machine learning module 306 may be implemented in a manner that adapts to a specific configuration of the computing device 200 on which it is executed. For example, the interpolation machine learning module 306 may determine a computation graph utilizing available computing resources (e.g., processor 235). For example, if the interpolation machine learning module 306 is implemented as a distributed application across multiple devices, such as where media server 101 comprises multiple instances of media server 101, the interpolation machine learning module 306 may determine the computations to be performed on individual devices in a computationally optimized manner. In another example, the interpolation machine learning module 306 may determine that processor 235 includes a GPU with a specific number of GPU cores (e.g., 1000), and implement the interpolation machine learning module 306 accordingly (e.g., as 1000 individual processes or threads).
[0105] In some embodiments, the interpolation machine learning module 306 can integrate trained models. For example, the interpolation machine learning model may include multiple trained models, each applicable to the same input data. In these embodiments, the interpolation machine learning module 306 may select a specific trained model, for example, based on available computing resources, the success rate of previous inferences, etc.
[0106] In some embodiments, the interpolation machine learning module 306 may execute multiple trained models. In these embodiments, the filter machine learning module 205 may combine outputs from individual applied models, for example, using a voting technique that scores the individual outputs from each applied trained model, or by selecting one or more specific outputs. In some embodiments, such a selector is part of the model itself and serves as a connection layer between trained models. Furthermore, in these embodiments, the interpolation machine learning module 306 may apply a time threshold (e.g., 0.5 ms) for applying individual trained models and utilize only those individual outputs available within the time threshold. Outputs not received within the time threshold may not be utilized, for example, excluded. This approach may be suitable, for example, when there is a time limit specified by, for example, the operating system 262 or one or more applications 264 when invoking the interpolation machine learning module 306. In this way, the maximum time spent by the interpolation machine learning module 306 performing a task (e.g., identifying one or more features in a selected image pair and generating one or more intermediate images) can be defined, which improves the responsiveness of the media application 103 and results in the interpolation machine learning module 306 providing a real-time guarantee for the best-effort generation of one or more intermediate images.
[0107] In some embodiments, the interpolation machine learning module 306 receives feedback. For example, the interpolation machine learning module 306 may receive feedback from a user or a set of users via the user interface module 206. Feedback may include, for example, that an intermediate image is too dissimilar to a specific pair of images to be used to generate a video. If a single user provides feedback, the interpolation machine learning module 306 provides feedback to an interpolation machine learning model, which uses the feedback to update its parameters to modify one or more intermediate images in the output. When a set of users provides feedback, the interpolation machine learning module 306 provides aggregated feedback to an interpolation machine learning model, which uses the aggregated feedback to update its parameters to modify the intermediate images in the output. For example, aggregated feedback may include a subset of videos and how the user reacts to the subset of videos by watching only one video and refusing to watch the rest, watching all videos in the subset, sharing the video, providing instructions to approve or disapprove the video (e.g., thumbs up / thumbs down, like, +1, etc.).
[0108] In some embodiments, the discriminator machine learning module 308 includes a discriminator machine learning model trained to receive one or more intermediate images and one or more of a first still image and a second still image as input and output the probability that one or more intermediate images are generated images. In some embodiments, the interpolation machine learning module 306 includes a set of instructions executable by the processor 235 to output the probability that one or more intermediate images are generated images. In some embodiments, the interpolation machine learning module 306 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.
[0109] In some embodiments, the discriminator machine learning module 308 may use training data (licensed for training purposes) to generate a trained model, specifically a discriminator machine learning model. For example, the training data may include image pairs associated with descriptions of whether intermediate images are generated images and ground truth data in the form of intermediate images. In some embodiments, descriptions of intermediate images may be automatically added through image analysis. Training data may be obtained from any source, such as a data repository specifically tagged for training, data licensed to it for use as training data for machine learning, etc. In some embodiments, training may occur on a media server 101 that directly provides training data to user device 115, training may occur locally on user device 115, or a combination of both. In some embodiments, the discriminator model may be jointly trained with an interpolation machine learning model using generative adversarial methods.
[0110] In some embodiments, the discriminator machine learning module 308 is configured to apply a discriminator machine learning model to data, such as application data 266 (e.g., an intermediate image and one or more of a first still image and a second still image), and generate the probability that the intermediate image is visually indistinguishable from one or more of the first still image and the second still image. In some embodiments, if the probability does not meet a threshold, the intermediate image is excluded. In some embodiments, if the probability exceeds a threshold, depending on the embodiment, the intermediate image is accepted and provided as input to the deep machine learning module 310 or the video generator 312.
[0111] The above description provides additional information about how the discriminator machine learning module 308 uses training data, employs the processor 235, and is used as a different type of machine learning model, and will not be repeated here.
[0112] In some embodiments, the deep machine learning module 310 includes a deep machine learning model trained to receive a first still image as input and output a 3D representation of a scene in the first still image based on a prediction of the depth of the first still image. In some embodiments, the interpolation machine learning module 306 includes a set of instructions executable by the processor 235 to output the probability that one or more intermediate images are generated images. In some embodiments, the interpolation machine learning module 306 is stored in the memory 237 of the computing device 200 and is accessible and executable by the processor 235.
[0113] In some embodiments, the deep machine learning module 310 may use trained data (licensed for training purposes) to generate a trained model, specifically a deep machine learning model. For example, the training data may include images of a scene and ground-based data in the form of a three-dimensional representation of that scene. The training data may be obtained from any source, such as a data repository specifically tagged for training, data licensed to it for use as training data for machine learning, etc.
[0114] In some embodiments, the deep machine learning module 310 is configured to apply a deep machine learning model to data, such as application data 266 (e.g., a first still image), and generate a three-dimensional representation of the scene. In some embodiments, the deep machine learning model is a classifier that receives the first still image as input and generates a three-dimensional representation of the scene. The three-dimensional representation of the scene may include camera effects such as scaling, panning, rotation, or combination.
[0115] Figure 6 The illustration shows an example of 3D scaling of a static image. The first image 600 can be equivalent to the first static image received as input to a deep machine learning model as described above. The deep machine learning model generates a 3D representation of the scene with a scaled-down camera effect as output. This is illustrated as a second example 650 and a third example 675. Figure 6 In the example, sequences 600 to 675 provide the observer with a video in which people and leaves get closer and closer in the frame, thus emphasizing the depth in the image.
[0116] The above describes how the deep machine learning module 310 uses training data, employs processor 235, and is used as a different type of machine learning model, and these descriptions will not be repeated here.
[0117] Video generator 312 generates a video from a first still image, a second still image, and one or more intermediate images. In some embodiments, three or more frames are arranged in a sequence, wherein the first frame of the sequence is the first still image, the last frame of the sequence is the second still image, and each of the one or more intermediate images is a corresponding intermediate frame in the sequence between the first frame and the last frame.
[0118] User interface module 206 generates the user interface. In some embodiments, user interface module 206 includes a set of instructions executable by processor 235 to generate the user interface. In some embodiments, user interface module 206 is stored in memory 237 of computing device 200 and is accessible and executable by processor 235.
[0119] Example Flowchart
[0120] Figures 7A to 7B This is a flowchart illustrating an example method 700 for generating video from candidate image pairs according to some embodiments. The method shown in flowchart 700 can be... Figure 2 The computing device 200 in the middle executes.
[0121] Method 700 may begin at box 702. In box 702, a request for access to a collection of media items associated with a user account is generated. In some embodiments, this request is generated by user interface module 206. Box 702 may be followed by box 704.
[0122] At box 704, a license interface element is displayed. For example, user interface module 206 may display a user interface that includes a license interface element that requests permission from the user to access a collection of media items. Box 704 may be followed by box 706.
[0123] At box 706, it is determined whether the user has granted permission to access the media item collection. In some embodiments, box 806 is performed by user interface module 206. If the user does not grant permission, the method ends. If the user does grant permission, box 706 may be followed by box 708.
[0124] At box 708, candidate image pairs are selected from the image set. For example, candidate pairs are selected as those that occurred during a bounded time period (such as last week, last month, etc.). Alternatively, candidate pairs can be received when they are created after the user captures images from camera 243. Box 708 may be followed by box 710.
[0125] At box 710, it is determined whether a filter excludes candidate image pairs. Filters may include temporal filters, motion filters, etc. If the determination is yes, candidate image pairs are excluded. If the determination is no, any remaining candidate image pairs can be considered as a specific image pair. Box 710 may be followed by box 712.
[0126] At box 712, the image interpolator generates one or more intermediate images based on a specific image pair. Box 712 may be followed by box 714.
[0127] At box 714, determine whether the filter or image interpolator excludes one or more intermediate images. If the determination is yes, one or more intermediate images are excluded. If the determination is no, one or more intermediate images are provided to the video generator. Box 714 may be followed by box 716.
[0128] At box 716, a video comprising three or more frames arranged in a sequence is generated, wherein the first frame of the sequence is a first image, the last frame of the sequence is a second image, and each of the one or more intermediate images is a corresponding second intermediate frame between the first frame and the last frame.
[0129] In addition to the descriptions above, users can be provided with control over whether and when a system, program, or feature described herein can collect user information (e.g., information about the user's social networks, social behaviors or activities, professions, user preferences, or the user's current location), and whether the user sends content or communications from the server. Furthermore, before storing or using certain data, it can be processed in one or more ways to remove personally identifiable information. For example, a user's identity can be processed so that personally identifiable information about the user cannot be determined, or the user's geographic location can be generalized (e.g., to a city, zip code, or state level) where location information is obtained, making it impossible to determine the user's specific location. Therefore, users can control what information about themselves is collected, how that information is used, and what information is provided to them.
[0130] In the foregoing description, numerous specific details have been set forth for purposes of explanation in order to provide a thorough understanding of this specification. However, it will be apparent to those skilled in the art that this disclosure may be practiced without these specific details. In some instances, structures and devices are shown in block diagram form to avoid obscuring the description. For example, embodiments may be described above primarily with reference to the user interface and specific hardware. However, embodiments can be applied to any type of computing device capable of receiving data and commands, as well as any peripheral devices providing services.
[0131] References to "some embodiments" or "some examples" in this specification mean that a particular feature, structure, or characteristic described in connection with an embodiment or example may be included in at least one of the described implementations. The phrase "some embodiments" appearing in various places in this specification does not necessarily refer to the same embodiment.
[0132] Some of the detailed descriptions above are presented based on algorithms and symbolic representations of operations on data bits within computer memory. These algorithmic descriptions and representations are means by which those skilled in the art of data processing most effectively communicate their work to others skilled in the art. Algorithms are, and are generally considered, a self-consistent sequence of steps that leads to a desired result. These steps are those that require physical manipulation of physical quantities. Typically, though not always, these quantities take the form of electrical or magnetic data that can be stored, transmitted, combined, compared, and otherwise manipulated. Sometimes, primarily for common reasons, it has proven convenient to refer to these data as bits, values, elements, symbols, characters, terms, numbers, etc.
[0133] However, it should be remembered that all these and similar terms will be associated with appropriate physical quantities and are merely convenient labels applied to those quantities. Unless otherwise explicitly stated as is evident from the discussion below, it should be understood that throughout the description, the use of terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” refers to the actions and processes of a computer system or similar electronic computing device that manipulate data represented as physical (electronic) quantities in the registers and memories of the computer system and convert them into other data similarly represented as physical quantities in the computer system’s memory or registers or other such information storage, transmission or display devices.
[0134] Embodiments of this specification may also relate to a processor for performing one or more steps of the methods described above. The processor may be a dedicated processor selectively activated or reconfigured by a computer program stored in a computer. Such a computer program may be stored in a non-transitory computer-readable storage medium, including but not limited to any type of disk, including optical discs, ROMs, CD-ROMs, magnetic disks, RAM, EPROMs, EEPROMs, magnetic or optical cards, flash memory, including USB keys with non-volatile memory, or any type of medium suitable for storing electronic instructions, each coupled to a computer system bus.
[0135] The description may take the form of some entirely hardware embodiments, some entirely software embodiments, or some embodiments that include both hardware and software elements. In some embodiments, this specification is implemented in software, which includes, but is not limited to, firmware, resident software, microcode, etc.
[0136] Furthermore, the description may take the form of a computer program product accessible from a computer-usable or computer-readable medium that provides program code for use by or in conjunction with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer-readable medium can be any means that can contain, store, deliver, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.
[0137] A data processing system suitable for storing or executing program code will include at least one processor directly or indirectly coupled to memory elements via a system bus. Memory elements may include local memory used during the actual execution of the program code, mass storage devices, and cache memory that provides temporary storage for at least some of the program code to reduce the number of times code must be retrieved from the mass storage device during execution.
Claims
1. A computer-implemented method for generating video, comprising: Candidate image pairs are selected from a set of images associated with a user account, wherein each pair includes a first still image and a second still image from the user account; A filter is applied to select a specific image pair from the candidate image pairs, wherein the filter includes a time filter that excludes one or more of the candidate image pairs when the time difference between the corresponding timestamps associated with the first still image and the second still image in the pair is greater than a time threshold; Using an image interpolator, one or more intermediate images are generated based on the specific image pair; as well as Generate a video comprising three or more frames arranged in a sequence, wherein the first frame of the sequence is the first still image, the last frame of the sequence is the second still image, and each of the one or more intermediate images is a corresponding intermediate frame in the sequence between the first frame and the last frame.
2. The method according to claim 1, wherein, Each of the one or more intermediate images is associated with a corresponding timestamp having a value between the timestamp of the first still image and the timestamp of the second still image, and the position of each intermediate image in the sequence is based on the corresponding timestamp.
3. The method according to claim 1, wherein, The time threshold is 2 seconds.
4. The method according to claim 1, wherein, The image interpolator includes an interpolation machine learning model that receives the first static image and the second static image as input and generates the one or more intermediate images.
5. The method according to claim 1, wherein, Generating the one or more intermediate images based on the specific image pair includes: Generate multiple candidate intermediate images; and Each candidate intermediate image is evaluated using the following steps: Generate candidate videos, wherein the candidate videos include the first still image as the first frame, the candidate intermediate image as the second frame, and the second still image as the last frame; and If the candidate video does not contain frame interpolation faults, then the candidate intermediate image is selected as one of the one or more intermediate images.
6. The method of claim 5, further comprising: The interpolation fault is determined using a discriminator machine learning model trained to determine whether an input image is a generated image, wherein the candidate intermediate image is selected if the discriminator machine learning model determines that the candidate intermediate image is indistinguishable from the generated image.
7. A computer-implemented method for generating video, comprising: Candidate image pairs are selected from a set of images associated with a user account, wherein each pair includes a first still image and a second still image from the user account; A filter is applied to select a specific image pair from the candidate image pairs, wherein the filter includes a motion filter that excludes one or more of the candidate image pairs by: Estimate motion between the first and second still images of the candidate pair; and It is determined that the motion between the first static image and the second static image of the candidate pair is less than a minimum motion threshold; Using an image interpolator, one or more intermediate images are generated based on the specific image pair; and Generate a video comprising three or more frames arranged in a sequence, wherein the first frame of the sequence is the first still image, the last frame of the sequence is the second still image, and each of the one or more intermediate images is a corresponding intermediate frame in the sequence between the first frame and the last frame.
8. The method according to claim 7, wherein, The filter further excludes one or more of the candidate image pairs by determining that the motion between the first and second still images in the candidate pair exceeds a maximum motion threshold.
9. The method according to claim 7, wherein, The image interpolator includes an interpolation machine learning model that receives the first static image and the second static image as input and generates the one or more intermediate images.
10. The method according to claim 7, wherein, Generating the one or more intermediate images based on the specific image pair includes: Generate multiple candidate intermediate images; and Each candidate intermediate image is evaluated using the following steps: Generate candidate videos, wherein the candidate videos include the first still image as the first frame, the candidate intermediate image as the second frame, and the second still image as the last frame; and If the candidate video does not contain frame interpolation faults, then the candidate intermediate image is selected as one of the one or more intermediate images.
11. The method of claim 10, further comprising: The interpolation fault is determined using a discriminator machine learning model trained to determine whether an input image is a generated image, wherein the candidate intermediate image is selected if the discriminator machine learning model determines that the candidate intermediate image is indistinguishable from the generated image.
12. A computer-implemented method for generating video, comprising: Candidate image pairs are selected from a set of images associated with a user account, wherein each pair includes a first still image and a second still image from the user account; A filter is applied to select a specific image pair from the candidate image pairs, wherein the filter includes a filter machine learning model, the filter machine learning module excluding one or more of the candidate image pairs by: Generate feature vectors representing the first static image and the second static image in each candidate pair; and Exclude one or more of the candidate pairs corresponding to the corresponding feature vectors, wherein the distance between the corresponding feature vectors is greater than a threshold vector distance, wherein the feature vectors are mathematical representations, and wherein the mathematical representations of similar images in the vector space are closer than the mathematical representations of dissimilar images; Using an image interpolator, one or more intermediate images are generated based on the specific image pair; and Generate a video comprising three or more frames arranged in a sequence, wherein the first frame of the sequence is the first still image, the last frame of the sequence is the second still image, and each of the one or more intermediate images is a corresponding intermediate frame in the sequence between the first frame and the last frame.
13. The method according to claim 12, wherein, The feature vector is a first feature vector, and the filter machine learning model is further operable to: Receive an intermediate image from the one or more intermediate images as input; Generate one or more second feature vectors corresponding to the intermediate image; and One or more intermediate images corresponding to the corresponding feature vectors are excluded, wherein the distance between the corresponding feature vector of the intermediate image and the corresponding feature vector of the corresponding candidate image pair is greater than the threshold vector distance.
14. The method according to claim 12, wherein, The image interpolator includes an interpolation machine learning model that receives the first static image and the second static image as input and generates the one or more intermediate images.
15. The method according to claim 12, wherein, Generating the one or more intermediate images based on the specific image pair includes: Generate multiple candidate intermediate images; and Each candidate intermediate image is evaluated using the following steps: Generate candidate videos, wherein the candidate videos include the first still image as the first frame, the candidate intermediate image as the second frame, and the second still image as the last frame; and If the candidate video does not contain frame interpolation faults, then the candidate intermediate image is selected as one of the one or more intermediate images.
16. The method of claim 15, further comprising: The interpolation fault is determined using a discriminator machine learning model trained to determine whether an input image is a generated image, wherein the candidate intermediate image is selected if the discriminator machine learning model determines that the candidate intermediate image is indistinguishable from the generated image.
17. A computer-implemented method for generating video, comprising: Candidate image pairs are selected from a set of images associated with a user account, wherein each pair includes a first still image and a second still image from the user account; Apply filters to select specific image pairs from the candidate image pairs; Using an image interpolator, one or more intermediate images are generated based on the specific image pair; If the filter excludes one or more intermediate images, a frame interpolation failure is determined to have occurred; as well as In response to the occurrence of the frame interpolation failure, one or more additional intermediate images are generated; as well as Generate a video comprising three or more frames arranged in a sequence, wherein the first frame of the sequence is the first still image, the last frame of the sequence is the second still image, and each of the one or more intermediate images is a corresponding intermediate frame in the sequence between the first frame and the last frame.
18. The method according to claim 17, wherein, The image interpolator includes an interpolation machine learning model that receives the first static image and the second static image as input and generates the one or more intermediate images.
19. The method of claim 17, wherein, Generating the one or more intermediate images based on the specific image pair includes: Generate multiple candidate intermediate images; and Each candidate intermediate image is evaluated using the following steps: Generate candidate videos, wherein the candidate videos include the first still image as the first frame, the candidate intermediate image as the second frame, and the second still image as the last frame; and If the candidate video does not contain frame interpolation faults, then the candidate intermediate image is selected as one of the one or more intermediate images.
20. The method of claim 19, further comprising: The interpolation fault is determined using a discriminator machine learning model trained to determine whether an input image is a generated image, wherein the candidate intermediate image is selected if the discriminator machine learning model determines that the candidate intermediate image is indistinguishable from the generated image.
21. A computer-implemented method for generating video, comprising: Candidate image pairs are selected from a set of images associated with a user account, wherein each pair includes a first still image and a second still image from the user account; Apply filters to select specific image pairs from the candidate image pairs; Using an image interpolator, one or more intermediate images are generated based on the specific image pair; as well as Generate a video comprising three or more frames arranged in a sequence, wherein the first frame of the sequence is the first still image, the last frame of the sequence is the second still image, and each of the one or more intermediate images is a corresponding intermediate frame in the sequence between the first frame and the last frame; The generation of the video includes: using a deep machine learning model to generate a three-dimensional representation of a scene in the first static image based on a prediction of the depth of the first static image, wherein the deep machine learning model is a classifier that receives the first static image as input, and wherein the video includes camera effects generated based on the three-dimensional representation of the scene.
22. The method according to claim 21, wherein, The image interpolator includes an interpolation machine learning model that receives the first static image and the second static image as input and generates the one or more intermediate images.
23. The method according to claim 21, wherein, Generating the one or more intermediate images based on the specific image pair includes: Generate multiple candidate intermediate images; and Each candidate intermediate image is evaluated using the following steps: Generate candidate videos, wherein the candidate videos include the first still image as the first frame, the candidate intermediate image as the second frame, and the second still image as the last frame; and If the candidate video does not contain frame interpolation faults, then the candidate intermediate image is selected as one of the one or more intermediate images.
24. The method of claim 23, further comprising: The interpolation fault is determined using a discriminator machine learning model trained to determine whether an input image is a generated image, wherein the candidate intermediate image is selected if the discriminator machine learning model determines that the candidate intermediate image is indistinguishable from the generated image.
25. A system for generating video, comprising one or more processors and a memory storing computer-readable instructions that, when executed by the one or more processors, cause the system to perform the operations according to any one of claims 1 to 24.
26. A computer program product comprising computer-readable instructions that, when executed by a computing device, cause the computing device to perform the operations according to any one of claims 1 to 24.